Skip to content

parsabe/BlackWall

Repository files navigation

BlackWall

A Safety Line Against Rogue AI

Overview

BlackWall intro
Click the image to watch the BlackWall intro
Video preview
Click the image to watch the video

Abstract

The increasing integration of social media and conversational AI into daily life has intensified concerns around the spread of harmful, illegal, and psychologically sensitive content, including suicidal ideation, self-harm, depression, and other forms of negative influence. Recent cases of emotional attachment to chatbots and instances where AI systems have unintentionally misled users on mental health issues highlight the urgency of reliable safety mechanisms. This paper presents Blackwall, a domain-aware and interpretable framework designed to identify, assess, and rank high-risk content across online platforms. By operating across heterogeneous data sources and providing transparent risk explanations, Blackwall supports early intervention, responsible moderation, and safer human–AI interaction. The framework aims to contribute toward ethically grounded content safety systems that can mitigate psychological harm while preserving transparency and trust.

The Blackwall project focuses on developing a reliable and interpretable artificial intelligence system for identifying and assessing suicidal ideation alongside other harmful, illegal, and psychologically sensitive content, including self-harm, depression, and broader forms of negative or misleading influence. Rather than operating on social media platforms directly, Blackwall is trained and evaluated on curated datasets originating from online environments, with the primary objective of preventing intelligent systems from generating, reinforcing, or amplifying dangerous content.

The system is designed not only to distinguish between low-risk and high-risk content, but also to provide transparent and explainable risk assessments that can support mental health professionals, safety moderators, and the development of responsible AI-driven tools.

Blackwall is evaluated using a Twitter-derived dataset consisting of short text samples labeled as “Not Suicide Post” or “Potential Suicide Post”, and a Reddit SuicideWatch–derived dataset containing longer-form texts annotated with a severity score ranging from 0 to 5, reflecting increasing levels of suicide risk.

To improve robustness under distribution shifts between heterogeneous data sources, Blackwall incorporates Domain Adversarial Training (DAT) to reduce domain-specific bias and encourage domain-invariant representations. In addition, we include a PAC-oriented evaluation and conditioning strategy to assess whether learned decision rules generalize consistently under standard learnability assumptions, supporting stable performance as data scale and domains vary.

Overall, Blackwall aims to demonstrate how domain-robust and explainable AI can serve as a protective mechanism against unsafe or rogue behavior in intelligent systems, contributing to safer human–AI interaction while adhering to ethical and technical robustness requirements.