<a href="https://colab.research.google.com/github/kristenvaccaro/CSE190-HW5/blob/main/Week_5_Recourse_and_%22The_Algorithm%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated Toxicity Detection

In this assignment you will consider interactions around an automated toxic content (e.g., hate speech) detection system.

# Part I: Building and testing a toxic content prediction tool

In [44]:
# Start by making sure the data is available (e.g., uploading to Colab)
# Then load the toxic.tsv file (here tab separated rather than comma separated data)
# You can see details about the data here:
import pandas as pd

df = pd.read_csv('toxic.tsv', sep='\t')
labels = df['label'].tolist()
messages = df['message'].tolist()

In [45]:
# While it's generally good practice to check what is in your data after loading
# the fact that this data contains toxic content means you may want to limit
# your exploration of the raw data. First five rows should be fine to view, though.

messages[:5]
#labels[:5]

['butt butt butt butt butt butt butt butt butt butt butt butt butt butt butt butt butt butt butt butt vbutt vbutt butt v v butt v butt butt butt vbutt vbutt',
 'okay i see. you accused me of being one of them!',
 "It was just pointed out (and I checked it in Guralnik's Careless Whisper)that Presley did indeed die whilst on the toilet.  If you feel it's important then add away.",
 '"   Please do not vandalize pages, as you did with this edit to Los Angeles Lakers. If you continue to do so, you will be blocked from editing.    "',
 '"  Summary GA review – see WP:WIAGA for criteria  Is it reasonably well written? A. Prose quality:     B. MoS compliance:    Is it factually accurate and verifiable? A. References to sources:     B. Citation of reliable sources where necessary:     C. No original research:    Is it broad in its coverage? A. Major aspects:     B. Focused:    Is it neutral? Fair representation without bias:    Is it stable?  No edit wars, etc:    Does it contain images to illus

In [52]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

# Note that the decision tree needs numerical data,
# So we create a pipeline where any text will be converted to numerical features
# Here we use TF-IDF (term frequency - inverse document frequency)
# You can read a nice overview of the concept here: https://www.tidytextmining.com/tfidf.html
text_clf = Pipeline([
    ('tfidf', TfidfVectorizer()),  # Convert text to numerical features
    ('clf', DecisionTreeClassifier()),  # Train the decision tree classifier
                                        # You might switch this line to
                                        # ('clf', DecisionTreeClassifier(max_depth=5))
                                        # for the visualization described below
])
text_clf.fit(messages, labels)

In [48]:
# You now have a content moderation system that can predict whether a new string is toxic or not
new_text = ["what an idiot."]
predictions = text_clf.predict(new_text)
print(predictions)

[ True]


In [7]:
# If you'd like to visualize the tree, to see the word for each node,
# You can use the following code
# NOTE: You may want to restrict the depth of the decision tree for this kind
# of visualization work (it can be big!), which you can do as follows:
# ('clf', DecisionTreeClassifier(max_depth=5)) within the pipeline.
# But make sure you remove the max_depth setting for the rest of the assignment
from sklearn import tree
import graphviz

dot_data = tree.export_graphviz(text_clf['clf'],
                                out_file=None,
                                feature_names=text_clf['tfidf'].get_feature_names_out(),
                                class_names=['Negative', 'Positive'],
                                filled=True,
                                rounded=True,
                                special_characters=True)

graph = graphviz.Source(dot_data)
graph.render("text_decision_tree")  # Saves the tree as a PDF file

'text_decision_tree.pdf'

In [49]:
# Now you can start investigating that model.
# For example, test out how it labels Standard American English (SAE)
# vs. how it labels African American Vernacular English (AAVE)
# The two files aave_samples.txt and sae_samples.txt actually contained
# paired statements, where each line is equivalent to the other.

sae_df = pd.read_csv('sae_samples.txt', sep='\t',header=None,names=["message"])
aave_df = pd.read_csv('aave_samples.txt', sep='\t',header=None,names=["message"])

sae_messages = sae_df["message"].tolist()
aave_messages = aave_df["message"].tolist()

In [64]:
# Make a prediction for each message in the sae_messages and aave_messages lists


How much more likely is the model to predict AAVE speech as toxic than SAE speech?

In [63]:
# Look at how many toxic and non-toxic predictions your model made for
# SAE and AAVE messages.
# Keep in mind True/False is not whether the label is correct, it just corresponds to toxic/nontoxic
# Also keep in mind each row in the SAE data contains equivalent information
# to the same row in the AAVE data.


Identify 5-10 examples of messages that are labeled toxic in AAVE but non-toxic in SAE.

Do those examples help you understand what has gone wrong with your model?




*Click here to write your text response*

It is possible to access the tree structure in scikit-learn and manually modify it, to remove specific nodes. Do you think this would be a good way to solve the problems you've uncovered?

*Click here to write your text response*

For a much more thorough analysis, you could read the article this assignment draws on: https://aclanthology.org/P19-1163.pdf

## Part 2: Recourse (15 pts)

Machine learning classifiers are often used to detect hate speech online. For example, the bot pictured below was designed by [Jasper Stephenson](https://www.jasperstephenson.com) to help manage abusive content on discord.
![SafeSpaceBot](https://storage.googleapis.com/3lix-images/1rFcSntbispfYHagAX129_qcoHpbqfmsNn1P67Ncjg4I/kix.s5cl9igwh4bf-large.jpg)

Suppose that you are designing an interface for a social media site. The machine learning team has already developed a hate speech detection model.

Instead of simply banning users or content, your team wants to nudge users to improve their behavior. You are tasked with designing a way to warn users before they post content that the model deems harmful and to give them an opportunity to alter the offending content.

**Provide three sketches and explain how your designs provide recourse to users.**

*Each sketch is worth five points.*

Please ensure each sketch does the following:

- Articulate your goal or goals for the design. (1-2 sentences)
- Specify which criteria for recourse (diverse, sparse, plausible, actionable) your design achieves and how it does so. (1-2 sentences)
- Label your sketch. Indicate how your design provides recourse and where/how it achieves the criteria for recourse.

## Part 3: Contestability (5 pts)

Like all machine learning models, your team's hate speech classifier is imperfect and will make mistakes. Sometimes it will be an isolated incident.

Sometimes it may be a systematic problem. For example, you might consistently miss a type of hate speech (e.g., hate speech towards a group that you didn't have in your training data or [for a group that your policy does not protect](https://www.propublica.org/article/facebook-hate-speech-censorship-internal-documents-algorithms)). Or you might consistently label something as hate speech when it's not (e.g., [when a community is reclaiming a term](https://www.npr.org/sections/publiceditor/2019/08/21/752330316/a-former-slur-is-reclaimed-and-listeners-have-mixed-feelings)).

You are tasked with designing an interface to help users discover possible problems with the hate speech classifier and advocate for the social media platform to change how the classifier works.

**Provide one sketch and explain how your design supports contestability.**

Please ensure your sketch does the following:

1. Articulate your goal or goals for your design. These goals should be specific, like a task or component of contestability that you will focus on. Contestability is defined as providing mechanisms for users to "understand, construct, shape and challenge model predictions". *As an example, you could choose to focus on the aspect of "challenging" the model and set the goal: "build consensus in a community that the problem exists".*

2. Specify at least three criteria for someone to succeed at the goal. Note that unlike recourse, these criteria are not being defined in class. You should think about what appropriate criteria are. *As an example, if the goal you choose is "building consensus in a community that the problem exists," you might define the criteria as: 1) focusing on concrete examples/instances of the problem, 2) providing a signal of how much agreement there is, 3) encouraging disagreements to be civil and constructive.*

3. Label each sketch. Indicate where/how your design achieves the criteria identified above.

*Note: Novel and creative approaches are encouraged.*




