# Testing Data Bias in Google Jigsaw's Perspective API


The goal of this assignment is to determine some form of bias in the Perspective API, which is a natural language processing model designed to provide a toxicity score for internet comments.

In [33]:
import numpy as np
import pandas as pd

from googleapiclient import discovery
import json

API_KEY = 'AIzaSyC0CKEKkA9fqWsAJ8iUuUQt43lIF233Zxs'

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)


test_df = pd.read_csv("data_bias_training_data2.csv")
test_df.head()



Unnamed: 0,Type,Comments,Actual Label,Perspective API
0,ACC,wtf that was so crazy,NOT TOXIC,
1,FULL,what the fuck that was so crazy,NOT TOXIC,
2,ACC,thats so funny im lmao,NOT TOXIC,
3,FULL,thats so funny im laughing my ass off,NOT TOXIC,
4,ACC,thats so hilarious im lmfao,NOT TOXIC,


In [34]:


toxic_base = 0.5

# testing with neutral queries



neutral_text_samples =  ['this is a neutral comment' ,
 'the sky is blue',
 'hello world', 'leaves are green']

#iterating through the statemenets and testing the level of toxicity for each

for text in neutral_text_samples:
  analyze_request = {
    'comment': { 'text': text },
    'requestedAttributes': {'TOXICITY': {}}
  }

  response = client.comments().analyze(body=analyze_request).execute()
  toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

# printing toxicity scores


  if toxicity_score > toxic_base:
    print(f'comment: {text}\n Score: {toxicity_score}\n TOXIC\n')
  else:
    print(f'comment: {text}\n Score: {toxicity_score}\n NOT TOXIC \n')



comment: this is a neutral comment
 Score: 0.020607091
 NOT TOXIC 

comment: the sky is blue
 Score: 0.022610499
 NOT TOXIC 

comment: hello world
 Score: 0.020842785
 NOT TOXIC 

comment: leaves are green
 Score: 0.03309893
 NOT TOXIC 



Above, I created a few objective and neutral statements to establish a baseline idea for what the model considers "toxic". I also established a threshold of toxicity (above 0.5 being toxic).

The code above iterates through a list of statements, extracts them and their corresponding toxicity scores, and saves them as variables  (response and toxicity score). It then displays the score along with the predetermined level of toxcicity that the score corresponds to.

The output of the code displays all the statements with very low toxicity scores (all below .1).

##Hypothesis:

#### The model will likely score obscenities as much more toxic than other negative language, but, given acronyms that contain obscenities, the model will output a much lower score than the statements that contain the obscenities in full.



If the model is unable to correctly identify acronyms as toxic, this could demonstrate age bias. Online, children are the most vulnerable to being exposed to profanities, and they are much more likely to see these in the form of acronyms. Platforms and developers that use Perspective API should then be aware that the API does not account for profane text that is seen in the form of acronyms.


In [35]:
comments = test_df["Comments"]
API_labels = []

def checking_toxicity(comments):

  for text in comments:
    analyze_request = {
      'comment': { 'text': text },
      'requestedAttributes': {'TOXICITY': {}}
  }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

    comment_label = "TOXIC" if toxicity_score > toxic_base else "NOT TOXIC"
    API_labels.append(comment_label)

    print(f'comment: {text}\n Score: {toxicity_score}\n {comment_label}\n')


checking_toxicity(comments)



comment: wtf that was so crazy
 Score: 0.5566829
 TOXIC

comment: what the fuck that was so crazy
 Score: 0.88599813
 TOXIC

comment: thats so funny im lmao
 Score: 0.08637348
 NOT TOXIC

comment: thats so funny im laughing my ass off 
 Score: 0.62136006
 TOXIC

comment: thats so hilarious im lmfao
 Score: 0.112643376
 NOT TOXIC

comment: thats so hilarious im laughing my fucking ass off
 Score: 0.85333383
 TOXIC

comment: that test was so hard, im going to kms
 Score: 0.09394901
 NOT TOXIC

comment: that test was so hard, im going to kill myself
 Score: 0.7308154
 TOXIC

comment: I cant believe we lost, that is such bs
 Score: 0.24442963
 NOT TOXIC

comment: I cant believe we lost, that is such bullshit
 Score: 0.8460273
 TOXIC

comment: I have to leave for school rfn
 Score: 0.08716487
 NOT TOXIC

comment: I have to leave for school right fucking now
 Score: 0.76197964
 TOXIC

comment: you must be joking, bffr
 Score: 0.17371799
 NOT TOXIC

comment: you must be joking, be fucking for

The above code iterates through a list of comments, some of which are actually toxic and some of which are not. Ideally, the model should have the same level of accuracy for comments with acronyms containing profanity that it does for comments with the full text of the profanity.

Just looking at the output above, it looks like this is not the case. All of the comments containing profanity are labelled as toxic, while almost none of the comments with acronyms are labelled as toxic.

Let's now check the accuracy of the model by comparing the toxic/not toxic labels that the model outputs with the actual toxic/not toxic labels in the test data.

To do this, we will examine the percentage of true positive classifiers (comments that were accurately flagged as toxic) and true negative classifiers (comments that were accurately flagged as not toxic) for both the text containing acronyms and the text containing the full text containing profanities.

##Accuracy Checking




In [36]:
actual_labels = test_df["Actual Label"].tolist()
type = test_df["Type"].tolist()

ACC_indices = []
FULL_indices = []

for i in range (len(type)):
  if type[i] == "ACC":
    ACC_indices.append(i)
  else:
    FULL_indices.append(i)

ACC_api_labels = [API_labels[i] for i in ACC_indices]
FULL_api_labels = [API_labels [i] for i in FULL_indices]

actual_labels = [actual_labels[i] for i in ACC_indices]


First, I extracted the indices that contain acronyms and those that contain full text. I then used this to extract the corresponding indices from a previously created list that stored all of the outputted Perspective API labels. I placed these two categories (ACC and FULL) into corresponding lists.

In [37]:
def calculate_accuracy(actual_labels, ACC_api_labels, FULL_api_labels):

    TP_acronym = 0
    TP_full = 0
    TN_acronym = 0
    TN_full = 0

    t_total = 0
    nt_total = 0

    for i in range(len(actual_labels)):
        if actual_labels[i] == "TOXIC":
            t_total += actual_labels[i] == "TOXIC"
            TP_acronym += ACC_api_labels[i] == "TOXIC"
            TP_full += FULL_api_labels[i] == "TOXIC"

        elif actual_labels[i] == "NOT TOXIC":
            nt_total += actual_labels[i] == "NOT TOXIC"
            TN_acronym += ACC_api_labels[i] == 'NOT TOXIC'
            TN_full += FULL_api_labels[i] == "NOT TOXIC"

    try:
        percent_TP_acronym = (TP_acronym / t_total) * 100
    except ZeroDivisionError as e:
        print(f"Error: {e}")
        percent_TP_acronym = 0

    try:
        percent_TP_full = (TP_full / t_total) * 100
    except ZeroDivisionError as e:
        print(f"Error: {e}")
        percent_TP_full = 0

    try:
        percent_TN_acronym = (TN_acronym / nt_total) * 100
    except ZeroDivisionError as e:
        print(f"Error: {e}")
        percent_TN_acronym = 0

    try:
        percent_TN_full = (TN_full / nt_total) * 100
    except ZeroDivisionError as e:
        print(f"Error: {e}")
        percent_TN_full = 0

    return percent_TP_acronym, percent_TN_acronym, percent_TP_full, percent_TN_full

percent_TP_acronym, percent_TN_acronym, percent_TP_full, percent_TN_full = calculate_accuracy(actual_labels, ACC_api_labels, FULL_api_labels)

print(f'Percentage of actually toxic acronyms labelled toxic: {percent_TP_acronym}%')
print(f'Percentage of not toxic acronyms labelled not toxic: {percent_TN_acronym}%')
print(f'Percentage of actually toxic full text labelled toxic: {percent_TP_full}%')
print(f'Percentage of not toxic full text labelled not toxic correctly: {percent_TN_full}%')




Percentage of actually toxic acronyms labelled toxic: 0.0%
Percentage of not toxic acronyms labelled not toxic: 83.33333333333334%
Percentage of actually toxic full text labelled toxic: 100.0%
Percentage of not toxic full text labelled not toxic correctly: 0.0%


Above, I counted the number of true positives and true negatives for both the full text comments and the comments containing acronyms. I then compared the number of true negatives and true positives to the total number of negatives and positives and calculated the percent of the time that the model accurately guessed that the comments were positive and negative based on the type of comment (full text or containing an acronym).


The model accurately predicted that a statement containing an acronym was toxic 0% of the time. Looking at the previous data, the model only predicted that a comment containing an acronym was toxic once, but, in the test data, this comment was marked not toxic. This proves my hypothesis to be true because the model was unable to identify when a comment that contains profanity hidden in a commonly used acronym was toxic. This means that developers using the perspective API need to be aware that users could potentially be exposed to toxic comments if obscenities are hidden in an acronym. This could represent a form of age bias because, on the internet, children are the most vulnerable to seeing inappropriate language, and could be exposed to such obscenities without their parents being aware.


The model accurately predicted that a comment containing an acronym that contains profanity was not toxic 83.3% of the time. In this situation, a false positive, as seen with the lack of true negatives, is much more harmful than a false negative. This means that, although the model did accurately classify many comments that contain acronyms with obscenities as not toxic, the vulnerability in the model is still obvious, and the model could still cause unintentional harm.

The model accurately predicted that a comment containing full text profanities was toxic 100% of the time. In fact, every comment containing a full text profanity was labelled toxic. Although this definitely proves the models ability to flag profane language to a degree, it still does not make up for the potential deficiencies in the model that could cause harm due to its inability to flag comments with acronyms; the statements "that test was so hard i'm going to kms" and "that test was so hard, im going to kill myself", can cause just as much harm because they have the same meaning, but the former received a toxicity score of 0.24 and the latter received a score of 0.73.


Finally, the model accurately predicted that a comment containing full text profanities were not toxic 0% of the time. Because the model does not understand social cues or slang, it is understanable, that many of the comments with profanities are labelled as toxic. As I said before, a false negative, in this situation is much more harmful than a false positive. Because of this, deficiencies in the models ability to accurately flag comments containing profanities as not toxic are much less concerning, and can be seen as a potential strength of the model.

Generally, the model is likely to flag comments as toxic if they contain any harmful words, regardless of the content of the rest of the statement. This means that the model is likely unable to identify slang words like "cringe" or "cheugy" as toxic as well. However, this means that neutral statements containing words with double meanings (for instance, "the bitch just gave birth to a litter of puppies") would also likely be inaccurately flagged as toxic. Additionally, this means that users on platforms that use Perspective API could get around the models flags by hiding their language in acronyms, as seen throughout the dataset.
