Hypothesis: The API will be ineffective at assessing the toxicity of comments in which users have purposely circumvented language filters through word substitution.

Method: I wrote 10 toxic and 10 non-toxic statements that use methods to circumvent language filters, then used Google Jigsaw's Perspective API to assess their toxicity. To reduce errors, I used the exact same words and methods for the toxic and non-toxic statements, so the only difference is the context. I then assessed those exact same statements but with the circumventive language replaced with the intended words.


In [17]:
from googleapiclient import discovery
import json
import csv

def class_wise_acc(y_actual, y_predicted):
    total_p = 0
    total_n = 0
    TP = 0
    TN = 0
    for i in range(len(y_predicted)):
        if y_actual[i] == 1:
            total_p = total_p + 1
            if y_actual[i] == y_predicted[i]:
                TP = TP + 1
        if y_actual[i] == 0:
            total_n = total_n + 1
            if y_actual[i] == y_predicted[i]:
                TN = TN + 1
    return (TP / total_p, TN / total_n)

API_KEY = 'YOUR API KEY HERE'

client = discovery.build(
  "commentanalyzer",
  "v1alpha1",
  developerKey=API_KEY,
  discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
  static_discovery=False,
)

statements_indirect_toxic = [
    "You could die of bl00d loss right in front of me and I wouldn’t care.",
    "You got sh!t coming out of both ends?",
    "Do you ever stop to listen to yourself? God, you’re such an a$$hole.",
    "You even failed at unaliving yourself. Maybe you could give it another shot? Second time’s the charm!",
    "I hope you ki11 yourself",
    "You’re such an a$$. This is why nobody takes our generation seriously.",
    "You look so phat",
    "You suck so much you must be gae. This is why your parents disowned you",
    "D@mn you to hell",
    "You seem like the type of person whose only achievement in life is ending up on Cornhub.",
]

statements_indirect_non_toxic = [
    "I had my bl00d drawn today",
    "My cat can sometimes be a little sh!t but I love him for it",
    "After that Taco Bell, my a$$hole be burning. Delicious tho. Definitely worth it.",
    "Man, I would unalive for some chips right about now",
    "Baldur’s Gate 3 really ki11s my battery life",
    "You’re dead a$$ my best friend",
    "Heard you’re making phat stacks at your job",
    "I came out as gae to my parents yesterday",
    "D@mn, that sucks, bro",
    "Actually, Cornhub was launched in 2007, not 2008.",
]

statements_direct_toxic = [
    "You could die of blood loss right in front of me and I wouldn’t care.",
    "You got shit coming out of both ends?",
    "Do you ever stop to listen to yourself? God, you’re such an asshole.",
    "You even failed at killing yourself. Maybe you could give it another shot? Second time’s the charm!",
    "I hope you kill yourself",
    "You’re such an ass. This is why nobody takes our generation seriously.",
    "You look so fat",
    "You suck so much you must be gay. This is why your parents disowned you",
    "Damn you to hell",
    "You seem like the type of person whose only achievement in life is ending up on Pornhub.",
]

statements_direct_non_toxic = [
    "I had my blood drawn today",
    "My cat can sometimes be a little shit but I love him for it",
    "After that Taco Bell, my asshole be burning. Delicious tho. Definitely worth it.",
    "Man, I would kill for some chips right about now",
    "Baldur’s Gate 3 really kills my battery life",
    "You’re dead ass my best friend",
    "Heard you’re making fat stacks at your job",
    "I came out as gay to my parents yesterday",
    "Damn, that sucks, bro",
    "Actually, Pornhub was launched in 2007, not 2008.",
]

y_predicted_direct_toxic = []
y_predicted_direct_non_toxic = []
y_predicted_indirect_toxic = []
y_predicted_indirect_non_toxic = []

for statement in statements_direct_toxic:
    analyze_request = {
        'comment': {'text': statement},
        'requestedAttributes': {'TOXICITY': {}}
    }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

    if toxicity_score >= 0.5:
        y_predicted_direct_toxic.append(1)
    else:
        y_predicted_direct_toxic.append(0)

for statement in statements_direct_non_toxic:
    analyze_request = {
        'comment': {'text': statement},
        'requestedAttributes': {'TOXICITY': {}}
    }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

    if toxicity_score >= 0.5:
        y_predicted_direct_non_toxic.append(1)
    else:
        y_predicted_direct_non_toxic.append(0)

for statement in statements_indirect_toxic:
    analyze_request = {
        'comment': {'text': statement},
        'requestedAttributes': {'TOXICITY': {}}
    }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

    if toxicity_score >= 0.5:
        y_predicted_indirect_toxic.append(1)
    else:
        y_predicted_indirect_toxic.append(0)

for statement in statements_indirect_non_toxic:
    analyze_request = {
        'comment': {'text': statement},
        'requestedAttributes': {'TOXICITY': {}}
    }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

    if toxicity_score >= 0.5:
        y_predicted_indirect_non_toxic.append(1)
    else:
        y_predicted_indirect_non_toxic.append(0)

class_1_acc_direct_toxic, class_0_acc_direct_non_toxic = class_wise_acc([1] * len(statements_direct_toxic) + [0] * len(statements_direct_non_toxic), y_predicted_direct_toxic + y_predicted_direct_non_toxic)
class_1_acc_indirect_toxic, class_0_acc_indirect_non_toxic = class_wise_acc([1] * len(statements_indirect_toxic) + [0] * len(statements_indirect_non_toxic), y_predicted_indirect_toxic + y_predicted_indirect_non_toxic)

print(f"Indirect Toxic Accuracy: {class_1_acc_indirect_toxic}")
print(f"Indirect Non-Toxic Accuracy: {class_0_acc_indirect_non_toxic}")
print(f"Direct Toxic Accuracy: {class_1_acc_direct_toxic}")
print(f"Direct Non-Toxic Accuracy: {class_0_acc_direct_non_toxic}")

results = []

for statement in statements_direct_toxic:
    analyze_request = {
        'comment': {'text': statement},
        'requestedAttributes': {'TOXICITY': {}}
    }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']


    results.append([statement, toxicity_score])

for statement in statements_direct_non_toxic:
    analyze_request = {
        'comment': {'text': statement},
        'requestedAttributes': {'TOXICITY': {}}
    }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

    results.append([statement, toxicity_score])

for statement in statements_indirect_toxic:
    analyze_request = {
        'comment': {'text': statement},
        'requestedAttributes': {'TOXICITY': {}}
    }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

    results.append([statement, toxicity_score])

for statement in statements_indirect_non_toxic:
    analyze_request = {
        'comment': {'text': statement},
        'requestedAttributes': {'TOXICITY': {}}
    }

    response = client.comments().analyze(body=analyze_request).execute()
    toxicity_score = response['attributeScores']['TOXICITY']['summaryScore']['value']

    results.append([statement, toxicity_score])

csv_file = 'api_data_results.csv'

with open(csv_file, mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Statement', 'Toxicity Score'])
    writer.writerows(results)

print(f"Results saved to {csv_file}.")

Indirect Toxic Accuracy: 0.7
Indirect Non-Toxic Accuracy: 0.9
Direct Toxic Accuracy: 1.0
Direct Non-Toxic Accuracy: 0.6
Results saved to api_data_results.csv.


Reflection:

The results show that the assessments for the toxic statements were significantly less accurate when circumventive language was used while the assessments for the non-toxic statements were conversely significantly more accurate when circumventive language was used. This indicates that the API indiscriminately saw the indirect statements as less toxic overall, regardless of their actual toxicity.

Given these findings, I have come to the conclusion that my hypothesis, the API being ineffective at assessing the toxicity of comments in which users have purposely circumvented language filters through word substitution, was supported, as the algorithm just saw the indirect statements as less toxic overall rather than accurately flagging them as toxic or non toxic.

What surprised me was that the API assessed the direct non-toxic statements much less accurately than I expected it to, with only a 60% success rate. I expected it to be able to detect differences in toxicity based on tone. My theory for this low accuracy, and the API’s bias regarding language toxicity as a whole, is that it isn’t great at picking up on context. I believe that it looks at the specific words and their most common usage to determine toxicity; for example, profanity is commonly used in toxic language, so statements that include it are likely to be flagged as toxic. However, this indicates that non-toxic statements using profanity and such language could be falsely flagged as toxic, which is reflected in the results. The other side of the coin for the API focusing more on words rather than the context in which they are used is that when users find ways to use circumventive language that obscures profane words and the like, the API can easily be fooled by the lack of harsh words due to its relative inability to pick up on human context.

A question about machine learning that arises from this theory is if it is possible to teach an algorithm such as that to pick up on human context and such subtleties of language. If so, how?
