# Capstone Project - Tan Kelvin (TP063098)

# ParlAI Dialogue Safety Model with Emoticons and Internet Slangs Translation

# Research Questions
1. What are the limitations faced by current state of the art natural language processing tools in handling toxic comments?
2. How could emoticons and Internet slangs improve the performance of ParlAI Dialogue Safety model on the Wikipedia Toxic Comments and ChatEval Twitter datasets?
3. How could the proposed classification model improve the existing performance of ParlAI Twitter model on ChatEval Twitter dataset?

# Sampling

## Wikipedia Toxic Comments
Obtained from https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

In [1]:
import pandas as pd

In [2]:
wiki_train_df = pd.read_csv('wikipedia-toxic-comment-train.csv')
wiki_test_df = pd.read_csv('wikipidea-toxic-comment-test.csv')

In [4]:
wiki_train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [5]:
wiki_test_df.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


## ChatEval Twitter
Obtained from https://chateval.org/

In [18]:
import csv

"""
ChatEval's dataset came as a txt file, so it needs to be converted into csv first
"""
chateval_txt = pd.read_csv("twitter.txt", header = None, sep = '\0')
chateval_txt.columns = ['tweet']
chateval_txt.to_csv('chat_eval_twitter.csv', 
                index = None)

In [19]:
chateval_tweet_df = pd.read_csv('chat_eval_twitter.csv')
chateval_tweet_df.head()

Unnamed: 0,tweet
0,when you find so you decide to stay in her cla...
1,not a single reporter noticed that hillary's e...
2,jim schwartz has faced 2 teams as defensive co...
3,go gettum jared!
4,arsenal didn't deserve to win either so?


# Exploring

## Statistical exploration

In [23]:
import re

"""
Uses the re module to count number of words in a string.
"""
def count_words(string):
    return len(re.findall(r'\w+', string))

In [24]:
wiki_train_df['words_num'] = wiki_train_df['comment_text'].apply(count_words)
wiki_train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,words_num
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,50
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,20
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,44
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,114
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,14


In [26]:
wiki_test_df['words_num'] = wiki_test_df['comment_text'].apply(count_words)
wiki_test_df.head()

Unnamed: 0,id,comment_text,words_num
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,75
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,10
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",5
3,00017563c3f7919a,":If you have a look back at the source, the in...",39
4,00017695ad8997eb,I don't anonymously edit articles at all.,8


In [28]:
chateval_tweet_df['words_num'] = chateval_tweet_df['tweet'].apply(count_words)
chateval_tweet_df.head()

Unnamed: 0,tweet,words_num
0,when you find so you decide to stay in her cla...,11
1,not a single reporter noticed that hillary's e...,22
2,jim schwartz has faced 2 teams as defensive co...,18
3,go gettum jared!,3
4,arsenal didn't deserve to win either so?,8


In [31]:
from lexical_diversity import lex_div as ld

"""
Uses the lexical_diversity module to compute lexical diversity of a string.
"""
def count_ld(string):
    flt = ld.flemmatize(string)
    return(ld.ttr(flt))

In [32]:
wiki_train_df['lex_div'] = wiki_train_df['comment_text'].apply(count_ld)
wiki_train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,words_num,lex_div
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,50,0.953488
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,20,1.0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,44,0.880952
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,114,0.684685
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,14,0.923077


In [33]:
wiki_test_df['lex_div'] = wiki_test_df['comment_text'].apply(count_ld)
wiki_test_df.head()

Unnamed: 0,id,comment_text,words_num,lex_div
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,75,0.805556
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,10,0.833333
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",5,0.818182
3,00017563c3f7919a,":If you have a look back at the source, the in...",39,0.710526
4,00017695ad8997eb,I don't anonymously edit articles at all.,8,1.0


In [34]:
chateval_tweet_df['lex_div'] = chateval_tweet_df['tweet'].apply(count_ld)
chateval_tweet_df.head()

Unnamed: 0,tweet,words_num,lex_div
0,when you find so you decide to stay in her cla...,11,0.916667
1,not a single reporter noticed that hillary's e...,22,0.9
2,jim schwartz has faced 2 teams as defensive co...,18,0.941176
3,go gettum jared!,3,1.0
4,arsenal didn't deserve to win either so?,8,1.0


In [71]:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("emoji", first=True)

"""
Uses the spaCy library and adds the spacymoji to its pipeline.
Checks if a string has emoji and if so, return its amount.
"""
def count_emoji(text):
    doc = nlp(text)
    emoji_num = 0
    if doc._.has_emoji:
        for e in doc._.emoji:
            emoji_num += 1
    return emoji_num

print(count_emoji("🏴 I will display 😜 and 😀"))

3


In [36]:
wiki_train_df['emojis_num'] = wiki_train_df['comment_text'].apply(count_emoji)
wiki_train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,words_num,lex_div,emojis_num
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0,50,0.953488,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0,20,1.0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0,44,0.880952,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0,114,0.684685,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0,14,0.923077,0


In [37]:
wiki_test_df['emojis_num'] = wiki_test_df['comment_text'].apply(count_emoji)
wiki_test_df.head()

Unnamed: 0,id,comment_text,words_num,lex_div,emojis_num
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,75,0.805556,0
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,10,0.833333,0
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",5,0.818182,0
3,00017563c3f7919a,":If you have a look back at the source, the in...",39,0.710526,0
4,00017695ad8997eb,I don't anonymously edit articles at all.,8,1.0,0


In [38]:
chateval_tweet_df['emojis_num'] = chateval_tweet_df['tweet'].apply(count_emoji)
chateval_tweet_df.head()

Unnamed: 0,tweet,words_num,lex_div,emojis_num
0,when you find so you decide to stay in her cla...,11,0.916667,2
1,not a single reporter noticed that hillary's e...,22,0.9,0
2,jim schwartz has faced 2 teams as defensive co...,18,0.941176,0
3,go gettum jared!,3,1.0,0
4,arsenal didn't deserve to win either so?,8,1.0,0


In [69]:
import re

"""
Use re module to substitute contractions with their full forms
"""
def decontract(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

print(decontract("Hey I'm Yann, how're you and how's it going ? That's interesting: I'd love to hear more about it."))
print(decontract("Oh no he didn't. I can't and I won't. I'll know what I'm gonna do."))

Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.
Oh no he did not. I can not and I will not. I will know what I am gonna do.


In [114]:
import json

"""
downloadslangs.py has already scrapped the internet slangs website into shortendtext.json
"""
with open('ShortendText.json') as slangs_json:
    slangs_dict = json.load(slangs_json)
    
slangs_json.close()

"""
Replaces slang with meaning in the dictionary
Still need to improve on punctuation
"""
def replace_slang(phrase):
    split_phrase=phrase.split()
    for i in phrase.split():
        if i in slangs_dict :
            split_phrase[split_phrase.index(i)]=slangs_dict[i]
    return(' '.join(split_phrase))

print(replace_slang("brb lol I dunno so bb"))
print(replace_slang("brb lol, I dunno so bb"))

be right back laughing out loud I I don't know so bye bye
be right back lol, I I don't know so bye bye
