# Toxic App - DSC478 Final Project

#### Authors: *Jeffrey Bocek, Xuyang Ji & Anna-Lisa Vu*

## Project Overview 

Unfortunately, having conversation about topics one cares about can be challenging under some scenerios, such as the threat of online abuse and harassment. Those insecure online environment not only causes many individuals to refrain from expressing themselves and seeking diverse opinions; but also has led to various platforms struggling to effectively facilitate discussions, resulting in many communities limiting or forcing shutting down user comment sections.


With a goal to foster healthier online communities by addressing the issue, our team worked on developing a Python application focuses on comment toxicity detection. The app can be used as a third-party library or extension for social media sites or public sites where users are allowed to leave comments. With various clustering and predictive models stored in the backend, the app allows users to detect the toxicity level of specific queries, and revise them for maintaining a more respectful online community. 

## Dataset Description
The provided dataset contains a large number of Wikipedia comments which have been labeled by human raters for toxic behavior, such as toxic, severe toxic, obscene, threat, insult and identity hate. 

#### Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import random
import operator #for sorting 
from sklearn import preprocessing # for normalization 
from collections import Counter #finding the majority 

#Text Preprocessing 
import re # for number removal 
import string # for punctutation removal 

import nltk 
## for stopword removal
from nltk.corpus import stopwords
nltk.download('stopwords')
stopWords= stopwords.words('english')
## lemmatization 
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('punkt')
lemmatizer= WordNetLemmatizer()
from nltk.tokenize import word_tokenize

import pickle  #save variables to file

import matplotlib.gridspec as gridspec
import matplotlib.cm as cm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

In [None]:
train_df = pd.read_csv('train.csv')
test_comments = pd.read_csv('test.csv')
test_lab= pd.read_csv('test_labels.csv',header=0, na_values=-1)
train_df.describe()

In [None]:
test_df = pd.merge(test_comments, test_lab, on="id")
test_df.head(3)

### Comment Length 
Looking at the histoplot for comment length differences between clean and toxic comments, clean comments tend to be approximately one-fourth longer than toxic comments on average.  Upon examining random samples, it becomes apparent that many clean comments consist of long and well-crafted responses. 

In [None]:
comm= train_df[train_df.iloc[:,2:].sum(axis=1)==0]["comment_text"].str.split().apply(len)
toxic_comm= train_df[train_df.iloc[:,2:].sum(axis=1)!=0]["comment_text"].str.split().apply(len)
plt.hist([comm, toxic_comm], bins = 40, color =('c','m'), label=("Clean Comment", "Toxic Comment"))
plt.legend(loc='best')
plt.title("Distribution for Comment Lengths")
plt.xlim(0,600)
plt.xlabel("Word Counts")
plt.ylabel("Comment Density")
plt.show()

#### Class Imbalance 
The dataset comprises 159,571 comments from Wikipedia, with each comment consisting of a string data input feature and six labels that categorize the comment as toxic, severe_toxic, obscene, threat, insult, or identity_hate. The following figure shows how these labels are distributed throughout the dataset, including multilabelled data. Although there is no missing value in the training dataset; however, based on the fact that the mean values are extremely low, it can be inferred that the majority of the comments are likely to be clean/non-toxic comments. In other words, the toxicity is not evenly distributed across classes, and class imbalance is present. Upon investigation, the clean comment ratio in training set is 89.8%, while there are 58.2% of clean comments in the test set. 

The breakdown demonstrates that while most comments with other labels are also toxic, not all of them are. Only "severe_toxic" is clearly a subcategory of "toxic," which is reasonable to rule out labeling errors. The observation indicates that "toxic" is not a overseeing label, but rather a subcategory in the bigger context with considerable overlap with other labels. Regarding the issue of multi-labelling, it would most likely to pose difficulty to train a classifier on specific labels in the raw dataset due to overlapping. The ambiguity surrounding the label assignments and the absence of clear explanations is the reason why we opted to use aggregate labels of general toxicity levels, called "non_toxic", "mild_toxicity", "toxic", and "severe-toxic" as the targets going forwards. 

In [None]:
# Toxic vs. clean comment 
clean_comm= (train_df.iloc[:,2:].sum(axis=1)==0).sum(axis=0)
clean_test= len(test_lab[test_lab.isna().any(axis=1)])
print("The clean comment ratio in training set is:",(round(clean_comm/len(train_df),3)))
print("The clean comment ratio in test set is:",(round(clean_test/len(test_lab),3)))

In [None]:
#Ratio of comments in each toxic class

categories= list(train_df.columns.values[2:])
                 
counts=[]
for category in categories:
    count=train_df[category].sum()
    ratio= round(count/len(train_df),3)
    counts.append((category,count,ratio))
category_stat = pd.DataFrame(counts, columns=["Class","Counts","Percentage"])
category_stat
ax= sns.barplot(x="Class",y="Counts",data=category_stat)
ax.bar_label(ax.containers[0])
plt.title("Comments in Each Toxic Class")
plt.ylabel('Comment Counts')
plt.xlabel('Comment Class ')
ax.tick_params(labelsize=7)

In [None]:
#Comments belong to multiple classes by getting rowSums 
multiClass_comm= train_df.iloc[:,2:].sum(axis=1).value_counts()
multiClass= multiClass_comm[multiClass_comm.index>1]
ax=sns.barplot(x=multiClass.index, y=multiClass.values)
ax.bar_label(ax.containers[0])
plt.title("Comments Belong to Multiple Classes")
plt.ylabel('Comment Counts')
plt.xlabel('Classes Counts')
ax.tick_params(labelsize=7)
plt.show()

### Data Preprocessing 

#### I. Class Resampling
As the first step to ensuring data quality, the reconfigure_categories function transform the multi-labeled dataset into 4 separate independent classes. While the semantic meaning of toxic and severe-toxic comments shows some level of graduation; we firstly fill the all toxic column of the severe-toxic comments with 1, and extract extract all severe-toxic comments from the toxic class. The newly created non-toxic column is for comments with all target_col value of 0. Since labels are nested under each other in some cases, comments with toxicity-type defined are identified and stored in a new column called toxicity-defined. Finally, the new mild_toxicity column is created for comment with toxicity defined but not labeled as toxic/severe toxic. The key assumption is that there is no significant correlation among the following labels. 
- level 1: Non-toxic comment 
- level 2: Mild_toxicity comment
- level 3: Toxic comment 
- level 4: severe_toxic comment 

In [None]:
pd.options.mode.chained_assignment = None
def reconfigure_categories(train_df, test_lab):
    'add categories to df'

    def plot_df_distributions(df, position, chart_title):
        'make a bar chart of df distributions'
        target_column =list(df.columns[position:-1])
        labelC = df[target_column].sum()
        plt.figure(figsize=(10,7))
        ax = sns.barplot(x=labelC.index, y=labelC.values,dodge=False)
        ax.set_yscale('log')
        ax.tick_params(labelsize=7)
        plt.title(chart_title)
        for i in ax.containers:
            ax.bar_label(i,)
    
    def config_train(train_df, position):
        target_col = list(train_df.columns[position:])
        train_df.loc[train_df['severe_toxic']==1,'toxic'] =1
        train_df['non_toxic'] = 1-train_df[target_col].max(axis=1)
        train_df['toxicity_defined'] = train_df[['insult', 'obscene', 'identity_hate', 'threat']].max(axis=1)
        train_df['toxicity_undefined'] = 0  # initiziling all value of 0
        # for each comment that's not toxicity defined but labelled as toxic, labeled them as 1 in new column
        train_df.loc[(train_df['toxicity_defined'] == 0) & (train_df['toxic'] == 1), 'toxicity_undefined'] = 1
        train_df['mild_toxicity'] = 0
        train_df.loc[(train_df['toxicity_defined'] == 1) & (train_df['toxic'] == 0) & (train_df['severe_toxic'] == 0),
                'mild_toxicity'] = 1

        train_df['toxicity_level']=0 #initializing a column of 0
        train_df.loc[(train_df['non_toxic']==1),'toxicity_level']=1
        train_df.loc[(train_df['mild_toxicity']==1),'toxicity_level']=2
        train_df.loc[(train_df['toxic']==1)& (train_df['severe_toxic']==0),'toxicity_level']=3
        train_df.loc[(train_df['severe_toxic']==1), 'toxicity_level']=4

        #drop rows with toxicity level undefined
        train_df= train_df[train_df['toxicity_level']!= 0]

        #make sure the toxic comments does not include severe toxic comments
        train_df.loc[train_df['severe_toxic'] == 1, 'toxic'] = 0

        #train_df = train_df[train_df.toxicity_level !=2]
        #train_df['toxicity_level'] = train_df['toxicity_level'].replace([3,4], [2,3])
        #train_df = train_df.drop(columns=['mild_toxicity'])
        #print(train_df['toxicity_level'].value_counts())

        return train_df

    def config_test_lab(test_lab, position):
        test_lab = test_lab[~test_lab.isnull().any(axis=1)]
        target_col = list(test_lab.columns[position:])
        #fill toxic column of sever toxic comments with 1 
        test_lab.loc[test_lab['severe_toxic']==1,'toxic'] =1

        test_lab['non_toxic'] = 1-test_lab[target_col].max(axis=1)
        test_lab['toxicity_defined'] = test_lab[['insult', 'obscene', 'identity_hate', 'threat']].max(axis=1) 
        test_lab['toxicity_undefined'] = 0  # initiziling all value of 0
        test_lab.loc[(test_lab['toxicity_defined'] == 0) & (test_lab['toxic'] == 1), 'toxicity_undefined'] = 1
        test_lab['mild_toxicity'] = 0
        test_lab.loc[(test_lab['toxicity_defined'] == 1) & (test_lab['toxic'] == 0) & (test_lab['severe_toxic'] == 0),
                'mild_toxicity'] = 1

        test_lab['toxicity_level']=0 #initializing a column of 0
        test_lab.loc[(test_lab['non_toxic']==1),'toxicity_level']=1
        test_lab.loc[(test_lab['mild_toxicity']==1),'toxicity_level']=2
        test_lab.loc[(test_lab['toxic']==1)& (test_lab['severe_toxic']==0),'toxicity_level']=3
        test_lab.loc[(test_lab['severe_toxic']==1), 'toxicity_level']=4

        test_lab= test_lab[test_lab['toxicity_level']!= 0]

        #make sure the toxic comments does not include severe toxic comments
        test_lab.loc[test_lab['severe_toxic'] == 1, 'toxic'] = 0

        #test_lab = test_lab[test_lab.toxicity_level !=2]
        #test_lab['toxicity_level'] = test_lab['toxicity_level'].replace([3,4], [2,3])
        #test_lab = test_lab.drop(columns=['mild_toxicity'])
        #print(test_lab['toxicity_level'].value_counts())
        return test_lab
 
    train_df = config_train(train_df, 2)
    test_lab = config_test_lab(test_lab, 2)

    plot_df_distributions(train_df, 2, "Training Data Class Distribution")
    plot_df_distributions(test_lab, 2, "Test Data Class Distribution")
    
    return train_df, test_lab

In [None]:
train_df, test_lab = reconfigure_categories(train_df, test_df)

As the the tuples in non_toxic class significantly outnumbered the other classes, directly feeding the dataset with 4 classes to classifiers can make them biased in favor of the majority class. Hence, we resampled the dataset by reudcing of the size of the class which is in abundance while keeping all tuples in the minority classes. By doing such, the run-time are significantly improved. 

In [None]:
def get_subsample4(data, nontox_frac, toxic_frac):
    
    data=data[['comment_text', 'non_toxic','mild_toxicity','toxic','severe_toxic','toxicity_level']]
    non_toxic= data[data['toxicity_level']==1].sample(frac=nontox_frac,random_state=961)
    toxic= data[data['toxicity_level']==3].sample(frac=toxic_frac,random_state=961)
    dataB= non_toxic.append(toxic)
    data= dataB.append(data[(data['toxicity_level']==2)|(data['toxicity_level']==4)])
    return data

In [None]:
# the fractions below correspond to each toxicity level.  They can be adjusted accordingly.
train_df4 = get_subsample4(train_df, 0.01, .10)
test_df4 = get_subsample4(test_df, 0.01, .10)

In [None]:
train_df4=train_df[['comment_text','non_toxic','mild_toxicity','toxic','severe_toxic','toxicity_level']]
nonToxic= train_df4[train_df4['toxicity_level']==1].sample(frac=0.01,random_state=961)
toxic = train_df4[train_df4['toxicity_level']==3].sample(frac=0.1,random_state=961)
trainB= nonToxic.append(toxic)
train_df4= trainB.append(train_df4[(train_df4['toxicity_level']==2)|(train_df4['toxicity_level']==4)])
targetB= list(train_df4.columns.drop(['comment_text','toxicity_level']))
labelB= train_df4[targetB].sum()
plt.figure(figsize=(5,5))
ax = sns.barplot(x=labelB.index, y=labelB.values)
ax.tick_params(labelsize=7)

#### II. Text Preprocessing 

The diversity and vastness of social media comments make it difficult to comprehend and capture the underlying trends and characteristics for comments with different toxicity levels. In details, keeping all words makes the dimensionality of each text extremely high, which makes classification more challenging. However, properly preprocessing the data, by reducing noise in the text, may improve classifier performance and speed up the classification process, thereby facilitating real-time sentiment analysis. Hence,in order to conduct data mining on online opinion data, the text preprocessing steps involve several stages, including removal of punctuation, lowering the text, removal of white spaces, stemming, removing stop words, handling negation, and finally tokenization and lemmatization. 

In [None]:
#Removing punctuations'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
def remove_punc(comment):
    nonPunc="".join([letter for letter in comment if letter not in string.punctuation])
    return nonPunc

#Lowering the text
def toLower(comment):
    return comment.lower()

#Removing numbers 
def replace_numbers(comment):
    """Replace all interger occurrences in 
    list of tokenized words with textual representation"""
    return re.sub(r'\d+', '', comment)

#Removing whitespaces
def remove_space(comment):
    return comment.strip()

#Tokenization
def text2word(comment):
    return word_tokenize(comment)

#Removing Stop words
def remove_stopW(words,stopWords):
    return [word for word in words if word not in stopWords]

#Lemmatization
def lematizer(words):
    lemmatizer = WordNetLemmatizer()
    lemm_comm= [lemmatizer.lemmatize(word) for word in words]
    return lemm_comm

def lematizer_verb(words):
    lemmatizer = WordNetLemmatizer()
    lemm_comm= [lemmatizer.lemmatize(word,"v") for word in words]
    return lemm_comm


def clean_comment(comment):
    comment= remove_punc(comment)
    comment= toLower(comment)
    comment= replace_numbers(comment)
    comment= remove_space(comment)
    words=text2word(comment)
    words=remove_stopW(words,stopWords)
    words=lematizer(words)
    words=lematizer_verb(words)
    
    return ' '.join(words)

In [None]:
train_df4['comment_text'] = train_df4['comment_text'].apply(lambda x: clean_comment(x))
test_df['comment_text'] = test_df['comment_text'].apply(lambda x: clean_comment(x))
train_df4= train_df4.reset_index(drop=True)
train_df4.head(5)

#### III. Data Quality Assessment 
IN the final step for data preprocessing, we examined data quality of the 4 classes based on the Bag of Word model, where each row represents a specific text in corpus and each column represents a word in vocabulary. Based on the following plot on top terms per toxicity level, there is no significant difference between non_toxic and mild_toxicity. Hence, regarding the small distrbution in mild_toxicity class and the low level of uniqueness, we decided to drop mild_toxicity for model quality. 

In [None]:
def topNwords(data,topN=5):
    vec= CountVectorizer(stop_words='english',analyzer='word').fit(data)
    bagOfW= vec.transform(data)
    sumWord= bagOfW.sum(axis=0)
    wordFreq= [(word,sumWord[0,idx])for word, idx in vec.vocabulary_.items()]
    wordFreq= sorted(wordFreq,key=lambda x:x[1], reverse=True)
    topTerm= pd.DataFrame(wordFreq[:topN], columns=["Term","Count"])
    return topTerm

def topTermByClass(data, topN=10):
    ''' Expecting the data with two columns, one with comment_text, the other
    with its toxicity_level. Return a list of dataframes for each of the class in the toxicity_level,
    where each dataframe contains topN terms and their tfidf value'''
    dfs=[]
    for level in np.unique(data['toxicity_level']):
        idx= data['toxicity_level'].index[data['toxicity_level']==level]
        topTerm= topNwords(data.loc[idx]['comment_text'],topN= 10)
        dfs.append(topTerm)
    #print("DFS:"+str(dfs))
    type=['non_toxic', 'mild_toxicity', 'toxic','severe_toxic']
    plt.figure(figsize=(10,8))
    plt.suptitle("Top Terms Per Toxicity Level",fontsize=10)
    plt.rcParams["figure.autolayout"] = True
    gridspec.GridSpec(2,2)
    for termClass in range(2):
        plt.subplot2grid((2,2),(0,termClass))
        sns.barplot(x=dfs[termClass]['Term'],y=dfs[termClass]['Count'])
        plt.title(("Toxicity Class:{"+type[termClass]+"}"),fontsize=8)
        plt.xlabel('Word', fontsize=7)
        plt.xticks(fontsize=7)
        plt.yticks(fontsize=7)
        plt.ylabel('Count', fontsize=7)
    row2Ind= 2
    #print("dfs[row2Ind]['Term']"+str(dfs[row2Ind]['Term']))
    #print("dfs[row2Ind]['Count']"+str(dfs[row2Ind]['Count']))
    for row2 in range(2):
        plt.subplot2grid((2,2),(1,row2))
        sns.barplot(x=dfs[row2Ind]['Term'],y=dfs[row2Ind]['Count'])
        plt.title(("Toxicity Class: {"+type[row2Ind]+"}"),fontsize=8)
        plt.xlabel('Word', fontsize=7)
        plt.xticks(fontsize=7)
        plt.yticks(fontsize=7)
        plt.ylabel('Count', fontsize=7)
        row2Ind +=1
    plt.show()
    return dfs

In [None]:
trainSub= train_df4[['comment_text','toxicity_level']]
x= topTermByClass(trainSub,10)


#### IV. Final Resampling

In [None]:
def removeMild(data):
    data = data[data.toxicity_level !=2]
    data['toxicity_level'] = train_df['toxicity_level'].replace([3,4], [2,3])
    data = data.drop(columns=['mild_toxicity'])       
    data=data[['comment_text', 'non_toxic','toxic','severe_toxic','toxicity_level']]

In [None]:
train_df,

In [None]:
def get_subsample(data, nontox_frac, toxic_frac, severetoxic_frac, chart_title):
    
    data=data[['comment_text', 'non_toxic','toxic','severe_toxic','toxicity_level']]

    non_toxic= data[data['toxicity_level']==1].sample(frac=nontox_frac,random_state=961)
    #print("Non toxic shape:"+str(nonToxic.shape))

    toxic= data[data['toxicity_level']==2].sample(frac=toxic_frac,random_state=961)
    #print("Toxic shape:"+str(nonToxic.shape))

    severe_toxic = data[data['toxicity_level']==3].sample(frac=severetoxic_frac,random_state=961)
    #print("Sever Toxic shape:"+str(toxic.shape))

    data = non_toxic.append(toxic).append(severe_toxic)
    print(data[['toxicity_level']].value_counts())

    #make sure the toxic comments does not include severe toxic comments
    data.loc[data['severe_toxic'] == 1, 'toxic'] = 0

    target= list(data.columns.drop(['comment_text','toxicity_level']))
    label= data[target].sum()
    plt.figure(figsize=(5,5))
    ax = sns.barplot(x=label.index, y=label.values)
    ax.tick_params(labelsize=7)
    plt.title(chart_title)
    for i in ax.containers:
        ax.bar_label(i,)

    return data

In [None]:
# the fractions below correspond to each toxicity level.  They can be adjusted accordingly.
train_df = get_subsample(train_df, 0.01, .10, 1.0, "Distribution of Training Data")
test_df = get_subsample(test_df, 0.01, .10, 1.0, "Distribution of Testing Data")

print('Train df shape = ' +str(train_df.shape))
print('Test df shape = ' +str(test_df.shape))

In [None]:
def clean_comments(train_df, test_df):
    'clean comments in dataframes'

    train_df['comment_text'] = train_df['comment_text'].apply(lambda x: clean_comment(x))
    test_df['comment_text'] = test_df['comment_text'].apply(lambda x: clean_comment(x))
    return train_df, test_df

In [None]:
def reset_index(df):
    'reset index of df'

    df= df.reset_index(drop=True)
    return df

In [None]:
def store_dfs(file_name, dfs):
    'store dataframes to file'

    clean_data = dfs
    with open(file_name, 'wb') as my_file_object:
        pickle.dump(clean_data, my_file_object)

### Final Set of Steps to Clean Data and Reduce Data Sets

In [None]:
# the fractions below correspond to each toxicity level.  They can be adjusted accordingly.
train_df = get_subsample(train_df, 0.01, .10, 1.0, "Distribution of Training Data")
test_df = get_subsample(test_df, 0.01, .10, 1.0, "Distribution of Testing Data")

print('Train df shape = ' +str(train_df.shape))
print('Test df shape = ' +str(test_df.shape))

train_df, test_df = clean_comments(train_df, test_df)
#test
#train_df["comment_text"].iloc[2]

for df in [train_df, test_df]:
   reset_index(df)

#can input file name to correspond with percent of distributions
store_dfs('clean_data1.p', [train_df, test_df])


In [None]:
train_df["comment_text"].iloc[2]

In [None]:
test_df["comment_text"].iloc[2]

In [None]:
test_lab.head()