# Toxic Comment Classification

Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content).

In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.

The dataset can be found on Kaggle : 

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

# Libraries

In [76]:
import pandas as pd
import numpy as np
import re
import string
import itertools as it

import nltk
from nltk.corpus import stopwords                  # module for stop words that come with NLTK
from nltk.stem.wordnet import WordNetLemmatizer    # module for lemmatization
from nltk import word_tokenize, pos_tag            # tokenization and Part of Speech tagging

nltk.download('stopwords') #stopwords used to preprocess the corpus

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jeremy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
stopwords_english = stopwords.words('english') # a list of English stopwords

Lemmatizer = lemmatizer = WordNetLemmatizer()  # a method that returns the lemmatized form of word 
                                               # ("was" => "be" - "rocks" => "rock")

# Import Data

## File descriptions
* train.csv - the training set, contains comments with their binary labels
* test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand 
labeling, the test set contains some comments which are not included in scoring.
* sample_submission.csv - a sample submission file in the correct format
* test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)

In [15]:
train_data = pd.read_csv("Data/train.csv")
test_data = pd.read_csv('Data/test.csv')

In [16]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB


In [17]:
train_data.isna().sum()

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

In [18]:
train_data.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


As we can see the training dataset contains :
* the comment ID
* the raw text
* the different categories of toxicity

In [19]:
# Let's check some comments
for i in range(10):
    print(train_data['comment_text'][i])
    print('---------------')

Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
---------------
D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)
---------------
Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.
---------------
"
More
I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any prefere

In [20]:
#Let's check in the test.csv
test_data.head(10)

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.
5,0001ea8717f6de06,Thank you for understanding. I think very high...
6,00024115d4cbde0f,Please do not add nonsense to Wikipedia. Such ...
7,000247e83dcc1211,:Dear god this site is horrible.
8,00025358d4737918,""" \n Only a fool can believe in such numbers. ..."
9,00026d1092fe71cc,== Double Redirects == \n\n When fixing double...


Here we just have the ID's and comments with no classification

# Clean the corpus

In [21]:
# We define the list of punctuations we want to remove
# Note that we let ! in the corpus
# punc = '''()-[]{};:'"\,<>./?@#$%^&*_~'''

In [22]:
#Let''s define a function that preprocesses a text

def preprocess(corpus):
    
    '''
    From a string, make text lowercase, remove hyperlinks, punctuation, word containing numbers, stopwords.
    Input : a list of strings
    Output : a list of tokens stored in a generator (yield)
    '''

    for text in corpus:

        text = text.lower()                                               # Lowercase
        text = re.sub(r'https?://[^\s\n\r]+', '', text)                   # Remove links
        text = re.sub('[%s]' % re.escape(string.punctuation), '', text)   # Remove punctuation
        text = re.sub('\w*\d\w*', '', text)                               # Remove words containing numbers
    
        yield ' '.join([word for word in text.split(' ') if word not in stopwords_english]) # Return a generator 

In [23]:
# tokens = [lemmatizer.lemmatize(token) for token in word_tokenize(text) if token not in stopwords_english]  # Lemmatize and transform a string into a list of tokens

In [26]:
%%time

# We save the cleaned comments in a list to be easily manipulated
clean_comments = list(preprocess(train_data['comment_text']))

Wall time: 35.7 s


In [25]:
for i in range(10):
    print(clean_comments[i])
    print('------------')

explanation
why edits made username hardcore metallica fan reverted werent vandalisms closure gas voted new york dolls fac please dont remove template talk page since im retired 
------------
daww matches background colour im seemingly stuck thanks  talk  january   utc
------------
hey man im really trying edit war guy constantly removing relevant information talking edits instead talk page seems care formatting actual info
------------

more
i cant make real suggestions improvement  wondered section statistics later subsection types accidents  think references may need tidying exact format ie date format etc later noone else first  preferences formatting style references want please let know

there appears backlog articles review guess may delay reviewer turns listed relevant form eg wikipediagoodarticlenominationstransport  
------------
sir hero chance remember page thats
------------


congratulations well use tools well  · talk 
------------
cocksucker piss around work
-----------

We note some words with no meaning, or typos. It can be better but we are going to work with that at first.



# Word Embeddings

In this section, we will try out several word enbeddings methods. 

Let's use the simplest one : bag-of-words (BOW)

BOW method calculates the frequency of a word for each document, based on a Vocabulary.  
To use BOW, instead of recreating from scratch, we can use the library Scikit Learn => CountVectorizer

In [87]:
%%time

# Bag-of-words
vectorizer = CountVectorizer(min_df=3,max_df=0.9) #Filter words that are note present at least in min_df documents & no more that 90% of all documents
bow = vectorizer.fit_transform(clean_comments) #return a document-term matrix (n_samples,n_features)

Wall time: 7.86 s


We have generated a sparse matrix, composed of word frequencies for each document, with only a small number of non-zero elements (*stored elements* in the representation  below).  

*Note*  
It appears that using a generator (produced with yield in the preprocessing function *preprocess*) accelerate the preprocessing, but *fit_transform* actually takes more time this way.

In [89]:
# Let's take a look at the features / vocabulary
print(vectorizer.get_feature_names()[:30])
print('------------')
print(vectorizer.get_feature_names()[100:130])
print('------------')
print(vectorizer.get_feature_names()[1000:1030])
print('------------')
print(vectorizer.get_feature_names()[10000:10030])

['aa', 'aaa', 'aaand', 'aac', 'aachen', 'aah', 'aaliyah', 'aamir', 'aan', 'aand', 'aang', 'aap', 'aaps', 'aar', 'aardvark', 'aarem', 'aaron', 'aarons', 'aas', 'aatalk', 'aau', 'aave', 'ab', 'aba', 'aback', 'abad', 'abaddon', 'abandon', 'abandoned', 'abandoning']
------------
['abolish', 'abolished', 'abolishing', 'abolition', 'abolitionist', 'abolitionists', 'abomb', 'abominable', 'abomination', 'abominations', 'aboriginal', 'aboriginals', 'aborigine', 'aborigines', 'abort', 'aborted', 'abortion', 'abortions', 'abot', 'abotu', 'abou', 'aboumekhael', 'abound', 'abounds', 'abour', 'about', 'aboutcom', 'abouth', 'abouti', 'above']
------------
['agreement', 'agreements', 'agrees', 'agress', 'agressing', 'agression', 'agressive', 'agressively', 'agressor', 'agricultural', 'agriculture', 'agriculturists', 'agrizoophobia', 'aground', 'ags', 'aguilera', 'aguri', 'agw', 'ah', 'aha', 'ahaha', 'ahahahahaha', 'aharon', 'ahd', 'ahead', 'ahem', 'ahh', 'ahhh', 'ahhhh', 'ahhrelief']
------------
['co



As we can see, some words are not in the english dictionary, or are just grostesque.  
However filtering this word would filter "toxic" words too, as "cocksucker".  
We will therfore let these outliers in the dataset for the moment.  

In [90]:
# from sklearn.feature_extraction.text import TfidfVectorizer
# vec = TfidfVectorizer()
# vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=word_tokenize,
#                min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
#                smooth_idf=1, sublinear_tf=1 )
# trn_term_doc = vec.fit_transform(clean_comments)

# Models

## Train & Test Preparation 

In [83]:
# # We scale the data
# scaler = StandardScaler(with_mean=False).fit(bow)
# scaled_bow = scaler.transform(bow)

In [91]:
# Let's define target, which is the classification made by human
target = train_data[['toxic', 'severe_toxic', 'obscene', 'threat','insult', 'identity_hate']]
target = np.array(target) #transform dataframe into array

In [96]:
# We create the train and test sets using train_test_split
train_x, test_x, train_y, test_y = /train_test_split(bow,target, test_size=0.20 ,random_state=0)

## Naïve Bayes & Logistic regression

We define NaÏve Bayes relation

In [None]:
def NB(y_i,y):
    p = bow

In [None]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [100]:
logistic_model = LogisticRegression(C=4,dual=True,solver='liblinear',max_iter=400).fit(train_x,train_y[:,0])



In [74]:
train_y[:,0].sum()/len(train_y)

0.09551450773955004