# This is the first solution to the Toxic Comments challenge as part of my Machine Learning Capstone project.  


Steps that we will go through are as follows:

1. Import and explore the data
2. Process the data into a format that we can train a model with
3. Train a model
4. Use our model to make predictions using our test sets (we have multiple, one is 20% of our training data, the other is the testing set provided by the Kaggle Competition
5. View our accuracy, precision, recall and f1 scores
6. Submit our testing set to the Kaggle Competition to retrive our mean-wise AUC ROC score


In [1]:
# 1. Data importation and exploration

import pandas as pd

test_data = pd.read_csv('test.csv') # this is our training set, with labels (provided by kaggle in csv format)
train_data = pd.read_csv('train.csv')  # this is our testing set without labels (provided by kaggle in csv format)

print(train_data.head(n=10))

print('Training set data points:  ' + format(len(train_data)))

                 id                                       comment_text  toxic  \
0  0000997932d777bf  Explanation\nWhy the edits made under my usern...      0   
1  000103f0d9cfb60f  D'aww! He matches this background colour I'm s...      0   
2  000113f07ec002fd  Hey man, I'm really not trying to edit war. It...      0   
3  0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...      0   
4  0001d958c54c6e35  You, sir, are my hero. Any chance you remember...      0   
5  00025465d4725e87  "\n\nCongratulations from me as well, use the ...      0   
6  0002bcb3da6cb337       COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK      1   
7  00031b1e95af7921  Your vandalism to the Matt Shirvington article...      0   
8  00037261f536c51d  Sorry if the word 'nonsense' was offensive to ...      0   
9  00040093b2687caa  alignment on this subject and which are contra...      0   

   severe_toxic  obscene  threat  insult  identity_hate  
0             0        0       0       0          

We can see that there are 159,571 training points provided.  Each training point has a unique id label and 6 labels that classify if it falls into any category of toxic content.  From the first 10 entries we can see that point 6 is classified as being severly toxic, obscene and insulting.  Lets take a closer look at training point 6.

In [2]:
print(train_data.iloc[6])

id                                           0002bcb3da6cb337
comment_text     COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
toxic                                                       1
severe_toxic                                                1
obscene                                                     1
threat                                                      0
insult                                                      1
identity_hate                                               0
Name: 6, dtype: object


In [3]:
print(train_data.iloc[4])

id                                                0001d958c54c6e35
comment_text     You, sir, are my hero. Any chance you remember...
toxic                                                            0
severe_toxic                                                     0
obscene                                                          0
threat                                                           0
insult                                                           0
identity_hate                                                    0
Name: 4, dtype: object


Now we will look at the testing data provided

In [4]:
print(test_data.head(n=10))

                 id                                       comment_text
0  00001cee341fdb12  Yo bitch Ja Rule is more succesful then you'll...
1  0000247867823ef7  == From RfC == \n\n The title is fine as it is...
2  00013b17ad220c46  " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3  00017563c3f7919a  :If you have a look back at the source, the in...
4  00017695ad8997eb          I don't anonymously edit articles at all.
5  0001ea8717f6de06  Thank you for understanding. I think very high...
6  00024115d4cbde0f  Please do not add nonsense to Wikipedia. Such ...
7  000247e83dcc1211                   :Dear god this site is horrible.
8  00025358d4737918  " \n Only a fool can believe in such numbers. ...
9  00026d1092fe71cc  == Double Redirects == \n\n When fixing double...


We can see here that the testing data does not include training labels, therefore we will partition our training set for the purpose of verifying our model performance

In [5]:
# 2. Process the data into a format that we can train a model with

# we will seperate our training data into features and labels

features = train_data['comment_text']
label_columns = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
labels = train_data[label_columns]

print(features.shape)
print(labels.shape)

print(features.head(n=10))
print(labels.head(n=10))

(159571,)
(159571, 6)
0    Explanation\nWhy the edits made under my usern...
1    D'aww! He matches this background colour I'm s...
2    Hey man, I'm really not trying to edit war. It...
3    "\nMore\nI can't make any real suggestions on ...
4    You, sir, are my hero. Any chance you remember...
5    "\n\nCongratulations from me as well, use the ...
6         COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
7    Your vandalism to the Matt Shirvington article...
8    Sorry if the word 'nonsense' was offensive to ...
9    alignment on this subject and which are contra...
Name: comment_text, dtype: object
   toxic  severe_toxic  obscene  threat  insult  identity_hate
0      0             0        0       0       0              0
1      0             0        0       0       0              0
2      0             0        0       0       0              0
3      0             0        0       0       0              0
4      0             0        0       0       0              0
5      0        

Now let us see a breakdown of our classification labels

In [6]:
train_data.describe()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805
std,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
for i in label_columns:
    print (i)
    print (train_data[i].value_counts())

toxic
0    144277
1     15294
Name: toxic, dtype: int64
severe_toxic
0    157976
1      1595
Name: severe_toxic, dtype: int64
obscene
0    151122
1      8449
Name: obscene, dtype: int64
threat
0    159093
1       478
Name: threat, dtype: int64
insult
0    151694
1      7877
Name: insult, dtype: int64
identity_hate
0    158166
1      1405
Name: identity_hate, dtype: int64


Based on this histogram, we will ignore words that occur in more than 40,000*0.8 times (32,000) times.  Words that occur too frequently will not help us classify comments into these labels, because they are more than likely trivial for determining these classifications.

In [10]:
from sklearn.model_selection import train_test_split

# we will split the training data into training and testing data (80% will be training)
X_train, X_test, y_train, y_test = train_test_split(features, labels, train_size=0.8, random_state=42)

In [12]:
from sklearn.feature_extraction.text import CountVectorizer


# countvectorizer will give us word counts for how many times each word 
#(dictionary will be built from all available words in our training set) occurs in each comment
# it will also remove punctuation and extra spacing

count_vect = CountVectorizer(stop_words='english', max_df = 0.32, min_df=3) 
# stop words will remove and, or, if, etc, max_df will disregard words that occur in more than xx percent of comments, 
#min_df is the minimum times a word must occur to be considered a feature

X_train_counts = count_vect.fit_transform(X_train)


print (len(count_vect.get_feature_names()))

46062


We have 46062 features (unique words) that will be considered when building our model

In [13]:
#3. We will train our model
#4. Use our model to make predictions using our test sets (we have multiple, one is 20% of our training data, the other is the testing set provided by the Kaggle Competition
#5. View our accuracy, precision, recall and f1 scores

from sklearn.naive_bayes import MultinomialNB # this is our classifier (Naive Bayes)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # these are our metrics

predictions = pd.DataFrame(columns=[label_columns]) # we will store our predictions in this variable
test_results = pd.DataFrame(columns=['id','toxic','severe_toxic','obscene','threat','insult','identity_hate']) # we will store our predictions for kaggle in this variable

test_data_transform = count_vect.transform(test_data['comment_text']) #transform our testing data using our trained countvector for Kaggle
X_test_transform = count_vect.transform(X_test)  #transform our testing data (20% of training data) using our trained countvector

for i in label_columns:
    clf = MultinomialNB().fit(X_train_counts, y_train[i])
    predictions[i] = clf.predict(X_test_transform)
    test_results[i] = clf.predict(test_data_transform)
    print (i)
    print('Accuracy score: ', format(accuracy_score(y_test[i], predictions[i])))
    print('Precision score: ', format(precision_score(y_test[i], predictions[i])))
    print('Recall score: ', format(recall_score(y_test[i], predictions[i])))
    print('F1 score: ', format(f1_score(y_test[i], predictions[i])))

test_results['id'] = test_data['id']
    

toxic
Accuracy score:  0.9472975090083033
Precision score:  0.721042471042471
Recall score:  0.7333115183246073
F1 score:  0.7271252433484748
severe_toxic
Accuracy score:  0.9842080526398246
Precision score:  0.34925864909390447
Recall score:  0.660436137071651
F1 score:  0.45689655172413796
obscene
Accuracy score:  0.9664421118596271
Precision score:  0.6646216768916156
Recall score:  0.7580174927113703
F1 score:  0.708253881776083
threat
Accuracy score:  0.9949240169199436
Precision score:  0.11403508771929824
Recall score:  0.17567567567567569
F1 score:  0.1382978723404255
insult
Accuracy score:  0.9605514648284506
Precision score:  0.5954814416352878
Recall score:  0.6858736059479554
F1 score:  0.6374892024186583
identity_hate
Accuracy score:  0.9843333855553815
Precision score:  0.25821596244131456
Recall score:  0.3741496598639456
F1 score:  0.3055555555555556


In [14]:
# now we will export the testing 

test_results.to_csv('naive_bayes.csv', index=False)

This resulted in a mean-wise ROC AUC score of 0.7808 in the Kaggle competition.  We can see that although accuracy is high, our other scores are low.  There is also corellation with number of classifications in training data of each classification type and scores.  The greater number of examples we had of each classification type meant a higher score.  We will attempt other models to improve our score.