# Toxic Comment Challenge
* This is a very well known "Toxic Comment Challenge" from Kaggle.
* This will be challenging but considering you have experience with that Trump Tweets thing, I'm confident you'll pull it off.

# Data about Data 

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
* *`toxic`*
* *`severe_toxic`*
* *`obscene`*
* *`threat`*
* *`insult`*
* *`identity_hate`*


* **train.csv** - the training set, contains comments with their binary labels
* **test.csv** - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.
* **sample_submission.csv** - a sample submission file in the correct format
* **test_labels.csv** - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)

___

***Things to do***
- Perform the data pre-processing.
- Perform EDA
- If there's an empty `Comment`, drop it.
- Remove punctuation
- Convert a collection of raw documents to a matrix of TF-IDF features using sklern TfidVectorizer
- Make a sparse matrix with required data for training and testing set using the sparse.hstack() method
- Create an empty(np.zeros) array "preds" of test size
- Fit a logisticRegression model
- Using the model.predict_proba() of LogisticRegression, calculate the probability and add it in the prediction array.
- Make the pred array, a pandas dataframe and set column names.
 
***What will be new***
- Almost everything is new. 
 
***What will be tricky***
- TfidVectorizer would be tricky, but you can refer to this sklearn documentation [HERE](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [1]:
import numpy as np 
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from scipy import sparse

In [42]:
# Sample output
pd.read_csv('sample_submission.csv').head(5)

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.5,0.5,0.5,0.5,0.5,0.5
1,0000247867823ef7,0.5,0.5,0.5,0.5,0.5,0.5
2,00013b17ad220c46,0.5,0.5,0.5,0.5,0.5,0.5
3,00017563c3f7919a,0.5,0.5,0.5,0.5,0.5,0.5
4,00017695ad8997eb,0.5,0.5,0.5,0.5,0.5,0.5


In [43]:
# Import train
df = pd.read_csv('train.csv')
df.head(5)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [44]:
# remove punctuation, add a column with text length, , make lower cases

df['comment_text'] = df['comment_text'].str.lower() 
df['comment_text'] = df['comment_text'].str.replace('[\W_]+',' ')
df['text_length'] = (df['comment_text'].str.split('[\W_]+'))
df['text_length'] = df['text_length'].str.len()
df.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,text_length
0,0000997932d777bf,explanation why the edits made under my userna...,0,0,0,0,0,0,50
1,000103f0d9cfb60f,d aww he matches this background colour i m se...,0,0,0,0,0,0,21
2,000113f07ec002fd,hey man i m really not trying to edit war it s...,0,0,0,0,0,0,45
3,0001b41b1c6bb37e,more i can t make any real suggestions on imp...,0,0,0,0,0,0,118
4,0001d958c54c6e35,you sir are my hero any chance you remember wh...,0,0,0,0,0,0,15
5,00025465d4725e87,congratulations from me as well use the tools...,0,0,0,0,0,0,12
6,0002bcb3da6cb337,cocksucker before you piss around on my work,1,1,1,0,1,0,8
7,00031b1e95af7921,your vandalism to the matt shirvington article...,0,0,0,0,0,0,22
8,00037261f536c51d,sorry if the word nonsense was offensive to yo...,0,0,0,0,0,0,90
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0,12


In [45]:
# remove stop words

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in stop_words]))
    return removed_stop_words

df['clean_comment'] = df['comment_text']
df['clean_comment'] = remove_stop_words(df['clean_comment'])
df.head(10)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,text_length,clean_comment
0,0000997932d777bf,explanation why the edits made under my userna...,0,0,0,0,0,0,50,explanation edits made username hardcore metal...
1,000103f0d9cfb60f,d aww he matches this background colour i m se...,0,0,0,0,0,0,21,aww matches background colour seemingly stuck ...
2,000113f07ec002fd,hey man i m really not trying to edit war it s...,0,0,0,0,0,0,45,hey man really trying edit war guy constantly ...
3,0001b41b1c6bb37e,more i can t make any real suggestions on imp...,0,0,0,0,0,0,118,make real suggestions improvement wondered sec...
4,0001d958c54c6e35,you sir are my hero any chance you remember wh...,0,0,0,0,0,0,15,sir hero chance remember page
5,00025465d4725e87,congratulations from me as well use the tools...,0,0,0,0,0,0,12,congratulations well use tools well talk
6,0002bcb3da6cb337,cocksucker before you piss around on my work,1,1,1,0,1,0,8,cocksucker piss around work
7,00031b1e95af7921,your vandalism to the matt shirvington article...,0,0,0,0,0,0,22,vandalism matt shirvington article reverted pl...
8,00037261f536c51d,sorry if the word nonsense was offensive to yo...,0,0,0,0,0,0,90,sorry word nonsense offensive anyway intending...
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0,12,alignment subject contrary dulithgow


In [37]:
df.corr()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,text_length
toxic,1.0,0.308619,0.676515,0.157058,0.647518,0.266009,-0.051482
severe_toxic,0.308619,1.0,0.403014,0.123601,0.375807,0.2016,0.010148
obscene,0.676515,0.403014,1.0,0.141179,0.741272,0.286867,-0.04061
threat,0.157058,0.123601,0.141179,1.0,0.150022,0.115128,-0.006366
insult,0.647518,0.375807,0.741272,0.150022,1.0,0.337736,-0.04238
identity_hate,0.266009,0.2016,0.286867,0.115128,0.337736,1.0,-0.014014
text_length,-0.051482,0.010148,-0.04061,-0.006366,-0.04238,-0.014014,1.0


In [80]:
df2 = pd.read_csv('test.csv')

df2['comment_text'] = df2['comment_text'].str.lower() 
df2['comment_text'] = df2['comment_text'].str.replace('[\W_]+',' ')

df2['toxic'] = np.zero 
df2['severe_toxic'] = np.zero
df2['obscene'] = np.zero
df2['threat'] = np.zero
df2['insult'] = np.zero
df2['identity_hate'] = np.zero

AttributeError: module 'numpy' has no attribute 'zero'

In [None]:
# remove stop words

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in stop_words]))
    return removed_stop_words

df2['clean_comment'] = df2['comment_text']
df2['clean_comment'] = remove_stop_words(df2['clean_comment'])
df2.head(10)

In [69]:
train = df.drop(['id','comment_text'], axis=1)
#

In [70]:

X_train = train['clean_comment']
X_test = df2['clean_comment']

y_train = train[['toxic','severe_toxic','obscene','threat','insult','identity_hate']]
y_test = df2[['toxic','severe_toxic','obscene','threat','insult','identity_hate']]

X_train.shape, y_train.shape, X_test.shape

((159571,), (159571, 6), (153164,))

In [78]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('bow', CountVectorizer()),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])


In [79]:
pipeline.fit(X_train, y_train)

ValueError: bad input shape (159571, 6)

**Fitting the Classifier**

The next step is to create a pipeline that combines the preprocessor created above with a classifier. In this case I have used a simple RandomForestClassifier to start with.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

You can then simply call the fit method on the raw data and the preprocessing steps will be applied followed by training the classifier.

In [None]:
rf.fit(X_train, y_train)

To predict on new data it is as simple as calling the predict method and the preprocessing steps will be applied followed by the prediction.

In [None]:
y_pred = rf.predict(X_test)

**Model Selection**

A pipeline can also be used during the model selection process. The following example code loops through a number of scikit-learn classifiers applying the transformations and training the model.

In [None]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]
for classifier in classifiers:
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', classifier)])
    pipe.fit(X_train, y_train)   
    print(classifier)
    print("model score: %.3f" % pipe.score(X_test, y_test))

The pipeline can also be used in grid search to find the best performing parameters. To do this you first need to create a parameter grid for your chosen model. One important thing to note is that you need to append the name that you have given the classifier part of your pipeline to each parameter name. In my code above I have called this ‘classifier’ so I have added classifier__ to each parameter. Next I created a grid search object which includes the original pipeline. When I then call fit, the transformations are applied to the data, before a cross-validated grid-search is performed over the parameter grid.

In [None]:
param_grid = { 
    'classifier__n_estimators': [200, 500],
    'classifier__max_features': ['auto', 'sqrt', 'log2'],
    'classifier__max_depth' : [4,5,6,7,8],
    'classifier__criterion' :['gini', 'entropy']}
from sklearn.model_selection import GridSearchCV
CV = GridSearchCV(rf, param_grid, n_jobs= 1)
                  
CV.fit(X_train, y_train)  
print(CV.best_params_)    
print(CV.best_score_)

I am working quite a lot with scikit-learn machine learning projects at the moment. Before I started to use pipelines I would find that when I went back to a project to work on it again even after only a short time I would have trouble following the workflow again. Pipelines have really helped me to put together projects that are both easily repeatable and extensible. I hope that this guide helps others who are interested in learning how to use them.