# Toxic Comment Challenge
* This is a very well known "Toxic Comment Challenge" from Kaggle.
* This will be challenging but considering you have experience with that Trump Tweets thing, I'm confident you'll pull it off.

# Data about Data 

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
* *`toxic`*
* *`severe_toxic`*
* *`obscene`*
* *`threat`*
* *`insult`*
* *`identity_hate`*


* **train.csv** - the training set, contains comments with their binary labels
* **test.csv** - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.
* **sample_submission.csv** - a sample submission file in the correct format
* **test_labels.csv** - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)

___

***Things to do***
- Perform the data pre-processing.
- Perform EDA
- If there's an empty `Comment`, drop it.
- Remove punctuation
- Convert a collection of raw documents to a matrix of TF-IDF features using sklern TfidVectorizer
- Make a sparse matrix with required data for training and testing set using the sparse.hstack() method
- Create an empty(np.zeros) array "preds" of test size
- Fit a logisticRegression model
- Using the model.predict_proba() of LogisticRegression, calculate the probability and add it in the prediction array.
- Make the pred array, a pandas dataframe and set column names.
 
***What will be new***
- Almost everything is new. 
 
***What will be tricky***
- TfidVectorizer would be tricky, but you can refer to this sklearn documentation [HERE](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [1]:
import numpy as np 
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from scipy import sparse

In [2]:
# Sample output
pd.read_csv('sample_submission.csv').head(5)

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.5,0.5,0.5,0.5,0.5,0.5
1,0000247867823ef7,0.5,0.5,0.5,0.5,0.5,0.5
2,00013b17ad220c46,0.5,0.5,0.5,0.5,0.5,0.5
3,00017563c3f7919a,0.5,0.5,0.5,0.5,0.5,0.5
4,00017695ad8997eb,0.5,0.5,0.5,0.5,0.5,0.5


In [3]:
# Import train
df = pd.read_csv('train.csv')
df.head(5)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
# remove punctuation, add a column with text length, , make lower cases

df['comment_text'] = df['comment_text'].str.lower() 
df['comment_text'] = df['comment_text'].str.replace('[\W_]+',' ')
df['text_length'] = (df['comment_text'].str.split('[\W_]+'))
df['text_length'] = df['text_length'].str.len()
df.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,text_length
0,0000997932d777bf,explanation why the edits made under my userna...,0,0,0,0,0,0,50
1,000103f0d9cfb60f,d aww he matches this background colour i m se...,0,0,0,0,0,0,21
2,000113f07ec002fd,hey man i m really not trying to edit war it s...,0,0,0,0,0,0,45
3,0001b41b1c6bb37e,more i can t make any real suggestions on imp...,0,0,0,0,0,0,118
4,0001d958c54c6e35,you sir are my hero any chance you remember wh...,0,0,0,0,0,0,15
5,00025465d4725e87,congratulations from me as well use the tools...,0,0,0,0,0,0,12
6,0002bcb3da6cb337,cocksucker before you piss around on my work,1,1,1,0,1,0,8
7,00031b1e95af7921,your vandalism to the matt shirvington article...,0,0,0,0,0,0,22
8,00037261f536c51d,sorry if the word nonsense was offensive to yo...,0,0,0,0,0,0,90
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0,12


In [5]:
# remove stop words

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in stop_words]))
    return removed_stop_words

df['clean_comment'] = df['comment_text']
df['clean_comment'] = remove_stop_words(df['clean_comment'])
df.head(10)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,text_length,clean_comment
0,0000997932d777bf,explanation why the edits made under my userna...,0,0,0,0,0,0,50,explanation edits made username hardcore metal...
1,000103f0d9cfb60f,d aww he matches this background colour i m se...,0,0,0,0,0,0,21,aww matches background colour seemingly stuck ...
2,000113f07ec002fd,hey man i m really not trying to edit war it s...,0,0,0,0,0,0,45,hey man really trying edit war guy constantly ...
3,0001b41b1c6bb37e,more i can t make any real suggestions on imp...,0,0,0,0,0,0,118,make real suggestions improvement wondered sec...
4,0001d958c54c6e35,you sir are my hero any chance you remember wh...,0,0,0,0,0,0,15,sir hero chance remember page
5,00025465d4725e87,congratulations from me as well use the tools...,0,0,0,0,0,0,12,congratulations well use tools well talk
6,0002bcb3da6cb337,cocksucker before you piss around on my work,1,1,1,0,1,0,8,cocksucker piss around work
7,00031b1e95af7921,your vandalism to the matt shirvington article...,0,0,0,0,0,0,22,vandalism matt shirvington article reverted pl...
8,00037261f536c51d,sorry if the word nonsense was offensive to yo...,0,0,0,0,0,0,90,sorry word nonsense offensive anyway intending...
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0,12,alignment subject contrary dulithgow


In [6]:
df.corr()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,text_length
toxic,1.0,0.308619,0.676515,0.157058,0.647518,0.266009,-0.051482
severe_toxic,0.308619,1.0,0.403014,0.123601,0.375807,0.2016,0.010148
obscene,0.676515,0.403014,1.0,0.141179,0.741272,0.286867,-0.04061
threat,0.157058,0.123601,0.141179,1.0,0.150022,0.115128,-0.006366
insult,0.647518,0.375807,0.741272,0.150022,1.0,0.337736,-0.04238
identity_hate,0.266009,0.2016,0.286867,0.115128,0.337736,1.0,-0.014014
text_length,-0.051482,0.010148,-0.04061,-0.006366,-0.04238,-0.014014,1.0


In [7]:
df2 = pd.read_csv('test.csv')

df2['comment_text'] = df2['comment_text'].str.lower() 
df2['comment_text'] = df2['comment_text'].str.replace('[\W_]+',' ')

df2['toxic'] = 0
df2['severe_toxic'] = 0
df2['obscene'] = 0
df2['threat'] = 0
df2['insult'] = 0
df2['identity_hate'] = 0

In [8]:
df2

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,yo bitch ja rule is more succesful then you ll...,0,0,0,0,0,0
1,0000247867823ef7,from rfc the title is fine as it is imo,0,0,0,0,0,0
2,00013b17ad220c46,sources zawe ashton on lapland,0,0,0,0,0,0
3,00017563c3f7919a,if you have a look back at the source the inf...,0,0,0,0,0,0
4,00017695ad8997eb,i don t anonymously edit articles at all,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
153159,fffcd0960ee309b5,i totally agree this stuff is nothing but too...,0,0,0,0,0,0
153160,fffd7a9a6eb32c16,throw from out field to home plate does it ge...,0,0,0,0,0,0
153161,fffda9e8d6fafa9e,okinotorishima categories i see your changes ...,0,0,0,0,0,0
153162,fffe8f1340a79fc2,one of the founding nations of the eu germany...,0,0,0,0,0,0


In [9]:
# remove stop words

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in stop_words]))
    return removed_stop_words

df2['clean_comment'] = df2['comment_text']
df2['clean_comment'] = remove_stop_words(df2['clean_comment'])
df2.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,clean_comment
0,00001cee341fdb12,yo bitch ja rule is more succesful then you ll...,0,0,0,0,0,0,yo bitch ja rule succesful ever whats hating s...
1,0000247867823ef7,from rfc the title is fine as it is imo,0,0,0,0,0,0,rfc title fine imo
2,00013b17ad220c46,sources zawe ashton on lapland,0,0,0,0,0,0,sources zawe ashton lapland
3,00017563c3f7919a,if you have a look back at the source the inf...,0,0,0,0,0,0,look back source information updated correct f...
4,00017695ad8997eb,i don t anonymously edit articles at all,0,0,0,0,0,0,anonymously edit articles
5,0001ea8717f6de06,thank you for understanding i think very highl...,0,0,0,0,0,0,thank understanding think highly would revert ...
6,00024115d4cbde0f,please do not add nonsense to wikipedia such e...,0,0,0,0,0,0,please add nonsense wikipedia edits considered...
7,000247e83dcc1211,dear god this site is horrible,0,0,0,0,0,0,dear god site horrible
8,00025358d4737918,only a fool can believe in such numbers the c...,0,0,0,0,0,0,fool believe numbers correct number lies 10 00...
9,00026d1092fe71cc,double redirects when fixing double redirects...,0,0,0,0,0,0,double redirects fixing double redirects blank...


In [10]:
train = df.drop(['id','comment_text'], axis=1)
test =  df2.drop(['id','comment_text'], axis=1)

In [11]:

X_train = train['clean_comment']
y_train = train[['toxic','severe_toxic','obscene','threat','insult','identity_hate']]

X_test = test['clean_comment']
y_test = test[['toxic','severe_toxic','obscene','threat','insult','identity_hate']]

X_train.shape, y_train.shape, X_test.shape

((159571,), (159571, 6), (153164,))

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

pipeline = Pipeline([
    ('bow', CountVectorizer()),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])


In [13]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.metrics import confusion_matrix,classification_report


col = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

for i, j in enumerate(col):

    print('===Fit '+j)
    print('   Fitting model')
    pipeline.fit(X_train, y_train[j])
    
    X_pred = pipeline.predict(X_train)   # checking scores on train data only, so that we fill test data after
    print("   " + j + " train score: %.3f" % pipeline.score(X_train, y_train[j])) 
    
# Fitting provides good results
    # ===Fit toxic
    #    Fitting model
    #    toxic train score: 0.926
    # ===Fit severe_toxic
    #    Fitting model
    #    severe_toxic train score: 0.990
    # ===Fit obscene
    #    Fitting model
    #    obscene train score: 0.955
    # ===Fit threat
    #    Fitting model
    #    threat train score: 0.997
    # ===Fit insult
    #    Fitting model
    #    insult train score: 0.954
    # ===Fit identity_hate
    #    Fitting model
    #    identity_hate train score: 0.991

      

===Fit toxic
   Fitting model
   toxic train score: 0.926
===Fit severe_toxic
   Fitting model
   severe_toxic train score: 0.990
===Fit obscene
   Fitting model
   obscene train score: 0.955
===Fit threat
   Fitting model
   threat train score: 0.997
===Fit insult
   Fitting model
   insult train score: 0.954
===Fit identity_hate
   Fitting model
   identity_hate train score: 0.991


In [17]:

preds = np.zeros((len(test), len(col)))

for i, j in enumerate(col):
    print('fit', j)
    preds[:,i] = pipeline.predict_proba(X_test)[:,1]
    
    
# Here i am trying to create an array with 6 columns (for toxic, severe_toxic, etc) that contains the predictions of the test file
# fitting seems ok (no error)

# HOWEVER: the output is not good: 
  # numer do not make sense: max value is 0.72 with an average of 0.0012
  # all columns display the same values
  # the 1st line (id 00001cee341fdb12) should be classified at elast as a little bit toxic (it scores 0.0000109895287988836)

    

fit toxic
fit severe_toxic
fit obscene
fit threat
fit insult
fit identity_hate


In [18]:
preds


# output:
# array([[1.09895288e-05, 1.09895288e-05, 1.09895288e-05, 1.09895288e-05,
#         1.09895288e-05, 1.09895288e-05],
#        [1.62353833e-05, 1.62353833e-05, 1.62353833e-05, 1.62353833e-05,
#         1.62353833e-05, 1.62353833e-05],
#        [9.25549243e-03, 9.25549243e-03, 9.25549243e-03, 9.25549243e-03,
#         9.25549243e-03, 9.25549243e-03],
#        ...,
#        [3.63198350e-06, 3.63198350e-06, 3.63198350e-06, 3.63198350e-06,
#         3.63198350e-06, 3.63198350e-06],
#        [3.51731042e-07, 3.51731042e-07, 3.51731042e-07, 3.51731042e-07,
#         3.51731042e-07, 3.51731042e-07],
#        [1.09095396e-05, 1.09095396e-05, 1.09095396e-05, 1.09095396e-05,
#         1.09095396e-05, 1.09095396e-05]])

array([[1.09895288e-05, 1.09895288e-05, 1.09895288e-05, 1.09895288e-05,
        1.09895288e-05, 1.09895288e-05],
       [1.62353833e-05, 1.62353833e-05, 1.62353833e-05, 1.62353833e-05,
        1.62353833e-05, 1.62353833e-05],
       [9.25549243e-03, 9.25549243e-03, 9.25549243e-03, 9.25549243e-03,
        9.25549243e-03, 9.25549243e-03],
       ...,
       [3.63198350e-06, 3.63198350e-06, 3.63198350e-06, 3.63198350e-06,
        3.63198350e-06, 3.63198350e-06],
       [3.51731042e-07, 3.51731042e-07, 3.51731042e-07, 3.51731042e-07,
        3.51731042e-07, 3.51731042e-07],
       [1.09095396e-05, 1.09095396e-05, 1.09095396e-05, 1.09095396e-05,
        1.09095396e-05, 1.09095396e-05]])

In [19]:
subm = pd.read_csv('sample_submission.csv')

submid = pd.DataFrame({'id': subm["id"]})
submission = pd.concat([submid, pd.DataFrame(preds, columns = col)], axis=1)
submission.to_csv('feat_lr_2cols.csv', index=False)

