# This notebook was completed using Python 3

# Wikipedia Talk Data - Getting Started

This notebook gives an introduction to working with the various data sets in [Wikipedia
Talk](https://figshare.com/projects/Wikipedia_Talk/16731) project on Figshare. The release includes:

1. a large historical corpus of discussion comments on Wikipedia talk pages
2. a sample of over 100k comments with human labels for whether the comment contains a personal attack
3. a sample of over 100k comments with human labels for whether the comment has aggressive tone

Please refer to our [wiki](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release) for documentation of the schema of each data set and our [research paper](https://arxiv.org/abs/1610.08914) for documentation on the data collection and modeling methodology. 

In this notebook we show how to build a simple classifier for detecting personal attacks and apply the classifier to a random sample of the comment corpus to see whether discussions on user pages have more personal attacks than discussion on article pages.

## Building a classifier for personal attacks
In this section we will train a simple bag-of-words classifier for personal attacks using the [Wikipedia Talk Labels: Personal Attacks]() data set.

In [2]:
import pandas as pd
import urllib
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import numpy as np
import sys

In [3]:
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 

def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

                
## download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
## download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [4]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [5]:
len(annotations['rev_id'].unique())

115864

In [6]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [7]:
# join labels and comments
comments['attack'] = labels

# QUESTIONS + CODES

**a. What are the text cleaning methods you tried? What are the ones you have included in the final code?**

I tried removing all the special characters, for example: %^&()<>?/\|}{~:, etc... I replaced them with a space and then broke down the sentences into arrays of words, then concatenated these words into new sentences without any leading or trailing spaces. This method was included in the final code. 

One thing I tried but did not work out was to remove all the stop words. However, I decided to leave it for the hyperparameters tuning process. 

In [7]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

# remove special characters
specialChar = '[=@_!#$%^&*()<>?/\|}{~:`]'
for string in specialChar:
    comments['comment'] = comments['comment'].apply(lambda x: x.replace(string, " "))

# split the untrimmed sentences into arrays of words to remove the space
comments['comment'] = comments['comment'].apply(lambda x: x.split())

# join the words to create new sentences and trim trailing, leading spaces 
s = ' '
comments['comment'] = comments['comment'].apply(lambda x: s.join(x))
comments['comment'] = comments['comment'].apply(lambda x: x.strip())

In [8]:
comments.query('attack')['comment'].head(10)

rev_id
801279                           Iraq is not good USA is bad
2702703    fuck off you little asshole. If you want to ta...
4632658          i have a dick, its bigger than yours hahaha
6545332    renault you sad little bpy for driving a renau...
6545351    renault you sad little bo for driving a renaul...
7977970    34, 30 Nov 2004 UTC Because you like to accuse...
8359431    You are not worth the effort. You are arguing ...
8724028    Yes, complain to your rabbi and then go shoot ...
8845700                     i am using the sandbox, ass wipe
8845736    GOD DAMN GOD DAMN it fuckers, i am using the G...
Name: comment, dtype: object

**b. What are the features you considered using? What features did you use in the final code?** 

Originally I tried using a bag of words representation with word and character n-grams, individually, and together. I also tried 2 different extraction methods (CountVectorizer + TfidfTransformer, or TfidfVectorizer). I didn't use non-word features such as "year" because I did not think it was useful for classifying attacks. The "year" of the comments did not offer anything that could help us with the data training process. 

**c. What optimizations did you add in your code, if any?**

Optimizations most occured during text cleaning process. I experimented with different number of folds for KFold in order to determine which one was efficient in both time and space complexity.  

**d. What are the ML methods you tried out, and what were your best results with each method? Which was the best ML method you saw before tuning hyperparameters?**

These are the classifiers that I used:
1. LinearSVC
1. LinearDiscriminantAnalysis
1. MLPClassifier
1. RandomForestClassifier
1. MultinomialNB
1. SGDClassifier

Best ML method is SGDClassifier (**code below**) as it returned the highest ROC AUC score among all methods: 0.949, even though it's still lower than that of the strawman code. LinearSVC and MLPClassifier took forever to run while LinearDiscriminantAnalysis made the kernel die. The best scores for MLPClassifier, RandomForestClassifier, and MultinomialNB were 0.930, 0.903, 0.942 respectively. 

## SGDClassifier Model With KFold 5-fold

In [15]:
# SGDClassifier Model With KFold 

kfold = KFold(5,True,1)

stop_words = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out',
              'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 
              'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him',
              'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don',
              'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 
              'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 
              'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 
              'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 
              'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being',
              'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']

X = np.array(comments['comment'])
y = np.array(comments['attack'])

for train_comments, test_comments in kfold.split(X):
    X_train, X_test = X[train_comments], X[test_comments]
    y_train, y_test = y[train_comments], y[test_comments]

    classifier = SGDClassifier(loss='log')


    clf = Pipeline([('vect', CountVectorizer(analyzer='word', stop_words=stop_words)),
                    ('tfidf', TfidfTransformer()), 
                    ("clf", classifier)])
        

    clf = clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    print('Test ROC AUC: %.3f' %auc)
    trueVals = y_test
    predictedVals = clf.predict(X_test)
    print(classification_report(trueVals, predictedVals))
    print(confusion_matrix(trueVals,predictedVals))

Test ROC AUC: 0.946
              precision    recall  f1-score   support

       False       0.92      1.00      0.96     20387
        True       0.96      0.34      0.50      2786

    accuracy                           0.92     23173
   macro avg       0.94      0.67      0.73     23173
weighted avg       0.92      0.92      0.90     23173

[[20348    39]
 [ 1838   948]]
Test ROC AUC: 0.943
              precision    recall  f1-score   support

       False       0.92      1.00      0.96     20458
        True       0.96      0.34      0.51      2715

    accuracy                           0.92     23173
   macro avg       0.94      0.67      0.73     23173
weighted avg       0.92      0.92      0.90     23173

[[20421    37]
 [ 1779   936]]
Test ROC AUC: 0.949
              precision    recall  f1-score   support

       False       0.92      1.00      0.96     20465
        True       0.96      0.34      0.51      2708

    accuracy                           0.92     23173
   mac

**e. What hyper-parameter tuning did you do, and by how many percentage points did your accuracy go up?**

I played around with the parameters for the Classifier (alpha), and for the CountVectorizer (ngram_range, max_features, min_df, 'word' analyzer, stop_words), and for the TfidfTransformer (norm, smooth_idf, sublinear_idf, use_idf). Most parameters consisted of 1 or 2 inputs because I did not want to substantially increase the time complexity. 

For the list of stop words, I used the list from the following source: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/.

My accuracy remains the same as the strawman figure (0.94) but some other metrics improved, which I will talk about in part g. 

**f. What did you learn from the different metrics? Did you try cross-validation?**

To get the best model pre-hyperparameter-tuning, I tried KFold cross-validation with number of folds = 5. This means splitting the data and target into 5 train/test sets. This was a reasonable number because it guaranteed intensive testing/training while not sacrificing time complexity. When tuning hyperparameters, I used GridSearchCV with 4-fold cross validation for moderately good results with faster runtime.     

The metrics returned varied results in each fold. I picked the one with the highest ROC AUC score for each model and then from these models, determined which model was the highest. I also checked to see if accuracy or precision or recall improved. 

**g. What are your best final Result Metrics? By how much is it better than the strawman figure? Which model gave you this performance?**

Best final Result Metrics is below. The model that gave the best score was SGDClassifier with GridSearchCV 4-fold. The best score (0.941) is lower than strawman's ROC AUC score (0.957). However, there were some improvements in other metrics.

Accuracy remains the same 0.94. Precision score for "False" comments (when a comment is classified as non-attack) went up by 0.01. For "True" comments (when a comment is classified as attack), recall went up by 0.05 and f-1 score went from 0.69 to 0.72. 

As a result, in the confusion matrix, the final result did slightly better for "True" comments and slightly worse for "False" comments than the strawman figure.

Overall, the LinearRegression model used in the strawman is still the best one. Originally its ROC AUC score was 0.957. However, I tried it with KFold (split 5) and managed to bring its score higher to 0.963. 

## FINAL RESULTS: Hyperparameters tuning for SGDClassifier, CV=4 

In [None]:
# Hyperparameters tuning for SGDClassifier, CV=4 

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

classifier = SGDClassifier(loss='log')

stop_words = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out',
              'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 
              'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him',
              'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don',
              'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 
              'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 
              'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 
              'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 
              'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being',
              'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']

pipeline = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer(norm = 'l2')), 
                ("clf", classifier)])

param_grid = dict(
                vect__analyzer=['word'],
                vect__max_features=[10000,20000],
                vect__ngram_range=[(1,2),(1,3)],
                vect__stop_words=[stop_words],
                vect__min_df=[1,2],
                tfidf__norm=['l2', None],
                tfidf__smooth_idf=[True,False],
                tfidf__sublinear_tf=[True,False],
                tfidf__use_idf=[True,False],
                clf__alpha=[1e-2, 1e-3],
                )

grid_search = GridSearchCV(pipeline, cv=4,param_grid=param_grid, verbose=10, n_jobs=-1)
if __name__ == "__main__":
    # fit on TRAINING data
    grid_search.fit(train_comments['comment'], train_comments['attack'])

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
                            
    sys.stdout.flush()
    
# Run the grid_search transforms+prediction with best parameters on test data
y_pred = grid_search.predict(test_comments['comment'])
y_test = test_comments['attack']

auc = roc_auc_score(test_comments['attack'], grid_search.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

# Get reports and metrics
print("Classification Report")
print(classification_report(y_test, y_pred))
                        
print("Confusion Matrix")
print(confusion_matrix(y_test, y_pred))
                              
p,r,f1,support = precision_recall_fscore_support(y_test, y_pred, average='binary')





Fitting 4 folds for each of 256 candidates, totalling 1024 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   22.5s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   24.9s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed:  3.1min
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  89 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done 121 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  6.5min
[Parallel(n_jobs=-1)]: Done 157 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  8.4min
[Parallel(n_jobs=-1)]: Done 197 tasks      | elapsed:  

## Original strawman

In [35]:
# Original Strawman

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    ('clf', LogisticRegression()),
])
clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)
trueVals = test_comments['attack']
predictedVals = clf.predict(test_comments['comment'])
print(classification_report(trueVals, predictedVals))
print(confusion_matrix(trueVals,predictedVals))





Test ROC AUC: 0.957
              precision    recall  f1-score   support

       False       0.94      0.99      0.97     20422
        True       0.92      0.55      0.69      2756

    accuracy                           0.94     23178
   macro avg       0.93      0.77      0.83     23178
weighted avg       0.94      0.94      0.93     23178

[[20281   141]
 [ 1234  1522]]


## Strawman with KFold 5-fold

In [9]:
# Strawman with KFold 5-fold

stop_words = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out',
              'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 
              'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him',
              'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don',
              'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 
              'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 
              'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 
              'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 
              'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being',
              'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']

kfold = KFold(5,True,1)

X = np.array(comments['comment'])
y = np.array(comments['attack'])

for train_comments, test_comments in kfold.split(X):
    X_train, X_test = X[train_comments], X[test_comments]
    y_train, y_test = y[train_comments], y[test_comments]


    classifier = LogisticRegression()

    vectorizerW = TfidfVectorizer(analyzer='word', stop_words=stop_words)
    vectorizerC = TfidfVectorizer(analyzer='char')
    combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
    clf = Pipeline([("features", combined_features), ("clf", classifier)])


    clf = clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])

    print('Test ROC AUC: %.3f' %auc)
    trueVals = y_test
    predictedVals = clf.predict(X_test)

    print(classification_report(trueVals, predictedVals))
    print(confusion_matrix(trueVals,predictedVals))




Test ROC AUC: 0.960
              precision    recall  f1-score   support

       False       0.95      0.99      0.97     20387
        True       0.89      0.59      0.71      2786

    accuracy                           0.94     23173
   macro avg       0.92      0.79      0.84     23173
weighted avg       0.94      0.94      0.94     23173

[[20183   204]
 [ 1147  1639]]




Test ROC AUC: 0.961
              precision    recall  f1-score   support

       False       0.95      0.99      0.97     20458
        True       0.90      0.59      0.71      2715

    accuracy                           0.94     23173
   macro avg       0.93      0.79      0.84     23173
weighted avg       0.94      0.94      0.94     23173

[[20287   171]
 [ 1118  1597]]




Test ROC AUC: 0.963
              precision    recall  f1-score   support

       False       0.95      0.99      0.97     20465
        True       0.89      0.60      0.72      2708

    accuracy                           0.94     23173
   macro avg       0.92      0.80      0.84     23173
weighted avg       0.94      0.94      0.94     23173

[[20261   204]
 [ 1080  1628]]




Test ROC AUC: 0.959
              precision    recall  f1-score   support

       False       0.95      0.99      0.97     20486
        True       0.89      0.58      0.70      2687

    accuracy                           0.94     23173
   macro avg       0.92      0.78      0.84     23173
weighted avg       0.94      0.94      0.94     23173

[[20301   185]
 [ 1134  1553]]




Test ROC AUC: 0.953
              precision    recall  f1-score   support

       False       0.95      0.99      0.97     20478
        True       0.88      0.58      0.70      2694

    accuracy                           0.94     23172
   macro avg       0.91      0.78      0.83     23172
weighted avg       0.94      0.94      0.94     23172

[[20259   219]
 [ 1139  1555]]


**h. What is the most interesting thing you learned from doing the report?**

The most interesting thing is trying different models to figure out which one gave me the best results. They were fun to play with and it was also enjoyable tweaking all the models to find the best one. 

**i. What was the hardest thing to do?**

The hardest thing was trying to piece all the information together. I did not feel prepared even after watching the 2 lectures on sklearn and spending a significant amount of time looking through tutorials. However, I started to get a feel of it eventually after trying out different models. One other thing was some models took really long to run. I had to stop the kernel from running because one model took almost 2 hours to run. 

In [26]:
# Hyperparameter tuning for SGDClassifier Take 2

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

classifier = SGDClassifier(loss='log')

stop_words = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out',
              'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 
              'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him',
              'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don',
              'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 
              'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 
              'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 
              'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 
              'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being',
              'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']

pipeline = Pipeline([('vect', CountVectorizer()),
                ('tfidf', TfidfTransformer(norm = 'l2')), 
                ("clf", classifier)])

param_grid = dict(
                vect__analyzer=['word'],
                vect__max_features=[5000,10000],
                vect__ngram_range=[(1,2),(1,3)],
                vect__stop_words=[stop_words],
                #vect__min_df=[1,2],
                tfidf__norm=['l1', 'l2', None],
                tfidf__smooth_idf=[True,False],
                tfidf__sublinear_tf=[True,False],
                tfidf__use_idf=[True,False],
                clf__alpha=[1e-2, 1e-3],
                #clf__class_weight=['balanced', None]
                )

grid_search = GridSearchCV(pipeline, cv=3,param_grid=param_grid, verbose=10, n_jobs=-1)
if __name__ == "__main__":
# fit on TRAINING data
    grid_search.fit(train_comments['comment'], train_comments['attack'])

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
                            #print("\n")
    sys.stdout.flush()
# Run the grid_search transforms+prediction with best parameters on test data
y_pred = grid_search.predict(test_comments['comment'])
y_test = test_comments['attack']

auc = roc_auc_score(test_comments['attack'], grid_search.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

# Get reports and metrics
print("Classification Report")
print(classification_report(y_test, y_pred))
                        
print("Confusion Matrix")
print(confusion_matrix(y_test, y_pred))
                              
p,r,f1,support = precision_recall_fscore_support(y_test, y_pred, average='binary')




Fitting 3 folds for each of 192 candidates, totalling 576 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   37.5s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   58.6s
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done  89 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 121 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 157 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  8.3min
[Parallel(n_jobs=-1)]: Done 197 tasks      | elapsed:  

Best score: 0.941
Best parameters set:
	clf__alpha: 0.01
	tfidf__norm: None
	tfidf__smooth_idf: False
	tfidf__sublinear_tf: True
	tfidf__use_idf: True
	vect__analyzer: 'word'
	vect__max_features: 10000
	vect__ngram_range: (1, 3)
	vect__stop_words: ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not'

In [44]:
# Take 4
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

classifier = SGDClassifier(loss='log')

stop_words = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out',
              'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 
              'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him',
              'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don',
              'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 
              'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 
              'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 
              'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 
              'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being',
              'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']

pipeline = Pipeline([('vect', TfidfVectorizer()),
                ('tfidf', TfidfTransformer()), 
                ("clf", classifier)])

param_grid = dict(
                vect__analyzer=['word','char'],
                vect__max_features=[10000,20000],
                vect__ngram_range=[(1,2),(1,3)],
                vect__stop_words=[stop_words],
                tfidf__norm=['l2', None],
                tfidf__smooth_idf=[True,False],
                tfidf__sublinear_tf=[True,False],
                tfidf__use_idf=[True,False],
                clf__alpha=[1e-2, 1e-3],
                #vect__ngram_range=[(1,2),(1,3)],
                #vect__stop_words=[stop_words],
                vect__min_df=[3,4],
                #tfidf__norm=['l1', 'l2', None],
                #tfidf__smooth_idf=[True,False],
                #tfidf__sublinear_tf=[True,False],
                #tfidf__use_idf=[True,False],
                #clf__alpha=[1e-2, 1e-3],
                clf__class_weight=['balanced',None]
                )

grid_search = GridSearchCV(pipeline, cv=2,param_grid=param_grid, verbose=10, n_jobs=-1)
if __name__ == "__main__":
# fit on TRAINING data
    grid_search.fit(train_comments['comment'], train_comments['attack'])

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
                            #print("\n")
    sys.stdout.flush()
# Run the grid_search transforms+prediction with best parameters on test data
y_pred = grid_search.predict(test_comments['comment'])
y_test = test_comments['attack']

auc = roc_auc_score(test_comments['attack'], grid_search.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

# Get reports and metrics
print("Classification Report")
print(classification_report(y_test, y_pred))
                        
print("Confusion Matrix")
print(confusion_matrix(y_test, y_pred))
                              
p,r,f1,support = precision_recall_fscore_support(y_test, y_pred, average='binary')



Fitting 2 folds for each of 1024 candidates, totalling 2048 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   18.6s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   31.4s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:  5.0min
[Parallel(n_jobs=-1)]: Done  89 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 121 tasks      | elapsed:  9.2min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  9.9min
[Parallel(n_jobs=-1)]: Done 157 tasks      | elapsed: 11.7min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 12.5min
[Parallel(n_jobs=-1)]: Done 197 tasks      | elapsed: 1

Best score: 0.939
Best parameters set:
	clf__alpha: 0.01
	clf__class_weight: 'balanced'
	tfidf__norm: 'l2'
	tfidf__smooth_idf: True
	tfidf__sublinear_tf: False
	tfidf__use_idf: True
	vect__analyzer: 'word'
	vect__max_features: 20000
	vect__min_df: 3
	vect__ngram_range: (1, 2)
	vect__stop_words: ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 