# Wikipedia Talk Data - Getting Started

This notebook gives an introduction to working with the various data sets in [Wikipedia
Talk](https://figshare.com/projects/Wikipedia_Talk/16731) project on Figshare. The release includes:

1. a large historical corpus of discussion comments on Wikipedia talk pages
2. a sample of over 100k comments with human labels for whether the comment contains a personal attack
3. a sample of over 100k comments with human labels for whether the comment has aggressive tone

Please refer to our [wiki](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release) for documentation of the schema of each data set and our [research paper](https://arxiv.org/abs/1610.08914) for documentation on the data collection and modeling methodology. 

In this notebook we show how to build a simple classifier for detecting personal attacks and apply the classifier to a random sample of the comment corpus to see whether discussions on user pages have more personal attacks than discussion on article pages.

## Building a classifier for personal attacks
In this section we will train a simple bag-of-words classifier for personal attacks using the [Wikipedia Talk Labels: Personal Attacks]() data set.

In [16]:
import pandas as pd
import urllib
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import numpy as np

In [2]:
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 

def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

                
## download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
## download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

### meow


In [3]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [4]:
len(annotations['rev_id'].unique())

115864

In [5]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [6]:
# join labels and comments
comments['attack'] = labels

In [7]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

# remove special characters
specialChar = '[=@_!#$%^&*()<>?/\|}{~:`]'
for string in specialChar:
    comments['comment'] = comments['comment'].apply(lambda x: x.replace(string, " "))

comments['comment'] = comments['comment'].apply(lambda x: x.split())
s = ' '
comments['comment'] = comments['comment'].apply(lambda x: s.join(x))
comments['comment'] = comments['comment'].apply(lambda x: x.strip())

In [8]:
comments.query('attack')['comment'].head(10)

rev_id
801279                           Iraq is not good USA is bad
2702703    fuck off you little asshole. If you want to ta...
4632658          i have a dick, its bigger than yours hahaha
6545332    renault you sad little bpy for driving a renau...
6545351    renault you sad little bo for driving a renaul...
7977970    34, 30 Nov 2004 UTC Because you like to accuse...
8359431    You are not worth the effort. You are arguing ...
8724028    Yes, complain to your rabbi and then go shoot ...
8845700                     i am using the sandbox, ass wipe
8845736    GOD DAMN GOD DAMN it fuckers, i am using the G...
Name: comment, dtype: object

In [19]:
# fit a simple text classifier
# With FeatureUnion, only param: analyzer='word' and analyzer='char', AUC = 0.959
# With CountVect + TfidfTransformer, AUC = 0.957
# No FeatureUnion, 1 Vectorizer AUC = 0.958
# With FeatureUnion, AUC = 0.958


train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

# clf = Pipeline([
    #('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
    #('tfidf', TfidfTransformer(norm = 'l2')),
    #('clf', LogisticRegression()),
#])

classifier = LogisticRegression()
# vectorizer = TfidfVectorizer()
# stop_words = ['ourselves', 'hers', 'between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out',
#               'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 
#               'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him',
#               'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don',
#               'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 
#               'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 
#               'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 
#               'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 
#               'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being',
#               'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than']
vectorizerW = TfidfVectorizer(analyzer='word')
vectorizerC = TfidfVectorizer(analyzer='char')
combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
clf = Pipeline([("features", combined_features), ("clf", classifier)])


clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

# x_train = train_comments['comment']
# y_train = train_comments['attack']
# x_test = test_comments['comment']
# y_test = test_comments['attack']

trueVals = test_comments['attack']
predictedVals = clf.predict(test_comments['comment'])
print(classification_report(trueVals, predictedVals))
print(confusion_matrix(trueVals,predictedVals))





Test ROC AUC: 0.959
              precision    recall  f1-score   support

       False       0.94      0.99      0.97     20422
        True       0.91      0.56      0.69      2756

    accuracy                           0.94     23178
   macro avg       0.93      0.77      0.83     23178
weighted avg       0.94      0.94      0.93     23178

[[20268   154]
 [ 1226  1530]]


In [None]:
# LinearSVC model. Taking forever
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

svm = SVC()
classifier = CalibratedClassifierCV(svm)
vectorizerW = TfidfVectorizer(min_df=1, analyzer='word', stop_words=None, lowercase=True)
vectorizerC = TfidfVectorizer(min_df=1, analyzer='char', stop_words=None, lowercase=True)
combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
clf = Pipeline([("features", combined_features), ("clf", classifier)])

clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

# x_train = train_comments['comment']
# y_train = train_comments['attack']
# x_test = test_comments['comment']
# y_test = test_comments['attack']

trueVals = test_comments['attack']
predictedVals = clf.predict(test_comments['comment'])
print(classification_report(trueVals, predictedVals))
print(confusion_matrix(trueVals,predictedVals))



In [None]:
# LinearDiscriminantAnalysis. Kernel died
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

classifier = LinearDiscriminantAnalysis()
vectorizerW = TfidfVectorizer(min_df=1, analyzer='word', stop_words=None, lowercase=True)
vectorizerC = TfidfVectorizer(min_df=1, analyzer='char', stop_words=None, lowercase=True)
combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])

from sklearn.preprocessing import FunctionTransformer

clf = Pipeline([("features", combined_features), 
                ("trans", FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
                ("clf", classifier)])

clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

# x_train = train_comments['comment']
# y_train = train_comments['attack']
# x_test = test_comments['comment']
# y_test = test_comments['attack']

trueVals = test_comments['attack']
predictedVals = clf.predict(test_comments['comment'])
print(classification_report(trueVals, predictedVals))
print(confusion_matrix(trueVals,predictedVals))





In [20]:
# MLPClassifier. Also slow 
# AUC: 0.930

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

classifier = MLPClassifier()
vectorizerW = TfidfVectorizer(analyzer='word')
vectorizerC = TfidfVectorizer(analyzer='char')
combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
clf = Pipeline([("features", combined_features), ("clf", classifier)])

clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

# x_train = train_comments['comment']
# y_train = train_comments['attack']
# x_test = test_comments['comment']
# y_test = test_comments['attack']

trueVals = test_comments['attack']
predictedVals = clf.predict(test_comments['comment'])
print(classification_report(trueVals, predictedVals))
print(confusion_matrix(trueVals,predictedVals))




Test ROC AUC: 0.930
              precision    recall  f1-score   support

       False       0.95      0.98      0.96     20422
        True       0.78      0.64      0.70      2756

    accuracy                           0.94     23178
   macro avg       0.87      0.81      0.83     23178
weighted avg       0.93      0.94      0.93     23178

[[19939   483]
 [  994  1762]]


In [None]:
# RandomForestClassifier
# No FeatureUnion, 1 Vectorizer AUC = 0.894
# With FeatureUnion, AUC = 0.874
# With CountVectorizer + TfidfTransformer AUC = 0.904
# With FeatureUnion, only param: analyzer='word' and analyzer='char', AUC = 0.868

kfold = KFold(5,True,1)

X = np.array(comments['comment'])
y = np.array(comments['attack'])

for train_comments, test_comments in kfold.split(X):
    X_train, X_test = X[train_comments], X[test_comments]
    y_train, y_test = y[train_comments], y[test_comments]
    classifier = RandomForestClassifier()
    vectorizerW = TfidfVectorizer(analyzer='word')
    vectorizerC = TfidfVectorizer(analyzer='char')
    combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
    clf = Pipeline([("features", combined_features), ("clf", classifier)])


# vectorizerW = TfidfVectorizer(min_df=1, analyzer='word', stop_words=None, lowercase=True)
# vectorizerC = TfidfVectorizer(min_df=1, analyzer='char', stop_words=None, lowercase=True)
# combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
# clf = Pipeline([('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
                # ('tfidf', TfidfTransformer(norm = 'l2')), 
                # ("clf", classifier)])

    clf = clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    # clf = clf.fit(train_comments['comment'], train_comments['attack'])
    # auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
    print('Test ROC AUC: %.3f' %auc)
    trueVals = y_test
    predictedVals = clf.predict(X_test)
    # trueVals = test_comments['attack']
    # predictedVals = clf.predict(test_comments['comment'])
    print(classification_report(trueVals, predictedVals))
    print(confusion_matrix(trueVals,predictedVals))


In [45]:
# MultinomialNB
# With FeatureUnion, only param: analyzer='word' and analyzer='char', AUC = 0.858
# No FeatureUnion, 1 Vectorizer AUC = 0.837
# With FeatureUnion, AUC = 0.858 
# With CountVectorizer + TfidfTransformer AUC = 0.936



kfold = KFold(5,True,1)

X = np.array(comments['comment'])
y = np.array(comments['attack'])

for train_comments, test_comments in kfold.split(X):
    X_train, X_test = X[train_comments], X[test_comments]
    y_train, y_test = y[train_comments], y[test_comments]

    classifier = MultinomialNB()
# vectorizerW = TfidfVectorizer(min_df=1, analyzer='word', stop_words=None, lowercase=True)
# vectorizerC = TfidfVectorizer(min_df=1, analyzer='char', stop_words=None, lowercase=True)
# combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
# clf = Pipeline([("features", combined_features), ("clf", classifier)])

    clf = Pipeline([('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
                    ('tfidf', TfidfTransformer(norm = 'l2')), 
                    ("clf", classifier)])

    clf = clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    # clf = clf.fit(train_comments['comment'], train_comments['attack'])
    # auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
    print('Test ROC AUC: %.3f' %auc)
    trueVals = y_test
    predictedVals = clf.predict(X_test)
    # trueVals = test_comments['attack']
    # predictedVals = clf.predict(test_comments['comment'])
    print(classification_report(trueVals, predictedVals))
    print(confusion_matrix(trueVals,predictedVals))

Test ROC AUC: 0.939
              precision    recall  f1-score   support

       False       0.94      0.99      0.96     20387
        True       0.85      0.55      0.67      2786

    accuracy                           0.93     23173
   macro avg       0.90      0.77      0.82     23173
weighted avg       0.93      0.93      0.93     23173

[[20125   262]
 [ 1253  1533]]
Test ROC AUC: 0.935
              precision    recall  f1-score   support

       False       0.94      0.99      0.96     20458
        True       0.85      0.54      0.66      2715

    accuracy                           0.94     23173
   macro avg       0.90      0.77      0.81     23173
weighted avg       0.93      0.94      0.93     23173

[[20200   258]
 [ 1238  1477]]
Test ROC AUC: 0.940
              precision    recall  f1-score   support

       False       0.94      0.99      0.96     20465
        True       0.84      0.56      0.67      2708

    accuracy                           0.94     23173
   mac

In [47]:
# SGDClassifier
# With FeatureUnion, only param: analyzer='word' and analyzer='char', AUC = 0.939
# With CountVectorizer + TfidfTransformer AUC = 0.944
# With FeatureUnion, AUC = 0.939
# No FeatureUnion, 1 Vectorizer AUC = 0.940 

kfold = KFold(5,True,1)
# train_comments = comments.query("split=='train'")
# test_comments = comments.query("split=='test'")

# x_train = train_comments['comment']
# y_train = train_comments['attack']
# x_test = test_comments['comment']
# y_test = test_comments['attack']

X = np.array(comments['comment'])
y = np.array(comments['attack'])

for train_comments, test_comments in kfold.split(X):
    X_train, X_test = X[train_comments], X[test_comments]
    y_train, y_test = y[train_comments], y[test_comments]

    classifier = SGDClassifier(loss='log')
# vectorizer = TfidfVectorizer()
# vectorizerW = TfidfVectorizer(min_df=1, analyzer='word', stop_words=None, lowercase=True)
# vectorizerC = TfidfVectorizer(min_df=1, analyzer='char', stop_words=None, lowercase=True)
# combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
# clf = Pipeline([("vect", vectorizer), ("clf", classifier)])

    clf = Pipeline([('vect', CountVectorizer(analyzer='word')),
                    ('tfidf', TfidfTransformer()), 
                    ("clf", classifier)])
        

    clf = clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    # clf = clf.fit(train_comments['comment'], train_comments['attack'])
    # auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
    print('Test ROC AUC: %.3f' %auc)
    trueVals = y_test
    predictedVals = clf.predict(X_test)
    # trueVals = test_comments['attack']
    # predictedVals = clf.predict(test_comments['comment'])
    print(classification_report(trueVals, predictedVals))
    print(confusion_matrix(trueVals,predictedVals))

Test ROC AUC: 0.940
              precision    recall  f1-score   support

       False       0.92      1.00      0.96     20387
        True       0.94      0.38      0.54      2786

    accuracy                           0.92     23173
   macro avg       0.93      0.69      0.75     23173
weighted avg       0.92      0.92      0.91     23173

[[20320    67]
 [ 1726  1060]]
Test ROC AUC: 0.941
              precision    recall  f1-score   support

       False       0.92      1.00      0.96     20458
        True       0.94      0.37      0.53      2715

    accuracy                           0.92     23173
   macro avg       0.93      0.69      0.75     23173
weighted avg       0.92      0.92      0.91     23173

[[20390    68]
 [ 1699  1016]]
Test ROC AUC: 0.944
              precision    recall  f1-score   support

       False       0.92      1.00      0.96     20465
        True       0.95      0.39      0.55      2708

    accuracy                           0.93     23173
   mac

In [49]:
# SGDClassifier
# With FeatureUnion, only param: analyzer='word' and analyzer='char'

kfold = KFold(5,True,1)
# train_comments = comments.query("split=='train'")
# test_comments = comments.query("split=='test'")

# x_train = train_comments['comment']
# y_train = train_comments['attack']
# x_test = test_comments['comment']
# y_test = test_comments['attack']

X = np.array(comments['comment'])
y = np.array(comments['attack'])

for train_comments, test_comments in kfold.split(X):
    X_train, X_test = X[train_comments], X[test_comments]
    y_train, y_test = y[train_comments], y[test_comments]

    classifier = SGDClassifier(loss='log')
    vectorizerW = TfidfVectorizer(analyzer='word')
    vectorizerC = TfidfVectorizer(analyzer='char')
    combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
    clf = Pipeline([("features", combined_features), ("clf", classifier)])

    #clf = Pipeline([('vect', CountVectorizer(analyzer='word')),
                    #('tfidf', TfidfTransformer()), 
                    #("clf", classifier)])
        

    clf = clf.fit(X_train, y_train)
    auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    # clf = clf.fit(train_comments['comment'], train_comments['attack'])
    # auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
    print('Test ROC AUC: %.3f' %auc)
    trueVals = y_test
    predictedVals = clf.predict(X_test)
    # trueVals = test_comments['attack']
    # predictedVals = clf.predict(test_comments['comment'])
    print(classification_report(trueVals, predictedVals))
    print(confusion_matrix(trueVals,predictedVals))

Test ROC AUC: 0.937
              precision    recall  f1-score   support

       False       0.92      0.99      0.96     20387
        True       0.91      0.41      0.56      2786

    accuracy                           0.92     23173
   macro avg       0.92      0.70      0.76     23173
weighted avg       0.92      0.92      0.91     23173

[[20275   112]
 [ 1647  1139]]
Test ROC AUC: 0.940
              precision    recall  f1-score   support

       False       0.93      1.00      0.96     20458
        True       0.93      0.41      0.57      2715

    accuracy                           0.93     23173
   macro avg       0.93      0.70      0.76     23173
weighted avg       0.93      0.93      0.91     23173

[[20368    90]
 [ 1604  1111]]
Test ROC AUC: 0.944
              precision    recall  f1-score   support

       False       0.93      0.99      0.96     20465
        True       0.92      0.43      0.59      2708

    accuracy                           0.93     23173
   mac

In [None]:
# Hyperparameter tuning for SGDClassifier Take 1

kfold = KFold(5,True,1)
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")



# X = np.array(comments['comment'])
# y = np.array(comments['attack'])

# for train_comments, test_comments in kfold.split(X):
#     X_train, X_test = X[train_comments], X[test_comments]
#     y_train, y_test = y[train_comments], y[test_comments]

X_train = train_comments['comment']
y_train = train_comments['attack']
X_test = test_comments['comment']
y_test = test_comments['attack']

classifier = SGDClassifier(loss='log')
vectorizerW = TfidfVectorizer(analyzer='word')
vectorizerC = TfidfVectorizer(analyzer='char')
combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])
pipeline = Pipeline([("features", combined_features), ("clf", classifier)])
param_grid = dict(
            features__word__max_features=[2000,4000],
            features__char__max_features=[2000,4000],
            features__word__min_df=[2,3],
            features__char__min_df=[2,3],
            features__word__ngram_range=[(1,2), (1,3)],
            features__char__ngram_range=[(3,3), (3,4), (4,4)],
            #features__word__stop_words=['english', None],
            features__word__lowercase=[True],
            features__char__lowercase=[True],
            clf__alpha=[1e-2, 1e-3]
            )

grid_search = GridSearchCV(pipeline, param_grid=param_grid, cv=3,verbose=10, n_jobs=-1)
if __name__ == "__main__":
# fit on TRAINING data
    grid_search.fit(X_train, y_train)

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
                            #print("\n")
    
# Run the grid_search transforms+prediction with best parameters on test data
y_pred = grid_search.predict(X_test)

 
# Get reports and metrics
print("Classification Report")
print(classification_report(y_test, y_pred))
                        
print("Confusion Matrix")
print(confusion_matrix(y_test, y_pred))
                              
p,r,f1,support = precision_recall_fscore_support(y_test, y_pred, average='binary')

Fitting 3 folds for each of 192 candidates, totalling 576 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  7.2min
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed: 12.7min
[Parallel(n_jobs=-1)]: Done  89 tasks      | elapsed: 15.6min
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed: 19.0min
[Parallel(n_jobs=-1)]: Done 121 tasks      | elapsed: 23.0min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed: 25.6min
[Parallel(n_jobs=-1)]: Done 157 tasks      | elapsed: 28.4min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 32.7min
[Parallel(n_jobs=-1)]: Done 197 tasks      | elapsed: 3

In [17]:
# Hyperparameter tuning for SGDClassifier Take 2

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")

classifier = SGDClassifier(loss='log')

# vectorizerW = TfidfVectorizer(min_df=1, analyzer='word', stop_words=None, lowercase=True)
# vectorizerC = TfidfVectorizer(min_df=1, analyzer='char', stop_words=None, lowercase=True)
# combined_features = FeatureUnion([("word", vectorizerW), ("char", vectorizerC)])

#clf = Pipeline([("vect", vectorizer), ("clf", classifier)])


# countVect = CountVectorizer(max_features = 10000, ngram_range = (1,2))
# tfidfTrans = TfidfTransformer(norm = 'l2')
# combined_features = FeatureUnion([("vect", countVect), ("tfidf", tfidfTrans)])
# pipeline = Pipeline([('features', combined_features), 
#                 ("classifier", classifier)])

pipeline = Pipeline([('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
                ('tfidf', TfidfTransformer(norm = 'l2')), 
                ("clf", classifier)])

param_grid = dict(
#                 features__word__max_features=[2000,4000],
#                 features__char__max_features=[2000,4000],
#                 features__word__min_df=[2,3],
#                 features__char__min_df=[2,3],
#                 features__word__ngram_range=[(1,2), (1,3)],
#                 features__char__ngram_range=[(3,3), (3,4), (4,4)],
                # features__word__stop_words=[stop_words, 'english', None],
#                 features__word__lowercase=[True],
#                 features__char__lowercase=[True],
                vect__ngram_range=[(1,2),(1,3)],
                tfidf__use_idf=[True,False],
                clf__alpha=[1e-2, 1e-3]
                )

grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10, n_jobs=-1)
if __name__ == "__main__":
# fit on TRAINING data
    grid_search.fit(train_comments['comment'], train_comments['attack'])

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(param_grid.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
                            #print("\n")
    
# Run the grid_search transforms+prediction with best parameters on test data
y_pred = grid_search.predict(test_comments['comment'])
y_test = test_comments['attack']

 
# Get reports and metrics
print("Classification Report")
print(classification_report(y_test, y_pred))
                        
print("Confusion Matrix")
print(confusion_matrix(y_test, y_pred))
                              
p,r,f1,support = precision_recall_fscore_support(y_test, y_pred, average='binary')




[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.


Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   23.0s
[Parallel(n_jobs=-1)]: Done   4 out of  24 | elapsed:   24.2s remaining:  2.0min
[Parallel(n_jobs=-1)]: Done   7 out of  24 | elapsed:   48.8s remaining:  2.0min
[Parallel(n_jobs=-1)]: Done  10 out of  24 | elapsed:   51.0s remaining:  1.2min
[Parallel(n_jobs=-1)]: Done  13 out of  24 | elapsed:   53.4s remaining:   45.2s
[Parallel(n_jobs=-1)]: Done  16 out of  24 | elapsed:  1.2min remaining:   34.8s
[Parallel(n_jobs=-1)]: Done  19 out of  24 | elapsed:  1.2min remaining:   18.8s
[Parallel(n_jobs=-1)]: Done  22 out of  24 | elapsed:  1.4min remaining:    7.8s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  1.5min finished


Best score: 0.899
Best parameters set:
	clf__alpha: 0.001
	tfidf__use_idf: False
	vect__ngram_range: (1, 3)
Classification Report
              precision    recall  f1-score   support

       False       0.89      1.00      0.94     20422
        True       0.92      0.12      0.21      2756

    accuracy                           0.89     23178
   macro avg       0.91      0.56      0.58     23178
weighted avg       0.90      0.89      0.86     23178

Confusion Matrix
[[20394    28]
 [ 2421   335]]


In [99]:
# correctly classify nice comment
clf.predict(['Thanks for you contribution, you did a great job!'])

array([False])

In [100]:
# correctly classify nasty comment
clf.predict(['People as stupid as you should not edit Wikipedia!'])

array([ True])

# This notebook was completed using Python 3

**a. What are the text cleaning methods you tried? What are the ones you have included in the final code?**
I tried removing all the special characters, for example: %^&()<>?/\|}{~:, etc... I replaced them with a space and then broke down the sentences into arrays of words, then concatenated these words into new sentences without any leading or trailing space. This method was included in the final code. 

One thing I tried but did not work out was to remove all the stop words. However, I decided to leave it for the hyperparameters tuning part. 

**b. What are the features you considered using? What features did you use in the final code?** 
Originally I tried using a bag of words representation with word and character n-grams, individually, and together. I also tried 2 different extraction methods (CountVectorizer + TfidfTransformer, or TfidfVectorizer). I didn't use non-word features such as "year" because I did not think it was useful for clarifying attacks. The "year" of the comments did not offer anything that could help us with the process. 

**c. What optimizations did you add in your code, if any?**


**d. What are the ML methods you tried out, and what were your best results with each method? Which was the best ML method you saw before tuning hyperparameters?**
These are the classifiers that I used:
1. LinearSVC
1. LinearDiscriminantAnalysis
1. MLPClassifier
1. RandomForestClassifier
1. MultinomialNB
1. SGDClassifier

Best ML method is SGDClassifier as it returned the highest score among all methods: 0.944, even though it's still lower than that of the strawman code. LinearSVC and MLPClassifier took forever to run while LinearDiscriminantAnalysis made the kernel die. The best scores for MLPClassifier, RandomForestClassifier, and MultinomialNB were 0.930, 0.903, 0.936 respectively. 

**e. What hyper-parameter tuning did you do, and by how many percentage points did your accuracy go up?**

**f. What did you learn from the different metrics? Did you try cross-validation?**
To get the best model pre-hypermeter-tuning, I tried KFold cross-validation with number of folds = 5. This means splitting the data and target into 5 equal parts used for training and testing.    

**g. What are your best final Result Metrics? By how much is it better than the strawman figure? Which model gave you this performance?**

**h. What is the most interesting thing you learned from doing the report?**
The most interesting thing is trying different models to figure out which one gave me the best results. 

**i. What was the hardest thing to do?**
The hardest thing was trying to piece all the information together. I did not feel prepared even after watching the 2 lectures on sklearn and spending a significant amount of time looking through tutorials. However, I started to get a feel of it eventually after trying out different models. One other thing was some models took really long to run. I had to stop the kernel from running because one model took almost 2 hours to run. 