# Wikipedia Talk Data - Getting Started

This notebook gives an introduction to working with the various data sets in [Wikipedia
Talk](https://figshare.com/projects/Wikipedia_Talk/16731) project on Figshare. The release includes:

1. a large historical corpus of discussion comments on Wikipedia talk pages
2. a sample of over 100k comments with human labels for whether the comment contains a personal attack
3. a sample of over 100k comments with human labels for whether the comment has aggressive tone

Please refer to our [wiki](https://meta.wikimedia.org/wiki/Research:Detox/Data_Release) for documentation of the schema of each data set and our [research paper](https://arxiv.org/abs/1610.08914) for documentation on the data collection and modeling methodology. 

In this notebook we show how to build a simple classifier for detecting personal attacks and apply the classifier to a random sample of the comment corpus to see whether discussions on user pages have more personal attacks than discussion on article pages.

## Building a classifier for personal attacks
In this section we will train a simple bag-of-words classifier for personal attacks using the [Wikipedia Talk Labels: Personal Attacks]() data set.

In [2]:
import re, string, nltk
import pandas as pd
import urllib
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from nltk.corpus import stopwords

In [3]:
# download annotated comments and annotations

ANNOTATED_COMMENTS_URL = 'https://ndownloader.figshare.com/files/7554634' 
ANNOTATIONS_URL = 'https://ndownloader.figshare.com/files/7554637' 


def download_file(url, fname):
    urllib.request.urlretrieve(url, fname)

                
download_file(ANNOTATED_COMMENTS_URL, 'attack_annotated_comments.tsv')
download_file(ANNOTATIONS_URL, 'attack_annotations.tsv')

In [4]:
comments = pd.read_csv('attack_annotated_comments.tsv', sep = '\t', index_col = 0)
annotations = pd.read_csv('attack_annotations.tsv',  sep = '\t')

In [5]:
len(annotations['rev_id'].unique())

115864

In [6]:
# labels a comment as an atack if the majority of annoatators did so
labels = annotations.groupby('rev_id')['attack'].mean() > 0.5

In [7]:
# join labels and comments
comments['attack'] = labels

In [8]:
# remove newline and tab tokens
comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
#remove all the punctuations
comments['comment'] = comments['comment'].apply(lambda x: x.replace("/[^a-z \d \' \. \, \" \: \; \? \- \!  \s]/i", " "))
#remove all the hyperlinks
comments['comment'] = comments['comment'].apply(lambda x: x.replace(r"http\S+", ""))
comments['comment'] = comments['comment'].apply(lambda x: x.replace("[0-9]", ""))
comments.head()

Unnamed: 0_level_0,comment,year,logged_in,ns,sample,split,attack
rev_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
37675,`- This is not ``creative``. Those are the di...,2002,False,article,random,train,False
44816,` :: the term ``standard model`` is itself le...,2002,False,article,random,train,False
49851,"True or false, the situation as of March 200...",2002,False,article,random,train,False
89320,"Next, maybe you could work on being less cond...",2002,True,article,random,dev,False
93890,This page will need disambiguation.,2002,True,article,random,train,False


In [9]:
comments.query('attack')['comment'].head()

rev_id
801279             Iraq is not good  ===  ===  USA is bad   
2702703      ____ fuck off you little asshole. If you wan...
4632658         i have a dick, its bigger than yours! hahaha
6545332      == renault ==  you sad little bpy for drivin...
6545351      == renault ==  you sad little bo for driving...
Name: comment, dtype: object

In [10]:
#FINAL SUBMITTED ONE
# fit a simple text classifier

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")


#English Stemmer to stem words to basic form.
from nltk.stem.snowball import EnglishStemmer



stemmer = EnglishStemmer()
#analyzer an instance to build analyzer for countvectorizer
analyzer = CountVectorizer().build_analyzer()

#method for getting stem words
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

#param grid to be used for tuning the C parameter using GridSearchCV
#param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

clf = Pipeline([
    #Feature Extraction Models-Tfidfvectorizer and countvectorizer
    #('vect', TfidfVectorizer(max_df=1.0, min_df=1, analyzer=stemmed_words)),
    ('vect', CountVectorizer(max_df = 1.0, min_df=1, max_features=10000, ngram_range=(1,4), 
                             analyzer=stemmed_words, strip_accents={'ascii'})),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    #('clf', LinearSVC(random_state=0)),
    ('clf', LogisticRegression(penalty='l2', dual=False)),
    #('clf',GridSearchCV(LogisticRegression(penalty='l2'), param_grid))
    #('clf', RandomForestClassifier(n_estimators=200,max_depth=, random_state=1)),
    #('clf', MLPClassifier()),
])

clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
predicted_values = clf.predict(test_comments['comment'])
precision_recall_fscore = precision_recall_fscore_support(test_comments['attack'], predicted_values)
cmat = confusion_matrix(test_comments['attack'], predicted_values)
print(classification_report(test_comments['attack'], predicted_values))
             
print('Test ROC AUC: %.3f' %auc)
print('precision, recall, fscore: ', precision_recall_fscore)
print('confusion matrix: ', cmat)

             precision    recall  f1-score   support

      False       0.95      0.99      0.97     20422
       True       0.92      0.58      0.71      2756

avg / total       0.94      0.94      0.94     23178

Test ROC AUC: 0.962
precision, recall, fscore:  (array([0.94549354, 0.91681109]), array([0.99294878, 0.57583454]), array([0.96864028, 0.70737687]), array([20422,  2756], dtype=int64))
confusion matrix:  [[20278   144]
 [ 1169  1587]]


In [11]:
#TRIED OUT RANDOM CLASSIFIER. I TRIED OUT WITH MANY DIFFERENT HYPERPARAMETERS as stated in readme. BUT INCLUDING JUST THE ONE 
#WHICH GAVE THE BEST RESULT (that was 0.957)

train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")


#English Stemmer to stem words to basic form.
from nltk.stem.snowball import EnglishStemmer



stemmer = EnglishStemmer()
#analyzer an instance to build analyzer for countvectorizer
analyzer = CountVectorizer().build_analyzer()

#method for getting stem words
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

#param grid to be used for tuning the C parameter using GridSearchCV
#param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

clf = Pipeline([
    #Feature Extraction Models-Tfidfvectorizer and countvectorizer
    #('vect', TfidfVectorizer(max_df=1.0, min_df=1, analyzer=stemmed_words)),
    ('vect', CountVectorizer(max_df = 1.0, min_df=1, max_features=10000, ngram_range=(1,4), 
                             analyzer=stemmed_words, strip_accents={'ascii'})),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    #('clf', LogisticRegression(penalty='l2', dual=False)),
    #('clf',GridSearchCV(LogisticRegression(penalty='l2'), param_grid))
    ('clf', RandomForestClassifier(n_estimators=200,max_depth=100, random_state=1)),
    #('clf', MLPClassifier()),
])

clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict_proba(test_comments['comment'])[:, 1])
predicted_values = clf.predict(test_comments['comment'])
precision_recall_fscore = precision_recall_fscore_support(test_comments['attack'], predicted_values)
cmat = confusion_matrix(test_comments['attack'], predicted_values)
print(classification_report(test_comments['attack'], predicted_values))
             
print('Test ROC AUC: %.3f' %auc)
print('precision, recall, fscore: ', precision_recall_fscore)
print('confusion matrix: ', cmat)

             precision    recall  f1-score   support

      False       0.93      1.00      0.96     20422
       True       0.95      0.46      0.62      2756

avg / total       0.93      0.93      0.92     23178

Test ROC AUC: 0.957
precision, recall, fscore:  (array([0.93137524, 0.95151515]), array([0.99686612, 0.45573295]), array([0.96300851, 0.61629048]), array([20422,  2756], dtype=int64))
confusion matrix:  [[20358    64]
 [ 1500  1256]]


In [15]:
#TRIED OUT Linear SVC. ROC Score=0.817
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")


#English Stemmer to stem words to basic form.
from nltk.stem.snowball import EnglishStemmer



stemmer = EnglishStemmer()
#analyzer an instance to build analyzer for countvectorizer
analyzer = CountVectorizer().build_analyzer()

#method for getting stem words
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

#param grid to be used for tuning the C parameter using GridSearchCV
#param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

clf = Pipeline([
    #Feature Extraction Models-Tfidfvectorizer and countvectorizer
    #('vect', TfidfVectorizer(max_df=1.0, min_df=1, analyzer=stemmed_words)),
    ('vect', CountVectorizer(max_df = 1.0, min_df=1, max_features=10000, ngram_range=(1,4), 
                             analyzer=stemmed_words, strip_accents={'ascii'})),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    #('clf', LogisticRegression(penalty='l2', dual=False)),
    #('clf',GridSearchCV(LogisticRegression(penalty='l2'), param_grid))
    #('clf', RandomForestClassifier(n_estimators=200,max_depth=100, random_state=1)),
    ('clf', svm.LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
     verbose=0))
    #('clf', MLPClassifier()),
])

clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict(test_comments['comment']))
predicted_values = clf.predict(test_comments['comment'])
precision_recall_fscore = precision_recall_fscore_support(test_comments['attack'], predicted_values)
cmat = confusion_matrix(test_comments['attack'], predicted_values)
print(classification_report(test_comments['attack'], predicted_values))
             
print('Test ROC AUC: %.3f' %auc)
print('precision, recall, fscore: ', precision_recall_fscore)
print('confusion matrix: ', cmat)

             precision    recall  f1-score   support

      False       0.95      0.99      0.97     20422
       True       0.88      0.65      0.74      2756

avg / total       0.94      0.95      0.94     23178

Test ROC AUC: 0.817
precision, recall, fscore:  (array([0.95397569, 0.87530682]), array([0.98756243, 0.6469521 ]), array([0.97047855, 0.74400167]), array([20422,  2756], dtype=int64))
confusion matrix:  [[20168   254]
 [  973  1783]]


In [18]:
#TRIED OUT Multinomial NB. ROC score=0.718
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")


#English Stemmer to stem words to basic form.
from nltk.stem.snowball import EnglishStemmer



stemmer = EnglishStemmer()
#analyzer an instance to build analyzer for countvectorizer
analyzer = CountVectorizer().build_analyzer()

#method for getting stem words
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

#param grid to be used for tuning the C parameter using GridSearchCV
#param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

clf = Pipeline([
    #Feature Extraction Models-Tfidfvectorizer and countvectorizer
    #('vect', TfidfVectorizer(max_df=1.0, min_df=1, analyzer=stemmed_words)),
    ('vect', CountVectorizer(max_df = 1.0, min_df=1, max_features=10000, ngram_range=(1,4), 
                             analyzer=stemmed_words, strip_accents={'ascii'})),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    #('clf', LogisticRegression(penalty='l2', dual=False)),
    #('clf',GridSearchCV(LogisticRegression(penalty='l2'), param_grid))
    #('clf', RandomForestClassifier(n_estimators=200,max_depth=100, random_state=1)),
    #('clf', svm.LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
    # intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     #multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
     #verbose=0))
    ('clf', MultinomialNB()),
    #('clf', MLPClassifier()),
])

clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict(test_comments['comment']))
predicted_values = clf.predict(test_comments['comment'])
precision_recall_fscore = precision_recall_fscore_support(test_comments['attack'], predicted_values)
cmat = confusion_matrix(test_comments['attack'], predicted_values)
print(classification_report(test_comments['attack'], predicted_values))
             
print('Test ROC AUC: %.3f' %auc)
print('precision, recall, fscore: ', precision_recall_fscore)
print('confusion matrix: ', cmat)

             precision    recall  f1-score   support

      False       0.93      1.00      0.96     20422
       True       0.94      0.44      0.60      2756

avg / total       0.93      0.93      0.92     23178

Test ROC AUC: 0.718
precision, recall, fscore:  (array([0.92934137, 0.94158879]), array([0.99632749, 0.43867925]), array([0.96166934, 0.59851485]), array([20422,  2756], dtype=int64))
confusion matrix:  [[20347    75]
 [ 1547  1209]]


In [20]:
#TRIED OUT MLP Classifier. ROC SCORE=0.815
train_comments = comments.query("split=='train'")
test_comments = comments.query("split=='test'")


#English Stemmer to stem words to basic form.
from nltk.stem.snowball import EnglishStemmer



stemmer = EnglishStemmer()
#analyzer an instance to build analyzer for countvectorizer
analyzer = CountVectorizer().build_analyzer()

#method for getting stem words
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

#param grid to be used for tuning the C parameter using GridSearchCV
#param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000] }

clf = Pipeline([
    #Feature Extraction Models-Tfidfvectorizer and countvectorizer
    #('vect', TfidfVectorizer(max_df=1.0, min_df=1, analyzer=stemmed_words)),
    ('vect', CountVectorizer(max_df = 1.0, min_df=1, max_features=10000, ngram_range=(1,4), 
                             analyzer=stemmed_words, strip_accents={'ascii'})),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    #('clf', LogisticRegression(penalty='l2', dual=False)),
    #('clf',GridSearchCV(LogisticRegression(penalty='l2'), param_grid))
    #('clf', RandomForestClassifier(n_estimators=200,max_depth=100, random_state=1)),
    #('clf', svm.LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     #intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     #multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
     #verbose=0))
    ('clf', MLPClassifier()),
])

clf = clf.fit(train_comments['comment'], train_comments['attack'])
auc = roc_auc_score(test_comments['attack'], clf.predict(test_comments['comment']))
predicted_values = clf.predict(test_comments['comment'])
precision_recall_fscore = precision_recall_fscore_support(test_comments['attack'], predicted_values)
cmat = confusion_matrix(test_comments['attack'], predicted_values)
print(classification_report(test_comments['attack'], predicted_values))
             
print('Test ROC AUC: %.3f' %auc)
print('precision, recall, fscore: ', precision_recall_fscore)
print('confusion matrix: ', cmat)

             precision    recall  f1-score   support

      False       0.95      0.97      0.96     20422
       True       0.75      0.66      0.70      2756

avg / total       0.93      0.93      0.93     23178

Test ROC AUC: 0.815
precision, recall, fscore:  (array([0.95469427, 0.75373754]), array([0.97096269, 0.65856313]), array([0.96275976, 0.70294345]), array([20422,  2756], dtype=int64))
confusion matrix:  [[19829   593]
 [  941  1815]]


In [11]:
# correctly classify nice comment
clf.predict(['Thanks for you contribution, you did a great job!'])

array([False])

In [12]:
# correctly classify nasty comment
clf.predict(['People as stupid as you should not edit Wikipedia!'])

array([ True])

##Questions and Answers
1. What are the text cleaning methods you tried? What are the ones you have included in the final code?
Ans:
I did text cleaning to remove any numbers from file, all the hyperlinks from the comments file, digits from the file, I removed the stop words from the comments file. Out of all other methods, the last method proved out to be the most useful and impactful.

2. What are the features you considered using ? What features did you use in the final code?
Ans: I worked with both TfIdfVectorizer and CountVectorizer but CountVectorizer was giving better results , thus I tried using CountVectorizer with different parameters, max_df, min_df, and by using analyser to use English stemmer , and also strip_accents{“ascii”} to take just ascii values
3.  What optimizations did you add in your code ? 
Ans: I performed data cleaning by removing any numbers, hyperlinks, new lines, tabs. I also tried to perform stemming using English Stemmer. That really improved my results. 
Once I got good results with Logistic Regression, I tried tuning Logistic Regression’s hyperparameters - C, Dual.  Also Tried to get the optimum value of C using GridSearchCV. 
4. What are the ML methods you tried out, and what were your best results with each method? Which was the best ML method you saw before tuning hyperparameters?
Ans: I tried out MultinomialNB, SVM, LinearSVC, Logistic Regression, Random Forest Classifier, Multiperceptron
MultinomialNB: 0.932
SVM: 0.942
Random Forest: 0.726(maxdepth = 2)
0.813(Maxdepth =5)
0.833(Maxdepth=8)
0.874(n_estimators=20)
0.879(n_estimators=25, max_depth=10)
0.924(n_estimators=25, max_depth=20)
0.934(n_estimators=80, max_depth=30)
0.942(n_estimators=100, max_depth=50)
0.952(n_estimators=200, max_depth=100)


MLPClassifier:
Default =0.500
Logistic Regression (with tfidfvectoriser): 0.954
Logistic Regression(with countvectorizer): 0.957
Logistic Regression(With countvectorizer and max_features=100000) : 0.959

I also tried using GridSearchCV with Logistic Regression, but It decreased my score from 0.962 to just 0.957

The ML method that gave the best results was Logistic Regression. 



5. What hyperparameter tuning did you do ?
I tried tweaking the C value of  0.1, 100, 1000 But all of this declined the roc score from 0.962. 
I also tried using GridSearchCV. But that also didn’t help in increasing the score.
When I included the hyperparameters dual=True and penalty=’l2’ , Even tried using dual =False, 
But it just doesnot increase score at all. It was stuck at 0.962.  


6. What did you learn from the different metrics? Did you try cross-validation?
Ans: precision and recall values shows the measure of relevance of the system. And support helped me know about the number of samples of true response that lies in that class. 

7. What are your best final Result Metrics? By how much is it better than the strawman figure? Which model gave you this performance?
Ans: My best result was ruc score 0.962. Which is 0.005 better than strawman figure. Logistic regression used with snowball stemmer and countvectorizer where max features are 1000000 gave me that result. 

8. What was the hardest thing to do ?
Ans: The hardest thing was having no background about Machine learning and thus struggling to figure out the ML Algorithms and understanding what it does was really challenging. 


## Prevalence of personal attacks by namespace
In this section we use our classifier in conjunction with the [Wikipedia Talk Corpus](https://figshare.com/articles/Wikipedia_Talk_Corpus/4264973) to see if personal attacks are more common on user talk or article talk page discussions. In our paper we show that the model is not biased by namespace.

In [None]:
import os
import re
from scipy.stats import bernoulli
% matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# download and untar data

USER_TALK_CORPUS_2004_URL = 'https://ndownloader.figshare.com/files/6982061'
ARTICLE_TALK_CORPUS_2004_URL = 'https://ndownloader.figshare.com/files/7038050'

download_file(USER_TALK_CORPUS_2004_URL, 'comments_user_2004.tar.gz')
download_file(ARTICLE_TALK_CORPUS_2004_URL,  'comments_article_2004.tar.gz')

os.system('tar -xzf comments_user_2004.tar.gz')
os.system('tar -xzf comments_article_2004.tar.gz')

In [None]:
# helper for collecting a sample of comments for a given ns and year from 
def load_no_bot_no_admin(ns, year, prob = 0.1):
    
    dfs = []
    
    data_dir = "comments_%s_%d" % (ns, year)
    for _, _, filenames in os.walk(data_dir):
        for filename in filenames:
            if re.match("chunk_\d*.tsv", filename):
                df = pd.read_csv(os.path.join(data_dir, filename), sep = "\t")
                df['include'] = bernoulli.rvs(prob, size=df.shape[0])
                df = df.query("bot == 0 and admin == 0 and include == 1")
                dfs.append(df)
                
    sample = pd.concat(dfs)
    sample['ns'] = ns
    sample['year'] = year
    
    return sample

In [None]:
# collect a random sample of comments from 2004 for each namespace
corpus_user = load_no_bot_no_admin('user', 2004)
corpus_article = load_no_bot_no_admin('article', 2004)
corpus = pd.concat([corpus_user, corpus_article])

In [None]:
# Apply model
corpus['comment'] = corpus['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
corpus['comment'] = corpus['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
corpus['attack'] = clf.predict_proba(corpus['comment'])[:,1] > 0.425 # see paper

In [None]:
# plot prevalence per ns

sns.pointplot(data = corpus, x = 'ns', y = 'attack')
plt.ylabel("Attack fraction")
plt.xlabel("Dicussion namespace")

Attacks are far more prevalent in the user talk namespace.