<h1>Waseem Revisited</h1>

In this project, we reproduce the experiment conducted in <a href=https://www.aclweb.org/anthology/N16-2013.pdf>Waseem 2016</a> to detect hate speech in Tweets, but with newly retrieved and annotated test data. Our hypothesis is that training on the same data and testing on new data will result in lower accuracy than on the old data. We reproduced the original method for retrieving Tweets using the following (particularly prone) search terms: “MKR”, “asian drive”, “feminazi”, “immigrant”, “n\*\*\*r”, “sjw”, “WomenAgainstFeminism”, “blameonenotall”, “islam terrorism”, “notallmen”, “victimcard”, “victim card”, “arab terror”, “gamergate”, “jsil”, “racecard”, and “race card”. We also used stratified sampling to get a representative distribution in the new test data for Tweets from the first three months of 2021. Even though we performed many aspects of the experiment in the same way, the results were significantly different. First, consider the distribution of terms.

<h3>Data</h3>

<h5>Retrieving Tweets</h5>

Our team used Indiana University's OSoMe service to retrieve Tweets between 01/03/2021 and 04/03/2021 for a three-month window using the keywords and keyphrases above as search queries. We took the following steps to get the data file for sampling:
<ul>
    <li>Retrieve .gzip files from OSoMe.</li>
    <li>Extract .gzip files to retrieve JSON files.</li>
    <li>Merge all JSON files.</li>
    <ul>
        <li>Only include Tweet IDs and extended Tweet content</li>
        <li>Skip truncated tweets (which applied to many retweets)</li></ul>
    <li>Remove Tweets with duplicate content using pandas.</li>
    <li>Remove non-English Tweets using langdetect.</li>
    <li>Write the results to a CSV file.</li></ul>
<br>
This yielded the included tweets.csv file. The total number of Tweets is as follows:

In [4]:
import pandas as pd
print("Reading CSV...")
df = pd.read_csv("tweets/tweets.csv", encoding='utf8')
print("Done.")
print()
print("Number of Tweets:")
print("----------------")
print(len(df))

Reading CSV...
Done.

Number of Tweets:
----------------
100082


We can find the proportion of each keyword or keyphrase with the following two functions, counting only one term per Tweet:

In [7]:
import numpy as np
import math

# Case-sensitive strings.
kws = ['MKR',
       'feminazi',
       'immigrant',
       'nigger',
       'sjw',
       'WomenAgainstFeminism',
       'blameonenotall',
       'notallmen',
       'victimcard',
       'gamergate',
       'jsil',
       'racecard',
       'race card',
       'asian drive',
       'islam terrorism',
       'arab terror']

In [39]:
def create_label(data, kws, col):

    kw_list = []
    
    for index, row in data.iterrows():
        contained_kws = []
        for kw in kws:
            if kw in row[col]:
                contained_kws.append(kw)
        if len(contained_kws) >= 1:
            # Count only the first string.
            kw_list.append(contained_kws[0])
        else:
            # Count the absence of any term as None 
            # (e.g., strings may be in the wrong case)
            kw_list.append(None)

    new_col = np.array(kw_list)

    data['Keyword'] = new_col

    return data

In [43]:
def key_proportions(data, kws, col):

    kw_counts = {}
    kw_percentages = {}
    kw_proportions = {}
    data_size = len(data)
    
    for kw in kws:
        count = data[col].str.contains(kw, na=False).sum()
        kw_counts[kw] = count
    
    for key in kw_counts:
        kw_percentages[key] = "%.2f" % float(kw_counts[key]*100/data_size)
        kw_proportions[key] = float(kw_counts[key]*100/data_size)

    # Create a table with the results.
    table = pd.DataFrame(kw_counts.items())
    table[2] = table[0].map(kw_proportions)
    table[3] = table[0].map(kw_percentages)
    table.columns = ['Keyword','Count','Proportion','%']
    
    return table

In [58]:
labeled_data = create_label(df, kws, "Tweet Content")
labeled_data = labeled_data.dropna(subset=['Keyword']) # Get rid of NaN values.
labeled_data

Unnamed: 0,Tweet ID,Tweet Content,Keyword
0,1366176476563914756,Why is this nigger driving the car?,nigger
2,1366177438753988613,y’all have no clue the RAGE that would consume...,nigger
3,1366177499143634945,One white gyal call me a nigger at a party she...,nigger
4,1366177543200641032,The world's biggest promotion has begun\n\n$AR...,MKR
6,1366179153372979209,@sham786364 @bobpockrass @NASCARONFOX The real...,immigrant
...,...,...,...
100077,1366132457767129095,@michaellooby2 @jackh1092 @Josh_HW Yeah they w...,race card
100078,1366134724389986308,@bmorepolo5 @MidwestBake @flockxfans You're pu...,race card
100079,1366134770095251461,@kapitalkeyz1 @rosaluxemburgs Lol nah. Legally...,race card
100080,1366149052530843650,It's the race card all over again.,race card


In [59]:
key_proportions(df, kws, "Keyword")

Unnamed: 0,Keyword,Count,Proportion,%
0,MKR,4273,4.269499,4.27
1,feminazi,776,0.775364,0.78
2,immigrant,45905,45.867389,45.87
3,nigger,8053,8.046402,8.05
4,sjw,3383,3.380228,3.38
5,WomenAgainstFeminism,0,0.0,0.0
6,blameonenotall,0,0.0,0.0
7,notallmen,178,0.177854,0.18
8,victimcard,34,0.033972,0.03
9,gamergate,1413,1.411842,1.41


<h5>Sampling Tweets</h5>

To get a representative sample of the data obtained above, we performed stratified sampling using the Python library, pandas.

In [68]:
N = 601 # Sample size. Set higher than needed for rounding errors.

groups = labeled_data.groupby('Keyword', group_keys=False)
proportional_groups = groups.apply(lambda x: x.sample(int(np.rint((N*len(x)/(len(labeled_data)))))))
sample = proportional_groups.sample(frac=1)
sample.reset_index(drop=True)

Unnamed: 0,Tweet ID,Tweet Content,Keyword
0,1346938089051189248,She definitely wanted to say nigger 😂,nigger
1,1371527461968736263,This is the fate we all have been faced with i...,immigrant
2,1352853571079966725,‘Leftists are weak’ seems to be a popular phra...,sjw
3,1373644115246735363,@achillistyy @anti_functional @boardhopping im...,immigrant
4,1352152292234035201,Looking for an excellent binary exchange? 👀 Us...,MKR
...,...,...,...
596,1369301209065140229,Still not familiar with the immigrant agenda i...,immigrant
597,1346727273345511431,Me )in some kids playing hide and go seek some...,nigger
598,1373858172197163009,@chrisbritt01 Ur race card has expired.,race card
599,1346587076616200194,"I totally get that this was a joke, and yet I ...",immigrant


And, now we can verify that the proportions are close to the desired proportions. It is difficult to get the *exact* proportions given the significantly smaller size of the sample data. <b>Note:</b> this is not the exact sample found in the /data directory, and was not used for the actual experiment. This is just a demonstration of the method used. Regardless, the proportions are roughly the same for both samples.

In [69]:
key_proportions(sample, kws, "Keyword")

Unnamed: 0,Keyword,Count,Proportion,%
0,MKR,34,5.657238,5.66
1,feminazi,6,0.998336,1.0
2,immigrant,360,59.900166,59.9
3,nigger,63,10.482529,10.48
4,sjw,27,4.492512,4.49
5,WomenAgainstFeminism,0,0.0,0.0
6,blameonenotall,0,0.0,0.0
7,notallmen,1,0.166389,0.17
8,victimcard,0,0.0,0.0
9,gamergate,11,1.830283,1.83


<h5>Comparing with the Original Data</h5>

*Forthcoming.* I found it more difficult to get accurate proportions of the Waseem test data.

However, there were some aspects of the experiment that were not performed the same, as will be described below.

<h3>Methodological Differences</h3>

<h5>Data Retrieval</h5>

Our overall project was comprised of three teams of graduate students with three members each. We conducted data retrieval independently of one another, combining the final samples.  The other teams had slightly different or smaller windows and a slightly different distribution of keywords. The Waseem data was collected *over the course* of two months, with no specified timeframe for the Tweets themselves.

<h5>Annotators</h5>

The original annotations were done by two annotators with outside expert verification. In our case, each of the three teams retrieved a unique set of 600 Tweets for a total of 1800 Tweets. Then, each member of each team was solely responsible for annotating a subset of 200 Tweets with no inter-rater reliability metric or expert verification. The resulting label distribution for our test data was 297 Tweets labeled for hate speech and 1490 labeled for none, giving a ratio of 1:5. Compare this to the original test data distribution of 496 Tweets labeled for hate speech and 1076 labeled for none, giving a drastically different ratio of 1:2.

<h5>Definitions</h5>

Our three teams did not confer about a governing definition of hate speech, although we did discuss it as a class. Our team relied on the definition given in <a href=https://dl.acm.org/doi/abs/10.1145/3232676>Fortuna 2018</a>:
<br>
<br>
<blockquote>
Hate speech is language that attacks or diminishes, that incites violence or hate against groups, based on specific characteristics such as physical appearance, religion, descent, national or ethnic origin, sexual orientation, gender identity or other, and it can occur with different linguistic styles, even in subtle forms or when humour is used.
</blockquote>
<br>
This definition, along with its associated criteria, differs from that given by Waseem, thus creating some differences in annotations. For example, the following Tweet was tagged as hate speech in the original test data:
<br>
<br>
<blockquote>
Islam is worse than the Nazi party ever was.
</blockquote>
<br>
Whereas Fortuna states that <q>speak[ing] badly about countries or religions (e.g., France, Portugal, Catholicism, Islam) is allowed in general, but discrimination is not allowed based on these categories.</q> Thus our labels on similar Tweets were not tagged as hate speech.

<h3>Libraries Used</h3>

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier as rf
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import numpy as np
import preprocessor as p
import sys

<h3>Preprocessing</h3>

Minimal preprocessing is performed. Only URLs are removed. Other preprocessing steps, such as removing stop words, are saved as features for GridSearchCV.

In [3]:
p.set_options(p.OPT.URL)

In [9]:
def preprocess(data):
    cleaned_data = []
    for line in data:
        cleaned_data.append(p.clean(line))
    return cleaned_data

def get_data(path):
    with open(path, mode='r', encoding = 'utf-8') as f:
        data = list(f)
    return preprocess(data)

<h3>Model Definitions</h3>

<h5>Logistic Regression</h5>

In [46]:
def LR(train_X, train_y, test_X, test_y):
    
    #Define the logistic regression (LR) pipeline.
    text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', LogisticRegression(random_state=0))
    ])
    
    #Define the optional parameters to be tried with LR.
    params = {
        'vect__stop_words': [None, 'english'],
        'vect__ngram_range': [(1,1),(2,2),(1,2),(2,3)],
        'clf__penalty': ('l1','l2','elasticnet'),
        'clf__C': (1,2,3),
        'clf__fit_intercept': (True, False),
        'tfidf__use_idf': (True, False)
    }
    
    #Try optional parameters with GridSearchCV
    print("Running logistic regression...")
    gs_clf = GridSearchCV(text_clf, params, cv=5, n_jobs=-1)
    gs_clf.fit(train_X, train_y)
    preds = gs_clf.predict(test_X)
    print("Done.")
    
    #Get the best parameters.
    best_params = gs_clf.best_params_
    print()
    print("Best parameters:")
    print("---------------")
    for key in best_params.keys():
        print(key + ': ' + str(best_params[key]))
    print()
    
    #Get the classification report.
    report = classification_report(test_y, preds, digits=6)
    print("Classification report:")
    print("---------------------")
    print(report)

<h5>Support Vector Machine</h5>

In [47]:
def SVM(train_X, train_y, test_X, test_y):
    #Define the support vector machine (SVM) pipeline.
    text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', svm.SVC(gamma='scale'))
    ])
    
    #Define the optional parameters to be tried with SVM.
    params = {
        'vect__stop_words': [None, 'english'],
        'vect__ngram_range':[(1,1),(2,2),(1,2),(2,3)],
        'clf__C':(1, 2, 3),
        'clf__kernel': ('linear','sigmoid','rbf'),
        'clf__shrinking':(True, False),
        'tfidf__use_idf':(True, False)
    }
    
    #Try optional parameters with GridSearchCV
    print("Running support vector classifier...")
    gs_clf = GridSearchCV(text_clf, params, cv=5, n_jobs=-1)
    gs_clf.fit(train_X, train_y)
    preds = gs_clf.predict(test_X)
    print("Done.")
    
    #Get the best parameters.
    best_params = gs_clf.best_params_
    print()
    print("Best parameters:")
    print("---------------")
    for key in best_params.keys():
        print(key + ': ' + str(best_params[key]))
    print()
    
    #Get the classification report.
    report = classification_report(test_y, preds, digits=6)
    print("Classification report:")
    print("---------------------")
    print(report)

<h5>Random Forest</h5>

In [48]:
def RF(train_X, train_y, test_X, test_y):
    #Define the random forest (RF) pipeline.
    text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', rf())
    ])
    
    #Define the optional parameters to be tried with RF.
    params = {
        'vect__stop_words': [None, 'english'],
        'vect__ngram_range':[(1,1),(2,2),(1,2),(2,3)],
        'clf__n_estimators': (10, 30, 50, 100),
        'clf__criterion':('gini','entropy'),
        'clf__max_features':('sqrt','log2'),
        'tfidf__use_idf':(True, False)
    }
    
    #Try optional parameters with GridSearchCV
    print("Running random forest...")
    gs_clf = GridSearchCV(text_clf, params, cv=5, n_jobs=-1)
    gs_clf.fit(train_X, train_y)
    preds = gs_clf.predict(test_X)
    print("Done.")
    
    #Get the best parameters.
    best_params = gs_clf.best_params_
    print()
    print("Best parameters:")
    print("---------------")
    for key in best_params.keys():
        print(key + ': ' + str(best_params[key]))
    print()
    
    #Get the classification report.
    report = classification_report(test_y, preds, digits=6)
    print("Classification report:")
    print("---------------------")
    print(report)

<h5>Naive Bayes</h5>

In [49]:
def NB(train_X, train_y, test_X, test_y):
    #Define the naive bayes (NB) pipeline.
    text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB())
    ])
    
    #Define the optional parameters to be tried with NB.
    params = {
        'vect__stop_words': [None, 'english'],
        'vect__ngram_range':[(1,1),(2,2),(1,2),(2,3)],
        'clf__alpha': (0.000001,1),
        'clf__fit_prior':(True, False),
        'tfidf__use_idf':(True, False)
    }
    
    #Try optional parameters with GridSearchCV
    print("Running naive bayes...")
    gs_clf = GridSearchCV(text_clf, params, cv=5, n_jobs=-1)
    gs_clf.fit(train_X, train_y)
    preds = gs_clf.predict(test_X)
    print("Done.")
    
    #Get the best parameters.
    best_params = gs_clf.best_params_
    print()
    print("Best parameters:")
    print("---------------")
    for key in best_params.keys():
        print(key + ': ' + str(best_params[key]))
    print()
    
    #Get the classification report.
    report = classification_report(test_y, preds, digits=6)
    print("Classification report:")
    print("---------------------")
    print(report)

<h2>Original Data</h2>

Import the relevant data. This is the original training and testing data used in Waseem 2016. 'X' corresponds to Tweet contents while 'y' corresponds to their labels.

In [13]:
print("Loading original data...")
train_X = get_data('data/waseemtrain.txt')
test_X = get_data('data/waseemtest.txt')
train_y = get_data('data/waseemtrainGold.txt')
test_y = get_data('data/waseemtestGold.txt')
print("Done.")

Loading original data...
Done.


<h2>Original Model and Results</h2>

In [51]:
LR(train_X, train_y, test_X, test_y)

Running logistic regression...
Done.

Best parameters:
---------------
clf__C: 1
clf__fit_intercept: True
clf__penalty: l2
tfidf__use_idf: True
vect__ngram_range: (2, 3)
vect__stop_words: None

Classification report:
---------------------
              precision    recall  f1-score   support

           1   0.876471  0.300403  0.447447       496
           2   0.752496  0.980483  0.851493      1076

    accuracy                       0.765903      1572
   macro avg   0.814484  0.640443  0.649470      1572
weighted avg   0.791613  0.765903  0.724008      1572



<h2>New Data</h2>

Using the same training data, test on newly obtained and annotated Tweets to compare the results.

In [12]:
print("Loading new test data...")
test_new_X = get_data('data/waseem_new_test.txt')
test_new_y = get_data('data/waseem_new_testGold.txt')
print("Done.")

Loading new test data...
Done.


<h2>New Models and Results</h2>

<h5>Retrying Logistic Regression</h5>

In [53]:
LR(train_X, train_y, test_new_X, test_new_y)

Running logistic regression...
Done.

Best parameters:
---------------
clf__C: 1
clf__fit_intercept: True
clf__penalty: l2
tfidf__use_idf: True
vect__ngram_range: (2, 3)
vect__stop_words: None

Classification report:
---------------------
              precision    recall  f1-score   support

           1   0.416667  0.033670  0.062305       297
           2   0.837209  0.990604  0.907470      1490

    accuracy                       0.831561      1787
   macro avg   0.626938  0.512137  0.484888      1787
weighted avg   0.767315  0.831561  0.767003      1787



<h5>Trying Support Vector Machine</h5>

In [54]:
SVM(train_X, train_y, test_new_X, test_new_y)

Running support vector classifier...
Done.

Best parameters:
---------------
clf__C: 1
clf__kernel: rbf
clf__shrinking: True
tfidf__use_idf: False
vect__ngram_range: (2, 2)
vect__stop_words: None

Classification report:
---------------------
              precision    recall  f1-score   support

           1   0.325000  0.043771  0.077151       297
           2   0.837436  0.981879  0.903923      1490

    accuracy                       0.825965      1787
   macro avg   0.581218  0.512825  0.490537      1787
weighted avg   0.752269  0.825965  0.766514      1787



<h5>Trying Random Forest</h5>

In [56]:
RF(train_X, train_y, test_new_X, test_new_y)

Running random forest...
Done.

Best parameters:
---------------
clf__criterion: entropy
clf__max_features: log2
clf__n_estimators: 50
tfidf__use_idf: True
vect__ngram_range: (2, 3)
vect__stop_words: None

Classification report:
---------------------
              precision    recall  f1-score   support

           1   0.333333  0.006734  0.013201       297
           2   0.834363  0.997315  0.908591      1490

    accuracy                       0.832680      1787
   macro avg   0.583848  0.502025  0.460896      1787
weighted avg   0.751091  0.832680  0.759777      1787



<h5>Trying Naive Bayes</h5>

In [57]:
NB(train_X, train_y, test_new_X, test_new_y)

Running naive bayes...
Done.

Best parameters:
---------------
clf__alpha: 1
clf__fit_prior: True
tfidf__use_idf: False
vect__ngram_range: (2, 3)
vect__stop_words: None

Classification report:
---------------------
              precision    recall  f1-score   support

           1   0.437500  0.023569  0.044728       297
           2   0.836251  0.993960  0.908310      1490

    accuracy                       0.832680      1787
   macro avg   0.636875  0.508764  0.476519      1787
weighted avg   0.769978  0.832680  0.764783      1787

