# Vaccination-Stance Classification

The following Jupyter notebook contains the python programs for generating the results given in the paper. The code was written using Python 3. It uses additional python libraries from nltk, pandas, and scikit-learn. Specifically, the following code was developed to classify vaccine-related tweets into 3 classes (pro-vaccine, anti-vaccine, and neutral).

To perform the classification, the tweets are preprocessed as follows:
- We first apply NLTK's TweetTokenizer() function to convert the tweets into lower case and then segment them into a set of tokens (words, hashtags, and mentions).
- Stopwords are removed from the extracted tokens.
- We then apply scikit-learn's CountVectorizer() function to count the frequency of each token (including bigrams).

After preprocessing, we use scikit-learn's implementation of l1-regularized logistic regression to train the model (with its default regularization parameter, C = 1). Performance of the classifier is evaluated using 10-fold cross validation. Oversampling was performed to the smaller class on the training set to handle the imbalanced class distribution. Results are reported in terms of the overall model accuracy as well as the precision, recall, and F-measure for each class.

### Configuration Parameters

In [1]:
datadir = '../data/'
resultdir = '../results/'

# Feature extraction parameter
ngrams = (1,2)                # extract unigrams and bigrams

# Classification parameters
oversampling = True           # stratification by oversampling the smaller classes in training set
numFolds = 10                 # number of folds for cross-validation

### Subroutine Definitions

In [2]:
import nltk
#nltk.download('stopwords')

from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import re

def getStopwords():
    stopwordlist = set(stopwords.words('english'))
    with open(datadir + 'stopwords.txt', 'r') as f:
        for line in f:
            stopwordlist.add(line.rstrip('\n'))
    
    return stopwordlist
               
def preprocess(sentence):
    tkz = TweetTokenizer(preserve_case=False)
    stop_words = getStopwords()
    temp = []
    for word in tkz.tokenize(sentence.lower()):
        if word != '' and not (word in stop_words): 
            temp.append(word)
    separator = ' '
    return separator.join(temp)

def getAllFeatures(sp_mat, features):
    vocab = dict([(value, key) for key, value in features.items()])
    result = []
    for i in range(sp_mat.shape[0]):
        p, q = sp_mat[i].nonzero()
        temp = ' '.join([vocab[q[j]] for j in range(len(q))])
        result.append(temp)
    return result

### Data Loading

In [3]:
import pandas as pd

rawdata = pd.read_csv(datadir + "pro_anti.csv",encoding = 'utf-8', header = 'infer')
print(rawdata.shape)

print('Pro-vs-anti class distribution:')
distrib = rawdata['class'].value_counts()
print(distrib)
probs = distrib/sum(distrib)
print(probs)

rawdata

(5611, 2)
Pro-vs-anti class distribution:
 0    2422
 1    1639
-1    1550
Name: class, dtype: int64
 0    0.431652
 1    0.292105
-1    0.276243
Name: class, dtype: float64


Unnamed: 0,X,class
0,If #vaccines do NOT cause autism CDC should ...,-1
1,First question asked when someone dies in a c...,1
2,@CollChris @PedsGeekMD @nnebeluk @somedocs @w...,1
3,"""Oh my God I can't believe we did what we di...",-1
4,Vaccine Safety Study Act (HR 3615) reintroduc...,1
...,...,...
5606,Here we are pharmaceutical companies coercing ...,-1
5607,@puddleg Don't know! I was trying to find proo...,-1
5608,@amandpms @TheCollectiveQ I was told the same ...,-1
5609,"@HerbsandDirt Just went to the ""pediatrician"" ...",-1


### Data preprocessing

In [4]:
data = rawdata.copy()
data['X'] = data['X'].apply(preprocess)
data

Unnamed: 0,X,class
0,#vaccines cause autism cdc qualms studying aut...,-1
1,first question asked someone dies car crash ? ...,1
2,@collchris @pedsgeekmd @nnebeluk @somedocs @we...,1
3,""" oh god believe . "" it'd terrible say words p...",-1
4,vaccine safety study act ( hr 3615 ) reintrodu...,1
...,...,...
5606,pharmaceutical companies coercing governments ...,-1
5607,@puddleg know ! trying find proof mmr vaccine ...,-1
5608,@amandpms @thecollectiveq told thing 1993 gave...,-1
5609,"@herbsanddirt went "" pediatrician "" said 1 yea...",-1


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = tf = CountVectorizer(max_df=0.95, min_df=2/data.shape[0], ngram_range=ngrams)    
X = vectorizer.fit_transform(data['X'].values)
features = vectorizer.vocabulary_
Y = rawdata['class']
X.shape

(5611, 15948)

In [6]:
print('Class distribution:')
distrib = Y.value_counts()
print(distrib)

probs = distrib/sum(distrib)
print(probs)

Class distribution:
 0    2422
 1    1639
-1    1550
Name: class, dtype: int64
 0    0.431652
 1    0.292105
-1    0.276243
Name: class, dtype: float64


### Model Building and Evaluation

#### Logistic Regression

In [7]:
import numpy as np
from sklearn.model_selection import KFold
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

kf = KFold(n_splits=numFolds, shuffle=True, random_state=1)
fold = 0
Ypred = Y.copy()

for train_index, test_index in kf.split(X):
    fold += 1
    print('\nFold %d:' % (fold))
    
    train = pd.DataFrame(X[train_index].toarray())
    train['class'] = Y[train_index].tolist()
    
    if oversampling:
        max_size = train['class'].value_counts().max()

        lst = [train]
        for class_index, group in train.groupby('class'):
            lst.append(group.sample(max_size-len(group), replace=True))
        data = pd.concat(lst)
    else:
        data = train
        
    Y_train = data['class']
    X_train = data.drop(['class'],axis=1)
    X_train = csr_matrix(X_train)

    clf = LogisticRegression(verbose=1, solver='liblinear',random_state=1,penalty='l1',max_iter=5000) 
    clf.fit(X_train,Y_train)

    pred_train = clf.predict(X_train)
    pred_test = clf.predict(X[test_index])
    Ypred[test_index] = pred_test

    print('\nTrain accuracy:' + str(accuracy_score(Y_train, pred_train)))
    print('Test accuracy:' + str(accuracy_score(Y[test_index], pred_test)))
    
    if fold == 1:
        cm = confusion_matrix(Y[test_index], pred_test)
    else:
        cm = cm + confusion_matrix(Y[test_index], pred_test)
    
print("Confusion Matrix:")
print(cm)
print("Accuracy =", sum(np.diag(cm))/sum(sum(cm)))
print("Micro F1 =", f1_score(Y, Ypred, average='micro'))
print("Macro F1 =", f1_score(Y, Ypred, average='macro'))
print("Weighted F1 =", f1_score(Y, Ypred, average='weighted'))
print("Accuracy =", accuracy_score(Y, Ypred))


Fold 1:
[LibLinear]
Train accuracy:0.9791570881226054
Test accuracy:0.8879003558718861

Fold 2:
[LibLinear]
Train accuracy:0.9801334546557476
Test accuracy:0.8894830659536542

Fold 3:
[LibLinear]
Train accuracy:0.9791953495487227
Test accuracy:0.8966131907308378

Fold 4:
[LibLinear]
Train accuracy:0.9792913023469857
Test accuracy:0.910873440285205

Fold 5:
[LibLinear]
Train accuracy:0.9803113553113553
Test accuracy:0.9037433155080213

Fold 6:
[LibLinear]
Train accuracy:0.9774297558728696
Test accuracy:0.9162210338680927

Fold 7:
[LibLinear]
Train accuracy:0.9806166056166056
Test accuracy:0.8948306595365418

Fold 8:
[LibLinear]
Train accuracy:0.9801934592353754
Test accuracy:0.9162210338680927

Fold 9:
[LibLinear]
Train accuracy:0.9819240196078431
Test accuracy:0.8823529411764706

Fold 10:
[LibLinear]
Train accuracy:0.9801859472641365
Test accuracy:0.9144385026737968
Confusion Matrix:
[[1344   40  166]
 [  25 2349   48]
 [ 175  100 1364]]
Accuracy = 0.9012653715915167
Micro F1 = 0.9012

In [8]:
print('Class -1:')
prec = cm[0][0]/cm[:,0].sum()
recall = cm[0][0]/cm[0,:].sum()
f1 = 2*prec*recall/(prec + recall)
print('   Precision =', prec)
print('   Recall =', recall)
print('   F-measure =', f1)
print('Class 0:')
prec = cm[1][1]/cm[:,1].sum()
recall = cm[1][1]/cm[1,:].sum()
f1 = 2*prec*recall/(prec + recall)
print('   Precision =', prec)
print('   Recall =', recall)
print('   F-measure =', f1)
print('Class 1:')
prec = cm[2][2]/cm[:,2].sum()
recall = cm[2][2]/cm[2,:].sum()
f1 = 2*prec*recall/(prec + recall)
print('   Precision =', prec)
print('   Recall =', recall)
print('   F-measure =', f1)

rawdata['Predicted'] = Ypred
rawdata['Processed'] = pd.DataFrame(getAllFeatures(X, features))
rawdata.to_csv(resultdir + 'pro_vs_anti_predictions.csv', index=False)

Class -1:
   Precision = 0.8704663212435233
   Recall = 0.8670967741935484
   F-measure = 0.8687782805429864
Class 0:
   Precision = 0.9437525110486139
   Recall = 0.9698596201486375
   F-measure = 0.9566279780085523
Class 1:
   Precision = 0.8643852978453739
   Recall = 0.8322147651006712
   F-measure = 0.8479950264221325
