# Lab 02: Stance Detection with Logistic Regression and Word Embeddings using scikit-learn

In this lab session we will implement a very simple Logistic Regression Classifier for Stance Detection using https://scikit-learn.org

Stance detection consists of classifying a given document as expressing an AGAINST, FAVOR or NEUTRAL attitude/stance with respect to a given topic. In this particular lab, we use the Task A data from the Semeval 2016 Twitter dataset for Stance detection: https://alt.qcri.org/semeval2016/task6/ 

Scikit-learn allows you to quickly experiment with a large number of machine learning algorithms in low resource environments (in comparison to neural network approaches). Scikit-learn also provides a large number of functionalities to process data and evaluate and visualize the obtained results.

Unlike other toolkits we will see during the course, scikit-learn is a library with an easy to use API ideal for quick experimentation with a large variety of models and algorithms. Usually, it is a good starting point for classification tasks.


## ADD DOCUMENTATION

+ TODO: Add inline documentation to the notebook cells, this way you can learn what it does.


## Functions for loading and pre-processing the corpus

In [29]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
%cd /content/drive/MyDrive/LAP/Subjects/AP1/labs

/content/drive/MyDrive/LAP/Subjects/AP1/labs


In [31]:
import pandas as pd
import nltk
import numpy as np

nltk.download('stopwords')

# load data
def load_data(fnames):
    data = []
    for fname in fnames:
        data.append(pd.read_csv(fname, sep='\t', encoding='utf-8'))
    data = pd.concat(data)
    targets = set(data['Target'])
    return data, list(targets)

def tokenized_tweets(df):
    """Function to tokenize the tweets using the tokenizer from NLTK"""
    tknzr = nltk.TweetTokenizer()
    df['Tokenized_tweet'] = df['Tweet'].apply(tknzr.tokenize)
    return df

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# ASSIGNMENT 1

+ TODO: describe how the function loading the word embeddings works. What is the input, the output and how it is obtained. HINT: Inspect the embedding glove twitter lexicon used in Assignment 2.

In [32]:
from nltk.corpus import stopwords
import string
from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer

def preprocess(data, tokenize=True, remove_stopwords=True, remove_none=True):
    '''
    preprocess data by tokenizing and removing stopwords, punctuation and none
    '''
    if tokenize:
        data = tokenized_tweets(data)
        data['Clean_tweet'] = data['Tokenized_tweet']
    if remove_stopwords:
        stop = stopwords.words('english')
        data['Clean_tweet'] = data['Clean_tweet'].apply(lambda sentence: [word for word in sentence if word not in stop])
        data['Clean_tweet'] = data['Clean_tweet'].apply(lambda sentence: [word for word in sentence if not all([c in string.punctuation for c in word])])
    if remove_none:
        data = data[data['Stance'] != 'NONE']
    return data[['Target','Clean_tweet', 'Stance']]   
    
def read_glove(path, dim):
    '''
    read the glove vectors from path with dimension dim
    '''
    df = pd.read_csv(path, sep=" ", quoting=3, header=None, index_col=0)
    glove = {key: val.values for key, val in df.T.items()}
    return glove

# TODO: provide a description of this function. HINT: it helps to look at the glove twitter word embeddings
# used below.    
def gloveVectorize(glove, text):
    '''
    The inputs are glove vectors and text.
    It gets the pretrained glove vectors of the words in the input text.
    Out of vocabulary words are removed.
    The final output vector is the average of the vectors.
    '''
    dim = len(glove["the"])
    X = np.zeros( (len(text), dim) )
    for text_id, t in enumerate(text):
        tmp = np.zeros((1, dim))        
        # remove oov words
        words = [w for w in t if w in glove.keys()]
        for word in words:
            tmp[:] += glove[word]

        if len(words) == 0:
            X[text_id, :] = np.zeros((1, dim)) 
        else:
            X[text_id, :] = tmp/len(words)
    return X

def encode_labels(labels):
    enc = preprocessing.LabelEncoder()
    encoded = enc.fit_transform(labels)
    decoded = enc.inverse_transform(encoded)
    return encoded, decoded

def data_as_numpy(data):
    return np.asarray(data['Clean_tweet']), np.asarray(data['Stance'])

In [33]:
# TASK A in-target supervised: 
# load train / test
folder = "stance-semeval2016"
trial_file = f"../datasets/{folder}/semeval2016-task6-trialdata.utf-8.txt"
train_file = f"../datasets/{folder}/semeval2016-task6-trainingdata.utf-8.txt"
test_file = f"../datasets/{folder}/SemEval2016-Task6-subtaskA-testdata-gold.txt"

training_data, targets = load_data([train_file, trial_file])
test_data, _ = load_data([test_file])

# show original training data
display(training_data)

#training_data[training_data['Target'] == targets[0]].head()

# preprocess
training_preproc_data = preprocess(training_data, remove_none=False)
test_preproc_data = preprocess(test_data, remove_none=False )

# show clean training data
display(training_preproc_data)

Unnamed: 0,ID,Target,Tweet,Stance
0,101,Atheism,dear lord thank u for all of ur blessings forg...,AGAINST
1,102,Atheism,"Blessed are the peacemakers, for they shall be...",AGAINST
2,103,Atheism,I am not conformed to this world. I am transfo...,AGAINST
3,104,Atheism,Salah should be prayed with #focus and #unders...,AGAINST
4,105,Atheism,And stay in your houses and do not display you...,AGAINST
...,...,...,...,...
95,96,Legalization of Abortion,@Corey_Frizzell @PEILiberalParty and most Isla...,NONE
96,97,Legalization of Abortion,@Docjp Pressure? It's their job and they are f...,NONE
97,98,Legalization of Abortion,I love how #liberals only accuse #conservative...,AGAINST
98,99,Legalization of Abortion,Help your friend figure out how they're going ...,NONE


Unnamed: 0,Target,Clean_tweet,Stance
0,Atheism,"[dear, lord, thank, u, ur, blessings, forgive,...",AGAINST
1,Atheism,"[Blessed, peacemakers, shall, called, children...",AGAINST
2,Atheism,"[I, conformed, world, I, transformed, renewing...",AGAINST
3,Atheism,"[Salah, prayed, #focus, #understanding, #Allah...",AGAINST
4,Atheism,"[And, stay, houses, display, like, times, igno...",AGAINST
...,...,...,...
95,Legalization of Abortion,"[@Corey_Frizzell, @PEILiberalParty, Islanders,...",NONE
96,Legalization of Abortion,"[@Docjp, Pressure, It's, job, failing, miserab...",NONE
97,Legalization of Abortion,"[I, love, #liberals, accuse, #conservatives, w...",AGAINST
98,Legalization of Abortion,"[Help, friend, figure, they're, going, get, st...",NONE


# ASSIGNMENT 2

+ TODO try with different word embeddings. For example:
  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz (1.2GB)


In [35]:
#TODO try with the different embeddings in resources directory and see which one obtains better results

# set path for word embeddings and vectorize
pretrained_wv_path = '../resources/glove.twitter.27B.25d.txt.gz'
glove = read_glove(pretrained_wv_path, 300)

# set path for word embeddings and vectorize
pretrained_wv_fasttext_path = '../resources/cc.en.300.vec.gz'
# glove_fasttext = read_glove(pretrained_wv_fasttext_path, 300)

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score

def fit_lr(train_x, train_y, c=1.0, weights='equal'):
    if weights == 'equal':
        logreg = Pipeline([
            ("scaler", StandardScaler()),
            ("logit", LogisticRegression(C=c, solver="lbfgs", max_iter=1000))
        ])
    else:
        logreg = Pipeline([
        ("scaler", StandardScaler()),
        ("logit", LogisticRegression(C=c, solver="lbfgs", class_weight=weights, max_iter=1000))
        ])
    logreg.fit(train_x, train_y)
    return logreg

# ASSIGNMENT 3

+ TODO: evaluate accuracy, F1 macro, F1 micro using sklearn functions. HINT: Check previous labs.

In [42]:
best_c = 1.2
class_weight = 'balanced'

# prediction map
label_map = {0: 'AGAINST',
             1: 'FAVOR',
             2: 'NONE'}

predictions = pd.DataFrame()

training_texts, training_labels = data_as_numpy(training_preproc_data)
train_x = gloveVectorize(glove, training_texts)
train_y, labels = encode_labels(training_labels)

test_texts, test_labels = data_as_numpy(test_preproc_data)
test_x = gloveVectorize(glove, test_texts)
test_y, labels = encode_labels(test_labels)

# fit model
logreg = fit_lr(train_x, train_y, c=best_c, weights=class_weight)

# predict
stance = logreg.predict(test_x)
stance_probs = logreg.predict_proba(test_x)

# TODO evaluate accuracy, F1 macro, F1 micro using sklearn functions

predictions = test_data[['ID', 'Target', 'Tweet']]
predictions_probs = test_data[['ID', 'Target', 'Tweet']]
predictions['Stance'] = [label_map[s] for s in stance]
predictions_probs['Stance'] = [s for s in stance_probs]

predictions = predictions.sort_values(by='ID')
predictions_probs = predictions_probs.sort_values(by='ID')

print("Accuracy:", accuracy_score(test_y, stance))
print("F1 macro:", f1_score(test_y, stance, average="macro", labels=[0, 1]))
print("F1 micro:", f1_score(test_y, stance, average="micro", labels=[0, 1]))
display(predictions)

Accuracy: 0.45396317053642915
F1 macro: 0.47028490522635824
F1 micro: 0.4897959183673469


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,ID,Target,Tweet,Stance
0,10001,Atheism,He who exalts himself shall be humbled; a...,AGAINST
1,10002,Atheism,RT @prayerbullets: I remove Nehushtan -previou...,NONE
2,10003,Atheism,@Brainman365 @heidtjj @BenjaminLives I have so...,AGAINST
3,10004,Atheism,#God is utterly powerless without Human interv...,FAVOR
4,10005,Atheism,@David_Cameron Miracles of #Multiculturalism...,NONE
...,...,...,...,...
1244,11245,Legalization of Abortion,@MetalheadMonty @tom_six I followed him before...,AGAINST
1245,11246,Legalization of Abortion,"For he who avenges blood remembers, he does no...",AGAINST
1246,11247,Legalization of Abortion,Life is sacred on all levels. Abortion does no...,AGAINST
1247,11248,Legalization of Abortion,"@ravensymone U refer to ""WE"" which =""YOU"" & a ...",FAVOR


# (BONUS) ASSIGNMENT 4

Check the Feature-based lab and perform the following steps:

+ TODO: Change code from above to use SVM instead of logistic regression.
+ TODO modify code to run over every target in the dataframes for training and test.

Write a table to:
+ TODO: Compare results with logistic regression and SVM using word embeddings on the test set.
+ TODO: Compare results with feature-based SVM on the test set.

Results are better with SVM than with logistic regression (2% F1 macro). Feature based SVM obtains much better results than this SVM in most of the targets. The only exception is in the target Climate Change is a Real Concern.

In [49]:
from sklearn.svm import LinearSVC
def fit_svm(train_x, train_y, c=1.0, weights='equal'):
    if weights == 'equal':
        svm = Pipeline([
            ("scaler", StandardScaler()),
            ("logit", LinearSVC(C=c, max_iter=1000))
        ])
    else:
        svm = Pipeline([
        ("scaler", StandardScaler()),
        ("logit", LinearSVC(C=c, class_weight=weights, max_iter=1000))
        ])
    svm.fit(train_x, train_y)
    return svm

In [50]:
best_c = 1.2
class_weight = 'balanced'

# prediction map
label_map = {0: 'AGAINST',
             1: 'FAVOR',
             2: 'NONE'}

results = {}
for target in targets:
    results[target] = {'svm': {}, 'logreg': {}}

for target in targets:
    print('Running experiments for "{}"'.format(target))
    predictions = pd.DataFrame()

    training_texts, training_labels = data_as_numpy(training_preproc_data)
    train_x = gloveVectorize(glove, training_texts)
    train_y, labels = encode_labels(training_labels)

    test_texts, test_labels = data_as_numpy(test_preproc_data)
    test_x = gloveVectorize(glove, test_texts)
    test_y, labels = encode_labels(test_labels)

    # fit model
    svm = fit_svm(train_x, train_y, c=best_c, weights=class_weight)

    # predict
    stance = svm.predict(test_x)

    # TODO evaluate accuracy, F1 macro, F1 micro using sklearn functions

    predictions = test_data.loc[:, ['ID', 'Target', 'Tweet']]
    predictions['Stance'] = [label_map[s] for s in stance]

    predictions = predictions.sort_values(by='ID')

    accuracy = accuracy_score(test_y, stance)
    f1_macro = f1_score(test_y, stance, average="macro", labels=[0, 1])
    f1_micro = f1_score(test_y, stance, average="micro", labels=[0, 1])
    results[target]["svm"] = {"accuracy": accuracy, "f1_macro": f1_macro, "f1_micro": f1_micro}
    print("Accuracy:", accuracy)
    print("F1 macro:", f1_macro)
    print("F1 micro:", f1_micro)

Running experiments for "Climate Change is a Real Concern"




Accuracy: 0.5132105684547638
F1 macro: 0.4987915535956581
F1 micro: 0.5668724279835392
Running experiments for "Hillary Clinton"




Accuracy: 0.5132105684547638
F1 macro: 0.4962636222106902
F1 micro: 0.5661348430262482
Running experiments for "Atheism"




Accuracy: 0.510808646917534
F1 macro: 0.4942069331212302
F1 micro: 0.5634961439588689
Running experiments for "Feminist Movement"




Accuracy: 0.5124099279423538
F1 macro: 0.4965364804587935
F1 micro: 0.5651055069480185
Running experiments for "Legalization of Abortion"
Accuracy: 0.5116092874299439
F1 macro: 0.4949161529793862
F1 micro: 0.5645244215938304




In [51]:
for target in targets:
    print('Running experiments for "{}"'.format(target))
    predictions = pd.DataFrame()

    training_texts, training_labels = data_as_numpy(training_preproc_data)
    train_x = gloveVectorize(glove, training_texts)
    train_y, labels = encode_labels(training_labels)

    test_texts, test_labels = data_as_numpy(test_preproc_data)
    test_x = gloveVectorize(glove, test_texts)
    test_y, labels = encode_labels(test_labels)

    # fit model
    logreg = fit_lr(train_x, train_y, c=best_c, weights=class_weight)

    # predict
    stance = logreg.predict(test_x)

    # TODO evaluate accuracy, F1 macro, F1 micro using sklearn functions

    predictions = test_data.loc[:, ['ID', 'Target', 'Tweet']]
    predictions['Stance'] = [label_map[s] for s in stance]

    predictions = predictions.sort_values(by='ID')

    accuracy = accuracy_score(test_y, stance)
    f1_macro = f1_score(test_y, stance, average="macro", labels=[0, 1])
    f1_micro = f1_score(test_y, stance, average="micro", labels=[0, 1])
    results[target]["logreg"] = {"accuracy": accuracy, "f1_macro": f1_macro, "f1_micro": f1_micro}
    print("Accuracy:", accuracy)
    print("F1 macro:", f1_macro)
    print("F1 micro:", f1_micro)

Running experiments for "Climate Change is a Real Concern"
Accuracy: 0.45396317053642915
F1 macro: 0.47028490522635824
F1 micro: 0.4897959183673469
Running experiments for "Hillary Clinton"
Accuracy: 0.45396317053642915
F1 macro: 0.47028490522635824
F1 micro: 0.4897959183673469
Running experiments for "Atheism"
Accuracy: 0.45396317053642915
F1 macro: 0.47028490522635824
F1 micro: 0.4897959183673469
Running experiments for "Feminist Movement"
Accuracy: 0.45396317053642915
F1 macro: 0.47028490522635824
F1 micro: 0.4897959183673469
Running experiments for "Legalization of Abortion"
Accuracy: 0.45396317053642915
F1 macro: 0.47028490522635824
F1 micro: 0.4897959183673469


In [52]:
from collections import defaultdict
data_results = defaultdict(list)
for target, values in results.items():
    for model, scores in values.items():
        data_results["target"].append(target)
        data_results["model"].append(model)
        data_results["accuracy"].append(scores["accuracy"])
        data_results["f1_macro"].append(scores["f1_macro"])
        data_results["f1_micro"].append(scores["f1_micro"])
        data_results["C"].append("1.2")

df_results = pd.DataFrame(data_results)
df_results.sort_values(by=['target', 'model'])

Unnamed: 0,target,model,accuracy,f1_macro,f1_micro,C
5,Atheism,logreg,0.453963,0.470285,0.489796,1.2
4,Atheism,svm,0.510809,0.494207,0.563496,1.2
1,Climate Change is a Real Concern,logreg,0.453963,0.470285,0.489796,1.2
0,Climate Change is a Real Concern,svm,0.513211,0.498792,0.566872,1.2
7,Feminist Movement,logreg,0.453963,0.470285,0.489796,1.2
6,Feminist Movement,svm,0.51241,0.496536,0.565106,1.2
3,Hillary Clinton,logreg,0.453963,0.470285,0.489796,1.2
2,Hillary Clinton,svm,0.513211,0.496264,0.566135,1.2
9,Legalization of Abortion,logreg,0.453963,0.470285,0.489796,1.2
8,Legalization of Abortion,svm,0.511609,0.494916,0.564524,1.2
