# Classification

## Install packages

In [None]:
!pip install asent
!pip install pandas
!pip install numpy
!pip install sklearn
!pip install gensim

## Import packages

In [3]:
from asent import lexicons
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import gensim.downloader
from sklearn.linear_model import LogisticRegression

## Load and split data

The data is loaded using the [```asent```]("https://github.com/KennethEnevoldsen/asent") package. It contains 7504 words and a continuous sentiment label which is constructed by [two annotaters]("https://ojs.aaai.org/index.php/ICWSM/article/view/14550").

I have split the data into a training and a test set containing 80% and 20% of the data, respectively. It is important that we all use the same random state, to ensure we get the same split. That way, we can compare our models on the held-out test set next week.

In [4]:
lex = pd.DataFrame(lexicons.get("lexicon_en_v1").items(), columns=["word", "sentiment"])

train, test = train_test_split(lex, test_size=0.2, random_state=42)

## Preprocessing and feature generation

To make the task a bit simpler, we binarise the sentiment label. We consider all words with a sentiment score above 0 as positive (1) and all words with a sentiment score below 0 as negative (0).

In [7]:
y = [1 if x>0 else 0 for x in train["sentiment"]]

In [8]:
print(train["sentiment"].to_list())
print(y)

[1.5, 1.3, 2.8, 0.8, 0.6, 0.1, -2.1, 0.5, -1.0, 2.2, 2.3, 3.0, -1.9, 2.1, -1.9, -0.1, -0.9, -1.6, -1.8, 2.4, -2.8, 1.4, -2.1, -1.9, -1.0, -3.4, 1.3, -1.6, 0.3, -2.1, -1.2, -1.5, 1.1, 2.8, 1.6, -1.7, -2.3, 0.8, 0.1, -2.8, -2.3, -1.6, -0.9, -1.0, 1.9, 1.5, 1.9, -1.6, 2.6, 2.6, -0.8, -1.6, -2.1, 2.2, -0.5, 3.1, 1.8, 1.6, 0.7, 0.3, -0.2, -1.3, -2.4, -0.8, -1.2, -2.6, 1.8, -0.9, 1.8, -1.1, -2.2, 1.5, 2.8, -1.2, -1.5, -1.2, 2.9, 1.6, 0.4, -2.2, -1.2, 1.3, 2.2, 1.9, 2.3, -2.7, 2.2, -1.2, -1.2, 0.2, -1.8, -2.9, -1.6, -1.7, 0.5, 2.1, 1.6, 2.5, 1.4, 0.5, 0.9, 0.8, -0.1, 1.8, -0.9, -2.1, -3.1, -2.0, -2.4, 0.6, 1.3, 1.6, 0.9, 1.8, -2.1, 1.9, 1.6, -1.8, 1.2, -1.2, -1.6, -1.6, -1.0, -2.2, 2.0, -0.2, -1.4, 2.0, -1.3, 1.7, -1.9, -1.9, 2.2, -1.0, -1.6, -1.1, 0.1, 1.1, 1.7, -2.0, 1.2, -1.7, 2.7, -0.5, -0.9, -1.5, 1.6, 1.1, -1.3, -2.7, 1.2, 2.2, -0.4, -1.2, -0.8, -2.1, 1.4, 1.1, -1.3, -2.0, 1.3, -1.2, 1.3, 0.2, -0.7, -2.1, -1.6, 1.0, -1.9, -1.0, 1.9, 1.6, -1.2, 0.1, 1.5, -1.9, -0.9, 1.5, -2.0, -2.3, -1.3

Now we have our labels for training, and we can start selecting what features to use to predict the sentiment of the words. I'm going to use the ```glove``` embeddings from earlier, but feel free to use any other embeddings or features you think will work well.

In [9]:
embeddings = gensim.downloader.load("glove-wiki-gigaword-300")

Some words are not in the embedding model vocabulary, so we need to decide how to represent them instead. I'm going to use a zero vector, but another common approach is to use the average of the embeddings of all other words (mean imputation).

In [10]:
features = [embeddings[r["word"]] if r["word"] in embeddings.index_to_key else np.zeros(shape=300) for i, r in train.iterrows()]

In [11]:
features

[array([-0.52907  ,  0.16877  ,  0.12275  ,  0.090413 ,  0.41462  ,
        -0.26423  , -0.18562  , -0.11718  ,  0.35143  ,  0.60676  ,
        -0.36296  ,  0.85104  , -0.0052894, -0.23895  ,  0.31248  ,
         0.73722  ,  0.33408  ,  0.66874  , -0.047936 ,  0.021481 ,
        -0.61175  ,  0.54261  , -0.52763  , -0.31914  ,  0.37443  ,
        -0.43761  , -0.32753  ,  0.63802  ,  0.25847  , -0.14036  ,
        -0.25623  , -0.81164  ,  0.7146   ,  0.099997 ,  0.69452  ,
         0.22954  , -0.65204  ,  0.02937  ,  0.10346  ,  0.01238  ,
        -0.24625  ,  0.42367  ,  0.41618  , -0.40403  , -0.33678  ,
         0.43976  , -0.059189 ,  0.094295 , -0.30645  , -0.36796  ,
        -0.16531  , -0.47487  ,  0.19546  ,  0.039917 ,  0.60604  ,
        -0.36435  , -0.15549  , -0.24495  , -0.040387 ,  0.52361  ,
        -0.38576  ,  0.72895  , -0.35242  ,  0.10104  , -0.1106   ,
         0.6565   , -0.21597  ,  0.18875  ,  0.13656  ,  0.76231  ,
        -0.33465  , -0.16709  , -0.10284  ,  0.2

Now we just need to transform the features to a format that the classifier can use. I'm going to use a simple logistic regression model from ```sklearn```, but you can use any other classifier you think will work well.

In [12]:
X = np.array(features)

In [13]:
clf = LogisticRegression(random_state=42)

Once you have specified your model, you can fit it to the training data.

In [14]:
clf.fit(X, y)

The class has a ```score()``` method that takes features and true labels and returns an accuracy score. It uses the fitted model to predict labels based on the features and compares them to the true labels.

In [15]:
clf.score(X, y)

0.7881059470264867

Now you have a sentiment classifier! If there indeed is a strong relationship between the sentiment of a word and its embedding, the relationships learned during training should generalise to the test set, which we will test next week.

For now, you can iterately improve the model by tweaking different parts of the pipeline, and reevaluating the performance on the training set. Here are some parameters you can try to change:

- try embedding models trained on different data (e.g., you can get an overview of different embeddings models that are available through the gensim api by running ```gensim.downloader.info()["models"].keys()```)
- try mean imputation
- try to change the parameters of the model (you can find an overview of the parameters in the [sklearn documentation]("https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"))
- try different models (you can find an overview of the sklearn supervised learning library [here]("https://scikit-learn.org/stable/supervised_learning.html"))
- try to use the continuous sentiment scores as labels - can you get similar performance? (why/why not?)

In [16]:
gensim.downloader.info()["models"].keys()

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])