# Classification

## Install packages

In [1]:
!pip install asent
!pip install pandas
!pip install numpy
!pip install sklearn
!pip install gensim

Defaulting to user installation because normal site-packages is not writeable
Collecting asent
  Downloading asent-0.8.3-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting spacy>=3.0.0 (from asent)
  Downloading spacy-3.7.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy>=3.0.0->asent)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy>=3.0.0->asent)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy>=3.0.0->asent)
  Downloading murmurhash-1.0.10-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy>=3.0.0->asent)
  Downloading cymem-2.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy>=3.0.0->as

## Import packages

In [2]:
from asent import lexicons
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import gensim.downloader
from sklearn.linear_model import LogisticRegression

## Load and split data

The data is loaded using the [```asent```]("https://github.com/KennethEnevoldsen/asent") package. It contains 7504 words and a continuous sentiment label which is constructed by [two annotaters]("https://ojs.aaai.org/index.php/ICWSM/article/view/14550").

I have split the data into a training and a test set containing 80% and 20% of the data, respectively. It is important that we all use the same random state, to ensure we get the same split. That way, we can compare our models on the held-out test set next week.

In [3]:
lex = pd.DataFrame(lexicons.get("lexicon_en_v1").items(), columns=["word", "sentiment"])

train, test = train_test_split(lex, test_size=0.2, random_state=42)

## Preprocessing and feature generation

To make the task a bit simpler, we binarise the sentiment label. We consider all words with a sentiment score above 0 as positive (1) and all words with a sentiment score below 0 as negative (0).

In [4]:
y = [1 if x>0 else 0 for x in train["sentiment"]]

In [5]:
print(train["sentiment"].to_list())
print(y)

[1.5, 1.3, 2.8, 0.8, 0.6, 0.1, -2.1, 0.5, -1.0, 2.2, 2.3, 3.0, -1.9, 2.1, -1.9, -0.1, -0.9, -1.6, -1.8, 2.4, -2.8, 1.4, -2.1, -1.9, -1.0, -3.4, 1.3, -1.6, 0.3, -2.1, -1.2, -1.5, 1.1, 2.8, 1.6, -1.7, -2.3, 0.8, 0.1, -2.8, -2.3, -1.6, -0.9, -1.0, 1.9, 1.5, 1.9, -1.6, 2.6, 2.6, -0.8, -1.6, -2.1, 2.2, -0.5, 3.1, 1.8, 1.6, 0.7, 0.3, -0.2, -1.3, -2.4, -0.8, -1.2, -2.6, 1.8, -0.9, 1.8, -1.1, -2.2, 1.5, 2.8, -1.2, -1.5, -1.2, 2.9, 1.6, 0.4, -2.2, -1.2, 1.3, 2.2, 1.9, 2.3, -2.7, 2.2, -1.2, -1.2, 0.2, -1.8, -2.9, -1.6, -1.7, 0.5, 2.1, 1.6, 2.5, 1.4, 0.5, 0.9, 0.8, -0.1, 1.8, -0.9, -2.1, -3.1, -2.0, -2.4, 0.6, 1.3, 1.6, 0.9, 1.8, -2.1, 1.9, 1.6, -1.8, 1.2, -1.2, -1.6, -1.6, -1.0, -2.2, 2.0, -0.2, -1.4, 2.0, -1.3, 1.7, -1.9, -1.9, 2.2, -1.0, -1.6, -1.1, 0.1, 1.1, 1.7, -2.0, 1.2, -1.7, 2.7, -0.5, -0.9, -1.5, 1.6, 1.1, -1.3, -2.7, 1.2, 2.2, -0.4, -1.2, -0.8, -2.1, 1.4, 1.1, -1.3, -2.0, 1.3, -1.2, 1.3, 0.2, -0.7, -2.1, -1.6, 1.0, -1.9, -1.0, 1.9, 1.6, -1.2, 0.1, 1.5, -1.9, -0.9, 1.5, -2.0, -2.3, -1.3

Now we have our labels for training, and we can start selecting what features to use to predict the sentiment of the words. I'm going to use the ```glove``` embeddings from earlier, but feel free to use any other embeddings or features you think will work well.

In [25]:
gensim.downloader.info()["models"].keys()

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])

In [26]:
embeddings = gensim.downloader.load("fasttext-wiki-news-subwords-300")



Some words are not in the embedding model vocabulary, so we need to decide how to represent them instead. I'm going to use a zero vector, but another common approach is to use the average of the embeddings of all other words (mean imputation).

In [27]:
features = [embeddings[r["word"]] if r["word"] in embeddings.index_to_key else np.zeros(shape=300) for i, r in train.iterrows()]

In [28]:
features

[array([ 1.4775e-02,  5.5670e-03,  1.4796e-02,  1.5721e-03, -1.1169e-03,
        -4.1300e-02,  1.8318e-02, -9.0461e-02, -5.6574e-02, -1.1036e-01,
         8.3429e-02,  4.8504e-02, -3.2571e-02,  5.2753e-03,  9.1695e-03,
         1.1335e-02,  7.3180e-02,  3.3428e-02,  4.0374e-02,  3.1039e-02,
         2.8389e-02,  7.2744e-02,  1.6404e-02,  3.5439e-02,  9.4774e-03,
        -9.1862e-03,  2.0759e-02, -6.7962e-02, -2.7318e-02,  4.3881e-04,
        -3.9674e-02, -3.9086e-02, -1.8275e-02, -1.7510e-02, -2.1018e-02,
        -5.2563e-02,  2.1687e-02,  4.4847e-02,  3.6356e-02,  1.6613e-02,
         3.2873e-02, -1.1656e-01, -8.6073e-02, -3.8387e-02, -8.9766e-02,
         6.2010e-03,  5.0888e-02,  2.1526e-02,  5.4192e-02, -1.3813e-02,
        -9.4073e-03,  2.5594e-02, -3.0182e-02,  1.0263e-02,  3.2720e-02,
        -8.3105e-02, -1.8027e-02,  3.2253e-02,  1.9499e-02,  3.2904e-02,
        -3.5459e-02, -1.5877e-02,  1.0960e-02,  1.2649e-02,  2.1588e-02,
         2.8503e-02,  3.0607e-02,  3.1749e-02,  4.3

Now we just need to transform the features to a format that the classifier can use. I'm going to use a simple logistic regression model from ```sklearn```, but you can use any other classifier you think will work well.

In [29]:
X = np.array(features)

In [33]:
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier

clf = DecisionTreeClassifier(random_state=42)

Once you have specified your model, you can fit it to the training data.

In [34]:
clf.fit(X, y)

The class has a ```score()``` method that takes features and true labels and returns an accuracy score. It uses the fitted model to predict labels based on the features and compares them to the true labels.

In [36]:
clf.score(X, y)

#for glove model
#MLP perform better than logreg (0.78), mlp gives 0.87
#DecisionTreeClassifier (0.87)
# RandomForestClassifier (0.87)

#fasttext-wiki-news-subwords-300:
# RandomForestClassifier (0.9)
# DecisionTreeClassifier (0.9)




0.9072130601365984

In [41]:
### ... extra: evaluate classifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc 
from sklearn import metrics



# create y_test and X_test based on train:test split 
y_test = [1 if x > 0 else 0 for x in test["sentiment"]]
test["sentiment"].to_list()
test_features = [embeddings[r["word"]] if r["word"] in embeddings.index_to_key else np.zeros(shape = 300) for i, r in test.iterrows()]
X_test = np.array(test_features)

# create predictions
y_pred = clf.predict(X_test)

# plot confision matrixs
classifier_metrics = metrics.classification_report(y_test, y_pred)

print(f'classifier metrics:')
print(classifier_metrics)



classifier metrics:
              precision    recall  f1-score   support

           0       0.72      0.80      0.76       866
           1       0.68      0.58      0.63       635

    accuracy                           0.71      1501
   macro avg       0.70      0.69      0.69      1501
weighted avg       0.70      0.71      0.70      1501



Now you have a sentiment classifier! If there indeed is a strong relationship between the sentiment of a word and its embedding, the relationships learned during training should generalise to the test set, which we will test next week.

For now, you can iterately improve the model by tweaking different parts of the pipeline, and reevaluating the performance on the training set. Here are some parameters you can try to change:

- try embedding models trained on different data (e.g., you can get an overview of different embeddings models that are available through the gensim api by running ```gensim.downloader.info()["models"].keys()```)
- try mean imputation
- try to change the parameters of the model (you can find an overview of the parameters in the [sklearn documentation]("https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html"))
- try different models (you can find an overview of the sklearn supervised learning library [here]("https://scikit-learn.org/stable/supervised_learning.html"))
- try to use the continuous sentiment scores as labels - can you get similar performance? (why/why not?)