Notebook prepared by Henrique Lopes Cardoso (hlc@fe.up.pt).

# WORD EMBEDDINGS FOR CLASSIFICATION


## Pretrained word embeddings

We can make use of pretrained word embeddings to represent our input text in a classification problem. Let's try it out with the embeddings we've trained in the word embeddings notebook, which have the advantage of having been trained on data that is similar to our classification task's data (reviews). You could try other embeddings (such as those available in [Gensim](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html)).


In [1]:
import gensim

wv = gensim.models.KeyedVectors.load("../reviews/reviews_wv/reviews_wv")

Let's load data for our classification task.


In [2]:
import pandas as pd
import re

# Importing the dataset
dataset = pd.read_csv("../data/restaurant_reviews.tsv", delimiter="\t", quoting=3)

dataset


Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


To make sure we have only tokens (words) that are ready to fetch embeddings for, we'll limit ourselves to lower-case alphabetic sequences. For that, we do some preprocessing:


In [3]:
# cleanup
corpus = []
for i in range(0, dataset["Review"].size):
    # get review, remove non alpha chars and convert to lower-case
    review = re.sub("[^a-zA-Z]", " ", dataset["Review"][i]).lower()
    # add review to corpus
    corpus.append(review)


Now we can convert our "cleaned" corpus into embeddings.


#### Fixing the length of the input

The reviews in our corpus have variable length. However, we need to represent them with a fixed-length vector of features. One way to do it is to impose a limit on the number of word embeddings we want to include.

To convert words into their vector representations (embeddings), let's create an auxiliary function that takes in the number of embeddings we wish to include in the representation:


In [4]:
import numpy as np


def text_to_vector(embeddings, text, sequence_len):
    # split text into tokens
    tokens = text.split()

    # convert tokens to embedding vectors, up to sequence_len tokens
    vec = []
    n = 0
    i = 0

    # while there are tokens and did not reach desired sequence length
    while i < len(tokens) and n < sequence_len:
        try:
            vec.extend(embeddings.get_vector(tokens[i]))
            n += 1
        except KeyError:
            True  # simply ignore out-of-vocabulary tokens
        finally:
            i += 1

    # add blanks up to sequence_len, if needed
    for j in range(sequence_len - n):
        vec.extend(
            np.zeros(
                embeddings.vector_size,
            )
        )

    return vec


The above _text_to_vector_ function takes an _embeddings_ dictionary, the _text_ to convert, and the number of words _sequence_len_ from _text_ to consider. It returns a vector with appended embeddings for the first _sequence_len_ words that exist in the _embeddings_ dictionary (tokens for which no embedding is found are ignored). In case the text has less than _sequence_len_ words for which we have embeddings, blank embeddings will be added.

To better decide how many word embeddings we wish to append, let's learn a bit more about the length of each review in our corpus.


In [5]:
from scipy import stats

lens = [len(c.split()) for c in corpus]
print(np.min(lens), np.max(lens), np.mean(lens), np.std(lens), stats.mode(lens))


1 32 11.04 6.312242073938545 ModeResult(mode=array([4]), count=array([80]))


  print(np.min(lens), np.max(lens), np.mean(lens), np.std(lens), stats.mode(lens))


So, we have reviews ranging from 1 to 32 tokens (words), with an average size of 11.04 and a standard deviation of 6.31, being 4 the most frequent review length.

Let's limit reviews to, say, length 10: longer reviews will get truncated, while shorter reviews will be padded with empty embeddings for the missing tokens. (Note: according to function _text_to_vector_, this may also happen to reviews of length >= 10, if they happen to include out-of-vocabulary tokens.)


In [6]:
# convert corpus into dataset with appended embeddings representation
embeddings_corpus = []
for c in corpus:
    embeddings_corpus.append(text_to_vector(wv, c, 10))

X = np.array(embeddings_corpus)
y = dataset["Liked"]

print(X.shape, y.shape)


(1000, 1500) (1000,)


As expected, our feature vectors have 1500 dimensions: 10 times the size of each embedding vector, which is 150 in this case.

Now we can use this feature representation to train a model! Try out training a Logistic Regression or a Support Vector Machine model.


In [7]:
# your code here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

classifiers = [
    LogisticRegression(),
    LinearSVC(),
]

for clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(clf.__class__.__name__)
    print(f"Score: {clf.score(X_test, y_test)}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"Precision: {precision_score(y_test, y_pred)}")
    print(f"Recall: {recall_score(y_test, y_pred)}")
    print(f"F1: {f1_score(y_test, y_pred)}")
    print()

LogisticRegression
Score: 0.765
Accuracy: 0.765
Precision: 0.7692307692307693
Recall: 0.7766990291262136
F1: 0.7729468599033817

LinearSVC
Score: 0.75
Accuracy: 0.75
Precision: 0.7345132743362832
Recall: 0.8058252427184466
F1: 0.7685185185185185



#### Aggregating word embeddings


Instead of appending word embeddings from a fixed number of tokens, we could consider using embeddings for the whole set of tokens, by taking their mean. This way, we will still get a fixed length representation, equal to the embeddings vector size (150 in our case).

Implement the _text_to_mean_vector_ function, which takes the embeddings dictionary and the text to convert, and returns the mean of the embeddings of its tokens.


In [8]:
# your code here
def text_to_mean_vector(embeddings, text):
    # split text into tokens
    tokens = text.split()

    # convert tokens to embedding vectors
    vec = []
    i = 0
    while i < len(tokens):  # while there are tokens
        try:
            vec.append(embeddings.get_vector(tokens[i]))
        except KeyError:
            True  # simply ignore out-of-vocabulary tokens
        finally:
            i += 1

    return np.mean(vec, axis=0)  # return the mean of vec


Use the above function to convert the corpus into a dataset with mean embeddings representation. The shape of the feature matrix _X_ should be _(1000, 150)_.


In [9]:
# your code here
embeddings_corpus = []
for c in corpus:
    embeddings_corpus.append(text_to_mean_vector(wv, c))

X = np.array(embeddings_corpus)
y = dataset["Liked"]

print(X.shape, y.shape)


(1000, 150) (1000,)


Now we can use this mean embeddings representation to train a model! Try out training a Logistic Regression or a Support Vector Machine model.


In [10]:
# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

for clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(clf.__class__.__name__)
    print(f"Score: {clf.score(X_test, y_test)}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"Precision: {precision_score(y_test, y_pred)}")
    print(f"Recall: {recall_score(y_test, y_pred)}")
    print(f"F1: {f1_score(y_test, y_pred)}")
    print()


LogisticRegression
Score: 0.83
Accuracy: 0.83
Precision: 0.8415841584158416
Recall: 0.8252427184466019
F1: 0.8333333333333333

LinearSVC
Score: 0.82
Accuracy: 0.82
Precision: 0.8252427184466019
Recall: 0.8252427184466019
F1: 0.8252427184466019



It is also possible to use other aggregation functions, besides taking the mean of the word embeddings. For instance, we could take the element-wise _max_. Try it out and check if you notice any changes in the performance of the models!


In [11]:
# your code here
def text_to_max_vector(embeddings, text):
    # split text into tokens
    tokens = text.split()

    # convert tokens to embedding vectors
    vec = []
    i = 0
    while i < len(tokens):  # while there are tokens
        try:
            vec.append(embeddings.get_vector(tokens[i]))
        except KeyError:
            True  # simply ignore out-of-vocabulary tokens
        finally:
            i += 1

    return np.max(vec, axis=0)  # return the max of vec


In [12]:
# your code here
embeddings_corpus = []
for c in corpus:
    embeddings_corpus.append(text_to_max_vector(wv, c))

X = np.array(embeddings_corpus)
y = dataset["Liked"]

print(X.shape, y.shape)


(1000, 150) (1000,)


In [13]:
# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

for clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(clf.__class__.__name__)
    print(f"Score: {clf.score(X_test, y_test)}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"Precision: {precision_score(y_test, y_pred)}")
    print(f"Recall: {recall_score(y_test, y_pred)}")
    print(f"F1: {f1_score(y_test, y_pred)}")
    print()


LogisticRegression
Score: 0.78
Accuracy: 0.78
Precision: 0.8041237113402062
Recall: 0.7572815533980582
F1: 0.78

LinearSVC
Score: 0.78
Accuracy: 0.78
Precision: 0.7920792079207921
Recall: 0.7766990291262136
F1: 0.7843137254901961

