This notebook explores word embedding through `gensim` package.

We will train embeddings from scratch as well as use pre-trained word vectors.
We will then attempt to use embeddings as features in text classification on the COVID tweet dataset. 

Before running, install gensim with:

`pip install gensim`

Some materials on word embeddings are adapted from: https://github.com/dbamman/anlp21/blob/main/4.embeddings/WordEmbeddings.ipynb

In [3]:
pip install gensim

Note: you may need to restart the kernel to use updated packages.


In [4]:
# import gensim and related libraries
import re
from gensim.models import Word2Vec, KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import datapath

### Training a word2vec model on a small Wikipedia dataset

First, let's train a new word2vec model on Wikipedia text data.

In [6]:
sentences=[]
# file from which to generate word embeddings
filename="/Users/owenmonroe/Desktop/GitHub/TextMiningFall23/Lab7_Oct16_WordNet_WordEmbedding/Datasets/wordembeddings/wiki.10K.txt"
with open(filename, 'rb') as file:
    for line in file:
        words=line.rstrip().lower().decode('utf-8')
        # this file is already tokenized, so we can split on whitespace
        # but first let's replace any sequence of whitespace (space, tab, newline, etc.) with single space
        words=re.sub("\s+", " ", words)
        sentences.append(words.split(" "))

model_wiki = Word2Vec(
        sentences,
        vector_size=100,
        window=5,
        min_count=2,
        workers=10)

my_trained_vectors = model_wiki.wv

# save vectors to file if you want to use them later
my_trained_vectors.save_word2vec_format('/Users/owenmonroe/Desktop/GitHub/TextMiningFall23/Lab7_Oct16_WordNet_WordEmbedding/Datasets/wordembeddings/embeddings.txt', binary=False)

In [7]:
my_trained_vectors.most_similar("actor", topn=10)

[('actress', 0.9433861970901489),
 ('musician', 0.9145520925521851),
 ('composer', 0.9099785685539246),
 ('writer', 0.9010313153266907),
 ('artist', 0.8904495239257812),
 ('comedian', 0.8801247477531433),
 ('producer', 0.8684588670730591),
 ('singer', 0.8677772879600525),
 ('pianist', 0.8594045042991638),
 ('dancer', 0.853753387928009)]

### Loading pre-trained Glove vectors

Let's load in vectors that have already been trained on a much bigger dataset. [Glove vectors](https://nlp.stanford.edu/projects/glove/) are trained using a different method than word2vec, but its vectors can also be loaded by `gensim`.  Here we'll use a 100-dimensional model trained on 6B words (from Wikipedia and news), but even bigger models are also available.

In [9]:
# First we have to convert the Glove format into w2v format; this creates a new file
glove_file="/Users/owenmonroe/Desktop/GitHub/TextMiningFall23/Lab7_Oct16_WordNet_WordEmbedding/Datasets/wordembeddings/glove.6B.100d.100K.txt"
glove_in_w2v_format="/Users/owenmonroe/Desktop/GitHub/TextMiningFall23/Lab7_Oct16_WordNet_WordEmbedding/Datasets/wordembeddings/glove.6B.100d.100K.w2v.txt"
_ = glove2word2vec(glove_file, glove_in_w2v_format)

  _ = glove2word2vec(glove_file, glove_in_w2v_format)


In [11]:
glove = KeyedVectors.load_word2vec_format("/Users/owenmonroe/Desktop/GitHub/TextMiningFall23/Lab7_Oct16_WordNet_WordEmbedding/Datasets/wordembeddings/glove.6B.100d.100K.w2v.txt", binary=False)

In [12]:
glove.most_similar("actor", topn=10)

[('actress', 0.8580666184425354),
 ('comedian', 0.795758843421936),
 ('starring', 0.7920297384262085),
 ('starred', 0.7582032680511475),
 ('actors', 0.7394535541534424),
 ('filmmaker', 0.7349801063537598),
 ('screenwriter', 0.7342271208763123),
 ('film', 0.6941469311714172),
 ('movie', 0.6924505829811096),
 ('comedy', 0.6884662508964539)]

`most_similar` computes cosine similarity between the given word and the vectors for each vocabulary word in the model and returns the top N words. You can play around with this function to discover other analogies that have been learned in this representation.

In [14]:
# one + two = three + ?
one="man"
two="king"
three="woman"

glove.most_similar(positive=[two, three], negative=[one], topn=5)

[('queen', 0.7698541283607483),
 ('monarch', 0.6843380331993103),
 ('throne', 0.6755736470222473),
 ('daughter', 0.6594556570053101),
 ('princess', 0.6520534157752991)]

In [None]:
one="paris"
two="france"
three="berlin"

glove.most_similar(positive=[two, three], negative=[one], topn=5)

[('germany', 0.892362117767334),
 ('austria', 0.7597678303718567),
 ('poland', 0.7425415515899658),
 ('denmark', 0.7360999584197998),
 ('german', 0.6986513137817383)]

## Comparing classification results for count based vectors to word embedding vectors

We will now try to use embeddings as features for text classification.

In [None]:
import pandas as pd

#### LOAD DATASETS ####

train_data_file = "Datasets/Corona_NLP/Tweets_preprocessed_train_data.csv"
test_data_file = "Datasets/Corona_NLP/Tweets_preprocessed_test_data.csv"

# Import train and test dataset into data frames and print out the original lengths
train_data_df = pd.read_csv(train_data_file)
test_data_df = pd.read_csv(test_data_file)
print ("Original train set: ",len(train_data_df))
print ("Original test set: ",len(test_data_df))

### CLEAN DATASETS ###
# Remove empty rows from both sets and print out the new lengths
train_data_df = train_data_df[~train_data_df["OriginalTweet"].isnull()]
test_data_df = test_data_df[~test_data_df["OriginalTweet"].isnull()]
print ("After removing empty tweets, train set size: ",len(train_data_df))
print ("After removing empty tweets, test set size: ",len(test_data_df))

# Remove rows with null labels
train_data_df = train_data_df[~train_data_df["Sentiment"].isnull()]
test_data_df = test_data_df[~test_data_df["Sentiment"].isnull()]
print ("After removing instances with no labels, train set size: ", len(train_data_df))
print ("After removing instances with no labels, test set size: ", len(test_data_df))

# print out top 5 rows of the train set
display(train_data_df.head(5))

Original train set:  40000
Original test set:  4957
After removing empty tweets, train set size:  39998
After removing empty tweets, test set size:  4957
After removing instances with no labels, train set size:  39996
After removing instances with no labels, test set size:  4957


Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,Hashtags,CleanedTweet,Accounts,TokenizedTweet,StopwordRemovedTweet,StemmedTweet
0,3799,48751,London,16-03-2020,@menyrbie @phil_gahan @chrisitv https://t.co/i...,Neutral,,https t co ifz9fan2pa and https t co xx6ghgfz...,"['menyrbie', 'phil_gahan', 'chrisitv']","['https', 't', 'co', 'ifz9fan2pa', 'and', 'htt...","['https', 'co', 'ifz9fan2pa', 'https', 'co', '...","['http', 't', 'co', 'ifz9fan2pa', 'and', 'http..."
1,3800,48752,UK,16-03-2020,advice talk to your neighbours family to excha...,Positive,,advice talk to your neighbours family to excha...,,"['advice', 'talk', 'to', 'your', 'neighbours',...","['advice', 'talk', 'neighbours', 'family', 'ex...","['advic', 'talk', 'to', 'your', 'neighbour', '..."
2,3801,48753,Vagabonds,16-03-2020,coronavirus australia: woolworths to give elde...,Positive,,coronavirus australia woolworths to give elder...,,"['coronavirus', 'australia', 'woolworths', 'to...","['coronavirus', 'australia', 'woolworths', 'gi...","['coronaviru', 'australia', 'woolworth', 'to',..."
3,3802,48754,,16-03-2020,my food stock is not the only one which is emp...,Positive,"['covid19france', 'covid_19', 'covid19', 'coro...",my food stock is not the only one which is emp...,,"['my', 'food', 'stock', 'is', 'not', 'the', 'o...","['food', 'stock', 'one', 'empty', 'please', 'p...","['my', 'food', 'stock', 'is', 'not', 'the', 'o..."
4,3803,48755,,16-03-2020,"me, ready to go at supermarket during the #cov...",Negative,"['covid19', 'coronavirus', 'coronavirusfrance'...",me ready to go at supermarket during the outbr...,,"['me', 'ready', 'to', 'go', 'at', 'supermarket...","['ready', 'go', 'supermarket', 'outbreak', 'pa...","['me', 'readi', 'to', 'go', 'at', 'supermarket..."


In [None]:
# use original tweets for model building
y_train = train_data_df["Sentiment"]
y_test = test_data_df["Sentiment"]

train_text = train_data_df["CleanedTweet"]
test_text = test_data_df["CleanedTweet"]

### Count-based feature extraction and modeling

This is something you have done many times and should be able to interpret well.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# set the n-gram range
vectorizer = CountVectorizer(ngram_range = (1,1))

# create training data representation
train_data_cv = vectorizer.fit_transform(train_text)
test_data_cv = vectorizer.transform(test_text)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, auc


lg = LogisticRegression(random_state=0, solver='liblinear')
lg.fit(train_data_cv, y_train)
predictions = lg.predict(test_data_cv)

print("Accuracy score: ", accuracy_score(y_test, predictions))
print("Precision score: ", precision_score(y_test, predictions, average="weighted"))
print("Recall score: ", recall_score(y_test, predictions, average = "weighted"))
print("F1 score: ", f1_score(y_test, predictions, average = "weighted"))

Accuracy score:  0.8426467621545289
Precision score:  0.842738983628627
Recall score:  0.8426467621545289
F1 score:  0.8425229333822474


### Word embeddings-based feature extraction

We can also create features directly from word embeddings. We can use different word embedding vectors: 
- word-embedding model trained on Wikipedia 
- pre-trained Glove vectors
- word-embedding model trained on the COVID19 tweets

###  Calculate the vector representation for input data given Wikipedia word embeddings

We already trained a model on Wikipedia data above. We will use that model to extract vector representation of training and testing data. Examine carefully how these vectors for the basis of the features below.

Essentially, we can find the vector for each word in the sentence and calculate the mean of all vectors as the representation of the sentence. This is a very simple method, but generally may not be the most effective.

Also note that some words from the dataset may not appear at all in the trained vectors. We refer to these as OOV (out-of-vocabulary) words.

In [None]:
import numpy as np

def transform_data_for_word_model(model, data_df):
    v = model.wv.get_vector('king')
    X = np.zeros((len(data_df), v.shape[0]))
    n = 0
    for index, row in data_df.iterrows():
        tokens = row["CleanedTweet"].split()
        vecs = []
        m = 0
        emptycount = 0
        for word in tokens:
            try:
                # throws KeyError if word not found
                vec = model.wv.get_vector(word)
                vecs.append(vec)
                m += 1
            except KeyError:
                pass
        if len(vecs) > 0:
            vecs = np.array(vecs)
            X[n] = vecs.mean(axis=0)
        else:
            emptycount += 1
        n+=1
    return X


xtrain = transform_data_for_word_model(model_wiki,train_data_df )
xtest = transform_data_for_word_model(model_wiki,test_data_df )

### Building and evaluating the model using Wikipedia word2vec embeddings

In [None]:
lg = LogisticRegression(random_state=0, solver='liblinear')
lg.fit(xtrain, y_train)
predictions = lg.predict(xtest)

print("Accuracy score: ", accuracy_score(y_test, predictions))
print("Precision score: ", precision_score(y_test, predictions, average="weighted"))
print("Recall score: ", recall_score(y_test, predictions, average = "weighted"))
print("F1 score: ", f1_score(y_test, predictions, average = "weighted"))

Accuracy score:  0.5497276578575752
Precision score:  0.5511847651334429
Recall score:  0.5497276578575752
F1 score:  0.5366641153146762


### Building and evaluating the model using Glove vectors

The difference between this function and the function above is that the pretrained Glove vectors are accessed slightly differently. In the former model that we trained, we accessed the model using `model.wv.get_vector` or `model.wv.most_similar`. For this pretrained Glove model, we have read/loaded it slightly differently and hence we do `model.get_vector` and `model.most_similar`. (wv specifically refers to `word2vec`). 

In [None]:
def transform_data_for_glove(model, data_df):
    v = model.get_vector('king')
    X = np.zeros((len(data_df), v.shape[0]))
    n = 0
    for index, row in data_df.iterrows():
        tokens = row["CleanedTweet"].split()
        vecs = []
        m = 0
        emptycount = 0
        for word in tokens:
            try:
                # throws KeyError if word not found
                vec = model.get_vector(word)
                vecs.append(vec)
                m += 1
            except KeyError:
                pass
        if len(vecs) > 0:
            vecs = np.array(vecs)
            X[n] = vecs.mean(axis=0)
        else:
            emptycount += 1
        n+=1
    return X

xtrain_glove = transform_data_for_glove(glove,train_data_df )
xtest_glove = transform_data_for_glove(glove,test_data_df )

In [None]:
lg = LogisticRegression(random_state=0, solver='liblinear')
lg.fit(xtrain_glove, y_train)
predictions = lg.predict(xtest_glove)

print("Accuracy score: ", accuracy_score(y_test, predictions))
print("Precision score: ", precision_score(y_test, predictions, average="weighted"))
print("Recall score: ", recall_score(y_test, predictions, average = "weighted"))
print("F1 score: ", f1_score(y_test, predictions, average = "weighted"))

Accuracy score:  0.6451482751664314
Precision score:  0.6414159733107198
Recall score:  0.6451482751664314
F1 score:  0.6359574122857006


Which of the Wikipedia word2vec and the pretrained Glove model perform better on the tweet dataset? Why?

### Using COVID19 tweets to create word embeddings

We will create a word2vec model from the COVID tweets. Then we will transform our training and testing data into the vectors using this word model.

In [None]:
frames = [train_data_df, test_data_df]
all_dataset = pd.concat(frames)
sentences_from_data = [x.split() for x in all_dataset["CleanedTweet"]]

model_covid = Word2Vec(
        sentences_from_data,
        vector_size=100,
        window=5,
        min_count=2,
        workers=10)

### Building and evaluating the model using COVID19 embeddings

We will use the same transformation function from above to transform the tweet data into features using COVID19 embeddings.

In [None]:
xtrain_covid = transform_data_for_word_model(model_covid,train_data_df )
xtest_covid = transform_data_for_word_model(model_covid,test_data_df )

In [None]:
lg = LogisticRegression(random_state=0, solver='liblinear')
lg.fit(xtrain_covid, y_train)
predictions = lg.predict(xtest_covid)

print("Accuracy score: ", accuracy_score(y_test, predictions))
print("Precision score: ", precision_score(y_test, predictions, average="weighted"))
print("Recall score: ", recall_score(y_test, predictions, average = "weighted"))
print("F1 score: ", f1_score(y_test, predictions, average = "weighted"))

Accuracy score:  0.5935041355658665
Precision score:  0.5913480888677216
Recall score:  0.5935041355658665
F1 score:  0.5856872979984648


How do the embedding-based classification models perform in comparison to the count-based models? How can we interpret the results?

You can also explore how the predictions from different models compare. What are some of the test examples that count-based model classify accurately, while the embedding-based model do not (and vice versa)? Checking the model predictions is a good way to gain more insight into model behavior.