# Intro to Word Embeddings

In this notebook we learn about word2vec as a new technique to convert words into (dense) vectors, and re-build the text classification pipeline we had previously built with sparse vectors.

You are encouraged to play around with the code and modify / re-built parts of it as you fit: there is NO substitute for "tinkering with code" to understand how all the concepts fit together (corollary: all this code is written for pedagogical purposes, so some functions are re-used from previous lectures to provide a self-sufficient script).

In [None]:
# some global imports
import json
import glob
import os
import pandas as pd
from collections import Counter
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
import numpy as np

In [None]:
%matplotlib inline

## Bonus: perceptron as the simplest NN

Taking the code (with some minor tweaks) from the fantastic Perceptron class here (https://www.thomascountz.com/2018/04/05/19-line-line-by-line-python-perceptron), this is a simple and transparent implementation of a Perceptron.

In [None]:
class Perceptron(object):

    def __init__(self, no_of_inputs):
        # initialize the w + bias array
        self.weights = np.zeros(no_of_inputs + 1)
        
        return
           
    def predict(self, inputs):
        summation = np.dot(inputs, self.weights[1:]) + self.weights[0]
        
        return 1 if summation > 0 else 0

    def train(self, training_inputs, labels, epochs=100, learning_rate=0.01):
        for _ in range(epochs):
            for inputs, label in zip(training_inputs, labels):
                prediction = self.predict(inputs)
                # update the weights
                self.weights[1:] += learning_rate * (label - prediction) * inputs
                # update the bias
                self.weights[0] += learning_rate * (label - prediction)

To understand how the forward pass works, let's do some quick calculations:

In [None]:
inputs = [1, 1]
weights = [0, 0]
bias = 1 
inputs_dot_weights = np.dot(np.array(inputs), np.array(weights))
_sum = inputs_dot_weights + bias

print(inputs_dot_weights, _sum)

We can use the class to learn some real-world function of interest, for example the AND function:

In [None]:
# let's learn the AND operator
training_inputs = []
training_inputs.append(np.array([1, 1]))
training_inputs.append(np.array([1, 0]))
training_inputs.append(np.array([0, 1]))
training_inputs.append(np.array([0, 0]))
# 1 only when both inputs are 1
labels = np.array([1, 0, 0, 0])

In [None]:
# instantiate class and train
perceptron = Perceptron(2)
# weights before training
print(perceptron.weights)
perceptron.train(training_inputs, labels)
# weights after training
print(perceptron.weights)

In [None]:
print(perceptron.predict(np.array([1, 1])))
print(perceptron.predict(np.array([0, 1])))

## Data loading

In [None]:
# make sure you have the datasets library installed
# see: https://github.com/huggingface/datasets

# !pip install datasets

In [None]:
import string

# some utils function
def get_finance_sentiment_dataset(split: str='sentences_allagree'):
    # load financial dataset from HF
    from datasets import load_dataset
    # https://huggingface.co/datasets/financial_phrasebank
    # by default, load just sentences for which all annotators agree
    dataset = load_dataset("financial_phrasebank", split)
    
    return dataset['train']


def get_finance_sentences():
    dataset = get_finance_sentiment_dataset()
    cleaned_dataset = [[pre_process_sentence(_['sentence']), _['label']] for _ in dataset]
    # debug 
    print("{} cleaned sentences from finance dataset\n".format(len(cleaned_dataset)))
    
    return cleaned_dataset


def pre_process_sentence(sentence: str):
    # this choices are VERY important. Here, we take a simplified 
    # view, remove the punctuations and just lower case everything
    lower_sentence = sentence.lower()
    exclude = set(string.punctuation)
    return ''.join(ch for ch in lower_sentence if ch not in exclude)

In [None]:
finance_dataset = get_finance_sentences()
# print out the first items in the dataset, to check the format
finance_dataset[:2]

In [None]:
# get sentences without label for vectorizer part
finance_dataset_sentences = [_[0] for _ in finance_dataset]

## From words to vectors

As you may recall, we introduced some "vectorizing" procedures for text before, e.g. TfidfVectorizer. As you may recall, these vectors are very long and sparse - we quikcly re-create some here for convenience:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = finance_dataset_sentences[:2]
tfidfvectorizer = TfidfVectorizer(analyzer='word')
tfidf_wm = tfidfvectorizer.fit_transform(docs)
tfidf_tokens = tfidfvectorizer.get_feature_names()
df_tfidfvect = pd.DataFrame(data=tfidf_wm.toarray(),
                            index=['Doc{}'.format(_) for _ in range(len(docs))], 
                            columns=tfidf_tokens)
print("TD-IDF Vectorizer\n")
print(df_tfidfvect)

Let us know use word2vec to get vectors for words first, and document after. We will use a fantastic Python library, gensim: https://radimrehurek.com/gensim/models/word2vec.html

In [None]:
#!pip install gensim==4.0.1

In [None]:
import gensim

In [None]:
def train_word2vec_model(
    sentences: list,
    min_count: int = 2,
    vector_size: int = 48,
    window: int = 2,
    epochs: int = 20
):
    """
    Sentences is a list of lists, where each list is composed by tokens in a sentence: e.g.
    
    [
        ['the', 'cat', 'is', 'on' ...],
        ['i', 'live', 'in', 'nyc', ...],
        ....
    ]
    
    """
    model =  gensim.models.Word2Vec(sentences=sentences,
                                    min_count=min_count,
                                    vector_size=vector_size,
                                    window=window,
                                    epochs=epochs)
    
    # this is how many words we will have in the space
    print("# words in the space: {}".format(len(model.wv.index_to_key)))

    # we return the space in a format that will allow us to do nice things afterwards ;-)    
    return model.wv

In [None]:
# let's use nltk tokenizer to break up sentences and build a word2vec model
# https://www.nltk.org/api/nltk.tokenize.html
from nltk.tokenize import word_tokenize

print(finance_dataset_sentences[0], '\n\n', word_tokenize(finance_dataset_sentences[0]))

In [None]:
tokenized_sentences = [word_tokenize(_) for _ in finance_dataset_sentences]
# debug 
tokenized_sentences[:2]

In [None]:
# build a counter to get a sense of the lexicon
word_counter = Counter([item for sent in tokenized_sentences for item in sent])
word_counter.most_common(20)

In [None]:
w2v_model = train_word2vec_model(tokenized_sentences)

Now that we have a vector space, let's find words similar to a given term...

In [None]:
for w in ['company', 'profit']:
    print('\n======>{}\n'.format(w), w2v_model.similar_by_word(w, topn=3))

In [None]:
# Q: what happens as the window grows bigger? What is your prediction?

# w2v_model = train_word2vec_model(tokenized_sentences, window=5)

To get a sense of what the vectors look like, we print them out in 2D using TSNE (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html):

In [None]:
def plot_scatter_by_category_with_lookup(title, 
                                         words, 
                                         word_to_target_cat,
                                         results):
    """
    Just a plotting routine
    """
    
    groups = {}
    for word, target_cat in word_to_target_cat.items():
        if word not in words:
            continue

        word_idx = words.index(word)
        x = results[word_idx][0]
        y = results[word_idx][1]
        if target_cat in groups:
            groups[target_cat]['x'].append(x)
            groups[target_cat]['y'].append(y)
        else:
            groups[target_cat] = {
                'x': [x], 'y': [y]
                }
    
    fig, ax = plt.subplots(figsize=(10, 10))
    for group, data in groups.items():
        ax.scatter(data['x'], data['y'], 
                   alpha=0.1 if group == 0 else 0.8, 
                   edgecolors='none', 
                   s=25, 
                   marker='o',
                   label=group)

    plt.title(title)
    plt.legend(loc=2)
    plt.show()
    
    return

In [None]:
def tsne_analysis(embeddings, perplexity=25, n_iter=500):
    tsne = TSNE(n_components=2, verbose=1, perplexity=perplexity, n_iter=n_iter)
    return tsne.fit_transform(embeddings)

In [None]:
# map words to known categories of interest

# 0 is the generic category
words = w2v_model.index_to_key
print(len(words))
words_to_category = {w: 0 for w in words}
# manually pick some words to display
for w in ['company', 'profit', 'investment', 'loss', 'margin', 'group']:
    words_to_category[w] = 1
for w in ['with', 'of', 'from', 'by', 'as']:
    words_to_category[w] = 2

In [None]:
embeddings = [w2v_model[w] for w in words]
tsne_results = tsne_analysis(embeddings)
assert len(tsne_results) == len(words)

In [None]:
plot_scatter_by_category_with_lookup('Finance word2vec', words, words_to_category, tsne_results)

_Why the quality is not ideal?_

Our dataset is very small, and word2vec works much better when large corpora are used. However, a pretty cool things of language is that is everywhere: the word "company" is very important in the financial sector, but of course also Wikipedia talks a lot about companies... can we make use of all the text out there?

The answer is YES: in particular, a pattern that is common to many NLP (but also vision-related) tasks is to initialize a model with PRE-TRAINED embeddings, obtained previously with training on large corpora. We could either re-use them or "fine-tune" them: in either case, we will, so to speak, be able to harness the power of Wikipedia even in a corpus very small such as ours.

### Bonus: using pre-trained embeddings

Here we use Gensim-data to recover dense vectors for words in our vocabulary, as pre-trained on Wikipedia.

In [None]:
import gensim.downloader as api

In [None]:
# glove-wiki-gigaword-50 (400000 records): Pre-trained vectors based on Wikipedia
pre_trained_model = api.load("glove-wiki-gigaword-50")
# test it out
for w in ['company', 'profit']:
    print('\n======>{}\n'.format(w), pre_trained_model.similar_by_word(w, topn=3))

In [None]:
words =[w for w in w2v_model.index_to_key if w in pre_trained_model]
print(len(words))
pre_trained_vectors = [pre_trained_model[w] for w in words]
pre_trained_tsne_results = tsne_analysis(pre_trained_vectors)

In [None]:
plot_scatter_by_category_with_lookup('Finance pre-trained word2vec', 
                                     words, 
                                     words_to_category, 
                                     pre_trained_tsne_results)

## Application: Text Classification Revisited

As you may recall, one text is in a vectorized form, the downstream pipeline we learned through scikit can be applied in the same way to language dataset. For convenience, we report again a standard classifier for financial news built with TF-IDF transformation first, and then use word2vec to the same.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

finance_dataset_text = [_[0] for _ in finance_dataset]
finance_dataset_label = [_[1] for _ in finance_dataset]
all_labels = set(finance_dataset_label)
print("All labels are: {}".format(all_labels))
X_train, X_test, y_train, y_test = train_test_split(finance_dataset_text, 
                                                    finance_dataset_label, 
                                                    test_size=0.1, 
                                                    random_state=42)

print(len(X_train))
final_tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english')
final_tfidf_train = final_tfidfvectorizer.fit_transform(X_train)
print(final_tfidf_train.shape)
X_test_transformed = final_tfidfvectorizer.transform(X_test)

In [None]:
model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0)
model.fit(final_tfidf_train, y_train)
predicted = model.predict(X_test_transformed)
predicted_prob = model.predict_proba(X_test_transformed)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

def calculate_confusion_matrix_and_report(y_predicted, y_golden, with_plot=True):
    # calculate confusion matrix: 
    cm = confusion_matrix(y_golden, y_predicted)
    # build a readable report;
    # https://scikit-learn.org/stable/modules/model_evaluation.html#classification-report
    print('\nClassification Report')
    print(classification_report(y_golden, y_predicted))
    # plot the matrix
    if with_plot:
        plot_confusion_matrix(cm)
                                          
    return
                                          
def plot_confusion_matrix(c_matrix):
    plt.imshow(c_matrix, cmap=plt.cm.Blues)
    plt.xlabel("Predicted labels")
    plt.ylabel("True labels")
    plt.xticks([], [])
    plt.yticks([], [])
    plt.title("Confusion matrix")
    plt.colorbar()
    plt.show()
    
    return

In [None]:
print("Total of # {} test cases".format(len(y_test)))
calculate_confusion_matrix_and_report(predicted, y_test)

Let us know transform sentences using word2vec - we go through each of the sentence, remove stop words and take the average of the vector if present.

In [None]:
# debug some vars to make sure all is in order
print(w2v_model.most_similar("company"))
print(X_train[0],y_train[0])
print(X_test[0], y_test[0])

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
stop_words[:10]

In [None]:
def tokenize(sentence, stop_words):
    return [w for w in word_tokenize(sentence) if w not in stop_words]


def sentence_to_embedding(sentence, model, stop_words, dims=48):
    tokenized_sentence = tokenize(sentence, stop_words)
    if not tokenized_sentence:
        print("\n!!!ATTENTION!!! Empty sentence: {}".format(sentence))
        return np.zeros(dims)
    mean_array = np.mean([model[w] for w in tokenized_sentence if w in model] or [np.zeros(dims)], axis=0)
    assert len(mean_array) == dims
    
    return np.array(mean_array)

# debug
_test = 'company profits were soaring last year'
print(tokenize(_test, stop_words))
print(sentence_to_embedding(_test, w2v_model, stop_words))

In [None]:
# Q: instead of taking the average, can we weight "more" embeddings which are more important?
# e.g. can we use tf-idf as a weighting scheme to aggregate word vectors?

In [None]:
w2vec_X_train = np.array([sentence_to_embedding(_, w2v_model, stop_words) for _ in X_train])
w2vec_X_test = np.array([sentence_to_embedding(_, w2v_model, stop_words) for _ in X_test])
print(len(w2vec_X_train))
w2vec_X_train[0]

In [None]:
model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0)
model.fit(w2vec_X_train, y_train)
predicted = model.predict(w2vec_X_test)
print("Total of # {} test cases".format(len(y_test)))
calculate_confusion_matrix_and_report(predicted, y_test)

### Bonus: let's use pre-trained vectors instead

In [None]:
# get the model again
pre_trained_model = api.load("glove-wiki-gigaword-50")
# re-vectorize the Xs - make sure to specify the right size for the embeddings
pre_trained_w2vec_X_train = np.array([sentence_to_embedding(_, pre_trained_model, stop_words, dims=50) for _ in X_train])
pre_trained_w2vec_X_test = np.array([sentence_to_embedding(_, pre_trained_model, stop_words, dims=50) for _ in X_test])
print(len(pre_trained_w2vec_X_train))
pre_trained_w2vec_X_train[0]

In [None]:
model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0)
model.fit(pre_trained_w2vec_X_train, y_train)
predicted = model.predict(pre_trained_w2vec_X_test)
print("Total of # {} test cases".format(len(y_test)))
calculate_confusion_matrix_and_report(predicted, y_test)

## What's next?

We have discussed how to turn word into vectors using neural network - can we do the same to the entire sentence, without recurring to the mean trick?

YES, but training models that work well on sentences require a huge amount of computation. However, the same logic applies here: we can take a model that has been pre-trained on a very large corpus, and use it to vectorize our finance dataset.

As an example, we will use the convenient sentence transformer (https://github.com/UKPLab/sentence-transformers) to map text to a dense vector.

In [None]:
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer('stsb-distilbert-base')

In [None]:
# run example code to check all is good with the library
sentences = [
    'This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.'
]
sentence_embeddings = sentence_model.encode(sentences)
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding[:10])
    print("")

In [None]:
def bert_sentence_encoding(sentences, model):
    # this takes a while!
    embedded_sentences = model.encode(sentences)
    assert len(embedded_sentences) == len(sentences)
    
    return embedded_sentences

In [None]:
# re-vectorize the Xs - make sure to specify the right size for the embeddings
bert_w2vec_X_train = np.array(bert_sentence_encoding(X_train, sentence_model))
bert_w2vec_X_test = np.array(bert_sentence_encoding(X_test, sentence_model))
print(bert_w2vec_X_train[0].shape)

In [None]:
model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0)
model.fit(bert_w2vec_X_train, y_train)
predicted = model.predict(bert_w2vec_X_test)
print("Total of # {} test cases".format(len(y_test)))
calculate_confusion_matrix_and_report(predicted, y_test)