# **Introduction to Word Embeddings**

This tutorial illustrates several applications of word embeddings by estimating a Word2Vec model using an off-the-shelf python library (```Gensim```).

Some additional resources on Word2Vec:
- [Jurafsky & Martin (2021). Book chapter.](https://web.stanford.edu/~jurafsky/slp3/6.pdf)
- [Mikolov et al. (2013). Original paper.](https://arxiv.org/pdf/1301.3781.pdf)
- [Rong (2016). Additional explanation on how to train the model.](https://arxiv.org/pdf/1411.2738.pdf)
- [Alammar (2019). Illustrated guide.](https://jalammar.github.io/illustrated-word2vec/)
- [Gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html)

## **Setup**

In [None]:
# ## install necessary packages
# !pip install flashtext                  # easy phrase replacing methods
# !pip install contractions               # expand English contractions 
# !pip install --upgrade spacy==2.2.4     # functions for lemmatizing
# !pip install gensim==4.0.0              # word2vec estimation
# !pip install adjustText                 # generate plots with lots of text labels

In [None]:
import sys
import pandas as pd
import numpy as np
import string
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
from adjustText import adjust_text
from sklearn.decomposition import PCA
from collections import Counter
from sklearn.preprocessing import StandardScaler

In [None]:
# define paths
data_path = "../data/"
pymodules_path = "../pymodules"

In [None]:
# import our own code
sys.path.append(pymodules_path)
import preprocessing_class as pc
import dictionary_methods as dictionary_methods

## **Off-the-shelf Word2Vec using Gensim**

[Gensim](https://radimrehurek.com/gensim/index.html) is a very powerful library that contains efficient (written in ```C```) implementations of several NLP models. Word2Vec is included among these. We will start by using this library to demonstrate use cases for word embeddings.

### *Load data and preprocess text*

We will now load some real data over which we will estimate our word embeddings. We see that our data consists of paragraphs from the Inflation Reports produced by the Bank of England. The data starts on 1998 and ends in 2015. Reports are produced four times a year in the months of February, May, August and November.

As a starting point, we will need to download three data files from Google Drive:
1. [Inflation Reports data](https://drive.google.com/file/d/1o_67kmSkLjaEoYxIUkfnouC_Nm_IQJZE/view?usp=sharing) 
2. [Monetary Policy Committee minutes data](https://drive.google.com/file/d/1iCtirTJfgowx2TVJVmISNRbq690wv1vm/view?usp=sharing)
3. [Quarterly GDP data](https://drive.google.com/file/d/1m9lTsJU2--K8mpLu2STZSS691WLZnN6h/view?usp=sharing)

Once you download the two files put them in the *"data"* folder and keep their original names.

In [None]:
data = pd.read_csv(data_path + "ir_data_final.txt", sep="\t")
data = data[['ir_date', 'paragraph']]
data.columns = ['yearmonth', 'paragraph']
print(data.shape)
data.head(10)

In [None]:
# explore one of the paragraphs
data.loc[0, "paragraph"]

In [None]:
# check how often these reports are produced
grouped = data.groupby("yearmonth", as_index=False).size()
print(grouped.head(5))
print()
print(grouped.tail(5))

In [None]:
def apply_preprocessing(data, replacing_dict, pattern, punctuation):
    """ Function to apply the steps from the preprocessing class in the correct
        order to generate a term frequency matrix and the appropriate dictionaries
    """
    
    prep = pc.RawDocs(data, stopwords="short", lower_case=True, contraction_split=True, tokenization_pattern=pattern)
    prep.phrase_replace(replace_dict=replacing_dict, case_sensitive_replacing=False)
    # lower-case text, expand contractions and initialize stopwords list
    prep.basic_cleaning()
    # split the documents into tokens
    prep.tokenize_text()
    # clean tokens
    prep.token_clean(length=2, punctuation=punctuation, numbers=True)
    # create document-term matrix
    prep.dt_matrix_create(items='tokens', min_df=10, score_type='df')
    
    # get the vocabulary and the appropriate dictionaries to map from indices to words
    word2idx = prep.vocabulary["tokens"]
    idx2word = {i:word for word,i in word2idx.items()}
    vocab = list(word2idx.keys())
    
    return prep, word2idx, idx2word, vocab

In [None]:
# define dictionary for pre-processing class with terms we want to preserve
replacing_dict = {'monetary policy':'monetary-policy',
                  'interest rate':'interest-rate',
                  'interest rates':'interest-rate',
                  'yield curve':'yield-curve',
                  'repo rate':'repo-rate',
                  'bond yields':'bond-yields',
                  'real estate':'real-estate',
                  'economic growth':'economic-growth'}

In [None]:
# define tokenization pattern and punctuation symbols
pattern = r'''
          (?x)                # set flag to allow verbose regexps (to separate logical sections of pattern and add comments)
          \w+(?:-\w+)*        # word characters with internal hyphens
          | [][.,;"'?():-_`]  # preserve punctuation as separate tokens
          '''
punctuation = string.punctuation.replace("-", "")

In [None]:
# use preprocessing class
prep, word2idx, idx2word, vocab = apply_preprocessing(data.paragraph, replacing_dict, pattern, punctuation)

In [None]:
# inspect a particular tokenized document and compare to its original form
i = 10
print(data.paragraph[i])
print("\n ------------------------------- \n")
print(prep.tokens[i])

### *Model estimation*

Now that we have our text preprocessed we can use the [Gensim](https://radimrehurek.com/gensim/) library to efficiently estimate word embeddings using word2vec.

In [None]:
# train Gensim's Word2Vec model
gensim_model = Word2Vec(sentences=prep.tokens,      # corpus
                        vector_size=100,            # embedding dimension
                        window=4,                   # words before and after to take into consideration
                        sg=1,                       # use skip-gram
                        negative=5,                 # number of negative examples for each positive one
                        alpha=0.025,                # initial learning rate
                        min_alpha=0.0001,           # minimum learning rate
                        epochs=5,                   # number of passes through the data
                        min_count=1,                # words that appear less than this are removed
                        workers=1,                  # we use 1 to ensure replicability
                        seed=92                     # for replicability
                       )

In [None]:
# extract the word embeddings from the model
word_vectors = gensim_model.wv
word_vectors.vectors.shape  # vocab_size x embeddings dimension

There a lot of different ways in which we can use these estimated word embeddings. We will start by showing a simple way to visualize them in 2-dimensions.

### *Visualization*

In [None]:
# use a PCA decomposition to visualize the embeddings in 2D
def pca_scatterplot(model, words):
    pca = PCA(n_components=2, random_state=92)
    word_vectors = np.array([model[w] for w in words])
    low_dim_emb = pca.fit_transform(word_vectors)
    plt.figure(figsize=(21,10))
    plt.scatter(low_dim_emb[:,0], low_dim_emb[:,1], edgecolors='blue', c='blue')
    plt.xlabel("Component 1")
    plt.ylabel("Component 2")

    # get the text of the plotted words
    texts = []
    for word, (x,y) in zip(words, low_dim_emb):
        texts.append(plt.text(x+0.01, y+0.01, word, rotation=0))
    
    # adjust the position of the labels so that they dont overlap
    adjust_text(texts)
    # show plot
    plt.show()

In [None]:
# define the tokens to use in the plot
tokens_of_interest = ['economy', 'gdp', 'production', 'output',
                      'investment', 'confidence', 'sentiment',
                      'uncertainty', 'inflation', 'cpi',
                      'loan', 'mortgage', 'credit', 'debt', 'savings', 
                      'borrowing', 'housing', 'labour', 'workforce', 
                      'unemployment', 'employment', 'jobs', 'wages',
                      'trade', 'exports', 'imports']

# expand the list of tokens with all the tokens from the replacement dictionary
tokens_of_interest = set(tokens_of_interest + list(replacing_dict.values()) )

# plot
pca_scatterplot(word_vectors, list(tokens_of_interest))

We can clearly observe how words form some thematically cohesive groups; trade (e.g. exports, imports, trade, output), job-market (e.g. workforce, jobs, employment), housing (e.g. real-state, housing, borrowing, mortgage).

### *Nearest neighbors analysis*

We can further explore how words cluster in the embedded space by analyzing the nearest neighbours of some selected words.

In [None]:
# find the K nearest neighbours of relevant words
K = 10
words = ["uncertainty", "risk", "stable",
         "contraction", "expansion",
         "monetary-policy", "interest-rate", "inflation"]

for word in words:
    print(f"Nearest neighbors of: {word}")
    print(word_vectors.most_similar(word, topn=K))
    print("\n")

### *Analogy tasks*

A very interesting, and surprising, use of word embeddings is to find word analogies. The famous example used by [Mikolov et al. (2013)](https://arxiv.org/pdf/1301.3781.pdf) searches for a word $X$ in the embedded space that is similar to "woman" in the same sense that "king" is similar to "man". This task can be expressed in terms of a simple vector arithmetic problem as follows:

$$
\vec{King}^{\,} - \vec{Man}^{\,} = \vec{X}^{\,} - \vec{Woman}^{\,} \\
\vec{King}^{\,} - \vec{Man}^{\,} + \vec{Woman}^{\,} = \vec{X}^{\,}
$$

Mikolov et al. (2013) find that when performing this operation on their trained embeddings, they are able to recover the word "queen".

$$ \vec{King}^{\,} - \vec{Man}^{\,} + \vec{Woman}^{\,} \approx \vec{Queen}^{\,} $$

Using ```Gensim``` this operation can be very easily perfomed by simply using the ```.most_similar()``` function as follows:

<center>

```python
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
```

</center>

We will play with this idea and try to extend it to our own domain. Some of the analogies that we will try to solve are: 

$$
\vec{Contraction}^{\,} - \vec{Expansion}^{\,} + \vec{Downward}^{\,} = \vec{X}^{\,} \\
\vec{Inflation}^{\,} - \vec{CPI}^{\,} + \vec{GDP}^{\,} = \vec{X}^{\,} \\
$$

In [None]:
# create the analogy tasks for our data
positive_words = [['contraction', 'downward'],
                  ['expansion', 'tighten'],
                  ['inflation', 'gdp'],
                  ['company', 'wages']]

negative_words = [['expansion'],
                  ['contraction'],
                  ['cpi'],
                  ['profits']]

for pw, nw in zip(positive_words, negative_words):
    print(f"Analogy task for positive words: {pw} and negative words {nw}")
    print(word_vectors.most_similar(positive=pw, negative=nw))
    print("\n")

### *Building dictionaries*

One last use of word embeddings is to expand existing dictionaries by finding the nearest neighbours to a set of "center" terms. To illustrate this, we will show how to generate dictionaries of positive and negative terms to analyze text data from the Bank of England Monetary Police Comittee minutes.

In [None]:
# create a positive dictionary by finding the nearest neighbors to a combination of relevant words
N = 40
pos_center_terms = ['expansion', 'stable']
pos_nn = [w for w, _ in word_vectors.most_similar(positive=pos_center_terms, topn=N)]
pos_word2vec = pos_center_terms + pos_nn
print(pos_word2vec)

In [None]:
# create a negative dictionary by finding the nearest neighbors to a combination of relevant words
N = 40
neg_center_terms = ['contraction', 'uncertainty']
neg_nn = [w for w, _ in word_vectors.most_similar(positive=neg_center_terms, topn=N)]
neg_word2vec = neg_center_terms + neg_nn
print(neg_word2vec)

In [None]:
# load data for dictionary method example
path_dict_example = data_path + 'mpc_minutes.txt'
data_dict, prep_dict = dictionary_methods.dict_example(path_dict_example) # dataframe, preprocessing object

In [None]:
# generate the count of positive and negative lemmas in the corpus with our new dictionaries
pos_counts_word2vec, neg_counts_word2vec = dictionary_methods.pos_neg_counts(prep_dict, pos_word2vec, neg_word2vec)

In [None]:
# Apel and Blix-Grimaldi (2012) dictionaries
pos_words_AB = ['accelerate','accelerated','accelerates','accelerating','expand','expanded','expanding','expands',
             'fast','faster','fastest','gain','gained','gaining','gains','high','higher','highest','increase',
             'increased','increases','increasing','strong','stronger','strongest']

neg_words_AB = ['contract','contracted','contracting','contracts','decelerate','decelerated','decelerates',
             'decelerating','decrease','decreased','decreases','decreasing','lose','losing','loss','losses',
             'lost','low','lower','lowest','slow','slower','slowest','weak','weaker','weakest']

In [None]:
# generate the count of positive and negative lemmas in the corpus with Apel and Blix-Grimaldi (2012)
pos_counts_AB, neg_counts_AB = dictionary_methods.pos_neg_counts(prep_dict, pos_words_AB, neg_words_AB)

In [None]:
# add counts to the data
data_dict['pos_counts_word2vec'] = pos_counts_word2vec
data_dict['neg_counts_word2vec'] = neg_counts_word2vec

data_dict['pos_counts_AB'] = pos_counts_AB
data_dict['neg_counts_AB'] = neg_counts_AB

data_dict.head()

In [None]:
# aggregate to year-month level
data_agg = data_dict.groupby(['date']).agg({'pos_counts_word2vec': 'sum', 'neg_counts_word2vec': 'sum',
                                            'pos_counts_AB': 'sum', 'neg_counts_AB': 'sum',
                                            'year': 'mean', 'quarter':'mean'})
data_agg.head()

In [None]:
# aggregate to year-quarter level removing incomplete quarters 
data_agg['months_x_quarter'] = 1
data_agg = data_agg.groupby(['year', 'quarter']).sum()[['pos_counts_word2vec', 'neg_counts_word2vec',
                                                        'pos_counts_AB', 'neg_counts_AB',
                                                        'months_x_quarter']]

data_agg = data_agg[data_agg['months_x_quarter']==3]
del data_agg['months_x_quarter']

data_agg.head()

In [None]:
# compute sentiment at year-quarter level
data_agg['sentiment_word2vec'] = (data_agg.pos_counts_word2vec - data_agg.neg_counts_word2vec)/(data_agg.pos_counts_word2vec + data_agg.neg_counts_word2vec)
data_agg['sentiment_AB'] = (data_agg.pos_counts_AB - data_agg.neg_counts_AB)/(data_agg.pos_counts_AB + data_agg.neg_counts_AB)
data_agg.head()

Next we add quarterly GDP data collected from the ONS website.

In [None]:
# prepare GDP data
ons = pd.read_csv(data_path + 'ons_quarterly_gdp.csv', names=['label', 'gdp_growth', 'quarter_long'], header=0)
ons['year'] = ons.label.apply(lambda x: x[:4]).astype(int)
ons['quarter'] = ons.label.apply(lambda x: x[6]).astype(int)
ons = ons[['year', 'quarter', 'gdp_growth']]
ons = ons.drop_duplicates().reset_index(drop=True).copy()
ons.head()

In [None]:
# merge to sentiment data
df = data_agg.merge(ons, how='left', on=['year', 'quarter']).copy()
# create year-quarter variable
df["year_quarter"] = df.apply(lambda x: f"{int(x['quarter'])}Q{int(x['year'])}", axis=1)
df["year_quarter"] = df["year_quarter"].apply(lambda x: pd.Period(value=x, freq="Q").to_timestamp())
df.head()

In [None]:
print(df[['sentiment_AB', 'sentiment_word2vec', 'gdp_growth']].corr())

In [None]:
scaler = StandardScaler()

fig, ax = plt.subplots(figsize=(16,8))
ax.plot(df["year_quarter"], scaler.fit_transform(df.sentiment_AB.values.reshape(-1, 1)), label="Apel and Blix-Grimaldi (2012)")
ax.plot(df["year_quarter"], scaler.fit_transform(df.sentiment_word2vec.values.reshape(-1, 1)), label="Word2Vec dictionaries")
ax.plot(df["year_quarter"], scaler.fit_transform(df.gdp_growth.values.reshape(-1, 1)), label="GDP Growth")

plt.legend()
plt.show()