# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>
  

# <font color="#003660">Session 2: Unsupervised NLP</font>

# <font color="#003660">Notebook 2: Train Your Own Word Embeddings</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... are able to train your own word embeddings from data.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `SQLAlchemy`, together with `pymysql`, allows to communicate with SQL databases.
- `getpass` provides function to safely enter passwords.
- `spacy` offers industrial-strength natural language processing.
- `gensim` is a fast library for training of vector embeddings and topic models.
- `sklearn` is the de-facto standard machine learning package in Python.
- `plotly` is a library for creating interactive plots.

In [None]:
# Install packages
!pip install pymysql

In [None]:
import pandas as pd
import pickle
from sqlalchemy import create_engine, text
import getpass
import spacy
from gensim.models import word2vec
from gensim.models import KeyedVectors
import gensim.downloader as api
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

# How are word embeddings learned?

Word embeddings can be learned from a given corpus by training a shallow neural network. The training objective of the network is either to predict a target word from its context words in a sentence (CBOW) or, vice versa, to predict the context words of a target word in a sentence (Skip-gram). After training, the weights matrix W represents the actual embedding vectors. (Mikolov et al., 2013)

<br>

<center><img width=512 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_3/images/cbow_skipgram.jpg"/>Source: Kimothi et al. (2020)</center>

# Load documents

We will work with a dataset consisitng of approx. 130.000 wine reviews written by professional sommeliers. Each review has review text and rating and additional meta data about the wine, such as, variety, location, winery, or price. You can find the original dataset here: https://www.kaggle.com/zynicide/wine-reviews

In [None]:
# Get credentials
user = input("Username: ")
passwd = getpass.getpass("Password: ")
server = input("Server: ")
db = input("Database: ")

# Create an engine instance (SQLAlchemy)
engine = create_engine("mysql+pymysql://{}:{}@{}/{}".format(user, passwd ,server, db))

# Define SQL query
sql_query = "SELECT * FROM WineDataset"

# Query dataset (pandas)
corpus = pd.DataFrame(engine.connect().execute(text(sql_query)))

# Preprocess documents

Perform some standard natural language preprocessing steps with spaCy. As word embeddings are best trained on sentences, not documents, we first cut the reviews into sentences and then preprocess them sentence by sentence.

Warning: This takes 10+ minutes!

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser', 'tagger'])
nlp.add_pipe('sentencizer')

sentences = []
for i, entry in corpus.iterrows():
    tokens = nlp(entry['description'])
    for sentence in tokens.sents:
        tokens_to_keep = []
        for t in sentence:
            if t.is_alpha: # only consider alphanumerical tokens
                tokens_to_keep.append(t.text.lower()) # append lower-cased word
        sentences.append(tokens_to_keep)

Save sentences to disk.

In [None]:
# Set up Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
with open("/content/drive/MyDrive/amlta/sentences.pkl", "wb") as output:
    pickle.dump(sentences, output, pickle.HIGHEST_PROTOCOL)

Load sentences from disk.

In [None]:
with open("/content/drive/MyDrive/amlta/sentences.pkl", "rb") as input:
    sentences = pickle.load(input)

How many sentences do we have?

In [None]:
len(sentences)

Look at the first one.

In [None]:
sentences[0]

# Learn word embeddings from data

We use Gensim's implementation of word2vec to create word embeddings. See https://radimrehurek.com/gensim/models/keyedvectors.html#module-gensim.models.keyedvectors for documentation.

Create a model with 300 dimensions and a context window of 6 words. Only consider words that appear at least in 2 documents. Use 6 CPU cores for estimating the model.

In [None]:
model = word2vec.Word2Vec(sentences, vector_size=300, window = 6, min_count = 2, workers=6)

Get word vectors from model and store as file for later reuse.

In [None]:
word_vectors = model.wv
word_vectors.save_word2vec_format("/content/drive/MyDrive/amlta/wine_300dim_2minwords_6context")

Load word vectors from file.

In [None]:
word_vectors = KeyedVectors.load_word2vec_format("/content/drive/MyDrive/amlta/wine_300dim_2minwords_6context")

# Explore word embeddings

Retrieve most similar words to a given word.

In [None]:
word_vectors.most_similar("red")

In [None]:
word_vectors.most_similar("white")

Which word doesn't belong to the set?

In [None]:
word_vectors.doesnt_match(["red", "raspberry", "cranberry", "peach"])

In [None]:
word_vectors.doesnt_match(["white", "cherry", "cantaloupe", "citrus"])

Let's look at some analogies using "King – Man + Woman = Queen"-style vector arithmetic

Fig - Red + White = ?

In [None]:
word_vectors.most_similar(positive=['fig', 'white'], negative=['red'])

Honey - White + Red = ?

In [None]:
word_vectors.most_similar(positive=['honey', 'red'], negative=['white'])

Riesling - White + Red = ?

In [None]:
word_vectors.most_similar(positive=['riesling', 'red'], negative=['white'])

# Visualize embeddings

Get a list of all the words in the vocabulary.

In [None]:
vocab = list(word_vectors.key_to_index)

Retrieve the associated word embedding vectors from the model.

In [None]:
X = word_vectors[vocab]

Reduce the dimensionality of the data with PCA.

In [None]:
X_pca = PCA(n_components=2).fit_transform(X)

Reformat data, add similarity to a "seed" word, (filter to most similar words), and create an interactive scatterplot.

In [None]:
pca_df = pd.DataFrame(X_pca, index=vocab, columns=['x', 'y'])
pca_df["word"] = vocab

seed = "citrus"
pca_df["sim"] = 0

for word, sim in word_vectors.most_similar(seed, topn=100):
    pca_df.loc[word, 'sim'] = sim

# filter to most similar words?
pca_df = pca_df[pca_df["sim"]>0]

fig = px.scatter(pca_df, x="x", y="y", color="sim",
                 hover_data=["word"],
                 range_x = [-11, 11], range_y = [-11, 11],
                 opacity = 0.2, color_continuous_scale='agsunset_r')
fig.show()