# <font color="#003660">Applied Machine Learning for Text Analysis (M.184.5331)</font>
  

# <font color="#003660">Session 1: Introduction to Natural Language Processing</font>

# <font color="#003660">Notebook 3: Train Your Own Word Embeddings</font>

<center><br><img width=256 src="https://raw.githubusercontent.com/olivermueller/aml4ta-2021/main/resources/dag.png"/><br></center>

<p>
<center>
<div>
    <font color="#085986"><b>By the end of this lesson, you ...</b><br><br>
        ... are able to train your own word embeddings from data.
    </font>
</div>
</center>
</p>

# Import packages

As always, we first need to load a number of required Python packages:
- `pandas` provides high-performance, easy-to-use data structures and data analysis tools.
- `spacy` offers industrial-strength natural language processing.
- `gensim` is a fast library for training of vector embeddings and topic models.
- `sklearn` is the de-facto standard machine learning package in Python.
- `plotly` is a library for creating interactive plots.

In [None]:
import pandas as pd
import pickle
import spacy
from gensim.models import word2vec
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px

# How are word embeddings learned?

Word embeddings can be learned from a given corpus by training a shallow neural network. The training objective of the network is either to predict a target word from its context words in a sentence (CBOW) or, vice versa, to predict the context words of a target word in a sentence (Skip-gram). After training, the weights matrix W represents the actual embedding vectors. (Mikolov et al., 2013)

<br>

<center><img width=512 src="https://git.uni-paderborn.de/data.analytics.teaching/aml4ta-2020/-/raw/master/week_3/images/cbow_skipgram.jpg"/>Source: Kimothi et al. (2020)</center>

# Load documents

Load wine reviews (Source: https://www.kaggle.com/datasets/zynicide/wine-reviews) from a csv file.

In [None]:
corpus = pd.read_csv("https://raw.githubusercontent.com/olivermueller/amlta-2025/main/Session_01/winemag-data-130k-v2.csv")

# Preprocess documents

Perform some standard natural language preprocessing steps with spaCy. As word embeddings are best trained on sentences, not documents, we first cut the reviews into sentences and then preprocess them sentence by sentence.

Warning: This may take some minutes.

In [None]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
nlp.add_pipe('sentencizer')

def spacy_sentence_tokenize(df, text_col="description", batch_size=1000, n_process=4):
    texts = (str(x) if pd.notna(x) and x else "" for x in df[text_col].values)
    sentences = []
    for doc in nlp.pipe(texts, batch_size=batch_size, n_process=n_process):
        for sent in doc.sents:
            toks = [t.text.lower() for t in sent if t.is_alpha]
            if toks:
                sentences.append(toks)
    return sentences

sentences = spacy_sentence_tokenize(corpus, text_col="description")

How many sentences do we have?

In [None]:
len(sentences)

Look at the first one.

In [None]:
sentences[0]

# Learn word embeddings from data

We use Gensim's implementation of word2vec to create word embeddings. See https://radimrehurek.com/gensim/models/keyedvectors.html#module-gensim.models.keyedvectors for documentation.

Create a model with 300 dimensions and a context window of 6 words. Only consider words that appear at least in 2 documents. Use 6 CPU cores for estimating the model.

In [None]:
model = word2vec.Word2Vec(sentences, vector_size=300, window = 6, min_count = 2, workers=6)

Get word vectors from model.

In [None]:
word_vectors = model.wv

# Explore word embeddings

Retrieve most similar words to a given word.

In [None]:
word_vectors.most_similar("red")

In [None]:
word_vectors.most_similar("white")

Which word doesn't belong to the set?

In [None]:
word_vectors.doesnt_match(["red", "raspberry", "cranberry", "peach"])

In [None]:
word_vectors.doesnt_match(["white", "cherry", "cantaloupe", "citrus"])

Let's look at some analogies using "King – Man + Woman = Queen"-style vector arithmetic

Fig - Red + White = ?

In [None]:
word_vectors.most_similar(positive=['fig', 'white'], negative=['red'])

Honey - White + Red = ?

In [None]:
word_vectors.most_similar(positive=['honey', 'red'], negative=['white'])

Riesling - White + Red = ?

In [None]:
word_vectors.most_similar(positive=['riesling', 'red'], negative=['white'])

# Visualize embeddings

Get a list of all the words in the vocabulary.

In [None]:
vocab = list(word_vectors.key_to_index)

Retrieve the associated word embedding vectors from the model.

In [None]:
X = word_vectors[vocab]

Reduce the dimensionality of the data with PCA.

In [None]:
X_pca = PCA(n_components=2).fit_transform(X)

Reformat data, add similarity to a "seed" word, (filter to most similar words), and create an interactive scatterplot.

In [None]:
pca_df = pd.DataFrame(X_pca, index=vocab, columns=['x', 'y'])
pca_df["word"] = vocab

seed = "citrus"
pca_df["sim"] = 0

for word, sim in word_vectors.most_similar(seed, topn=100):
    pca_df.loc[word, 'sim'] = sim

pca_df = pca_df[pca_df["sim"]>0]

fig = px.scatter(pca_df, x="x", y="y", color="sim",
                 hover_data=["word"],
                 opacity = 0.2, color_continuous_scale='agsunset_r')
fig.show()