# Demo: Word vector embeddings
*Made by M.H. Skjelvareid using GitHub CoPilot for code snippet generation*.

This notebook demonstrates the representation of words as vectors, i.e. lists of
numbers. There is no single "correct" way to encode words as numbers, and the encoding
differs between language models. This notebook is based on an English language
model trained and distributed by spaCy,
[en_core_web_lg](https://spacy.io/models/en#en_core_web_lg). 


In [1]:
# Imports
import ipywidgets as widgets
import matplotlib.pyplot as plt
import numpy as np
import spacy
from IPython.display import display
from sklearn.manifold import TSNE

In [2]:
# Download English language model
if not spacy.util.is_package("en_core_web_lg"):
    !python -m spacy download en_core_web_lg

In [3]:
# Create NLP object (language processor)
nlp = spacy.load("en_core_web_lg")

In [4]:
doc = nlp("apple banana orange fruit car bus train vehicle cat dog animal")
embeddings = np.array([token.vector for token in doc])
print(embeddings.shape)

(11, 300)


## Word embedding visualizer
The code below creates an interactive text editing field that you can use to input your
own words and study how closely they are related in "embedding space". The embeddings,
which each consist of 300 numbers, cannot be directly visualized in a 2D plot. The
embeddings are therefore transformed to 2D representations using [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Note that the
two dimensions have no direct meaning - the point is that similar embeddings will tend
to be located close to each other in the 2D plot.   

In [5]:
# Create heading widget
heading = widgets.HTML(
    value=(
        "<h2>Word Embedding Visualization</h2><p>Enter at least 5 words, separated by commas. "
        'Click the "Update Visualization" button to see how these words are related to each other '
        "in the language model's vector space. Words with similar meanings will tend to "
        "cluster together.</p>"
    ),
    layout=widgets.Layout(width="80%", margin="10px 0"),
)

# Create text input widget and update button
text_input = widgets.Textarea(
    value="person, woman, nephew, car, airplane, train, dog, bird, grass, tree, algae, water, air, rock, sand",
    description="Words:",
    layout=widgets.Layout(width="80%", height="100px"),
)

update_button = widgets.Button(
    description="Update Visualization",
    layout=widgets.Layout(width="200px", margin="5px 0px 0px 90px"),
)

# Create output widget for the plot
plot_output = widgets.Output()


def create_plot(words):
    # Calculate vector embeddings for each word
    docs = [nlp(word) for word in words]
    embeddings = np.array([doc.vector for doc in docs])

    # t-SNE visualization
    tsne = TSNE(n_components=2, random_state=0, perplexity=min(5, len(words) - 1))
    embeddings_2d = tsne.fit_transform(embeddings)

    # Plot
    plt.figure(figsize=(10, 8))
    plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
    x_offset = 0.01 * (max(embeddings_2d[:, 0]) - min(embeddings_2d[:, 0]))

    for i, word in enumerate(words):
        plt.annotate(word, (embeddings_2d[i, 0] + x_offset, embeddings_2d[i, 1]))

    plt.xlabel("t-SNE Dimension 1")
    plt.ylabel("t-SNE Dimension 2")
    plt.show()


def on_button_click(b):
    # Clear only the plot output
    plot_output.clear_output(wait=True)

    # Get words from input, strip whitespace and split by comma
    words = [w.strip() for w in text_input.value.split(",") if w.strip()]

    # Create the plot in the output widget
    with plot_output:
        create_plot(words)


# Register the callback
update_button.on_click(on_button_click)

# Create the main UI layout
ui = widgets.VBox([heading, text_input, update_button])
main_layout = widgets.VBox([ui, plot_output])

# Initial display
display(main_layout)

# Create initial plot
with plot_output:
    words = [w.strip() for w in text_input.value.split(",") if w.strip()]
    create_plot(words)

VBox(children=(VBox(children=(HTML(value='<h2>Word Embedding Visualization</h2><p>Enter at least 5 words, sepaâ€¦