# Tutorial: Get and visualize word embeddings

## Step 1: Download GloVe embeddings file

Because GloVe is a static embedding model, we download the pre-trained GloVe embeddings rather than training our own. The nice thing is, once we have the embeddings file, we can use it for any future code. Here we'll download the GloVe embeddings file from Stanford's website.

In [None]:
# Set the path to the data directory
from pathlib import Path

local_data_path = Path().resolve().parent / "local_data"
assert local_data_path.exists(), "Data path does not exist"

# Download GloVe embeddings to the data directory
import requests

glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
glove_zip_path = local_data_path / "glove.6B.zip"
if not glove_zip_path.exists():
    print("Downloading GloVe embeddings... This may take a while.")
    response = requests.get(glove_url)
    response.raise_for_status()
    with open(glove_zip_path, "wb") as f:
        f.write(response.content)
    print("Download complete.")
else:
    print("GloVe embeddings already downloaded.")

# unzip the file
import zipfile

with zipfile.ZipFile(glove_zip_path, "r") as zip_ref:
    zip_ref.extractall(local_data_path)
print("Unzipped GloVe embeddings.")

OK. We have the embedding files, let's take a look at what we downloaded...

In [None]:
glove_files = local_data_path.glob("*glove*")
for file in glove_files:
    print(file)

You should see a file called `glove.6B.zip`. This is a zip file containing the GloVe embeddings. You should also see a bunch of files with the format `glove.6B.___d.txt`. These are the actual embeddings files. The number in the filename indicates the dimensionality of the embeddings. For example, `glove.6B.50d.txt` contains 50-dimensional embeddings.

Let's take a look:

In [None]:
glove_size = 100

if glove_size not in [50, 100, 200, 300]:
    raise ValueError("Invalid GloVe size. Must be one of [50, 100, 200, 300].")

glove_file = local_data_path / f"glove.6B.{glove_size}d.txt"

print("Loading GloVe embeddings...\n")
embeddings = {}
with open(glove_file, "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = list(map(float, values[1:]))
        embeddings[word] = vector
print("Examining GloVe embeddings...")
print(f"Number of embedded words: {len(embeddings)}")
print(f"Number of dimensions: {len(next(iter(embeddings.values())))}")
print(
    f"First five dimensions of embedding for 'business': {embeddings['business'][:5]}"
)

Now go back and change the `glove_size` to some alternative sizes based on the files you downloaded. 
* Does the number of embedded words change?
* Does the number of dimensions change?
* Do the embeddings look different?

## Step 2: Exploring the embeddings

Now let's look at some other words. Add some words to the `words` list below and see how the embeddings change.

In [None]:
words = [
    "hello",
    "world",
    "business",
    "economy",
    "finance",
    "covfefe",
    "perniciousness",
    "ice cream",
]
for word in words:
    if word in embeddings:
        print(f"Embedding for '{word}': {embeddings[word]}")
    else:
        print(f"'{word}' not found in GloVe embeddings.")

What happened with the 

* made-up word?
* rare word? 
* compound word?

What you observe is a limitation of static embedding models based on individual word tokens. If the word is not in the vocabulary that was used to train the embeddings, the model will not have an embedding for that word.

So, 
* When 'covfefe' was tweeted by President Trump in 2017, there was no embedding for it in the GloVe model.
* The word 'perniciousness' is a rare word that was not in the vocabulary of the GloVe model.
* "Ice cream" is a compound word. Both individual words are in the vocabulary, but the model does not have an embedding for the compound word "ice cream." Moreover, the meaning of the compound word is not just the sum of its parts.

What are the implications of this?



Despite this limitation, every year new words are added to the dictionary. What does this mean for the evolution of language? 

Moreover, we use compound words all the time. Even though we can break them down into their component parts, the meaning of the compound word is not always the same as the sum of its parts. For example, "ice cream" is a compound word, but its meaning is not just the sum of "ice" and "cream."

Now, let's consider some words that we might think to be similar to or very different from each other.

In [None]:
words = ["man", "boy", "male", "woman", "girl", "female"]
for word in words:
    if word in embeddings:
        print(f"Embedding for '{word}': {embeddings[word]}")
    else:
        print(f"'{word}' not found in GloVe embeddings.")

Are these embeddings similar to/different from eachother?

It's kind of hard to tell, right? With so many dimensions, it's hard to compare the embeddings. However, we can use a package called `gensim` to make comparisons easier

In [None]:
from gensim.models import KeyedVectors

glove_model = KeyedVectors.load_word2vec_format(
    glove_file, binary=False, no_header=True
)

# Compare each pair of words
from itertools import combinations

for word1, word2 in combinations(words, 2):
    if word1 in glove_model and word2 in glove_model:
        similarity = glove_model.similarity(word1, word2)
        print(f"Similarity between '{word1}' and '{word2}': {similarity:.4f}")
    else:
        print(f"'{word1}' or '{word2}' not found in GloVe embeddings.")

Do the results comport to your expectations of what would happen? 

Did you expect that "male" and "female" would be more similar to each other than "male" and "man"? Why do you think that this happened (Think back to how these embeddings were trained... what was the training objective?)

Now let's look also at the words that are most similar to these words rather than comparing them to each other.

In [None]:
# Find the most similar words to each word in the words list
from pprint import pprint

for word in words:
    if word in glove_model:
        similar_words = glove_model.most_similar(word, topn=5)
        print(f"Most similar words to '{word}':")
        pprint(similar_words, indent=4)
    else:
        print(f"'{word}' not found in GloVe embeddings.")

It's your turn to explore the embeddings! Find a few words that you're interested in and find how similar they are to other words. What are the most similar words to those words?

In [8]:
# Your playground here

Remember how the reading for this week talked about the ability to do vector algebra with word embeddings? Let's try that out!

What happens when we take the vector for "king" and subtract the vector for "man" and add the vector for "woman"?

In [None]:
# King-man+woman
print(f"King {glove_model['king'][:5]}...")
print(f"- man {glove_model['man'][:5]}...")
print(f"+ woman {glove_model['woman'][:5]}...")
result = glove_model.most_similar(positive=["king", "woman"], negative=["man"], topn=1)
print(f"\nSolution: {result}")

What about if we did the same thing for businessman?

In [None]:
# Businessman-man+woman
print(f"Businessman {glove_model['businessman'][:5]}...")
print(f"- man {glove_model['man'][:5]}...")
print(f"+ woman {glove_model['woman'][:5]}...")
result = glove_model.most_similar(
    positive=["businessman", "woman"], negative=["man"], topn=1
)
print(f"\nSolution: {result}")

Try it for yourself... Can you find other similar associations?

Do they ALL work or does the algebra break down for some word associations?

Why do you think that is?

In [11]:
# Your playground here

## Step 3: Visualizing the embeddings

We've played around with the embeddings quite a bit, but it's also helpful to visualize them. Let's use a package called `matplotlib` to visualize the embeddings.

However, the embeddings are high-dimensional, so we need to reduce the dimensionality before we can visualize them. One way to do this is to use a technique called t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE is a technique for dimensionality reduction.

In [None]:
# Visualize the embeddings
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import numpy as np


def plot_words(wordlist: list):
    word_embeddings = np.array(
        [embeddings[word] for word in wordlist if word in embeddings]
    )

    tsne = TSNE(n_components=2, perplexity=5, random_state=24601)
    reduced_embeddings = tsne.fit_transform(word_embeddings)
    plt.figure(figsize=(5, 5))
    for i, word in enumerate(wordlist):
        if word in embeddings:
            plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1])
            plt.annotate(
                word,
                xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]),
                xytext=(5, 2),
                textcoords="offset points",
            )
    plt.show()


words = ["man", "boy", "male", "king", "woman", "girl", "female", "queen"]
plot_words(words)

Naturally, with dimensionality reduction, you necessarily lose some information. However, even when we take high-dimensional data and reduce it to 2 dimensions, we can still see some interesting patterns. How do the words we selected cluster together?

Let's try a more diverse set of words and see if we can see other patterns. Let's plot some animals and types of art.

In [None]:
animals = [
    "cat",
    "dog",
    "fish",
    "bird",
    "elephant",
    "giraffe",
    "lion",
    "tiger",
    "bear",
    "zebra",
]
types_of_art = [
    "painting",
    "sculpture",
    "photography",
    "drawing",
    "printmaking",
    "collage",
    "mosaic",
    "graffiti",
]
plot_words(animals + types_of_art)

Do you see any interesting patterns?

Let's try one more set of words. Let's try some words related to food.

In [None]:
foods = [
    "pizza",
    "fish",
    "burger",
    "salad",
    "appetizer",
    "sushi",
    "cake",
    "lettuce",
    "chocolate",
    "candy",
]
plot_words(foods)

Again... do you see any interesting patterns?

Now let's try adding the food words to our animal and art words.

In [None]:
plot_words(foods + animals + types_of_art)

Did you notice that 'fish' shows up twice? That's because 'fish' is both an animal and a type of food. It's on both lists. What do you think about the location of 'fish' in the plot?

But wait, the meaning of 'fish' in the context of animals is different from the meaning of 'fish' in the context of food. What does this tell you about the embeddings?

This is a limitation of the static embedding model. The model does not take into account the context in which the word is used. For example, the word 'bank' can be a verb or a noun, and it can refer to a financial institution or the side of a river. The model does not take into account these different meanings of the word 'bank.' You get one embedding for the word 'bank,' regardless of its context. That embedding might have some information about the different meanings of the word like you likely see here with respect to its positioning relative to the other food and animal words, but it doesn't fully capture the different meanings of the word.

## Step 4: Document Embeddings

We've been looking at the embeddings of individual words, but sometimes you want to get the embeddings of entire documents. One way to do this from word embeddings is to take the average of the embeddings of all the words in the document. This is called a document embedding.

Let's start by loading a few documents.

In [None]:
import textwrap

abstracts_txt = Path() / "data" / "abstracts.txt"
assert abstracts_txt.exists(), "Abstracts file does not exist"
with open(abstracts_txt, "r", encoding="utf-8-sig") as f:
    abstracts = f.readlines()
print(f"Number of abstracts: {len(abstracts)}")
print(f"First abstract: {textwrap.fill(abstracts[0], width=80)}")

Remember from last week, we noted that we generally can't work with raw text. We need to clean it first. Let's do that now.

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")


def preprocess(text: str) -> list:
    doc = nlp(text)
    filtered_tokens = [
        token.lemma_
        for token in doc
        if not (token.is_punct or token.is_stop or token.is_digit or token.is_space)
    ]
    return filtered_tokens


spacy_results = [preprocess(abstract) for abstract in abstracts]

pprint(spacy_results, compact=True, indent=4)

Great! Make sure you understand why we did each of those preprocessing steps before proceeding. For instance, what would have happened if we didn't remove stop words? If we didn't remove punctuation?

Now, let's get the embeddings for each of the words.

In [None]:
emb_abstracts = []
for abstract in spacy_results:
    abstract_embeddings = []
    for word in abstract:
        if word in embeddings:
            abstract_embeddings.append(embeddings[word])
    if abstract_embeddings:
        emb_abstracts.append(abstract_embeddings)

for idx, abstract in enumerate(emb_abstracts, 1):
    print(f"Number of embeddings in abstract {idx}: {len(abstract)}")
    print(
        f"\tFirst five dimensions of first embedding in abstract {idx}: {abstract[0][:5]}"
    )
    print(
        f"\tFirst five dimensions of second embedding in abstract {idx}: {abstract[1][:5]}"
    )
    print(
        f"\tFirst five dimensions of third embedding in abstract {idx}: {abstract[2][:5]}"
    )

OK, so now we have all of the words replaced with their individual embeddings. Now we can take the average of the embeddings for each document to get a document embedding.

In [None]:
doc_embeddings = []
for abstract in emb_abstracts:
    if abstract:
        avg_embedding = np.mean(abstract, axis=0)
        doc_embeddings.append(avg_embedding)

print(f"Number of document embeddings: {len(doc_embeddings)}")
for idx, doc_embedding in enumerate(doc_embeddings, 1):
    print(f"Document embedding {idx}:\n{doc_embedding[:5]}...")

OK, but we have the same issue as before... the data are so high-dimensional that it's hard to determine how similar they are. Let's compute the cosine similarity between the document embeddings.

In [None]:
for idx, (doc_pair) in enumerate(combinations(range(len(doc_embeddings)), 2), 1):
    doc1_idx, doc2_idx = doc_pair
    doc1, doc2 = doc_embeddings[doc1_idx], doc_embeddings[doc2_idx]
    similarity = np.dot(doc1, doc2) / (np.linalg.norm(doc1) * np.linalg.norm(doc2))
    print(
        f"Similarity between document {doc1_idx + 1} and document {doc2_idx + 1}: {similarity:.4f}"
    )

If that similarity formula doesn't look familiar to you, revisit the chapter 6 reading for this week: formula 6.9

They're all pretty similar to eachother, right? Why do you think that is?

When we use t-SNE to visualize the document embeddings, we can see that there may be some differences.

In [None]:
# produce TSNE visualization of the document embeddings
tsne = TSNE(n_components=2, perplexity=5, random_state=24601, max_iter=10000)
reduced_embeddings = tsne.fit_transform(np.array(doc_embeddings))
plt.figure(figsize=(15, 15))
for i in range(len(doc_embeddings)):
    plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1])
    plt.annotate(
        f"Doc {i + 1}",
        xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]),
        xytext=(5, 2),
        textcoords="offset points",
    )
plt.show()

12 and 23 seem to be closely related. Let's see if they have something in common:

In [None]:
print("Abstract 12\n", "-" * 12)
print(textwrap.fill(abstracts[11]))
print("\nAbstract 23\n", "-" * 12)
print(textwrap.fill(abstracts[22]))

Um... not fantastic. Let's look at 48 and 50 that also seem pretty close

In [None]:
print("Abstract 48\n", "-" * 12)
print(textwrap.fill(abstracts[47]))
print("\nAbstract 50\n", "-" * 12)
print(textwrap.fill(abstracts[49]))

Also, not fantastic. The problem is that documents are much more than the average of their individiual words. The meaning of a document is not just the sum of its parts. Some words mean more than others (for example, why we got rid of stop words). Also, the order of the words matters. For example, "the cat sat on the mat" is different from "the mat sat on the cat."

Let's try a better approach. Let's use Open AI's embedding model to get the documents' embeddings instead.

## Step 5: Using Open AI's Embedding Model

In [None]:
import openai
import dotenv
import os

dotenv.load_dotenv()
openai.api_key = os.getenv("OPENAI_KEY")

text = (
    "Did you go to Paula, add $10 to your OpenAI API key, and put your new key "
    "it in your '.env' file? If not, this isn't going to work..."
)

response = openai.embeddings.create(model="text-embedding-3-small", input=text)
print(
    textwrap.fill(
        f'The OpenAI embedding for the text "{text}" is {response.data[0].embedding[:10]}...',
        width=100,
    )
)
print()
print(
    textwrap.fill(
        f"The OpenAI embedding for the text \"{text}\" is {len(response.data[0].embedding)} dimensions.",
        width=100,
    )
)

That is a lot of dimensions. Let's run the abstracts through the Open AI embedding model and see what we get.

In [None]:
response = openai.embeddings.create(model="text-embedding-3-small", input=abstracts)
openai_embeddings = [doc.embedding for doc in response.data]
print(f"Number of OpenAI document embeddings: {len(openai_embeddings)}")
for idx, doc_embedding in enumerate(openai_embeddings, 1):
    print(f"Document embedding {idx}:\n{doc_embedding[:5]}...")

Notice we didn't do ANY preprocessing. We just sent the raw text to the model. This is because the OpenAI model is trained to handle raw text. It handles the preprocessing for us. In fact, if we preprocess the text, we might actually invalidate the meaning of the text. 

For example, punctuation didn't mean much when we were looking at individual words, but it does mean something when we're looking at entire documents. Same with stop-words, when we're trying to capture the meaning of an entire document, we don't want to get rid of "the" or "a" because they can change the meaning of the words in-context.

The OpenAI model is a contextual embeddings model. This means that it takes into account the context in which the word is used. For example, the word 'bank' would have a different embedding depending on whether it was used in the context of a financial institution or the side of a river.

Let's revisualize the embeddings using t-SNE and see if the clusters are more meaningful.

In [None]:
# produce TSNE visualization of the Open AI document embeddings
tsne = TSNE(n_components=2, perplexity=5, random_state=24601, max_iter=10000)
reduced_embeddings = tsne.fit_transform(np.array(openai_embeddings))
plt.figure(figsize=(15, 15))
for i in range(len(openai_embeddings)):
    plt.scatter(reduced_embeddings[i, 0], reduced_embeddings[i, 1])
    plt.annotate(
        f"Doc {i + 1}",
        xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]),
        xytext=(5, 2),
        textcoords="offset points",
    )
plt.show()

OK, so we can see that the documents are clustering together in a different way. Let's see if the articles closer together are more similar to each other in a more meaningful way.

2 and 39 seem to hang together. Let's look at them:

In [None]:
print("Abstract 2\n", "-" * 12)
print(textwrap.fill(abstracts[1]))
print("\nAbstract 39\n", "-" * 12)
print(textwrap.fill(abstracts[38]))

Interesting, both of those seem to be about decision-making. Those make sense to be located nearer together.


What about 3 and 22?

In [None]:
print("Abstract 3\n", "-" * 12)
print(textwrap.fill(abstracts[2]))
print("\nAbstract 22\n", "-" * 12)
print(textwrap.fill(abstracts[21]))

These both seem to be about the CEO from the strategic management literature.

Now do some more exploring on your own. What clusters do you see in the above visualization? What do these clusters 'mean' to you when you look at the texts themselves?

In [88]:
# Your playground here

Document embeddings are a powerful way to capture the meaning of entire documents. They can be used for a variety of tasks, such as document classification, clustering, and information retrieval. But they are more than the sum/average of the individual words (even when you isolate just the 'important' words). The order of the words matters. The context in which the words are used matters.

Done...