⚠️ **Static Version Notice**

This is a static export of an interactive marimo notebook. Some features have been modified for compatibility:

- Interactive UI elements (sliders, dropdowns, text inputs) have been removed
- UI variable references have been replaced with default values
- Some cells may have been simplified or removed entirely

For the full interactive experience, please run the original marimo notebook (.py file) using:
```bash
uv run marimo edit notebook_name.py
```

---


# Module 2: Practice 3 - Word Embeddings

In [None]:
import subprocess
result = subprocess.run(['bash', '-c', 'uv run python -m spacy download en_core_web_lg'], capture_output=True, text=True)


## Setup

First, we import the *spacy* library and load the large English model.

In [None]:
import spacy

nlp = spacy.load("en_core_web_lg")


Next, let's define a function to calculate word embeddings based on an input word:

In [None]:
def calculate_embedding(input_word):
    word = nlp(input_word)
    return word.vector


Let's try with the word 'apple'.  For brevity, only the first elements of the embedding vector are displayed:

In [None]:
calculate_embedding("apple")[:10]


In [None]:
word_embedding = calculate_embedding("example_text")
word_embedding[:10]


## Similarity
Let's add a function to calculate the similarity between two words based on their embeddings:

In [None]:
def calculate_similarity(word1, word2):
    return nlp(word1).similarity(nlp(word2))


Compare embeddings of words: 'apple' and 'car'

In [None]:
calculate_similarity("apple", "car")


In [None]:
calculate_similarity("example_text", "example_text")


In [None]:
la_word1_embedding = nlp("example_text").vector
la_word2_embedding = nlp("example_text").vector
la_word3_embedding = nlp("example_text").vector
la_word = la_word1_embedding + (la_word2_embedding - la_word3_embedding)
la_word4 = nlp("example_text").vector


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
print("Cosine similarity: ", cosine_similarity([la_word], [la_word4])[0][0])


----
## Sentence Embeddings

Finally, to calculate an embedding for a sentence, we can just average the embeddings of all the words in that sentence.  We will again use `spacy` to calculate the sentence embeddings.

```python
query = "What is the capital of France?"
info_1 = "The capital of France is Paris"
info_2 = "France is a beautiful country"
info_3 = "Today is very warm in New York City"
print("Response 1 Similarity: ", nlp(query).similarity(nlp(info_1)))
print("Response 2 Similarity: ", nlp(query).similarity(nlp(info_2)))
print("Response 3 Similarity: ", nlp(query).similarity(nlp(info_3)))
```

In [None]:
query = "What is the capital of France?"
info_1 = "The capital of France is Paris"
info_2 = "France is a beautiful country"
info_3 = "Today is very warm in New York City"
print("Response 1 Similarity: ", nlp(query).similarity(nlp(info_1)))
print("Response 2 Similarity: ", nlp(query).similarity(nlp(info_2)))
print("Response 3 Similarity: ", nlp(query).similarity(nlp(info_3)))


Being able to quickly calculate similarities between a query and target information text is very powerful for Information Retrieval, especially when combined with Large Language Models trained for chat/question answering capabilities.