# Word embeddings

See the learning materials associated with this exercise <a href="https://applied-language-technology.mooc.fi/html/notebooks/part_iii/05_embeddings_continued.html" target="blank_">here</a>.

For instructions on how to use TestMyCode (TMC) to test your code and submit it to the server, see <a href="https://applied-language-technology.mooc.fi/html/tmc.html" target="blank_">here</a>.

Remember to save this Notebook before testing your code. Press <kbd>Control</kbd>+<kbd>s</kbd> or select the *File* menu and click *Save*.

**The maximum number of points for this exercise is 25.**

## 1. Load a large language model for English (3 points)

Load a large language model for English (`en_core_web_lg`) and assign the resulting *Language* object under the variable `nlp_lrg`.

In [7]:
# Import the spaCy library
import spacy

# Write your answer below this line
nlp_lrg = spacy.load('en_core_web_lg')

[1m[TMC][0m [92mThe variable "nlp_lrg" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "nlp_lrg" contains a spaCy Language object! 1 point.[0m

[1m[TMC][0m [92mThe variable "nlp_lrg" contains a Language object for English! 1 point.[0m

## 2. Feed text to the language model (3 points)

Feed the string objects stored under variables `text_1` and `text_2` to the language model under `nlp_lrg`.

Store the results under the variables `doc_1` and `doc_2`, respectively.

In [12]:
# Define two example texts
text_1 = "With around 570,000 inhabitants, the Hanseatic city is the 11th largest city of Germany as well as the second largest city in Northern Germany after Hamburg."
text_2 = "Bremen is the fourth largest city in the Low German dialect area after Hamburg, Dortmund and Essen."

# Write your answer below this line
doc_1 = nlp_lrg(text_1)
doc_2 = nlp_lrg(text_2)

[1m[TMC][0m [92mThe variables "doc_1" and "doc_2" were defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variables "doc_1" and "doc_2" contain spaCy Doc objects! 2 points.[0m

In [12]:
doc_2

Bremen is the fourth largest city in the Low German dialect area after Hamburg, Dortmund and Essen.

## 3. Compare the similarity of `doc_1` and `doc_2`  (3 points)

Compare cosine similarity of the vectors for `doc_1` and `doc_2`. Assign the result under the variable `cosine_docs`.

In [15]:
# Write your answer below this line
cosine_docs = doc_1.similarity(doc_2)
cosine_docs


[1m[TMC][0m [92mThe variable "cosine_docs" was defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variable "cosine_docs" contains the expected value! 2 points.[0m

## 4. Get noun phrases in `doc_1` and `doc_2` (3 points)

Get the noun phrases (available under the attribute `noun_chunks`) from the *Doc* objects `doc_1` and `doc_2`. 

Cast the outputs into Python lists and assign the lists to the variables `chunks_1` and `chunks_2`, respectively. 

In [16]:
# Write your answer below this line
chunks_1 = list(doc_1.noun_chunks)
chunks_2 = list(doc_2.noun_chunks)

[1m[TMC][0m [92mThe variables "chunks_1" and "chunks_2" were defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variables "chunks_1" and "chunks_2" contain list objects! 1 point.[0m

[1m[TMC][0m [92mThe variables "chunks_1" and "chunks_2" contain the expected values! 1 point.[0m

In [15]:
chunks_1

[around 570,000 inhabitants,
 the Hanseatic city,
 the 11th largest city,
 Germany,
 the second largest city,
 Northern Germany,
 Hamburg]

## 5. Compare the similarity of noun phrases (3 points)

The second item in the list `chunks_1` contains the noun phrase "the Hanseatic city". Compare cosine similarity between this noun phrase and

 1. The first item in the list `chunks_2` (Bremen) and store the result under the variable `bremen`.
 2. The fourth item in the list `chunks_2` (Hamburg) and store the result under the variable `hamburg`.
 
Tip: Keep in mind that items in lists can be accessed using brackets. Accessing an item in a list gives access to the methods and attributes associated with the list item.

In [18]:
# Write your answer below this line
bremen = chunks_1[1].similarity(chunks_2[0])
hamburg = chunks_1[1].similarity(chunks_2[3])

[1m[TMC][0m [92mThe variables "bremen" and "hamburg" were defined successfully! 1 point.[0m

[1m[TMC][0m [92mThe variables "bremen" and "hamburg" contain the expected values! 2 points.[0m

## 6. Bremen, Hamburg and other Hanseatic cities (3 points)

In terms of cosine similarity, a comparison between the vector for "the Hanseatic city" and the vectors for "Bremen" and "Hamburg" are only somewhat similar, although both Bremen and Hamburg are Hanseatic cities. How come this information is not captured by the vectors?

 1. One cannot compare vectors for proper nouns and noun phrases using cosine similarity.
 2. Vector representations for units larger than a single token may lose some information, because they are constructed by averaging the vectors for each token.
 3. Word embeddings cannot capture this kind of information.
 
Assign the number corresponding to the correct answer under the variable `answer_1`.

In [1]:
# Write your answer below this line
answer_1 = 2

[1m[TMC][0m [92mThe variable "answer_1" was defined successfully! [0m

[1m[TMC][0m [92mThe variable "answer_1" contains an integer.[0m

[1m[TMC][0m [92mThe variables "bremen" and "hamburg" contain the expected values![0m

[1m[TMC][0m [92mYour answer is correct! 3 points.[0m

In [18]:
doc_2

Bremen is the fourth largest city in the Low German dialect area after Hamburg, Dortmund and Essen.

## 7. Compare cosine similarity between several nouns (7 points)

The spaCy *Doc* object under the variable `doc_2` contains *Tokens* that refer to several cities in Germany: Bremen (at index `0`), Hamburg (`13`), Dortmund (`15`) and Essen (`17`).

 1. Retrieve the 300-dimensional word vectors for each *Token* mentioned above, and add them into a Python list ordered by their index. Assign the list under the variable `vector_list`.
 2. Use the `vstack()` function from the NumPy library to combine the vectors in `vector_list` into a NumPy matrix. Assign the result under the variable `matrix`.
 3. Use the `cosine_similarity()` function to measure cosine similarity between each vector in the matrix. Assign the result under the variable `cos_matrix`.
 4. Use the `heatmap()` function from the seaborn library to draw a heatmap of pairwise cosine similarities in `cos_matrix`.

Tip: Word vectors for various spaCy objects can be accessed using the attribute `vector`.

In [22]:
# Import the necessary libraries and their components
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from seaborn import heatmap

# Write your answer below this line
vector_list = [token.vector for token in [doc_2[0], doc_2[13], doc_2[15], doc_2[17]]]

# Convert the list of vectors into a NumPy matrix
matrix = np.vstack(vector_list)

# Measure cosine similarity between each vector in the matrix
cos_matrix = cosine_similarity(matrix)

# Draw a heatmap of pairwise cosine similarities
heatmap(cos_matrix, annot=True, cmap="YlGnBu", xticklabels=[str(token) for token in [doc_2[0], doc_2[13], doc_2[15], doc_2[17]]], yticklabels=[str(token) for token in [doc_2[0], doc_2[13], doc_2[15], doc_2[17]]])



[1m[TMC][0m [92mThe variable "cos_matrix" has the correct shape.[0m

[1m[TMC][0m [92mThe variables "vector_list", "matrix" and "cos_matrix" were defined successfully![0m

[1m[TMC][0m [92mThe variable "cos_matrix" contains the expected values! 4 points.[0m

Which pair of *Tokens* has the highest cosine similarity?

 1. Bremen (0) and Hamburg (1)
 2. Hamburg (1) and Essen (3)
 3. Dortmund (2) and Essen (3)

Assign the number corresponding to the correct answer under the variable `answer_2`.

In [1]:
# Write your answer below this line
answer_2 = 1

[1m[TMC][0m [92mThe variable "answer_2" was defined successfully! [0m

[1m[TMC][0m [92mThe variable "answer_2" contains an integer.[0m

[1m[TMC][0m [92mThe variable "cos_matrix" contains the expected values![0m

[1m[TMC][0m [92mYour answer is correct! 3 points.[0m