<a href="https://colab.research.google.com/github/josenomberto/UTEC-CDIAV3-MISTI/blob/main/day4_content_challenge_word_embeddings_exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Content Challenge: Word Embeddings

Today, we'll learn about word embeddings, visualize them, train a simple embedding model, and see how embeddings help with language tasks. Let's start!


In [None]:
# Import Packages

# Pre-trained Word Embedding Models
import gensim.downloader as api

# Word Embedding Functions
from gensim.models import Word2Vec

# Dimensionality Reduction Methods
from sklearn.decomposition import PCA

# General Packages
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

## EXERCISE: Visualizing Pre-trained Word Embeddings

Here, weâ€™ll use the `gensim` library to load a small set of pre-trained embeddings (e.g., Word2Vec).

Tasks:
1. Review the available pre-trained models from gensim using `print(api.info()['models'].keys())`, and select one of the models (e.g., 'fasttext-wiki-news-subwords-300'). You can look into the differences of each model. *Each model will have a different vocabulary, so keep that in mind.*
2. Load the selected pre-trained model.
3. Select a few words from the vocabulary of the pre-trained model (e.g. `["king", "queen", "man", "woman", "apple", "fruit"]`), extract their vectorized representation and then print them.
4. Using the dimensionality reduction technique **principle component analysis**, reduce the number of dimensions of the vectorized representations so the information is amenable to plotting (i.e., reduce the dimensions to 2 or 3).
5. Plot the dimensionality reduced vector representations.


### TASK 1: Select a pre-trained model

In [None]:
# TASK 1 EXERCISE

# List all pre-trained models available
print(api.info()['models'].keys())

dict_keys(['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis'])


### TASK 2: Load the pre-trained model

In [None]:
# TASK 2 EXERCISE

# Load your selected pre-trained embeddings
w2v_model = api.load(<REMOVE ME AND MY ARROWS>)  # This loads the selected model

SyntaxError: invalid syntax (<ipython-input-3-24f837c20b8f>, line 4)

### TASK 3: Extract the vectorized representations of sample words

In [None]:
# TASK 3 EXERCISE

words = list(w2v_model.index_to_key)
print(words)

# Select a Few Words and Visualize Relationships

''' ADD YOUR CODE HERE '''

### TASK 4: Dimensionality Reduction

In [None]:
# TASK 4 EXERCISE

# Reduce dimensions to 2D for visualization

''' ADD YOUR CODE HERE '''

### TASK 5: Visualize the Vector Representations

In [None]:
# TASK 5 EXERCISE

# Plot the Dimensionality Reduced Vector Representations
plt.figure(figsize=(5, 4))

''' ADD YOUR CODE HERE '''

plt.title("2D Visualization of Word Embeddings")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
sns.despine()
plt.show()

## EXERCISE: Train a Simple Word2Vec Model


Tasks:
1. Design your own small text corpus either on your own, from an online source, or using some generative AI tool. This text corpus will be a `list` of sentence `strings`.
2. Tokenize the sentence into individual words.
3. Train a word embedding using `Word2Vec` from `gensim`, or another similar package.
4. Extract the word embeddings for a few example words from your trained model.
5. Visualize the embeddings from your newly trained model.
6. Repeat tasks 1-5 adjusting different parameters to see how it affects the embedding model that is trained. Example parameters or hyperparameters (if you are using the gensim package) that you can adjust include the following: `corpus`, `vector_size`, `window`, `min_count`, etc. Remember, if you increase the vector_size, you will need to do dimensionality reduction to visualize the embedded words.


### TASK 1: Define your corpus

In [None]:
# TASK 1 EXERCISE

corpus = <REPLACE ME AND MY ARROWS>

### TASK 2: Tokenize the Corpus

In [None]:
# TASK 2 EXERCISE

# Tokenize the corpus

''' ADD YOUR CODE HERE '''

### TASK 3: Train your own Word Embedding Model

In [None]:
# TASK 3 EXERCISE

# Initialize and train the Word2Vec model
my_w2v_model = Word2Vec(
    sentences = <REPLACE ME AND MY ARROWS>,   # Input corpus
    vector_size = <REPLACE ME AND MY ARROWS>, # Dimensionality of word embeddings
    window = <REPLACE ME AND MY ARROWS>,      # Context window size
    min_count = <REPLACE ME AND MY ARROWS>,   # Ignore words with frequency lower than this
    workers = 4,                              # Use 4 CPU threads
    sg = <REPLACE ME AND MY ARROWS>           # Skip-gram (1) or CBOW (0)
)

### TASK 4: Extract example word embeddings

In [None]:
# TASK 4 EXERCISE

word_embeddings = np.array([])

''' ADD YOUR CODE HERE '''

### TASK 5: Visualize your Embeddings

In [None]:
# TASK 5 EXERCISE

# Plot the Dimensionality Reduced Vector Representations
plt.figure(figsize=(5, 4))

''' ADD YOUR CODE HERE '''

plt.title("2D Visualization of Word Embeddings")
plt.xlabel("Embedding Component 1")
plt.ylabel("Embedding Component 2")
sns.despine()
plt.show()

## EXERCISE: Determine the Similarity of Sentences

Here, you will explore the concept of capturing the similarity of words and groups of words in vector space.

Tasks:
1. Calculate the most similar words of your pre-trained model `w2v_model` to an example word (e.g., king) from the model's vocabulary using pre-built functions from `gensim` or a package of your choice.
2. Calculate the similarity between two words in the vocabulary of your pre-trained model `w2v_model` (e.g., "king" and "queen" or "apple" and "fruit") using pre-built functions from `gensim` or a package of your choice.
3. Develop a method to calculate the similarity between two sentences (e.g., "king and queen" with "man and woman"). Consider how you can create a summary vector of a sentence. Here you can use the imported function `cosine` from `scipy.spatial.distance` once you have created a summary vector of each sentence.


### TASK 1: Find similar words to a given word

In [None]:
# TASK 1 EXERCISE
word_1 = <REPLACE ME AND MY ARROWS>
word_2 = <REPLACE ME AND MY ARROWS>

similar_words_1 = <REPLACE ME AND MY ARROWS>
similar_words_2 = <REPLACE ME AND MY ARROWS>

print("Words similar to " + word_1 + ":", similar_words_1)
print("Words similar to " + word_2 + ":", similar_words_2)

SyntaxError: invalid syntax (<ipython-input-37-f39b13b05652>, line 2)

### TASK 2: Find the similarity between two words

In [None]:
# TASK 2 EXERCISE
word_to_compare_1 = <REPLACE ME AND MY ARROWS>
word_to_compare_2 = <REPLACE ME AND MY ARROWS>

word_pair_similarity = <REPLACE ME AND MY ARROWS>

print("Similarity between " + word_to_compare_1 + " and " + word_to_compare_2 + ":", word_pair_similarity)

### TASK 3: Find the similarity between two sentences

In [None]:
# TASK 3 EXERCISE

def sentence_vector(sentence, model):
    '''Create a function which, given a word embedding model, and a sentence,
       produces a vector representation of the sentence.
    '''

    pass

# Define two example sentences
sentence_1 = <REPLACE ME AND MY ARROWS>
sentence_2 = <REPLACE ME AND MY ARROWS>

# Calculate cosine similarity between the sentence embeddings
from scipy.spatial.distance import cosine

vector1 = sentence_vector(sentence_1, w2v_model)
vector2 = sentence_vector(sentence_2, w2v_model)
similarity = <REPLACE ME AND MY ARROWS>

print(f"Similarity between '{sentence_1}' and '{sentence_2}':", similarity)