<div class="alert alert-block alert-info">
    <h1>Natural Language Processing</h1>
    04
    <h3>General Information:</h3>
    <p>Please do not add or delete any cells. Answers belong into the corresponding cells (below the question). If a function is given (either as a signature or a full function), you should not change the name, arguments or return value of the function.<br><br> If you encounter empty cells underneath the answer that can not be edited, please ignore them, they are for testing purposes.<br><br>When editing an assignment there can be the case that there are variables in the kernel. To make sure your assignment works, please restart the kernel and run all cells before submitting (e.g. via <i>Kernel -> Restart & Run All</i>).</p>
    <p>Code cells where you are supposed to give your answer often include the line  ```raise NotImplementedError```. This makes it easier to automatically grade answers. If you edit the cell please outcomment or delete this line.</p>
    <h3>Submission:</h3>
    <p>Please submit your notebook via the web interface (in the main view -> Assignments -> Submit). The assignments are due on <b>Monday at 15:00</b>.</p>
    <h3>Group Work:</h3>
    <p>You are allowed to work in groups of up to three people. Please enter the UID (your username here) of each member of the group into the next cell. We apply plagiarism checking, so do not submit solutions from other people except your team members. If an assignment has a copied solution, the task will be graded with 0 points for all people with the same solution.</p>
    <h3>Questions about the Assignment:</h3>
    <p>If you have questions about the assignment please post them in the LEA forum before the deadline. Don't wait until the last day to post questions.</p>
    
</div>

In [None]:
'''
Group Work:
Enter the username of each team member into the variables. 
If you work alone please leave the other variables empty.
'''
member1 = 'hvu2s'
member2 = 'anuhel2s'
member3 = 'ksheka2s'


# Word2Vec and FastText Embeddings

In this assignment we will work on Word2Vec embeddings and FastText embeddings.

I prepared three dictionaries for you:

- ```word2vec_yelp_vectors.pkl```: A dictionary with 300 dimensional word2vec embeddings trained on the Google News Corpus, contains only words that are present in our Yelp reviews (key is the word, value is the embedding)
- ```fasttext_yelp_vectors.pkl```: A dictionary with 300 dimensional FastText embeddings trained on the English version of Wikipedia, contains only words that are present in our Yelp reviews (key is the word, value is the embedding)
- ```tfidf_yelp_vectors.pkl```: A dictionary with 400 dimensional TfIdf embeddings trained on the Yelp training dataset from last assignment (key is the word, value is the embedding)

In the next cell we load those into the dictionaries ```w2v_vectors```, ```ft_vectors``` and ```tfidf_vectors```.

© Tim Metzler, Hochschule Bonn-Rhein-Sieg

In [7]:
import pickle

with open('/srv/shares/NLP/embeddings/word2vec_yelp_vectors.pkl', 'rb') as f:
    w2v_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/embeddings/fasttext_yelp_vectors.pkl', 'rb') as f:
    ft_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/embeddings/tfidf_yelp_vectors.pkl', 'rb') as f:
    tfidf_vectors = pickle.loads(f.read())
    
with open('/srv/shares/NLP/datasets/yelp/reviews_train.pkl', 'rb') as f:
    train = pickle.load(f)
    
with open('/srv/shares/NLP/datasets/yelp/reviews_test.pkl', 'rb') as f:
    test = pickle.load(f)
    
reviews = train + test

## Creating a vector model with helper functions [30 points]

In the next cell we have the class ```VectorModel``` with the methods:

- ```vector_size```: Returns the vector size of the model
- ```embed```: Returns the embedding for a word. Returns None if there is no embedding present for the word
- ```cosine_similarity```: Calculates the cosine similarity between two vectors
- ```most_similar```: Given a word returns the ```top_n``` most similar words from the model, together with the similarity value, **sorted by similarity (descending)**. We do not want to return the word itself as the most similar one. So we only return the most similar words except for the first one.
- ```most_similar_vec```: Given a vector returns the ```top_n``` most similar words from the model, together with the similarity value, **sorted by similarity (descending)**. Here we want to keep the most similar one.

Your task is to complete these methods.

Example output:
```
model = VectorModel(w2v_vectors)

vector_good = model.embed('good')
vector_tomato = model.embed('tomato')

print(model.cosine_similarity(vector_good, vector_tomato)) # Prints: 0.05318105

print(model.most_similar('tomato')) 
'''
[('tomatoes', 0.8442263), 
 ('lettuce', 0.70699364),
 ('strawberry', 0.6888598), 
 ('strawberries', 0.68325955), 
 ('potato', 0.67841727)]
'''

print(model.most_similar_vec(vector_good)) 
'''
[('good', 1.0), 
 ('great', 0.72915095), 
 ('bad', 0.7190051), 
 ('decent', 0.6837349), 
 ('nice', 0.68360925)]
'''

```

In [8]:
from typing import List, Tuple, Dict
import numpy as np

   
class VectorModel:
    
    def __init__(self, vector_dict: Dict[str, np.ndarray]):
        # YOUR CODE HERE
        self.vector_dict = vector_dict 
        
    def embed(self, word: str) -> np.ndarray:
        # YOUR CODE HERE
        return self.vector_dict.get(word, None)
    
    def vector_size(self) -> int:
        # YOUR CODE HERE
        vector_size = len(next(iter(self.vector_dict.values())))
        return vector_size
    
    def cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
        # YOUR CODE HERE
        if vec1 is None or vec2 is None:
            return 0
        norm1 = np.linalg.norm(vec1)
        norm2 = np.linalg.norm(vec2)
        if norm1 == 0 or norm2 == 0:
            return 0
        return np.dot(vec1, vec2) / (norm1 * norm2)

    def most_similar(self, word: str, top_n: int=5) -> List[Tuple[str, float]]:
        # YOUR CODE HERE
        vec = self.embed(word)
        if vec is None:
            return []
        
        similarities = []
        for w, v in self.vector_dict.items():
            if w == word:
                continue
            sim = self.cosine_similarity(vec, v)
            similarities.append((w, sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_n]
        
    def most_similar_vec(self, vec: np.ndarray, top_n: int=5) -> List[Tuple[str, float]]:
        # YOUR CODE HERE        
        similarities = []
        for w, v in self.vector_dict.items():
            sim = self.cosine_similarity(vec, v)
            similarities.append((w, sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_n]
        

## Investigating similarity A) [10 points]

We now want to find the most similar words for a given input word for each model (Word2Vec, FastText and TfIdf).

Your input words are: ```['good', 'tomato', 'restaurant', 'beer', 'wonderful']```.

For each model and input word print the top three most similar words.

In [9]:
input_words = ['good', 'tomato', 'restaurant', 'beer', 'wonderful', 'dinner']

# Word2Vec
w2v_model = VectorModel(vector_dict = w2v_vectors)
for input_word in input_words:
    similar_words = w2v_model.most_similar(input_word, 3)
    print(f"Word2Vec: Top 3 most similar words to {input_word} are {similar_words}")

# FastText    
ft_model = VectorModel(vector_dict = ft_vectors)
for input_word in input_words:
    similar_words = ft_model.most_similar(input_word, 3)
    print(f"FastText: Top 3 most similar words to {input_word} are {similar_words}")
    
# TFIDF  
tfidf_model = VectorModel(vector_dict = tfidf_vectors)
for input_word in input_words:
    similar_words = tfidf_model.most_similar(input_word, 3)
    print(f"TFIDF: Top 3 most similar words to {input_word} are {similar_words}")

Word2Vec: Top 3 most similar words to good are [('great', 0.72915095), ('bad', 0.7190051), ('decent', 0.6837348)]
Word2Vec: Top 3 most similar words to tomato are [('tomatoes', 0.8442263), ('lettuce', 0.70699376), ('strawberry', 0.6888598)]
Word2Vec: Top 3 most similar words to restaurant are [('restaurants', 0.77228934), ('diner', 0.72802156), ('steakhouse', 0.72698534)]
Word2Vec: Top 3 most similar words to beer are [('beers', 0.8409688), ('drinks', 0.66893125), ('ale', 0.63828725)]
Word2Vec: Top 3 most similar words to wonderful are [('fantastic', 0.8047919), ('great', 0.76478696), ('fabulous', 0.7614761)]
Word2Vec: Top 3 most similar words to dinner are [('dinners', 0.7902064), ('brunch', 0.79005134), ('breakfast', 0.7007028)]
FastText: Top 3 most similar words to good are [('excellent', 0.7223856825801254), ('decent', 0.7202461451724537), ('bad', 0.6704173041669614)]
FastText: Top 3 most similar words to tomato are [('eggplant', 0.7518509618329048), ('spinach', 0.7422800959168396)

## Investigating similarity B) [10 points]

Comment on the output from the previous task. Let us look at the output for the word ```wonderful```. How do the models differ for this word? Can you reason why the TfIdf model shows so different results?

W2V model outputs: fantastic, great and fabulous. Meanwhile FT model gives out: lovely, fascinating and amazing. Given this results alone I feel like both performances are valid. On the other hand, TFIDF model gave a completely bad outputs. Maybe because vectors represented through TFIDF have no relationship to each other even for similar words. 

## Investigating similarity C) [10 points]

Instead of just finding the most similar word to a single word, we can also find the most similar word given a list of positive and negative words.

For this we just sum up the positive and negative words into a single vector by calculating a weighted mean. For this we multiply each positive word with a factor of $+1$ and each negative word with a factor of $-1$. Then we get the most similar words to that vector.

You are given the following examples:

```
inputs = [
    {
        'positive': ['good', 'wonderful'],
        'negative': ['bad']
    },
    {
        'positive': ['tomato', 'lettuce'],
        'negative': ['strawberry', 'salad']
    }    
]
```

In [10]:
class AnalogyVectorModel(VectorModel):
    
    def __init__(self, vector_dict: Dict[str, np.ndarray]):
        # YOUR CODE HERE
        super().__init__(vector_dict) 
        
    
    def most_similar_analogy(self, analogy_words: dict, top_n: int=5) -> List[Tuple[str, float]]:
        # YOUR CODE HERE
        vec_sum = np.zeros(super().vector_size())
        
        for word in analogy_words["positive"]:
            vec = self.embed(word)
            if vec is not None:
                vec_sum += vec
        for word in analogy_words["negative"]:
            vec = self.embed(word)
            if vec is not None:
                vec_sum -= vec
        word_list = analogy_words["positive"] + analogy_words["negative"]
            
        
        similarities = []
        for w, v in self.vector_dict.items():
            if w in word_list:
                continue
            sim = super().cosine_similarity(vec, v)
            similarities.append((w, sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_n]

inputs = [
    {
        'positive': ['good', 'wonderful'],
        'negative': ['bad']
    },
    {
        'positive': ['tomato', 'lettuce'],
        'negative': ['strawberry', 'fruit']
    },
    {
        'positive': ['ceasar', 'chicken'],
        'negative': []
    }    
]
# YOUR CODE HERE
w2v_analogy_model = AnalogyVectorModel(w2v_vectors)
for input_dict in inputs:
    similar_words = w2v_analogy_model.most_similar_analogy(input_dict, 3)
    print(f"Word2Vec: Top 3 most similar words to {input_dict} are {similar_words}")
    
ft_analogy_model = AnalogyVectorModel(ft_vectors)
for input_dict in inputs:
    similar_words = ft_analogy_model.most_similar_analogy(input_dict, 3)
    print(f"FastText: Top 3 most similar words to {input_dict} are {similar_words}")

Word2Vec: Top 3 most similar words to {'positive': ['good', 'wonderful'], 'negative': ['bad']} are [('terrible', 0.6828612), ('horrible', 0.67025983), ('lousy', 0.66476405)]
Word2Vec: Top 3 most similar words to {'positive': ['tomato', 'lettuce'], 'negative': ['strawberry', 'fruit']} are [('fruits', 0.77371895), ('berries', 0.6854092), ('mango', 0.6631807)]
Word2Vec: Top 3 most similar words to {'positive': ['ceasar', 'chicken'], 'negative': []} are [('meat', 0.67991304), ('pork', 0.6541998), ('turkey', 0.6282519)]
FastText: Top 3 most similar words to {'positive': ['good', 'wonderful'], 'negative': ['bad']} are [('nasty', 0.6049038343820987), ('lousy', 0.5805090777602171), ('awful', 0.574274257647221)]
FastText: Top 3 most similar words to {'positive': ['tomato', 'lettuce'], 'negative': ['strawberry', 'fruit']} are [('fruits', 0.8618268613306459), ('berries', 0.7313970644025728), ('edible', 0.6641486013579591)]
FastText: Top 3 most similar words to {'positive': ['ceasar', 'chicken'], 

## Investigating similarity D) [15 points]

We can use our model to find out which word does not match given a list of words.

For this we build the mean vector of all embeddings in the list.  
Then we calculate the cosine similarity between the mean and all those vectors.

The word that does not match is then the word with the lowest cosine similarity to the mean.

Example:

```
model = VectorModel(w2v_vectors)
doesnt_match(model, ['potato', 'tomato', 'beer']) # -> 'beer'
```

In [11]:
def doesnt_match(model, words):
    # YOUR CODE HERE
    vec_sum = np.zeros(model.vector_size())
    embed_list = []
    for word in words:
        word_embed = model.embed(word)
        embed_list.append((word, word_embed))
        vec_sum += word_embed
    vec_mean = vec_sum/len(words)
    
    similarity_scores = []
    for embeb in embed_list:
        similarity_score = model.cosine_similarity(vec_mean, embeb[1])
        similarity_scores.append((embeb[0], similarity_score))
    min_tuple = min(similarity_scores, key=lambda x: x[1])
    return min_tuple[0]

    
doesnt_match(VectorModel(w2v_vectors), ['vegetable', 'strawberry', 'tomato', 'lettuce'])

'vegetable'

## Document Embeddings A) [15 points]

Now we want to create document embeddings similar to the last assignment. For this you are given the function ```bagOfWords```. In the context of Word2Vec and FastText embeddings this is also called ```SOWE``` for sum of word embeddings.

Take the yelp reviews (```reviews```) and create a dictionary containing the document id as a key and the document embedding as a value.

Create the document embeddings from the Word2Vec, FastText and TfIdf embeddings.

Store these in the variables ```ft_doc_embeddings```, ```w2v_doc_embeddings``` and ```tfidf_doc_embeddings```

In [20]:
def bagOfWords(model, doc: List[str]) -> np.ndarray:
    '''
    Create a document embedding using the bag of words approach
    
    Args:
        model     -- The embedding model to use
        doc       -- A document as a list of tokens
        
    Returns:
        embedding -- The embedding for the document as a single vector 
    '''
    embeddings = [np.zeros(model.vector_size())]
    n_tokens = 0
    for token in doc:
        embedding = model.embed(token)
        if embedding is not None:
            n_tokens += 1
            embeddings.append(embedding)
    if n_tokens > 0:
        return sum(embeddings)/n_tokens
    return sum(embeddings)


ft_doc_embeddings = dict()
w2v_doc_embeddings = dict()
tfidf_doc_embeddings = dict()

# YOUR CODE HERE
for review in reviews:
    ft_doc_embeddings[review["id"]] = bagOfWords(ft_model, review["tokens"])
    w2v_doc_embeddings[review["id"]] = bagOfWords(w2v_model, review["tokens"])
    tfidf_doc_embeddings[review["id"]] = bagOfWords(tfidf_model, review["tokens"])

## Document Embeddings B) [10 points]

Create a vector model from each of the document embedding dictionaries. Call these ```model_w2v_doc```, ```model_ft_doc``` and ```model_tfidf_doc```.

Now find the most similar document (```top_n=1```) for document $438$ with each of these models. Use the method `most_similar`. For example `model.most_similar(438)`.

Print the text for each of the most similar reviews.

In [22]:
# First find the text for review 438
def find_doc(doc_id, reviews):
    for review in reviews:
        if review['id'] == doc_id:
            return review['text']
    
doc_id = 438

# Print it
print('Source document:')
print(find_doc(doc_id, reviews))

# Create the models
model_w2v_doc = None
model_ft_doc = None
model_tfidf_doc = None

model_w2v_doc = VectorModel(vector_dict = w2v_doc_embeddings)
similar_doc = model_w2v_doc.most_similar(find_doc(doc_id, reviews), 1)
similar_doc

Source document:
Absolutely ridiculously amazing! Chicken Tikka masala was perfect. Best I've ever had!


[]