<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Embeddings

---


### About:  
In this notebook you will compare two embedding methods, Word2Vec and BERT focusing on how each model represents text as vectors. 


### Learning Objective:
- Compare two embedding methods, Word2Vec and BERT, using customer reviews.

### Installs
- [gensim](https://radimrehurek.com/gensim/intro.html#installation)


### Notebook Guide

- NLP Scenario
- Static Embeddings
- Contextual Embeddings
- Conclusions and Takeaways

### Imports

In [None]:
#imports 
import numpy as np
import pandas as pd

## Gensim for word2vec 
# to clean text and implement word2vec imports 
from gensim.parsing.preprocessing import remove_stopwords
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec

## Pytorch imports for BERT embeddings
from transformers import AutoTokenizer, AutoModel

## Cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# NLP Scenario 
You are an analyst for a marketing company that just launched a new product suite of mobile devices. You have data from product reviews of the new product. In this notebook your goal is to evaluate different methods for representing text with text embeddings.

#####  Product Reviews 
1. "I absolutely love the TechWave X1! It has made my daily tasks so much easier and more efficient. Highly recommend it!"
2. "I'm not very impressed with the TechWave X1. It lacks some essential features and is quite slow."
3. "The TechWave X1 is fantastic! It has exceeded my expectations and has become an essential part of my daily routine."
4. "I found the TechWave X1 to be quite average. It does the job, but there's nothing particularly special about it."
5. "The TechWave X1 is terrible. It's full of glitches and crashes frequently. I regret purchasing it."
6.  "The TechWave X1 is disappointing. It doesn't live up to the hype and is missing several key functionalities."


In [None]:
# data setup 
# create a list of reviews

reviews = ["I absolutely love the TechWave X1! It has made my daily tasks so much easier and more efficient. Highly recommend it!",
"I'm not very impressed with the TechWave X1. It lacks some essential features and is quite slow.",
"The TechWave X1 is fantastic! It has exceeded my expectations and has become an essential part of my daily routine.",
"I found the TechWave X1 to be quite average. It does the job, but there's nothing particularly special about it.",
"The TechWave X1 is terrible. It's full of glitches and crashes frequently. I regret purchasing it.",
"The TechWave X1 is disappointing. It doesn't live up to the hype and is missing several key functionalities."
]

# Static Embeddings 

We will use the `Gensim` implementation of `Word2Vec` on our reviews to explore static embeddings. With static embeddings  a given word will have a set vector regardless of the words around it. 


`Gensim` is a popular open-source library for natural language processing. It excels at topic modeling and similarity methods using models such as `Latent Dirichlet Allocation (LDA)` and `Word2Vec`. We will use `Word2Vec` to find the similarity across our reviews. 

### Tokenization and pre-processing
In order to use the `Word2Vec model` we first need to convert our raw text to tokens. 

We will do this by defining a function called `preprocess_text` that uses built-in preprocessing tools from `Gensim` to:
1. make all the words lowercase 
2. remove special characters 
3. remove very common words, known as stop words 

After running this function we will create a list that contains cleaned tokens for each review. These tokens will be used in our `Word2Vec` model. 

In [None]:
def preprocess_text(text):
    """
    Clean and tokenize text using gensim's preprocessing utilities
    """
    # Convert to lowercase and complete other basic cleaning 
    tokens = simple_preprocess(text)  

    # Remove stopwords
    clean_text = remove_stopwords(' '.join(tokens))
    tokens = clean_text.split()
    
    return tokens

In [None]:
# Preprocess the reviews by calling our preprocess_text function on each review and saving to a new list

processed_reviews = [preprocess_text(review) for review in reviews]

In [None]:
# Review our tokenized reviews

for review in processed_reviews:
    print(review)

Compare the tokenized reviews with the original reviews. Note that the text has been converted to lowercase; special characters and numbers have been removed; and stopwords have also been removed.

#### Implementing `Word2Vec`

Text embeddings aim to preserve the meaning and context of words by assigning a vector to words where similar words with similar meanings or contexts would have similar vectors. 

We will call [Gensims's Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) to demonstrate one example of a static embedding. 
- Input: 
  - Tokens for each review that have been pre-processed (cleaned)
- Output:
  - A vector (size = 100 here) 


In [None]:
# Initialize the word2vec model on our reviews
w2v_model = Word2Vec(sentences=processed_reviews, vector_size=100, min_count=1)

In [None]:
# Get the word vector for a word ('impressed')
w2v_model.wv['impressed']

In [None]:
# and another word ('terrible')
w2v_model.wv['terrible']

In [None]:
# find the most similar words to a word
w2v_model.wv.most_similar('impressed')

In [None]:
w2v_model.wv.most_similar('terrible')

Notice that the word 'impressed' is similar to 'essential' and 'fantastic', while the word 'terrible' is similar to 'hype' and 'disappointing'. Note some words are similar to both (such as 'easier')! 

This shows that the `Word2Vec` model has learned the some of the semantic relationships between words based on the context in which they appear in the reviews. Given our very small training set, these results are impressive. With more data and more examples of usage for the model to train on, these results would improve. 

## Try it! 
### Evaluate the impact of modifying your vector size 

`Word2Vec` has multiple parameters that you can modify, and that will change the results of your model.

Below are a few key parameters you may want to edit, and their default values. 

- vector_size: int = 100
  - [Vector size docs](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#vector-size)
- window: int = 5 
  -  number of words used ranges from 2-10 
- min_count: int = 5
  - [Min Count Docs](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#min-count)

#### Example with default values 
```python 
Word2Vec(sentences=processed_reviews, vector_size=100, window=5, min_count=5)
``` 

Give it a try! Modify your `Word2Vec` parameters in each exercise below and observe your output.

#### Smaller vector
- Modify the `Word2Vec` model to use a smaller vector size (e.g. 10)
- Then, retrain the model on the processed reviews and find the most similar words to the word "impressed".
- What differences do you notice in the results compared to the previous model?

In [None]:
# Initialize the word2vec model on our reviews with a smaller vector size
w2v_model = Word2Vec(sentences=processed_reviews, vector_size=10, window=5, min_count=1)

# Find the most similar words to 'impressed' 
w2v_model.wv.most_similar('impressed')

#### Larger Vector and Window
- Repeat the process and modify the `Word2Vec` model to use a larger vector size (e.g. 300) and a larger window size (e.g. 10). 
- What differences do you notice in the results compared to the previous models?

In [None]:
# Initialize the Word2Vec model on our reviews with a smaller vector size
w2v_model = Word2Vec(sentences=processed_reviews, vector_size=300, window=10, min_count=1)

# Find the most similar words to 'impressed' 
w2v_model.wv.most_similar('impressed')

Notice that the `Word2Vec` model is sensitive to the vector size and other hyperparameters. These choices can affect the quality of the word embeddings and the similarity results. Results are also impacted by your training data.

Here we only have a small set of reviews, so the word embeddings may not be as accurate as they would be with a larger dataset.

### Understanding word order 

Let's examine how word order affects meaning with these two sentences:

- "The supplier agreed to pay the manufacturer"
- "The manufacturer agreed to pay the supplier"

The meaning is changed depending on which words come first. Let's first make embeddings using a static embedding (`Word2Vec`), then compare the difference when we use a contextual embedding. 

## Try it!  
Use the steps above to convert the following two sentences using `Word2Vec`

In [None]:
# Create a list of sentences
sentences = ["The supplier agreed to pay the manufacturer",
             "The manufacturer agreed to pay the supplier"]


# Preprocess the sentences using the function we created earlier, preprocess_text()
processed_reviews = [preprocess_text(sentence) for sentence in sentences]

# Initialize the Word2Vec model
w2v_model= Word2Vec(sentences=processed_reviews, vector_size=100, window=5, min_count=1)

In [None]:
# Get the word vector for a word ('supplier')
w2v_model.wv['supplier']

In [None]:
# Get the word vector for a word ('manufacturer')
w2v_model.wv['manufacturer']

In [None]:
# Stretch: calculate the cosine similarity of the same words in two sentences (with two meanings)

cos_sim = cosine_similarity([w2v_model.wv['supplier']], [w2v_model.wv['supplier']])
print(cos_sim)

#### What did you notice? 

These two sentences would have identical representations in `Word2Vec` but have very different meanings. In this example, the order of the words is critical to the meaning of the phrase. We aren't able to capture this order with the bag-of-words approach used in `Word2Vec` and other similar models.

To address this issue, methods were developed to capture the sequential order of words.

# Contextual Embeddings 

Unlike static embeddings, such as `Word2Vec`, **contextual embeddings** create different embeddings for the same word depending on the context in which it is used. To capture the context we use information such as the words around our word of interest, its sentence position, and other factors to map its vector.

**In other words, contextual embeddings are position aware.**

Let's look at our example from before: 
- "The supplier agreed to pay the manufacturer"
- "The manufacturer agreed to pay the supplier"

We will use `BERT`, a famous language model that uses transformer encoder architecture, to create contextual embeddings.

We can access this model using the [Transformers library in HuggingFace](https://huggingface.co/docs/transformers/en/index).

In [None]:
# Example sentences
sentence1 = "The supplier agreed to pay the manufacturer"
sentence2 = "The manufacturer agreed to pay the supplier"

In [None]:
# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
bert_model = AutoModel.from_pretrained('bert-base-uncased')

The first step is to create our tokens.  
_Note: **tensors** are a general term for algebraic objects including vectors, and matrices.  You can [learn more about tensors here](https://en.wikipedia.org/wiki/Tensor)._

In [None]:
# Add special tokens and convert to a tensor 
input1 = tokenizer(sentence1, return_tensors="pt", padding=True, truncation=True)

# Take a look at the tokenized sentence
tokenizer.convert_ids_to_tokens(input1['input_ids'][0])

Notice that the `BERT` tokenizer added two special tokens:
- `[CLS]` is BERT's way of marking the start of the input document (a sentence in this case)
- `[SEP]` indicates a separation (such as between sentences) and the end of a document. 

## Try it! 
Create the tokens for sentence 2.

In [None]:
# Add special tokens and convert to a tensor 
input2 = tokenizer(sentence2, return_tensors="pt", padding=True, truncation=True)

# Take a look at the tokenized sentence
tokenizer.convert_ids_to_tokens(input2['input_ids'][0])

### Setup code for BERT embeddings 

Let's define 2 functions to create the tokens and then use them to create the `BERT` embeddings at the sentence level and the token level. For now, focus on the outputs. 

How this code block works in detail is beyond the scope of this lesson.

In [None]:
# Helper function to get BERT embeddings and cosine similarity scores 


def get_bert_embedding(sentence):
    """
    Get BERT embeddings for a sentence
    """
    # Tokenize and get model outputs
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True)
    
    # Get model outputs
    outputs = bert_model(**inputs)
    
    # Get embeddings from last hidden state and convert to numpy
    embeddings = outputs.last_hidden_state[0].detach().numpy()
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    return embeddings, tokens

def compare_word(word, emb1, tokens1, emb2, tokens2):
    """
    Compare BERT embeddings for a specific word in two embeddings
    """
 
    # Find the word indices
    idx1 = tokens1.index(word)
    idx2 = tokens2.index(word)
    
    # Get the word vectors
    vec1 = emb1[idx1]
    vec2 = emb2[idx2]
    
    # Calculate cosine similarity
    similarity = cosine_similarity(vec1.reshape(1, -1), vec2.reshape(1, -1))[0][0]
    
    print(f"Comparing embeddings for word '{word}':")
    print(f"Similarity score: {similarity:.4f}")
    print(f"Vector in sentence 1 (first 5 values): {vec1[:5]}")
    print(f"Vector in sentence 2 (first 5 values): {vec2[:5]}")
    
    return similarity

Now we can pass each of our sentences to the function to get the embeddings for each. 

In [None]:
# Print each sentence as a reminder of what we are working with
print(sentence1)
print(sentence2)

In [None]:
# Get embeddings for both sentences
emb1, tokens1 = get_bert_embedding(sentence1)
emb2, tokens2 = get_bert_embedding(sentence2)

In [None]:
# Review the output for supplier 
tokens2.index('supplier')

In [None]:
# Compare the cosine similarity scores for supplier in each sentence
supplier_score = compare_word('supplier', emb1, tokens1, emb2, tokens2)

In [None]:
# Compare the cosine similarity scores for manufacturer in each sentence
manufacturer_score = compare_word('manufacturer', emb1, tokens1, emb2, tokens2)

Same word but different embeddings! That's because the order in which they were used (sequence) in each sentence was used in the embedding. 

## Try it! 
Checkout the embeddings for another word of your choice. 

In [None]:
compare_word('pay', emb1, tokens1, emb2, tokens2)

Notice that "pay" has a very high similarity across the two sentences. 

# Conclusions and Takeaways
In this notebook, we learned how to use: 
1. Word embeddings to represent text data
2. Word2Vec model to generate static word embeddings for a list of reviews
   - Word2Vec is a static embedding that has the same numerical representation for a given word regardless of context 
   - Its performance varies based on how it is tuned and the data used for training 
3. Pre-trained BERT model to generate contextual embeddings for a pair of sentences.
   - BERT is a contextual embedding that creates different embeddings for the same word depending on how it was used in a sentence. 

### Recommended readings
- [Foundational word2vec paper](https://arxiv.org/abs/1301.3781)
- [Foundational BERT paper](https://arxiv.org/pdf/1810.04805)