<a href="https://colab.research.google.com/github/jpatra85/ColabTF_EDU/blob/master/M5_AST_02_Word_Embeddings_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Programme in AI and MLOps
## A Program by IISc and TalentSprint
### Assignment 2: Word2vec, GloVe Word Embeddings

## Learning Objectives

At the end of the experiment, you will be able to:

* understand and perform text pre-processing
* train a Word2Vec model and save it in a file
* load the saved model to get the vector representation of words
* measure and plot the similarity between the words
* use the pre-trained GloVe Embeddings to plot the similarity between the words

## Word Embedding

Here we will learn to deal with textual data, we need to convert it into numbers before feeding it into any machine learning model. For simplicity, words can be compared to categorical variables. We use one-hot encoding to convert categorical features into numbers. To do so, we create dummy features for each of the category and populate them with 0's and 1's.

Similarly, if we use one-hot encoding on words in textual data, we will have a dummy feature for each word, which means 10,000 features for a vocabulary of 10,000 words. This is not a feasible embedding approach as it demands large storage space for the word vectors and reduces model efficiency and no relation is captured between words.

Some of the most popular techniques to learn word embeddings includes:
- Word2Vec
- GloVe

## Dataset Description

The IMDB movie review dataset can be downloaded from [here](http://ai.stanford.edu/~amaas/data/sentiment/). This dataset for binary sentiment classification contains around 50k movie reviews with the following attributes:

* **review:** text based review of each movie
* **sentiment:** positive or negative sentiment value


### Setup Steps:

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}

In [None]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}

In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M5_AST_02_Word_Embeddings_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")

    ipython.magic("sx wget https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/IMDB_Dataset.csv")
    ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/glove.6B.zip")
    ipython.magic("sx unzip glove.6B.zip")
    ipython.magic("sx wget https://cdn.talentsprint.com/talentsprint1/archives/sc/aiml/experiment_related_data/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar")
    ipython.magic("sx unrar e /content/AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.rar")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



**NOTE THAT ABOVE CELL MIGHT TAKE SOME TIME TO RUN AS IT IS DOWNLOADING THE NECESSARY DATA FILES!**

### Importing required packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords     # to get collection of stopwords
from nltk.tokenize import word_tokenize
import string
import gensim    # Word to Vec
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

from tensorflow.keras.preprocessing.text import Tokenizer           # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences      # to do padding or truncating
from keras.models import Sequential                   # the model
import pprint       # pprint is a native Python library that allows to customize the formatting of output
from sklearn.metrics.pairwise import cosine_similarity

### Load the Dataset

In [None]:
movie_reviews = pd.read_csv("IMDB_Dataset.csv")

# Check for null values
movie_reviews.isnull().values.any()

In [None]:
print(movie_reviews.shape)

In [None]:
# Print the first five rows from the data
movie_reviews.head()

In [None]:
# Unique values for sentiment
movie_reviews.sentiment.unique()

In [None]:
# Count for each sentiment
movie_reviews.sentiment.value_counts()

In [None]:
# Visualize the postive and negative sentiments
movie_reviews.sentiment.value_counts().plot.bar(ylim=0);

In [None]:
# Let us view one of the reviews
movie_reviews["review"][5]

### Data pre-processing

For the text data in review column, we will perform below pre-processing steps:
- removing html tags
- removing non alphabets (punctuations and numbers)
- removing stop words
- ignore words whose length is less than 2
- convert the text to lower-case



In [None]:
# Data Preprocessing

def preprocess_text(sen):

    sen = re.sub('<.*?>', ' ', sen)                        # remove html tags
    tokens = word_tokenize(sen)                            # tokenize words
    tokens = [w.lower() for w in tokens]                   # convert to lower case
    table = str.maketrans('', '', string.punctuation)      # remove punctuations
    stripped = [w.translate(table) for w in tokens]

    words = [word for word in stripped if word.isalpha()]  # remove non-alphabet
    stop_words = set(stopwords.words('english'))

    words = [w for w in words if not w in stop_words]      # remove stop words
    words = [w for w in words if len(w) > 2]               # Ignore words whose length is less than 2

    return words

In [None]:
# Store the preprocessed reviews in a new list
review_lines = movie_reviews['review'].apply(preprocess_text)

In [None]:
# Check for the length of the preprocessed text
len(review_lines)

In [None]:
# Print the preprocessed text for the first review
print(review_lines[1])

In [None]:
len(review_lines[1])

In [None]:
# Now let’s convert the sentiment from string to a binary form of 1 and 0,
# where 1 is for ‘positive’ sentiment and 0 for ‘negative’.
y = movie_reviews['sentiment'].apply(lambda x: 1 if x=="positive" else 0)

y[0:5]

## Word2Vec

It is one of the most popular techniques to learn word embeddings. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc. A word embedding is a learned representation for text where words that have the same meaning have a similar representation.

#### Why do we need them?

Consider the following similar sentences: **Have a good day** and **Have a great day**. They hardly have different meaning. If we construct an vocabulary (let’s call it V), it would have V = **{Have, a, good, great, day}**.

Now, let us create a one-hot encoded vector for each of these words in V. Length of our one-hot encoded vector would be equal to the size of V (=5). We would have a vector of zeros except for the element at the index representing the corresponding word in the vocabulary. That particular element would be one. The encodings below would explain this better.

Have = [1,0,0,0,0] ; a = [0,1,0,0,0] ; good = [0,0,1,0,0] ; great = [0,0,0,1,0] ; day = [0,0,0,0,1]

If we try to visualize these encodings, we can think of a 5 dimensional space, where each word occupies one of the dimensions and has nothing to do with the rest (no projection along the other dimensions). This means ‘good’ and ‘great’ are as different as ‘day’ and ‘have’, which is not true.

Our objective is to have words with similar context occupy close spatial positions. Mathematically, the **cosine** of the angle between such vectors should be close to 1, i.e. angle close to 0. Higher the cosine similarity, the words are more closer

**Cosine Similarity**

$sim(A, B) = cos(\theta) = \frac{\bar{A}. \bar{B}}{\bar{|A|}\bar{|B|}}$


<br><br>
<center>
<img src="https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/Word_Embedding.png" width="650" height="450">
</center>

**Word2vec** model has 2 algorithms:

1. **Continuous bag of word (CBOW):**

    CBOW predicts the target words from the surrounding context words. **Eg: Context word:** "The cat sits on the ..",  **Target word:** "mat"

2. **Skip-gram:**

    Skip-gram predicts surrounding context words from the target words. **Eg: Context word:** "The cat ... on the mat",  **Target word:** "sat"

**Note:** For more details of word2vec model refer to the following [link](https://medium.com/@zafaralibagh6/a-simple-word2vec-tutorial-61e64e38a6a1)




### Train word2vec model to obtain word embeddings

We will use Gensim to  implement the Word2Vec. **Gensim** is an open source Python library for natural language processing. It is developed and is maintained by the Czech natural language processing researcher Radim Řehůřek and his company RaRe Technologies. Here, the first step is to prepare the text corpus for learning the embedding by creating word tokens, removing punctuation, removing stop words etc. The word2vec algorithm processes documents sentence by sentence.

The dataset is already preprocessed. The `review_lines` contains the text corpus.

In [None]:
EMBEDDING_DIM = 100
# Train word2vec model after preprocessing the reviews
model = gensim.models.Word2Vec(sentences = review_lines,
                               max_vocab_size=100000,
                               window=1,
                               vector_size=EMBEDDING_DIM,
                               workers=4,
                               min_count=1,
                               sg = 0)

Parameters for Word2Vec:

- **sentences:** List of sentences; here we pass the list of review sentences.

- **vector_size:** The number of dimensions in which we want to represent our word. This is the size of the word vector which instructs the Word2Vec() method to create a vector size of 100

- **min_count:** Word with frequency greater than min_count only are going to be included into the model. Usually, the bigger and more extensive your text, the higher this number can be.

- **window:** Only terms that occur within a window-neighborhood of a term, in a sentence, are associated with it during training. The usual value is 4 or 5.

- **workers:** Number of threads used in training parallelization, to speed up training.

- **sg:** {0, 1} Training algorithm: 1 for skip-gram; otherwise CBOW.


To know more about the the parameters of gensim.models.Word2Vec, click [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec).

### Test Word2Vec Model

Try some word embeddings the model learnt from the movie review dataset.

The most similar words for word 'good' are:





In [None]:
model.wv.most_similar('good')

The process of creating word embeddings by training a Word2Vec model has been discussed so far. This model can be saved to be used later.

In [None]:
# Save model
filename = "imdb_embedding_word2vec.txt"
model.wv.save_word2vec_format(filename, binary=False)

In the next part, pre-trained word embeddings will be used to get an intuitive plot.

### Use Pre-trained Embedding

**The Google pre-trained word2vec model**

Google has published a pre-trained word2vec model. It is trained on part of Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. For more information about the word2vec model published by Google, you can see the link [here](https://code.google.com/archive/p/word2vec/).

Load the pre-trained word embedding saved in file `AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin`.

In [None]:
# Load Google news 300 vectors file
model_plot = gensim.models.KeyedVectors.load_word2vec_format('AIML_DS_GOOGLENEWS-VECTORS-NEGATIVE-300_STD.bin', binary=True, limit=500000)

In [None]:
# List of these words is specifically chosen to get the intuitive plot
words = ['king', 'queen', 'river', 'water', 'ocean', 'tree', 'leaf', 'happy', 'glad', 'mother', 'daughter']

In [None]:
# Creating a PrettyPrinter() object
pp = pprint.PrettyPrinter()

# Vector representation of a specific word
print("Size of the vector is", len(model_plot["king"]))
pp.pprint(model_plot["king"])

In [None]:
# Vector representation of each word using Word2Vec
word2vec = []

for word in words:
    try:
        word2vec.append(model_plot[word])
    except:
        pass
print("There are %d words and the vector size of each word is %d" %(len(word2vec),len(word2vec[0])))

### Measure the similarity between the words using cosine_similarity


In [None]:
w2v_similarity = []

for i, word_1 in enumerate(words):
    w2v_row_wise_simiarity = []
    for j, word_2 in enumerate(words):
        # Get the vectors of the word using Word2Vec
        vec_1, vec_2 = model_plot[word_1], model_plot[word_2]

        # As the vectors are in one dimensional, convert it to 2D by reshaping
        vec_1, vec_2 = np.array(vec_1).reshape(1,-1), np.array(vec_2).reshape(1,-1)

        # Measure the cosine similarity between two vectors
        similarity = cosine_similarity(vec_1,vec_2)
        w2v_row_wise_simiarity.append(np.array(similarity).item())

    # Store the cosine similarity values in a list
    w2v_similarity.append(w2v_row_wise_simiarity)

pd.DataFrame(w2v_similarity, columns = words, index = words)

### Visualize similarity using heatmap

In [None]:
sns.heatmap(pd.DataFrame(w2v_similarity, columns = words, index = words))

Higher the cosine similarity, the words are more closer

### Visualize the words in 2D-plane by reducing the dimensions using PCA

In [None]:
# Create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class
# n_components in PCA specifies the no.of dimensions
pca = PCA(n_components=2)

# Fit and transform the vectors using PCA model
reduced_w2v = pca.fit_transform(word2vec)

In [None]:
plt.figure(figsize=(8,5))
plt.scatter(reduced_w2v[:,0],reduced_w2v[:,1], s = 12, color = 'red')
plt.xlim([-2.5,2.5])
plt.ylim([-2.5,2.5])
x, y = reduced_w2v[:,0] , reduced_w2v[:,1]
for i in range(len(x)):
    plt.annotate(words[i],xy=(x[i], y[i]),xytext=(x[i]+0.05,y[i]+0.05))

From the above plot, it can be seen that tree leaf are more related, water river ocear are more related, and so on.

## GloVe

  GloVe stands for “Global Vectors” for word representation. It is developed by Stanford for generating word embeddings. GloVe captures both global statistics and local statistics of a corpus, in order to come up with word vectors.


### Using the pre-trained GloVe model

In [None]:
GloVe_Dict = {}
# Loading the 50-dimensional vector of the model
with open("glove.6B.50d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        GloVe_Dict[word] = vector

In [None]:
# Length of the word vocabulary
print(len(GloVe_Dict))

In [None]:
# Vector representation of a specific word
print("Size of the vector is", len(GloVe_Dict["king"]))
pp.pprint(GloVe_Dict["king"])

In [None]:
# Vector representation of each word using GloVe
vectors = []
for word in words:
    try:
        vector = GloVe_Dict[word]
        vectors.append(vector)
    except:
        pass
print("There are %d words and the vector size of each word is %d" %((len(vectors),len(vectors[0]))))

### Measure the similarity between the words using cosine_similarity


In [None]:
word_similarity = []
for i, word_1 in enumerate(words):
    row_wise_simiarity = []
    for j, word_2 in enumerate(words):
        # Get the vectors of the word using GloVe
        vec_1, vec_2 = GloVe_Dict[word_1], GloVe_Dict[word_2]

        # As the vectors are in one dimensional, convert it to 2D by reshaping
        vec_1, vec_2 = np.array(vec_1).reshape(1,-1), np.array(vec_2).reshape(1,-1)

        # Measure the cosine similarity between the vectors.
        similarity = cosine_similarity(vec_1, vec_2)
        row_wise_simiarity.append(np.array(similarity).item())

    # Store the cosine similarity values in a list
    word_similarity.append(row_wise_simiarity)

# Create a DataFrame to view the similarity between words
pd.DataFrame(word_similarity, columns=words, index=words)

### Visualize similarity using heatmap

In [None]:
sns.heatmap(pd.DataFrame(word_similarity, columns=words, index=words))

GloVe derives the semantic relationship between the words. Higher the cosine similarity, the words are relatively closer

### Visualize the words in 2D-plane by reducing the dimensions using PCA

In [None]:
# Create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class
# n_components in PCA specifies the no.of dimensions
pca = PCA(n_components=2)

# Fit and transform the vectors using PCA model
reduced_vectors = pca.fit_transform(vectors)

In [None]:
plt.figure(figsize=(7,5))
plt.scatter(reduced_vectors[:,0],reduced_vectors[:,1], s = 12, color = 'red')
plt.xlim([-3.5,4.5])
plt.ylim([-3.5,3.5])
x, y = reduced_vectors[:,0] , reduced_vectors[:,1]
for i in range(len(x)):
    plt.annotate(words[i],xy=(x[i], y[i]),xytext=(x[i]+0.05,y[i]+0.05))

### Please answer the questions below to complete the experiment:




In [None]:
#@title Which technique is used to address the issue of rare words in word2vec and GloVe embeddings? { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["", "Subword tokenization", "Reducing the context window size", "Applying dimensionality reduction techniques"]

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}


In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]


In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")