# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109B Introduction to Data Science

## Lab 9 extras: Embeddings 

**Harvard University**<br/>
**Spring 2022**<br/>
**Instructors**: Mark Glickman & Pavlos Protopapas<br/>
**Authors**: Shivas Jayaram
<br/>

## Learning Objectives

By the end of this Lab, you should understand how to:
* Using **pretrained Embeddings** in a model
* Review **Word2Vec** pretrained embeddings
* Review **Glove** pretrained embeddings



## **Setup Notebook**

**Imports**

In [1]:
# Imports
import time
import numpy as np
import pandas as pd
from gensim import models

# Tensorflow
import tensorflow as tf
from tensorflow.python.keras import backend as K

In [2]:
input_text = """Advanced Topics in Data Science (CS109b) is the second half of a one-year introduction to data science. 
Building upon the material in Introduction to Data Science, the course introduces advanced methods for data wrangling, 
data visualization, statistical modeling, and prediction. Topics include big data, multiple deep learning architectures 
such as CNNs, RNNs, transformers, language models, autoencoders, and generative models as well as basic 
Bayesian methods, and unsupervised learning.
"""

print("Input Text:",input_text)

Input Text: Advanced Topics in Data Science (CS109b) is the second half of a one-year introduction to data science. 
Building upon the material in Introduction to Data Science, the course introduces advanced methods for data wrangling, 
data visualization, statistical modeling, and prediction. Topics include big data, multiple deep learning architectures 
such as CNNs, RNNs, transformers, language models, autoencoders, and generative models as well as basic 
Bayesian methods, and unsupervised learning.



Tokenize the text using `tf.keras.preprocessing.text.Tokenizer`

In [3]:
# Initialize the Tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer()

# Fit on text to generate token index and vocabulary
tokenizer.fit_on_texts([input_text])

# Tokenize
tokens = tokenizer.texts_to_sequences([input_text])

print(tokenizer.word_counts)
word_index = tokenizer.word_index
index_word = tokenizer.index_word
print("word_index",word_index)
vocabulary_size = len(word_index.keys())
print("Vocabulary Size:",vocabulary_size)

OrderedDict([('advanced', 2), ('topics', 2), ('in', 2), ('data', 6), ('science', 3), ('cs109b', 1), ('is', 1), ('the', 3), ('second', 1), ('half', 1), ('of', 1), ('a', 1), ('one', 1), ('year', 1), ('introduction', 2), ('to', 2), ('building', 1), ('upon', 1), ('material', 1), ('course', 1), ('introduces', 1), ('methods', 2), ('for', 1), ('wrangling', 1), ('visualization', 1), ('statistical', 1), ('modeling', 1), ('and', 3), ('prediction', 1), ('include', 1), ('big', 1), ('multiple', 1), ('deep', 1), ('learning', 2), ('architectures', 1), ('such', 1), ('as', 3), ('cnns', 1), ('rnns', 1), ('transformers', 1), ('language', 1), ('models', 2), ('autoencoders', 1), ('generative', 1), ('well', 1), ('basic', 1), ('bayesian', 1), ('unsupervised', 1)])
word_index {'data': 1, 'science': 2, 'the': 3, 'and': 4, 'as': 5, 'advanced': 6, 'topics': 7, 'in': 8, 'introduction': 9, 'to': 10, 'methods': 11, 'learning': 12, 'models': 13, 'cs109b': 14, 'is': 15, 'second': 16, 'half': 17, 'of': 18, 'a': 19, 'o

## **Word2Vec Embedding**

### **Download pretrained Embedding**

In [6]:
start_time = time.time()
# Dowload the news dataset
word2vec_path = tf.keras.utils.get_file(
    origin="https://github.com/dlops-io/datasets/releases/download/v1.0/GoogleNews-vectors-negative300.bin.gz",
    extract=False)
print("word2vec_path:",word2vec_path)
execution_time = (time.time() - start_time)/60.0
print("Download execution time (mins)",execution_time)

word2vec_path: /home/u_61397728/.keras/datasets/GoogleNews-vectors-negative300.bin.gz
Download execution time (mins) 0.00025364160537719724


### **Load pretrained Embedding**

We need to prepare the pretrained embedding to use in our model

In [8]:
# Load word2vec
word2vec = models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

print("Number of word vectors:",word2vec.vectors.shape)

Number of word vectors: (3000000, 300)


In [9]:
# View some word embeddings
sample_embeddings_words = ["news", "data", "hurricane", "political"]
for word in sample_embeddings_words:
  print(word,":",word2vec[word][:5],", Shape:", word2vec[word].shape)

news : [-0.13867188  0.04370117 -0.13085938 -0.16796875 -0.06054688] , Shape: (300,)
data : [-0.17285156 -0.14257812  0.04370117 -0.03344727 -0.07861328] , Shape: (300,)
hurricane : [ 0.14453125 -0.11083984 -0.3671875   0.10449219  0.06396484] , Shape: (300,)
political : [-0.02868652  0.02929688 -0.0625      0.35351562 -0.11181641] , Shape: (300,)


In [13]:
word2vec

<gensim.models.keyedvectors.KeyedVectors at 0x7f7fc434c580>

In [14]:
# Prepare embedding matrix
embedding_dim = 300

# We want to select only the embeddings of our vocabulary to feed into our model in the next step
embedding_matrix = np.zeros((vocabulary_size+1, embedding_dim))
oov = {}    
n_covered = 0
n_oov = 0
for word, i in word_index.items():
  if word in word2vec:
    embedding_matrix[i] = word2vec[word]
  else:
    n_oov += 1

text_coverage = (vocabulary_size-n_oov)/vocabulary_size
print("Text Coverage:",text_coverage)

print("Embedding Matrix, Shape" ,embedding_matrix.shape)

Text Coverage: 0.8333333333333334
Embedding Matrix, Shape (49, 300)


In [19]:
len(word_index)

48

In [17]:
vocabulary_size

48

In [16]:
embedding_matrix.shape

(49, 300)

So we can see 80% of our vocubulary has pretrained embeddings. The OOV will have values of 0

### **Build Model using the pretrained Embedding**

In [1]:
# Model input
model_input = tf.keras.layers.Input(shape=(1))

# Embedding Layer, with pre-trained weights
embedding = tf.keras.layers.Embedding(input_dim=embedding_matrix.shape[0], 
                            output_dim=embedding_dim, 
                            embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), # Load pre-trained weights
                            name="embedding")(model_input)
# Create model
model = tf.keras.Model(inputs=model_input, outputs=embedding)

print(model.summary())

NameError: name 'tf' is not defined

In [None]:
word = "data"
print("Embedding Value for word:",word,model([word_index[word]])[:50])

In [2]:
word = "learning"
print("Embedding Value for word:",word,model([word_index[word]])[:50])

NameError: name 'model' is not defined

In [3]:
word = "cs109b"
print("Embedding Value for word:",word,model([word_index[word]])[:50])

NameError: name 'model' is not defined

## **GloVe Embedding**

### **Download pretrained Embedding**

* GloVe stands for Global Vectors, which is an open-source project developed by Stanford. It contains pre-trained word representations in various sizes, including 50-dimensional, 100-dimensional, 200-dimensional, and 300-dimensional.

* We choose the 100d version.

[Reference](http://nlp.stanford.edu/projects/glove/)

In [None]:
start_time = time.time()
# Dowload the news dataset
glove_path = tf.keras.utils.get_file(
    origin="https://github.com/shivasj/dataset-store/releases/download/v3.0/glove.6B.100d.txt.zip",
    extract=True)
glove_path = glove_path.replace(".zip","")
print("glove_path:",glove_path)
execution_time = (time.time() - start_time)/60.0
print("Download execution time (mins)",execution_time)

glove_path: /root/.keras/datasets/glove.6B.100d.txt
Download execution time (mins) 0.05553990205128988


### **Load pretrained Embedding**



We need to prepare the pretrained embedding to use in our model

In [None]:
# Build a dictionary with word and its vectors
embeddings_index = {}
with open(glove_path) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Number of word vectors:",len(embeddings_index))

Number of word vectors: 400000


In [None]:
# View some word embeddings
sample_embeddings_words = list(embeddings_index.keys())[:10]
for word in sample_embeddings_words:
  print(word,":",embeddings_index[word][:5],", Shape:", embeddings_index[word].shape)

the : [-0.038194 -0.24487   0.72812  -0.39961   0.083172] , Shape: (100,)
, : [-0.10767  0.11053  0.59812 -0.54361  0.67396] , Shape: (100,)
. : [-0.33979  0.20941  0.46348 -0.64792 -0.38377] , Shape: (100,)
of : [-0.1529  -0.24279  0.89837  0.16996  0.53516] , Shape: (100,)
to : [-0.1897    0.050024  0.19084  -0.049184 -0.089737] , Shape: (100,)
and : [-0.071953  0.23127   0.023731 -0.50638   0.33923 ] , Shape: (100,)
in : [ 0.085703 -0.22201   0.16569   0.13373   0.38239 ] , Shape: (100,)
a : [-0.27086   0.044006 -0.02026  -0.17395   0.6444  ] , Shape: (100,)
" : [-0.30457 -0.23645  0.17576 -0.72854 -0.28343] , Shape: (100,)
's : [ 0.58854 -0.2025   0.73479 -0.68338 -0.19675] , Shape: (100,)


In [None]:
# Prepare embedding matrix
embedding_dim = 100

# We want to select only the embeddings of our vocabulary to feed into our model in the next step
embedding_matrix = np.zeros((vocabulary_size+1, embedding_dim))
oov = {}    
n_covered = 0
n_oov = 0
for word, i in word_index.items():
  embedding_vector = embeddings_index.get(word)
  if embedding_vector is not None:
    embedding_matrix[i] = embedding_vector
  else:
    n_oov += 1

text_coverage = (vocabulary_size-n_oov)/vocabulary_size
print("Text Coverage:",text_coverage)

print("Embedding Matrix, Shape" ,embedding_matrix.shape)

Text Coverage: 0.9166666666666666
Embedding Matrix, Shape (49, 100)


So we can see 85% of our vocubulary has pretrained embeddings. The OOV will have values of 0

### **Build Model using the pretrained Embedding**

In [None]:
# Model input
model_input = tf.keras.layers.Input(shape=(1))

# Embedding Layer, with pre-trained weights
embedding = tf.keras.layers.Embedding(input_dim=embedding_matrix.shape[0], 
                            output_dim=embedding_dim, 
                            embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), # Load pre-trained weights
                            name="embedding")(model_input)
# Create model
model = tf.keras.Model(inputs=model_input, outputs=embedding)

print(model.summary())

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 1)]               0         
                                                                 
 embedding (Embedding)       (None, 1, 100)            4900      
                                                                 
Total params: 4,900
Trainable params: 4,900
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
word = "data"
print("Embedding Value for word:",word,model([word_index[word]]))

Embedding Value for word: data tf.Tensor(
[-0.47099    0.61577    0.68969   -0.18149    0.30778   -0.8415
 -0.41873   -0.20013    0.28184   -0.34005    0.77286   -0.22774
  0.059854  -0.24141    0.87783    0.72043    0.64295    0.36245
  0.41621    0.13001   -0.47074   -0.44664    0.47363    0.40755
 -1.0341    -1.1422     0.37436    0.24631   -0.67291    0.49177
  0.46506    0.13608   -0.93796    0.51887    0.51549   -0.26506
 -0.14551    0.22517    0.35244   -0.79648   -0.42247   -0.90587
 -0.83998    0.45365   -0.72494   -0.12592    0.43661   -0.53661
  0.020523  -0.74609    1.1925     0.15719    0.29318    0.92661
  0.48236   -1.829     -0.012697  -0.37029    2.3618     0.33587
 -0.1544     0.14657   -0.11307   -0.02493    0.31933    0.28815
 -0.2963    -0.33032    1.4774     0.23739   -0.25313    0.61367
  0.56811   -0.56991    0.48798    0.065367   0.28258   -0.13537
 -1.1096    -0.35971    0.85313    0.463     -1.1223     0.0071569
 -1.7636    -0.44547    1.2478    -0.37541   -0

In [None]:
word = "learning"
print("Embedding Value for word:",word,model([word_index[word]]))

Embedding Value for word: learning tf.Tensor(
[ 0.64812    0.69878   -0.39947    0.77634   -0.13132    0.2024
 -0.33399   -0.0066588  0.061684   0.1885    -0.10559   -0.31316
 -0.082495  -0.080517   0.3858    -0.10302    0.049431   0.17216
 -0.59079    0.77068   -1.2768    -0.25187    0.2195    -0.20176
 -0.30581   -0.18518    0.010889  -0.07529   -0.34732    0.61998
 -0.99703    1.0516    -0.42071   -0.39635    0.32607   -0.40061
 -0.46462    0.69904    0.29567   -0.35309   -0.59074    0.28999
 -0.25732   -0.1317    -0.69798    0.49818    0.41503    0.1487
  0.083347  -0.43543   -0.093969  -0.3543     0.014998   0.63593
  0.54564   -1.8439     0.78842   -0.19836    1.5707     0.25988
  0.20875    0.7521    -0.085488  -0.70717    0.094104   0.44485
  0.087818  -0.34779    0.57148    0.18662   -0.29435    0.42928
  0.28392   -0.61614   -0.34108    0.58192   -0.16388   -0.0081997
 -0.27162   -0.27112   -0.21471    0.37376   -0.5352    -0.060945
 -1.6317     0.85144    0.056035  -0.53861 

In [None]:
word = "cs109b"
print("Embedding Value for word:",word,model([word_index[word]]))

Embedding Value for word: cs109b tf.Tensor(
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0.], shape=(100,), dtype=float32)
