<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> AC295: Advanced Practical Data Science </h1>

## Transfer Learning for Text & Word Embeddings

**Harvard University, Fall 2020**  
**Instructors**: Pavlos Protopapas  

---

**Each assignment is graded out of 5 points.  The topic for this assignment is Transfer Learning for Text.**

**Due:** 10/27/2020 10:15 AM EDT

**Submit:** We won't be re running your notebooks, please ensure output is visible in the notebook.

#### Learning Objectives

In this exercise you will cover the following topics:  
- Tokenizing text using Text Vectorization
- Perform text classification & create word emedddings
- Load pre-trained word embeddings and perform text classification
- Understand Word Embeddings

---

#### Imports

In [None]:
import os
import requests
import zipfile
import tarfile
import shutil
import json
import time
import sys
import string
import re
import numpy as np
import pandas as pd
from glob import glob
from string import Template
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import backend as K
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import layers
from tensorflow.keras import activations
from tensorflow.keras import optimizers
from tensorflow.keras import losses
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.python.ops import io_ops
from tensorflow.keras.utils import to_categorical

from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity

#### Verify Setup

In [None]:
# Enable/Disable Eager Execution
# Reference: https://www.tensorflow.org/guide/eager
# TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, 
# without building graphs

#tf.compat.v1.disable_eager_execution()
#tf.compat.v1.enable_eager_execution()

print("tensorflow version", tf.__version__)
print("keras version", tf.keras.__version__)
print("Eager Execution Enabled:", tf.executing_eagerly())

# Get the number of replicas 
strategy = tf.distribute.MirroredStrategy()
print("Number of replicas:", strategy.num_replicas_in_sync)

devices = tf.config.experimental.get_visible_devices()
print("Devices:", devices)
print(tf.config.experimental.list_logical_devices('GPU'))

print("GPU Available: ", tf.config.list_physical_devices('GPU'))
print("All Physical Devices", tf.config.list_physical_devices())

# Better performance with the tf.data API
# Reference: https://www.tensorflow.org/guide/data_performance
AUTOTUNE = tf.data.experimental.AUTOTUNE

tensorflow version 2.3.0
keras version 2.4.0
Eager Execution Enabled: True
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
Number of replicas: 1
Devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[LogicalDevice(name='/device:GPU:0', device_type='GPU')]
GPU Available:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
All Physical Devices [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'), PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


#### Utils

In [None]:
def download_file(packet_url, base_path="", extract=False):
  if base_path != "":
    if not os.path.exists(base_path):
      os.mkdir(base_path)
  packet_file = os.path.basename(packet_url)
  with requests.get(packet_url, stream=True) as r:
      r.raise_for_status()
      with open(os.path.join(base_path,packet_file), 'wb') as f:
          for chunk in r.iter_content(chunk_size=8192):
              f.write(chunk)
  
  if extract:
    if packet_file.endswith(".zip"):
      with zipfile.ZipFile(os.path.join(base_path,packet_file)) as zfile:
        zfile.extractall(base_path)
    
    if packet_file.endswith(".tar.gz"):
      packet_name = packet_file.split('.')[0]
      with tarfile.open(os.path.join(base_path,packet_file)) as tfile:
        tfile.extractall(base_path)

def evaluate_model(model,test_data, training_results):
    
  # Get the model train history
  model_train_history = training_results.history
  # Get the number of epochs the training was run for
  num_epochs = len(model_train_history["loss"])

  # Plot training results
  fig = plt.figure(figsize=(15,5))
  axs = fig.add_subplot(1,2,1)
  axs.set_title('Loss')
  # Plot all metrics
  for metric in ["loss","val_loss"]:
      axs.plot(np.arange(0, num_epochs), model_train_history[metric], label=metric)
  axs.legend()
  
  axs = fig.add_subplot(1,2,2)
  axs.set_title('Accuracy')
  # Plot all metrics
  for metric in ["accuracy","val_accuracy"]:
      axs.plot(np.arange(0, num_epochs), model_train_history[metric], label=metric)
  axs.legend()

  plt.show()
  
  # Evaluate on test data
  evaluation_results = model.evaluate(test_data)
  print("Evaluation Results:", evaluation_results)

## Dataset

#### Download

In [None]:
start_time = time.time()
download_file("https://storage.googleapis.com/dataset_store/ac295/news300.zip", base_path="datasets", extract=True)
download_file("https://github.com/shivasj/dataset-store/releases/download/v3.0/glove.6B.100d.txt.zip", base_path="embedding", extract=True)
execution_time = (time.time() - start_time)/60.0
print("Download execution time (mins)",execution_time)

Download execution time (mins) 0.13868790467580158


#### Explore

In [None]:
data_dir = os.path.join("datasets","news300")
label_names = os.listdir(data_dir)

# Number of unique labels
num_classes = len(label_names) 
# Create label index for easy lookup
label2index = dict((name, index) for index, name in enumerate(label_names))
index2label = dict((index, name) for index, name in enumerate(label_names))

print("Number of classes:", num_classes)
print("Labels:", label_names)

# Generate a list of labels and path to text
data_x = []
data_y = []

for label in label_names:
  text_files = os.listdir(os.path.join(data_dir,label))
  data_x.extend([os.path.join(data_dir,label,f) for f in text_files])
  data_y.extend([label for f in text_files])

# Preview
print("data_x count:",len(data_x))
print("data_y count:",len(data_y))
print(data_x[:5])
print(data_y[:5])

# sns.countplot()
# plt.show()
np.unique(data_y, return_counts=True)

Number of classes: 3
Labels: ['politics', 'health', 'entertainment']
data_x count: 920
data_y count: 920
['datasets/news300/politics/216.txt', 'datasets/news300/politics/105.txt', 'datasets/news300/politics/83.txt', 'datasets/news300/politics/278.txt', 'datasets/news300/politics/256.txt']
['politics', 'politics', 'politics', 'politics', 'politics']


(array(['entertainment', 'health', 'politics'], dtype='<U13'),
 array([310, 310, 300]))

#### View Text

In [None]:
# Generate a random sample of index
data_samples = np.random.randint(0,high=len(data_x)-1, size=10)
for i,data_idx in enumerate(data_samples):
  # Read text
  txt = io_ops.read_file(data_x[data_idx])

  print("Label:",data_y[data_idx],", Text:",txt.numpy())

Label: entertainment , Text: b'The Wanted singer Tom Parker reveals he has inoperable brain tumor London (CNN)Tom Parker, a British singer who spent five years as part of the popular boy band The Wanted, has revealed he has an inoperable brain tumor. The 32-year-old told fans he has a grade four glioblastoma tumor and is undergoing treatment. Parker, who announced earlier this year that he and his wife are expecting their second child, wrote on Instagram: "We are all absolutely devastated but we are gonna fight this all the way.  "We don\'t want your sadness, we just want love and positivity and together we will raise awareness of this terrible disease and look for all available treatment options." In an interview with Britain\'s OK! Magazine, Parker and his wife, Kelsey Hardwick, said his doctors described the tumor as a "worst-case scenario" and informed the couple that it was terminal. The average survival time for patients with a grade four glioblastoma, one of the most serious kin

#### Build Data Pipelines

##### Text Vectorization

[Reference](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization)

Generate Text Vector:
- Standardize each sample (usually lowercasing + punctuation stripping)
- Split each sample into substrings (usually words)
- Recombine substrings into tokens (usually ngrams)
- Index tokens (associate a unique int value with each token)
- Transform each sample using this index, either into a vector of ints or a dense float vector

In [None]:
# Text Vectorization
def standardize_text(input_text):
  # Convert to lowercase
  lowercase = tf.strings.lower(input_text)
  # Remove HTML tags
  stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
  return tf.strings.regex_replace(
      stripped_html, "[%s]" % re.escape(string.punctuation), ""
  )

# Load Text
def load_text(path, label=None):
  text = io_ops.read_file(path)
  if label is None:
    return text
  else:
    return text, label

# Feature constraints
max_features = 15000
sequence_length = 1000

# Initialize Text Vectorizer
text_vectorizer = TextVectorization(
    standardize=standardize_text,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Create the vocabulary of entire dataset
text_data = tf.data.Dataset.from_tensor_slices(data_x)
text_data = text_data.map(load_text, num_parallel_calls=AUTOTUNE)

# Generate Text Vector
start_time = time.time()
text_vectorizer.adapt(text_data.batch(64))
execution_time = (time.time() - start_time)/60.0
print("Execution time (mins)",execution_time)

# Get Vocabulary
vocabulary = text_vectorizer.get_vocabulary()
vocabulary_size = len(vocabulary)
print("Vocabulary Size:",vocabulary_size)
# Generate word index
word_index = dict(zip(vocabulary, range(vocabulary_size)))

Execution time (mins) 0.01306746006011963
Vocabulary Size: 15000


In [None]:
# Check Vocabulary : 0 is reserved for padding and index 1 is reserved for "out of vocabulary" tokens
print("Vocabulary:",vocabulary[:20])
print("Vocabulary Size:",len(vocabulary))

# Test our text vectorizer
test_text = txt = io_ops.read_file(data_x[data_samples[0]])
print(test_text)
test_text_vector = text_vectorizer([test_text.numpy()])
print("Shape:",test_text_vector.shape)
print(test_text_vector[0,:20])

Vocabulary: ['', '[UNK]', 'the', 'to', 'and', 'of', 'a', 'in', 'that', 'for', 'is', 'on', 'said', 'with', 'as', 'it', 'have', 'are', 'be', 'at']
Vocabulary Size: 15000
tf.Tensor(b'The Wanted singer Tom Parker reveals he has inoperable brain tumor London (CNN)Tom Parker, a British singer who spent five years as part of the popular boy band The Wanted, has revealed he has an inoperable brain tumor. The 32-year-old told fans he has a grade four glioblastoma tumor and is undergoing treatment. Parker, who announced earlier this year that he and his wife are expecting their second child, wrote on Instagram: "We are all absolutely devastated but we are gonna fight this all the way.  "We don\'t want your sadness, we just want love and positivity and together we will raise awareness of this terrible disease and look for all available treatment options." In an interview with Britain\'s OK! Magazine, Parker and his wife, Kelsey Hardwick, said his doctors described the tumor as a "worst-case scena

##### Split Data

In [None]:
validation_percent = 0.20

# Split data into train / validate
train_x, validate_x, train_y, validate_y = train_test_split(data_x, data_y, test_size=validation_percent)

print("train_x count:",len(train_x))
print("validate_x count:",len(validate_x))

train_x count: 736
validate_x count: 184


##### Create TF Dataset

In [None]:
batch_size = 64

# Convert all y labels to numbers
train_processed_y = [label2index[label] for label in train_y]
validate_processed_y = [label2index[label] for label in validate_y]

# Converts to y to binary class matrix (One-hot-encoded)
train_processed_y = to_categorical(train_processed_y, num_classes=num_classes, dtype='float32')
validate_processed_y = to_categorical(validate_processed_y, num_classes=num_classes, dtype='float32')

# Vectorize Text
def vectorize_text(text, label=None):
  text = tf.expand_dims(text, -1)
  text = text_vectorizer(text)
  if label is None:
    return text
  else:
    return text, label

# Create TF Dataset
train_data = tf.data.Dataset.from_tensor_slices((train_x, train_processed_y))
validation_data = tf.data.Dataset.from_tensor_slices((validate_x, validate_processed_y))

#############
# Train data
#############
# Apply all data processing logic
train_data = train_data.map(load_text, num_parallel_calls=AUTOTUNE)
train_data = train_data.batch(batch_size)
train_data = train_data.map(vectorize_text, num_parallel_calls=AUTOTUNE)
train_data = train_data.cache().prefetch(buffer_size=100)

##################
# Validation data
##################
# Apply all data processing logic
validation_data = validation_data.map(load_text, num_parallel_calls=AUTOTUNE)
validation_data = validation_data.batch(batch_size)
validation_data = validation_data.map(vectorize_text, num_parallel_calls=AUTOTUNE)
validation_data = validation_data.cache().prefetch(buffer_size=100)

print("train_data",train_data)
print("validation_data",validation_data)

## Questions:

**All data preparation steps have been performed and you can continue to building models in following questions**

Note on dataset input sizes:
```
# Feature constraints
max_features = 15000
sequence_length = 1000
```

## Question 1 : Build a text classification model using FFNN (0.5 Point)

#### a) Build a Text Classification Model

- Build the model using a few `Dense` layers
- Input size is `1000`
- Do **NOT** use the `Embedding` layer in your model
- Use `categorical_crossentropy` loss
- Ensure there is a plot of your training history

#### b) How is the performance of your model?

*Your answer here*

## Question 2 : Build a text classification model with embeddings (1.0 Point)

#### a) Preliminary Questions 

- Explain what is the purpose of the `Embedding` Layer ? 
- Name two reasons why using one-hot encoded vectors instead is not the way to go
- Explain what the inputs and outputs of the `Embedding` Layer are. Also comment on the dimension going in and what is coming out


*Your answer here*

#### b) Build a Text Classification Model

- You need to include the `Embedding` layer in your model
- Following the `Embedding` layer you can use `Conv1D` or `LSTM`
- Use loss of `categorical_crossentropy`
- Ensure there is a plot of your training history



#### c) Save Embedding layer weights

- Save the weights of your embedding layer
- The weights will be used in Question 4
- Feel free to use the code below to extract layer weights

```
# Get the Embedding Layer
embedding_layer_no_tl = model.get_layer(name="embedding")
embedding_layer_no_tl_weights = embedding_layer_no_tl.get_weights()[0]
print(embedding_layer_no_tl_weights.shape)
```

---

## Question 3 : Build a text classification model with pre-trained embeddings (2.5 Point)

#### a) Preliminary Questions 


- How is using pre-trained word embedding adding knowledge to the model? (Use an example if needed) 

- Provide a scenario where retraining the pre-trained embedding layer may be needed

*Your answer here*

#### b) Build a Text Classifcation Model using pre-trained word embeddings

- Build a Text Classification Model
- You need to include the `Embedding` layer in your model
- The `Embedding` layer should have its weights loaded from any **pre-trained word embeddings** such as Glove, Word2Vec, FastText etc.
- [Example](https://medium.com/@ppasumarthi_69210/word-embeddings-in-keras-be6bb3092831) of how to load pre-trained word embeddings for Word2Vec
- [Example](https://keras.io/examples/nlp/pretrained_word_embeddings/) of how to load pre-trained word embeddings for Glove
- [Example](https://www.kaggle.com/vsmolyakov/keras-cnn-with-fasttext-embeddings) of how to load pre-trained word embeddings for FastText
- Make the `Embedding` layer as `trainable=False`
- Following the `Embedding` layer you can use `Conv1D` or `LSTM`
- Use loss of `categorical_crossentropy`
- Ensure there is a plot of your training history

#### c) Save Embedding layer weights

- Save the weights of your embedding layer
- The weights will be used in Question 4
- Feel free to use the code below to extract layer weights

```
# Get the Embedding Layer
embedding_layer_tl = model.get_layer(name="embedding")

embedding_layer_tl_weights = embedding_layer_tl.get_weights()[0]
print(embedding_layer_tl_weights.shape)
```

---

## Question 4 : Analysing Word Embeddings (1.0 Point)

Feel free to use these functions for this question:

```
def find_similar(words, word_index, vocabulary, embedding_layer_weights, topn=5):
  subset_word_index  = []
  for word in words:
    subset_word_index.append(word_index[word])
  
  cs_op = cosine_similarity(embedding_layer_weights[subset_word_index], embedding_layer_weights)
  for idx in range(len(words)):
    top = cs_op[idx].argsort()[-topn:][::-1]
    for i,t in enumerate(top):
      if i ==0:
        print("Similar words for:",vocabulary[t])
      else:
        print("    ",vocabulary[t])

def find_analogy(word_a, word_b, word_c, word_index, vocabulary, embedding_layer_weights, topn=5):
  idx_a = word_index[word_a]
  idx_b = word_index[word_b]
  idx_c = word_index[word_c]

  # Vectors
  vec_a = embedding_layer_weights[idx_a]
  vec_b = embedding_layer_weights[idx_b]
  vec_c = embedding_layer_weights[idx_c]

  op = vec_b - vec_a + vec_c
  cs_op = cosine_similarity([op], embedding_layer_weights)
  top = cs_op[0].argsort()[-topn:][::-1]

  print(word_b,"-",word_a,"+", word_c, "=")
  for i,t in enumerate(top):
    print("   ",vocabulary[t])
```

#### a) Finding Semantically similar words

- We want to find words that are semantically similar to the following words: ```['covid19','election','2020','pandemic','quarantine']```
- Run the function `find_similar(...)` and display the results for embedding weights from question 2 and question 3

In [None]:
#find_similar(...) for learned embeddings weights

In [None]:
#find_similar(...) for pre-trained embeddings weights

- Explain your results, does the similar words have any real world similarity?
- Explain the results of the word `covid19` from learned embeddings weights vs pre-trained embeddings weights

*Your answer here*

#### b) Finding Analogies

- Word embeddings can be used to find analogies between words. For example, “man” is to “woman” as “son” is to “daughter” is an example of analogy, 
- Let us verify the `male-female` analogy
- Run the function `find_analogy(...)` and display the results for embedding weights from question 2 and question 3

In [None]:
#find_analogy('man', 'woman', 'son',...) for learned embeddings weights

In [None]:
#find_analogy('man', 'woman', 'son'...) for pre-trained embeddings weights

- Explain your results, does word analogies work in both cases?

*Your answer here*