<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> AC295: Advanced Practical Data Science </h1>

## Attention and Transformers

**Harvard University, Fall 2020**  
**Instructors**: Pavlos Protopapas  

---

**Each assignment is graded out of 5 points.  The topic for this assignment is Transfer Learning for Text.**

**Due:** 11/03/2020 10:15 AM EDT

**Submit:** We won't be re running your notebooks, please ensure output is visible in the notebook.

#### Learning Objectives

In this exercise you will cover the following topics:  
- Tokenizing text for BERT
- BERT for Text Classification Task

---

#### Installs

In [None]:
!pip install transformers #Installing Huggingface transformers 

#### Imports

In [None]:
import os
import requests
import zipfile
import tarfile
import shutil
import json
import time
import sys
import string
import re
import numpy as np
import pandas as pd
from glob import glob
from string import Template
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import tensorflow as tf
from tensorflow import keras
from tensorflow.python.keras import backend as K
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import layers
from tensorflow.keras import activations
from tensorflow.keras import optimizers
from tensorflow.keras import losses
from tensorflow.keras import metrics
from tensorflow.keras.utils import to_categorical

from sklearn.model_selection import train_test_split

from transformers import BertTokenizer, TFBertForSequenceClassification

#### Verify Setup

In [None]:
# Enable/Disable Eager Execution
# Reference: https://www.tensorflow.org/guide/eager
# TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, 
# without building graphs

#tf.compat.v1.disable_eager_execution()
#tf.compat.v1.enable_eager_execution()

print("tensorflow version", tf.__version__)
print("keras version", tf.keras.__version__)
print("Eager Execution Enabled:", tf.executing_eagerly())

# Get the number of replicas 
strategy = tf.distribute.MirroredStrategy()
print("Number of replicas:", strategy.num_replicas_in_sync)

devices = tf.config.experimental.get_visible_devices()
print("Devices:", devices)
print(tf.config.experimental.list_logical_devices('GPU'))

print("GPU Available: ", tf.config.list_physical_devices('GPU'))
print("All Physical Devices", tf.config.list_physical_devices())

# Better performance with the tf.data API
# Reference: https://www.tensorflow.org/guide/data_performance
AUTOTUNE = tf.data.experimental.AUTOTUNE

#### Utils

In [None]:
def download_file(packet_url, base_path="", extract=False):
  if base_path != "":
    if not os.path.exists(base_path):
      os.mkdir(base_path)
  packet_file = os.path.basename(packet_url)
  with requests.get(packet_url, stream=True) as r:
      r.raise_for_status()
      with open(os.path.join(base_path,packet_file), 'wb') as f:
          for chunk in r.iter_content(chunk_size=8192):
              f.write(chunk)
  
  if extract:
    if packet_file.endswith(".zip"):
      with zipfile.ZipFile(os.path.join(base_path,packet_file)) as zfile:
        zfile.extractall(base_path)
    
    if packet_file.endswith(".tar.gz"):
      packet_name = packet_file.split('.')[0]
      with tarfile.open(os.path.join(base_path,packet_file)) as tfile:
        tfile.extractall(base_path)

def evaluate_model(model,test_data, training_results):
    
  # Get the model train history
  model_train_history = training_results.history
  # Get the number of epochs the training was run for
  num_epochs = len(model_train_history["loss"])

  # Plot training results
  fig = plt.figure(figsize=(15,5))
  axs = fig.add_subplot(1,2,1)
  axs.set_title('Loss')
  # Plot all metrics
  for metric in ["loss","val_loss"]:
      axs.plot(np.arange(0, num_epochs), model_train_history[metric], label=metric)
  axs.legend()
  
  axs = fig.add_subplot(1,2,2)
  axs.set_title('Accuracy')
  # Plot all metrics
  for metric in ["accuracy","val_accuracy"]:
      axs.plot(np.arange(0, num_epochs), model_train_history[metric], label=metric)
  axs.legend()

  plt.show()
  
  # Evaluate on test data
  evaluation_results = model.evaluate(test_data)
  print("Evaluation Results:", evaluation_results)

## Dataset

**We will continue to use the dataset from Exercise 6.** The dataset consists of news articles from CNN in the politics, health, and entertainment category. There are about 300 articles in each category.

#### Download

In [None]:
start_time = time.time()
download_file("https://storage.googleapis.com/dataset_store/ac295/news300.zip", base_path="datasets", extract=True)
execution_time = (time.time() - start_time)/60.0
print("Download execution time (mins)",execution_time)

#### Explore

In [None]:
data_dir = os.path.join("datasets","news300")
label_names = os.listdir(data_dir)

# Number of unique labels
num_classes = len(label_names) 
# Create label index for easy lookup
label2index = dict((name, index) for index, name in enumerate(label_names))
index2label = dict((index, name) for index, name in enumerate(label_names))

print("Number of classes:", num_classes)
print("Labels:", label_names)

# Generate a list of labels and path to text
data_x = []
data_y = []

for label in label_names:
  text_files = os.listdir(os.path.join(data_dir,label))
  data_x.extend([os.path.join(data_dir,label,f) for f in text_files])
  data_y.extend([label for f in text_files])

# Load the text content
for idx, path in enumerate(data_x):
  # Load text
  with open(path) as file:
    data_x[idx] = file.read()

# Preview
print("data_x count:",len(data_x))
print("data_y count:",len(data_y))
print(data_x[:5])
print(data_y[:5])
print("Label counts:",np.unique(data_y, return_counts=True))

#### View Text

In [None]:
# Generate a random sample of index
data_samples = np.random.randint(0,high=len(data_x)-1, size=10)
for i,data_idx in enumerate(data_samples):
  print("Label:",data_y[data_idx],", Text:",data_x[data_idx])

## Questions:

For this exercise you will use a transformer model **BERT** that was discussed in lecture. [Hugging Face](https://huggingface.co/) is an NLP-focused company with a large open-source library especially around the Transformers. They provide the transformers library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. These architectures come pre-trained with several sets of weights. Getting started with Transformers only requires to install the `pip install transformers`

You will use BERT direclty from the transformers package


## Question 1 : Build a text classification model using BERT (3.0 Points)

#### a) Prepare data for BERT

BERT requires the data to be tokenized in a specific way, for this you need to use the `BertTokenizer` from the `transformers` package from Hugging Face. Steps to prepare your dataset:

- Split data to train/validation
- Use `BertTokenizer` to tokenize the input text
- [BertTokenizer](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer), use `bert-base-uncased` as the `vocab_file` argument
- When using `tokenizer.encode_plus(...)` use the `max_length=256` or some value `<=512`. You may run into OOM error during training if the value is high
- The output tokens from `tokenizer.encode_plus(...)` is a dictionary with the keys `'input_ids', 'token_type_ids', 'attention_mask'`
- Remember to convert the data y values `to_categorical`
- Create TF Datasets using the tokenized results. When using `tf.data.Dataset.from_tensor_slices(...,...)` look out for the x values passed in. `BERT` requires 3 inputs `'input_ids', 'token_type_ids', 'attention_mask'` as a tuple.
- Remember to apply `shuffle(..)` `batch(...)` `prefetch(..)` to your train and validation data

In [None]:
# Examples on how to use the BertTokenizer
tokenizer=BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)

# Tokenizer encode_plus
text = "What you need to know about using them safely amid the pandemic"
outputs = tokenizer.encode_plus(text, 
                  add_special_tokens = True, # add [CLS], [SEP]
                  max_length = 15, # max length of the text that can go to BERT (<=512)
                  padding='max_length',
                  return_attention_mask = True, # add attention mask to not focus on pad tokens
                  truncation='longest_first',
                  return_tensors="tf"
              )
print("Tokenizer Output:",outputs)

# Tokenizer batch_encode_plus
text = ["What you need to know about using them safely amid the pandemic", 
        "A third of Medicare enrollees with coronavirus ended up in the hospital"]
outputs = tokenizer.batch_encode_plus(
        text,
        return_tensors='tf',
        add_special_tokens = True, # add [CLS], [SEP]
        return_token_type_ids=True,
        padding='max_length',
        max_length=15, # max length of the text that can go to BERT (<=512)
        return_attention_mask = True,
        truncation='longest_first'
    )
print("Tokenizer Output:",outputs)
print("Tokenizer Output Keys:", outputs.keys())

In [None]:
# Datatset Params
batch_size = 8 # You can try higher values but may run into OOM errors depending on which GPU you are using
train_shuffle_buffer_size = 800
validation_shuffle_buffer_size = 200

# Convert all y labels to numbers

# Converts to y to_categorical

# Create TF Dataset

# print("train_data",train_data)
# print("validation_data",validation_data)

Your train and validation dataset should look something like this
```
print("train_data",train_data)
print("validation_data",validation_data)

train_data <PrefetchDataset shapes: ({input_ids: (None, 256), token_type_ids: (None, 256), attention_mask: (None, 256)}, (None, 3)), types: ({input_ids: tf.int32, token_type_ids: tf.int32, attention_mask: tf.int32}, tf.int32)>
validation_data <PrefetchDataset shapes: ({input_ids: (None, 256), token_type_ids: (None, 256), attention_mask: (None, 256)}, (None, 3)), types: ({input_ids: tf.int32, token_type_ids: tf.int32, attention_mask: tf.int32}, tf.int32)>
```

#### b) BERT for Sequence Classification

- Build a model using `TFBertForSequenceClassification` from the `transformers` package from Hugging Face
- Load the pre-trained weights using `bert-base-uncased` make sure to pass the argument `num_labels`
- Train your model
- Ensure there is a plot of your training history

In [None]:
############################
# Training Params
############################
learning_rate = 2e-5 # Try 5e-5, 3e-5, 2e-5
epochs = 5

# Free up memory
K.clear_session()

# Build BERT model
model = ... 

# Optimizer
optimizer = optimizers.Adam(lr=learning_rate, epsilon=1e-08)

# Loss

# Compile

# Train model

# Evaluate Model

#### c) Classification Results

- What was your validation accuracy?
- It should be more that 95%

---
## Question 2 : Conceptual (2.0 Points)


#### a) How does the encoder-decoder structure work for language modelling?


*Your answer here*


#### b) Explain in your own words what is the attention mechanism. Why do State of the art models use this concept?

*Your answer here*

#### c) What is the biggest benefit of transformers compared to seqtoseq models?

*Your answer here*

#### d) Explain in your own words what the positional encoder is and why it is needed?

*Your answer here*