# M3L2 Transformers Lab
In this lab, we will practice how to download various models from the open source HuggingFace repository (https://huggingface.co/).  Please check out the website and click on the **Models** and **Datasets** tab to familiarize yourself with the models we will be using

In [None]:
!pip list

In [None]:
!pip install scipy==1.11.4
!pip install transformers
!pip install typing_extensions==4.10.0

In [None]:
import numpy as np
import pandas as pd

In [None]:
from transformers import pipeline, set_seed

### Section 1.1 - Sentiment Analysis 

First, let's look at using a transformer for Sentiment Analysis.  This task will take in a sentence and classify it as positive or negative.  Some models will output other classes, such as "neutral" or other labels depending on how they were trained.  You can go to the huggingface website for each model and see what the expected output classes will be, along with tips on how to use these models.  

The default classifier is "distilbert-base-uncased-finetuned-sst-2-english", which returns a 2 class output (positive or negative sentiment) of the sentence that you supply.  

We will start with the simplest way to use a model, with a feature called a ***pipeline***.  These are pre-trained models, so there is no training necessary.  

In [None]:
set_seed(10)
classifier = pipeline('sentiment-analysis')

In [None]:
res = classifier("I am mad.")

print(res) # tells you sentiment of the sentence

### Section 1.2 - Load different Sentiment Analysis Model
We will see how to change the model.  This model was trained on financial data, and also on 3 classes - positive, negative and neutral.  These differences from the previous model will become apparent in the results  

In [None]:
set_seed(10)
classifier2 = pipeline(task='sentiment-analysis', model='ProsusAI/finbert') 
res = classifier2("I am mad.")
print(res)

So the classifier doesn't get this right.  It thinks "I am mad" is a *positive* result.  There are 3 classes, so random guessing is 33%.  So here, it predicts positive by 37% or just better than random guessing.  

However, if we were to use a prompt that is more financial, you might get better results: https://huggingface.co/ProsusAI/finbert?text=I+am+mad.

### Section 2 - Text Generation
In this section, let's explore how to use transformers for text generation, given a specific prompt.

This is the default classifier for text generation, where you supply a seed and see what you get.  GPT2 is the default model that is loaded.


In [None]:
generator = pipeline('text-generation', model='gpt2')
set_seed(10)
generator("Hello, I like data science because ", max_length=50, num_return_sequences=2)

Let's try another classifier.  ***Distilgpt2*** is a much smaller classifier.  Let's see how it does with the same prompt and seed:

In [None]:
set_seed(10)

generator = pipeline('text-generation', model='distilgpt2')
generator("I love data science because,", max_length=50, num_return_sequences=2)

As you can see, the performance can be very different.

### Section 3 - Fine tuning the model
In this section, we will show how to fine tune a model to fit the data that is relevant to your application.

We will be using a reduced BERT transformer called distilbert-base-uncased-finetuned-sst-2-english.  Documentation can be found here: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english.  This model was chosen becuase it is small in size while still being comparable in performance to the full BERT model.  The small size will make it easier to train on a single laptop in a reasonable amount of time.

This model was trained on the *glue* and *sst2* datasets, which are made up of generalized language sentences and phrases.  

Here are the steps we will be taking:
- Load sentiment-analysis transformer and conduct baseline test
- Train transformer on new dataset, IMDB, which is made up of movie reviews
- Test transformer on same text as in baseline

In [None]:
# Load model and baseline performance
set_seed(10)
model_name = 'distilbert-base-uncased-finetuned-sst-2-english' 
classifier = pipeline("text-classification", model=model_name)

In [None]:
# IMDB database quote: "I can't believe that those praising this movie herein aren't thinking of some other film."
# This is reworded below so that we are not training and testing on the same words.  
classifier("Your praise would be better for another film.")

Next let's retrain the classifier on the IMDB movie review dataset.  

In [None]:
# Choose a size from 0-25K.  Here, I'm choosing a small number for demonstration purposes
test_size=50
train_size=50

In [None]:
!pip install pyarrow==12.0.1 datasets==2.18.0

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb")
dataset_train = dataset["train"][0:train_size]  # Just take the training split for now
print(dataset_train['text'][10])
print(dataset_train['label'][10])

In [None]:
''' Next we need to tokenize the new IMDB dataset in the format of the transformer
'''
from transformers import AutoTokenizer

# Using DistilBERT as it is 2.5x faster to train than the base BERT model.  
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_data = tokenizer.batch_encode_plus(dataset_train["text"], return_tensors="np", 
                                             padding=True, max_length=512, truncation=True )
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_data = dict(tokenized_data)

labels_train = np.array(dataset_train["label"])  # Label is already an array of 0 and 1

In [None]:
'''Train the model with the new tokenized text'''

from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam

# Load and compile our model
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer=Adam(3e-5)) 
model.fit(tokenized_data, labels_train)

### Test

In [None]:
#dataset = load_dataset("imdb")
dataset_test = dataset["test"][0:test_size]


In [None]:
'''tokenize the test data'''
from transformers import AutoTokenizer

# Using DistilBERT as it is 2.5x faster to train than the base BERT model.  
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_data = tokenizer.batch_encode_plus(dataset_test["text"], return_tensors="np", 
                                             padding=True, max_length=512, truncation=True )
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_test_data = dict(tokenized_data)

labels_test = np.array(dataset_test["label"])  # Label is already an array of 0 and 1

In [None]:
tokenized_test_data['input_ids'].shape

In [None]:
# Now you can do predictions like in Keras
ypred = model.predict(tokenized_test_data)

In [None]:
# Outputs are in logits, so you need to use a softmax to get predictions
import tensorflow as tf
ypred_predictions = tf.nn.softmax(ypred.logits)

In [None]:
ypred_predictions[:5]

In [None]:
# Now use argmax to get the label depending on which class gets the maximum prediction
y_test_pred_labels = np.argmax(ypred_predictions, axis=1)
y_test_pred_labels[0:5]

In [None]:
# compare to the true data
labels_test[0:5]

In [None]:
# Get the overall accuracy
model.evaluate(tokenized_test_data, labels_test)

### Next steps
Using only 50 training and test observations, performance is low.  Also we only had 1 epoch.  If you have a GPU or a more powerful computing platform, you may want to use more observations and run multiple epochs to see if that improves performance.  

In [None]:
'''tokenize the test data'''
from transformers import AutoTokenizer

# Using DistilBERT as it is 2.5x faster to train than the base BERT model.  
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_data = tokenizer.encode_plus("Your praise would be better for another film.", return_tensors="np", 
                                             padding='max_length', max_length=512,truncation=True) #
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_test_data = dict(tokenized_data)

labels_test = 0# np.array(dataset_test["label"])  # Label is already an array of 0 and 1

In [None]:
tokenized_test_data['input_ids'].shape

In [None]:
ypred = model.predict(tokenized_test_data)

In [None]:
ypred = model.predict(tokenized_test_data)
ypred_predictions = tf.nn.softmax(ypred.logits)

In [None]:
ypred_predictions