# M3L2 Transformers Lab
In this lab, we will practice how to download various models from the open source HuggingFace repository (https://huggingface.co/).  Please check out the website and click on the **Models** and **Datasets** tab to familiarize yourself with the models we will be using

In [1]:
import numpy as np
import pandas as pd

In [2]:
from transformers import pipeline, set_seed

### Section 1.1 - Sentiment Analysis 

First, let's look at using a transformer for Sentiment Analysis.  This task will take in a sentence and classify it as positive or negative.  Some models will output other classes, such as "neutral" or other labels depending on how they were trained.  You can go to the huggingface website for each model and see what the expected output classes will be, along with tips on how to use these models.  

The default classifier is "distilbert-base-uncased-finetuned-sst-2-english", which returns a 2 class output (positive or negative sentiment) of the sentence that you supply.  

We will start with the simplest way to use a model, with a feature called a ***pipeline***.  These are pre-trained models, so there is no training necessary.  

In [3]:
set_seed(10)
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [5]:
res = classifier("I am mad.")

print(res) # tells you sentiment of the sentence

[{'label': 'NEGATIVE', 'score': 0.9995855689048767}]


### Section 1.2 - Load different Sentiment Analysis Model
We will see how to change the model.  This model was trained on financial data, and also on 3 classes - positive, negative and neutral.  These differences from the previous model will become apparent in the results  

In [6]:
set_seed(10)
classifier2 = pipeline(task='sentiment-analysis', model='ProsusAI/finbert') 
res = classifier2("I am mad.")
print(res)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at ProsusAI/finbert and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'positive', 'score': 0.37243393063545227}]


So the classifier doesn't get this right.  It thinks "I am mad" is a *positive* result.  There are 3 classes, so random guessing is 33%.  So here, it predicts positive by 37% or just better than random guessing.  

However, if we were to use a prompt that is more financial, you might get better results: https://huggingface.co/ProsusAI/finbert?text=I+am+mad.

### Section 2 - Text Generation
In this section, let's explore how to use transformers for text generation, given a specific prompt.

This is the default classifier for text generation, where you supply a seed and see what you get.  GPT2 is the default model that is loaded.


In [8]:
generator = pipeline('text-generation', model='gpt2')
set_seed(10)
generator("Hello, I like data science because ", max_length=50, num_return_sequences=2)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I like data science because \xa0it's easy and\xa0 it's really clear \xa0that this is a good thing.\xa0 One example of \xa0data science being able to build large\xa0data,\xa0 was a report by a\xa0"},
 {'generated_text': 'Hello, I like data science because \xa0everything \xa0occurs in the context of a complex \xa0data \xa0data-coding system."\nBut some things don\'t make sense. You have to think about some of the things that'}]

Let's try another classifier.  ***Distilgpt2*** is a much smaller classifier.  Let's see how it does with the same prompt and seed:

In [9]:
set_seed(10)

generator = pipeline('text-generation', model='distilgpt2')
generator("I love data science because,", max_length=50, num_return_sequences=2)

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "I love data science because, in my book, they don't have to do it myself. But, to have the data available by email and, I mean, how would people want to know about, say, a new drug, or do the"},
 {'generated_text': "I love data science because, as scientists, I believe it is important that any new knowledge in the history of the human world, as I write, has not only brought to us the many facts of life. In fact, it's about them—"}]

As you can see, the performance can be very different.

### Section 3 - Fine tuning the model
In this section, we will show how to fine tune a model to fit the data that is relevant to your application.

We will be using a reduced BERT transformer called distilbert-base-uncased-finetuned-sst-2-english.  Documentation can be found here: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english.  This model was chosen becuase it is small in size while still being comparable in performance to the full BERT model.  The small size will make it easier to train on a single laptop in a reasonable amount of time.

This model was trained on the *glue* and *sst2* datasets, which are made up of generalized language sentences and phrases.  

Here are the steps we will be taking:
- Load sentiment-analysis transformer and conduct baseline test
- Train transformer on new dataset, IMDB, which is made up of movie reviews
- Test transformer on same text as in baseline

In [10]:
# Load model and baseline performance
set_seed(10)
model_name = 'distilbert-base-uncased-finetuned-sst-2-english' 
classifier = pipeline("text-classification", model=model_name)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [11]:
# IMDB database quote: "I can't believe that those praising this movie herein aren't thinking of some other film."
# This is reworded below so that we are not training and testing on the same words.  
classifier("Your praise would be better for another film.")

[{'label': 'NEGATIVE', 'score': 0.5186458826065063}]

Next let's retrain the classifier on the IMDB movie review dataset.  

In [14]:
# Choose a size from 0-25K.  Here, I'm choosing a small number for demonstration purposes
test_size=50
train_size=50

In [15]:
from datasets import load_dataset
dataset = load_dataset("imdb", )
#dataset = load_dataset("imdb")
dataset_train = dataset["train"][0:train_size]  # Just take the training split for now
print(dataset_train['text'][10])
print(dataset_train['label'][10])

It was great to see some of my favorite stars of 30 years ago including John Ritter, Ben Gazarra and Audrey Hepburn. They looked quite wonderful. But that was it. They were not given any characters or good lines to work with. I neither understood or cared what the characters were doing.<br /><br />Some of the smaller female roles were fine, Patty Henson and Colleen Camp were quite competent and confident in their small sidekick parts. They showed some talent and it is sad they didn't go on to star in more and better films. Sadly, I didn't think Dorothy Stratten got a chance to act in this her only important film role.<br /><br />The film appears to have some fans, and I was very open-minded when I started watching it. I am a big Peter Bogdanovich fan and I enjoyed his last movie, "Cat's Meow" and all his early ones from "Targets" to "Nickleodeon". So, it really surprised me that I was barely able to keep awake watching this one.<br /><br />It is ironic that this movie is about a detect

In [12]:
''' Next we need to tokenize the new IMDB dataset in the format of the transformer
'''
from transformers import AutoTokenizer

# Using DistilBERT as it is 2.5x faster to train than the base BERT model.  
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_data = tokenizer.batch_encode_plus(dataset_train["text"], return_tensors="np", 
                                             padding=True, max_length=512, truncation=True )
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_data = dict(tokenized_data)

labels_train = np.array(dataset_train["label"])  # Label is already an array of 0 and 1

In [13]:
'''Train the model with the new tokenized text'''

from transformers import TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam

# Load and compile our model
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer=Adam(3e-5)) 
model.fit(tokenized_data, labels_train)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.




<keras.callbacks.History at 0x1c78a238760>

### Test

In [14]:
#dataset = load_dataset("imdb")
dataset_test = dataset["test"][0:test_size]


In [15]:
'''tokenize the test data'''
from transformers import AutoTokenizer

# Using DistilBERT as it is 2.5x faster to train than the base BERT model.  
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_data = tokenizer.batch_encode_plus(dataset_test["text"], return_tensors="np", 
                                             padding=True, max_length=512, truncation=True )
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_test_data = dict(tokenized_data)

labels_test = np.array(dataset_test["label"])  # Label is already an array of 0 and 1

In [16]:
tokenized_test_data['input_ids'].shape

(50, 512)

In [17]:
# Now you can do predictions like in Keras
ypred = model.predict(tokenized_test_data)



In [18]:
# Outputs are in logits, so you need to use a softmax to get predictions
import tensorflow as tf
ypred_predictions = tf.nn.softmax(ypred.logits)

In [19]:
ypred_predictions[:5]

<tf.Tensor: shape=(5, 2), dtype=float32, numpy=
array([[9.9980921e-01, 1.9079298e-04],
       [9.9336433e-01, 6.6356286e-03],
       [9.9980825e-01, 1.9171991e-04],
       [9.9949467e-01, 5.0532701e-04],
       [2.3088291e-01, 7.6911712e-01]], dtype=float32)>

In [20]:
# Now use argmax to get the label depending on which class gets the maximum prediction
y_test_pred_labels = np.argmax(ypred_predictions, axis=1)
y_test_pred_labels[0:5]

array([0, 0, 0, 0, 1], dtype=int64)

In [21]:
# compare to the true data
labels_test[0:5]

array([0, 0, 0, 0, 0])

In [22]:
# Get the overall accuracy
model.evaluate(tokenized_test_data, labels_test)



0.030380595475435257

### Next steps
Using only 50 training and test observations, performance is low.  Also we only had 1 epoch.  If you have a GPU or a more powerful computing platform, you may want to use more observations and run multiple epochs to see if that improves performance.  

In [23]:
'''tokenize the test data'''
from transformers import AutoTokenizer

# Using DistilBERT as it is 2.5x faster to train than the base BERT model.  
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_data = tokenizer.encode_plus("Your praise would be better for another film.", return_tensors="np", 
                                             padding='max_length', max_length=512,truncation=True) #
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_test_data = dict(tokenized_data)

labels_test = 0# np.array(dataset_test["label"])  # Label is already an array of 0 and 1

In [24]:
tokenized_test_data['input_ids'].shape

(1, 512)

In [25]:
ypred = model.predict(tokenized_test_data)



In [26]:
ypred = model.predict(tokenized_test_data)
ypred_predictions = tf.nn.softmax(ypred.logits)



In [27]:
ypred_predictions

<tf.Tensor: shape=(1, 2), dtype=float32, numpy=array([[0.9741362 , 0.02586381]], dtype=float32)>