<a href="https://colab.research.google.com/github/michellejm/LLMs-fall-23/blob/main/bert_sentiment_via_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

❗**WARNING**❗

There are two ways to do this lab.
1. You can do it locally with a Jupyter Notebook. If you do that, it cannot be done in one sitting. There is a step that takes hours.
2. You can do it in a google colab notebook and pay for GPU's. I did it this way with a TPU processor. I use the pay as you go option and spent $10. I have plently left for the rest of the semester. It took about 30 mins total for runtime.  

## BERT lab
LLM's and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Due:** October 29


### Background
The purpose of this lab is to see how to apply BERT to a simple sentiment analysis task. BERT can be used for a wide variety of things including multi-label classification (i.e., sentiment analysis), question answering, text summarizaton, and more.
This model is [hosted on Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert). Hugging Face is a repository of machine learning models - a bit like GitHub.

We will use Tensorflow for this lab. You may have also seen PyTorch. They are very similar libraries for deep learning. Tensorflow has been around a bit longer and is more widely used for production than PyTorch, but the differences are most important for Machine Learning Engineers looking to deploy models within an existing environment.


### Notes
The way this lab is set up it's very hard to change the dataset. However, if you are looking for sentiment data to do a project on, here are some good options:
1. [Financial Reviews](https://huggingface.co/datasets/financial_phrasebank) (this dataset is formatted similarly to the IMDB)
2. [The Amazon Reviews](https://nijianmo.github.io/amazon/index.html) (note there's samples at the bottom)
3. [The Yelp dataset](https://www.yelp.com/dataset)

### References
This lab is heavily based on a well known lab for getting started working with BERT. It was written 2020. The earliest mention of it I can [find is here](https://www.kaggle.com/code/satyampd/imdb-sentiment-analysis-using-bert-w-huggingface/notebook).

## Part 1

Go to the [Hugging Face BERT model documentation](https://huggingface.co/docs/transformers/v4.34.1/en/model_doc/bert#resources). Scroll down to the Resources section and open at least one other notebook. The goal of this section is to get familiar with what BERT is used for and how the tasks are set up.

1. What is the title of the notebook?
2. What NLP problem does it solve (i.e., question answering, Named Entity Recognition, etc.)?
3. Does it use any frameworks (i.e., PyTorch, TensorFlow)?
4. Does it make any citations for the theory or framework?
5. Does it require an authentication token or API?

You do not need to turn this in -- the questions are here as a checklist to understand the tasks.


## Part 2


### Installations
The cell below contains commented code for installations. This works on a local installation of Jupyter Notebooks on Mac, Windows, or Linux. I think there is a GPU setting you have to change if you are using a [Colab see here](https://towardsdatascience.com/how-to-set-started-with-tensorflow-using-keras-api-and-google-colab-5421e5e4ef56).

**What am I installing?**

* [Tensorflow](https://www.tensorflow.org/) is an open-sourse machine learning platform that has many helpful tools. It is similar to PyTorch.
* [transformers](https://huggingface.co/docs/transformers/index) is a Hugging Face library that allows us to work w Hugging Face models. That's how we will access BERT.

**Note**

These libraries can be a little finicky. You may have to restart your Kernel get them to work.

In [1]:
!pip install tensorflow
!pip install transformers

Collecting tensorflow
  Downloading tensorflow-2.15.0-cp310-cp310-macosx_12_0_arm64.whl (2.1 kB)
Collecting tensorflow-macos==2.15.0
  Downloading tensorflow_macos-2.15.0-cp310-cp310-macosx_12_0_arm64.whl (208.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.8/208.8 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting astunparse>=1.6.0
  Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting absl-py>=1.0.0
  Downloading absl_py-2.0.0-py3-none-any.whl (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.2/130.2 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting libclang>=13.0.0
  Downloading libclang-16.0.6-py2.py3-none-macosx_11_0_arm64.whl (20.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.6/20.6 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting google-pasta>=0.1.1
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting tenso

In [3]:
!pip install numpy



In [4]:
import numpy as np
import pandas as pd
import sklearn

import tensorflow as tf
import transformers
#tqdm is a progress bar
from tqdm import tqdm

ModuleNotFoundError: No module named 'numpy'

If doing this on colab, you can use this link (to the course Github) to load the dataset.

In [3]:
df=pd.read_csv("https://raw.githubusercontent.com/michellejm/LLMs-fall-23/main/wk6-neural-networks-bert/in-progress/IMDB%20Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Sentiment Analysis with BERT
We will do the following operations to train a sentiment analysis model:
* Load the BERT Classifier and Tokenizer
* Turn the IMDB dataset into a processed dataset.
* Configure the Loaded BERT model and Train for Fine-tuning
* Make Predictions with the Fine-tuned Model

### Load classifier

You're calling the tokenizer and classifier from Hugging Face via the transformers library. We're using the version of [BERT that is for sequence classification](https://huggingface.co/docs/transformers/model_doc/bert#resources). Check out the model card to see what other versions there are.

Remember that the tokenizer for a dataset has to match how the training data for the model was tokenized.

We're also loading InputExample and InputFeatures from the transformers library. This allows us to look at a single train/test example, and understand how the model is operating.

In `transformers` and `sklearn` and many other machine learning libraries, models are instantiated as objects. This is what `model = ` and `tokenizer = ` lines are both doing.

In [4]:
# Loading the BERT Classifier and Tokenizer along with Input module
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Just calling the specifications of the model.

In [5]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


The sentiment is currently formatted as a categorical variable. We need a boolean.

We will frame it in terms of 'positive'. So, if a review is 'positive', it's True (1), 'negative' is False (0). There's no neutral (that's what allows us to do such simple encoding).

For a more complex multinomial example, [see this Hugging Face tutorial](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb).

In [6]:
# changing positive and negative into numeric values

def cat2num(value):
    if value=='positive':
        return 1
    else:
        return 0

df['sentiment']  =  df['sentiment'].apply(cat2num)
train = df[:45000]
test = df[45000:]

# Data Preprocessing
For training model with BERT, we need to do some additional preprocessing (because that's how BERT was trained to begin with).
* Add special tokens to separate sentences
* Pass sequences of constant length (introduce padding)
* Create array of 0s (pad token) and 1s (real token) called attention mask

**Review this output**

Notice how this sentence was tokenized.
1. What is the case (we are using BERT uncased)
2. What do you think the `##` mean?
3. What happened to puncutation?
4. What happened to possessive 's? What does that mean in terms of tokens?

In [7]:
example='CUNY is the nation’s largest urban public university, a transformative engine of social mobility that is a critical component of the lifeblood of New York City.'
tokens=tokenizer.tokenize(example)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(tokens)
print(token_ids)

['cu', '##ny', 'is', 'the', 'nation', '’', 's', 'largest', 'urban', 'public', 'university', ',', 'a', 'transform', '##ative', 'engine', 'of', 'social', 'mobility', 'that', 'is', 'a', 'critical', 'component', 'of', 'the', 'life', '##blood', 'of', 'new', 'york', 'city', '.']
[12731, 4890, 2003, 1996, 3842, 1521, 1055, 2922, 3923, 2270, 2118, 1010, 1037, 10938, 8082, 3194, 1997, 2591, 12969, 2008, 2003, 1037, 4187, 6922, 1997, 1996, 2166, 26682, 1997, 2047, 2259, 2103, 1012]


The Bert Vocabulary is fixed with a size of ~30K tokens. Words that are not part of vocabulary are represented as subwords and characters.

The tokenizer takes the input sentence and will decide to keep every word as a whole word, split it into sub words(with special representation of first sub-word and subsequent subwords — see ## symbol in the example above) or as a last resort decompose the word into individual characters.

### Special Tokens
* [SEP] - marker for ending of a sentence
* [CLS] - we must add this token to the start of each sentence, so BERT knows we're doing classification
* [PAD] -There is also a special token for padding:
* [UNK] - ERT understands tokens that were in the training set. Everything else can be encoded using the [UNK] (unknown) token

1. ***convert_data_to_examples***: This function will accept our train and test datasets and convert each row into an InputExample object.


In [8]:
def convert_data_to_examples(train, test, review, sentiment):
    train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[review],
                                                          label = x[sentiment]), axis = 1)

    validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[review],
                                                          label = x[sentiment]), axis = 1,)

    return train_InputExamples, validation_InputExamples

train_InputExamples, validation_InputExamples = convert_data_to_examples(train,  test, 'review',  'sentiment')


2. ***convert_examples_to_tf_dataset***: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.

In [9]:
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in tqdm(examples):
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,    # Add 'CLS' and 'SEP'
            max_length=max_length,    # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],input_dict["token_type_ids"], input_dict['attention_mask'])
        features.append(InputFeatures( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label) )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


DATA_COLUMN = 'review'
LABEL_COLUMN = 'sentiment'

Create the Training set
1. Call the function above to format the examples as needed.
2. Randomize the dataset and shuffle it

In [None]:
train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

 19%|█▊        | 8376/45000 [01:27<05:12, 117.18it/s]

Create the Test set
1. Call the function above to format the examples as needed
2. Just create a batch (randomization not so important)

In [None]:
validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)

**Warning**
If you are doing this locally on your machine, this next step is the slow one. When you get to here, run the code cell, make sure it is returning loss and accuracy and then walk away -- possibly for a full day.

If you are doing this with GPU's it may still take an hour or two to run.

**Key activity**

Watch the accuracy and loss change for a while here. Notice
* If the changes in each value are large or small.
*  Does the accuracy ever goes down?
* Does the loss ever go up?

Remember that the *loss*, not the *accuracy* is being optimized. This means that the loss could be minimized independent from the accuracy.

**Optionally**

1. Remember the learning rate is the step size at each iteration while the algorithm is minimizing the loss. Try changing the learning rate and see what happens (feel free to interrupt the cell block and start it over).
2. Change the number of epochs to 3 (only do this if you are on GPU's). Did it improve?




In [None]:
#If you are on a new Mac, change 'tf.keras.optimizers.Adam' to 'tf.keras.optimizers.legacy.Adam' it'll likely be faster
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-8),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

1. What is your accuracy and loss?
2. Did they improve between epoch 1 & 2? What numerical direction does improvement look like?
3. What do you think would happen with another epoch?
4. What do you think would happen with smaller learning rate?

In [None]:
pred_sentences = ['worst movie of my life, will never watch movies from this series', 'Wow, blew my mind, what a movie by Marvel, animation and story is amazing']

In [None]:
 # we are tokenizing before sending into our trained model
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
# axis=-1, this means that the index that will be returned by argmax will be taken from the *last* axis.
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
    print(pred_sentences[i], ": ", labels[label[i]])

# References:
1. https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671
2. https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca