❗**WARNING**❗

This lab cannot be done in one sitting. There is a step that takes hours. 

## BERT lab
LLM's and ChatGPT | Fall 2023 | McSweeney | CUNY Graduate Center

**Due:** October 29


### Background
The purpose of this lab is to see how to apply BERT to a simple sentiment analysis task. BERT can be used for a wide variety of things including multi-label classification (i.e., sentiment analysis), question answering, text summarizaton, and more. 
This model is [hosted on Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert). Hugging Face is a repository of machine learning models - a bit like GitHub. 


#### Notes
This is a short lab using the same dataset throughout. Feel free to switch it up, but once you are comfortable with how the different alogorithms approach the task of breaking up text, move on. 

### Installations
The cell below contains commented code for installations. This works on a local installation of Jupyter Notebooks on Mac, Windows, or Linux. I think there is a GPU setting you have to change if you are using a [Colab see here](https://towardsdatascience.com/how-to-set-started-with-tensorflow-using-keras-api-and-google-colab-5421e5e4ef56).

**What am I installing?**

* [Tensorflow](https://www.tensorflow.org/) is an open-sourse machine learning platform that has many helpful tools. It is similar to PyTorch. 
* [transformers] is a Hugging Face library that allows us to work w Hugging Face models. That's how we will access BERT.

**Note**

These libraries can be a little finicky. You may have to upgrade and restart your Kernel a few times to get them to work. After you install them, re-comment them and restart your Kernel.

In [3]:
#!pip install tensorflow
#!pip install transformers 
#!pip install --upgrade transformers

In [4]:
import numpy as np
import pandas as pd
import sklearn

import tensorflow as tf
import transformers
from tqdm import tqdm

In [5]:
df=pd.read_csv("IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**BERT stands for Bidirectional Encoder Representations from Transformers and it is a state-of-the-art machine learning model used for NLP tasks. Jacob Devlin and his colleagues developed BERT at Google in 2018. Devlin and his colleagues trained the BERT on English Wikipedia (2,500M words) and BooksCorpus (800M words) and achieved the best accuracies for some of the NLP tasks in 2018. There are two pre-trained general BERT variations: The base model is a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, whereas the large model is a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture.**

# Sentiment Analysis with BERT
We will do the following operations to train a sentiment analysis model:
* Install Transformers library;
* Load the BERT Classifier and Tokenizer alıng with Input modules;
* Download the IMDB Reviews Data and create a processed dataset (this will take several operations;
* Configure the Loaded BERT model and Train for Fine-tuning
* Make Predictions with the Fine-tuned Model

In [12]:
# Loading the BERT Classifier and Tokenizer along with Input module
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [13]:
model.summary()

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_75 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109483778 (417.65 MB)
Trainable params: 109483778 (417.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [14]:
# changing positive and negative into numeric values

def cat2num(value):
    if value=='positive': 
        return 1
    else: 
        return 0
    
df['sentiment']  =  df['sentiment'].apply(cat2num)
train = df[:45000]
test = df[45000:]

# Data Preprocessing
For training model with BERT, we need to do some additional Prepriocessing. Let's understand them one by one!
* Add special tokens to separate sentences and do classification
* Pass sequences of constant length (introduce padding)
* Create array of 0s (pad token) and 1s (real token) called attention mask

In [15]:
# But first see BERT tokenizer exmaples and other required stuff!

example='In this Kaggle notebook, I will do sentiment analysis using BERT with Huggingface'
tokens=tokenizer.tokenize(example)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(token_ids)

['in', 'this', 'ka', '##ggle', 'notebook', ',', 'i', 'will', 'do', 'sentiment', 'analysis', 'using', 'bert', 'with', 'hugging', '##face']
[1999, 2023, 10556, 24679, 14960, 1010, 1045, 2097, 2079, 15792, 4106, 2478, 14324, 2007, 17662, 12172]


- > The original word has been split into smaller subwords and characters. This is because Bert Vocabulary is fixed with a size of ~30K tokens. Words that are not part of vocabulary are represented as subwords and characters.

- > Tokenizer takes the input sentence and will decide to keep every word as a whole word, split it into sub words(with special representation of first sub-word and subsequent subwords — see ## symbol in the example above) or as a last resort decompose the word into individual characters. Because of this, we can always represent a word as, at the very least, the collection of its individual characters.

Reference: https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca

### Special Tokens
* [SEP] - marker for ending of a sentence
* [CLS] - we must add this token to the start of each sentence, so BERT knows we’re doing classification
* [PAD] -There is also a special token for padding:
* [UNK] - ERT understands tokens that were in the training set. Everything else can be encoded using the [UNK] (unknown) token

1. — ***convert_data_to_examples***: This will accept our train and test datasets and convert each row into an InputExample object.
2. — ***convert_examples_to_tf_dataset***: This function will tokenize the InputExample objects, then create the required input format with the tokenized objects, finally, create an input dataset that we can feed to the model.


In [16]:
def convert_data_to_examples(train, test, review, sentiment): 
    train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[review], 
                                                          label = x[sentiment]), axis = 1)

    validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[review], 
                                                          label = x[sentiment]), axis = 1,)
  
    return train_InputExamples, validation_InputExamples

train_InputExamples, validation_InputExamples = convert_data_to_examples(train,  test, 'review',  'sentiment')
                                                                         

In [17]:
# train_InputExamples[0]

In [18]:
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in tqdm(examples):
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,    # Add 'CLS' and 'SEP'
            max_length=max_length,    # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],input_dict["token_type_ids"], input_dict['attention_mask'])
        features.append(InputFeatures( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label) )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


DATA_COLUMN = 'review'
LABEL_COLUMN = 'sentiment'

In [19]:
train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

100%|████████████████████████████████████| 45000/45000 [01:01<00:00, 728.72it/s]


In [20]:
validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)

100%|██████████████████████████████████████| 5000/5000 [00:07<00:00, 707.71it/s]


In [21]:
## Our dataset containing processed input sequences are ready to be fed to the model.

**Warning** 
This next step is painfully slow. When you get to here, run the code cell, make sure it is returning loss and accuracy and then walk away -- possibly for hours.

In [None]:
#If you are on a new Mac, change 'tf.keras.optimizers.Adam' to 'tf.keras.optimizers.legacy.Adam' it'll be faster
model.compile(optimizer=tf.keras.optimizers.legacy.Adam(learning_rate=3e-2, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
   2439/Unknown - 27258s 11s/step - loss: 0.0087 - accuracy: 0.9996

In [None]:
pred_sentences = ['worst movie of my life, will never watch movies from this series', 'Wow, blew my mind, what a movie by Marvel, animation and story is amazing']

In [None]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')   # we are tokenizing before sending into our trained model
tf_outputs = model(tf_batch)                                  
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)       # axis=-1, this means that the index that will be returned by argmax will be taken from the *last* axis.
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
    print(pred_sentences[i], ": ", labels[label[i]])

# References:
1. https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671
2. https://medium.com/@dhartidhami/understanding-bert-word-embeddings-7dc4d2ea54ca