<a href="https://colab.research.google.com/github/laxmiharikumar/transformers/blob/main/TF_SimpleTrainingWithTransTrainers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!nvidia-smi

In [None]:
! pip install transformers datasets --quiet --upgrade 

## Load the dataset emotion

In [None]:
from datasets import load_dataset

emotion_dataset = load_dataset("emotion")
emotion_dataset

In [None]:
emotion_dataset["train"]["text"][23]

In [None]:
## While visualizing and exploring the dataset it is better to use pandas
emotion_df = emotion_dataset["train"].to_pandas()
emotion_df

In [None]:
# Understand what the labels are
features = emotion_dataset["train"].features
features

In [None]:
features["label"].int2str(3)

In [None]:
#Create an id to label dictionary
id2label = {idx:features["label"].int2str(idx) for idx in range(6)}
id2label

In [None]:
# Create a label to id dictionary
label2id = {value:key for key,value in id2label.items()}
label2id

In [None]:
## Check how many you have in each label category
emotion_df["label"].value_counts(normalize=True).sort_index()

Labels are very imbalanced. 30% is sadness. 3% only for suprprise. If the train the model naively on this distribution, one problem that can happen is tthe model will get very good at predicting these majority classes but struggle a lot of these rare classes.
We can up sample the rare classes i.e duplicate the rare classes until we get an even distribution. But the problem is that deep learning models liek transformers are really good at memorizing or discovering patterns in the data and so if we duplicate a lot of examples then probably the model is going to kind of memorize those duplicates and when it sees examples from that class in production it is not going to generalize very well.
So we are going to modify the loss function of the model during training and this will allow us to introduce a bias directly at the level of loss function which indicates these are the ways that the classes are distributed and hopefully this will encourage the model to pay more attention to these rare classes

## Prepare the dataset


MiniLM is smaller than BERT but just 1% lower in accuracy. Better to start off with such models. We can iterate faster

In [None]:
# Tokenize the data
from transformers import AutoTokenizer

checkpoint = "microsoft/MiniLM-L12-H384-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, return_tensors="tf") 

In [None]:
emotion_dataset["train"]["text"][:1]

In [None]:
# Feed in an example
tokenizer(emotion_dataset["train"]["text"][:1])

input_ids - tokenized inputs that we are feeding to the model
token_type_ids - to indicate if sentence 1 or sentence 2
attention_mask - to indicate which tokens correspond to padding or not

In [None]:
# Apply this tokenizer to all the examples in the dataset
def tokenize_text(examples):
  return tokenizer(examples["text"], truncation=True, max_length=512)

In [None]:
emotion_dataset = emotion_dataset.map(tokenize_text, batched=True)
emotion_dataset

## Dealing with imbalanced data

In the data there is some frequency distribution and we are going to introduce some weights/coefficients for the loss function which will multiply each one of those classes by an amount that is reflected in the data. If we demand that these coefficients range from 0 to 1, assign a high weight to rare classes and low weight to common classes so that the model doesnt get too biased on the majority classes

In [None]:
len(emotion_df) # This is the training dataset

In [None]:
tmp_weights = (1 - (emotion_df["label"].value_counts().sort_index() / len(emotion_df))).values
tmp_weights, type(tmp_weights)

In [None]:
# We want all the weights to be dictionary as we are working with tf

class_weights = {idx:tmp_weights[idx] for idx in range(len(tmp_weights))}
class_weights

In [None]:
## Rename label colum to labels
emotion_dataset = emotion_dataset.rename_column("label", "labels")
emotion_dataset

## Create the model

In [None]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint,
                                                           num_labels=6,
                                                           id2label=id2label,
                                                           label2id=label2id,
                                                           from_pt=True)

In [None]:
from sklearn import metrics
import tensorflow as tf

def compute_metrics(pred):
  labels = val_labels
  preds = np.argmax(tf.squeeze(pred[0]), axis=1)
  f1 = metrics.f1_score(labels, preds, average="weighted")
  return {"f1": f1}

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [None]:
tf_train_dataset = emotion_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=64
)

tf_validation_dataset = emotion_dataset["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=64
)

In [None]:
emotion_dataset["validation"]["labels"][:10]

In [None]:
import numpy as np

val_labels = np.concatenate([y for x, y in tf_validation_dataset], axis=0)
val_labels[:10]

In [None]:
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.optimizers import Adam


batch_size = 64
num_epochs = 5
# The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
# by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset,
# not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.
num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, end_learning_rate=0.0, decay_steps=num_train_steps
)

opt = Adam(learning_rate=lr_scheduler)

In [None]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

model.compile(
    optimizer=opt,
    loss=SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)


In [None]:
!huggingface-cli login

In [None]:
from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_dataset)
push_to_hub_callback = PushToHubCallback(
    output_dir="./minilm-finetuned-emotion", tokenizer=tokenizer, hub_model_id="laxsvips/minilm-finetuned-emotion"
)

In [None]:
history_1 = model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=num_epochs,
    class_weight=class_weights,
    callbacks= [push_to_hub_callback, metric_callback]
    )

In [None]:
# Plot the validation and training data separately
from matplotlib import pyplot as plt
def plot_loss_curves(history):
  """
  Returns separate loss curves for training and validation metrics.
  """ 
  loss = history.history['loss']
  val_loss = history.history['val_loss']

  accuracy = history.history['accuracy']
  val_accuracy = history.history['val_accuracy']

  epochs = range(len(history.history['loss']))

  # Plot loss
  plt.plot(epochs, loss, label='training_loss')
  plt.plot(epochs, val_loss, label='val_loss')
  plt.title('Loss')
  plt.xlabel('Epochs')
  plt.legend()

  # Plot accuracy
  plt.figure()
  plt.plot(epochs, accuracy, label='training_accuracy')
  plt.plot(epochs, val_accuracy, label='val_accuracy')
  plt.title('Accuracy')
  plt.xlabel('Epochs')
  plt.legend();

In [None]:
# Plot the accuracy
plot_loss_curves(history_1)

In [None]:
model.evaluate(tf_validation_dataset)

In [None]:
model_pred_probs = model.predict(tf_validation_dataset)
model_pred_probs[:10]

In [None]:
model_pred_probs[0][3]

In [None]:
a = tf.squeeze(model_pred_probs[0])
np.argmax(a[3])

In [None]:
import tensorflow as tf
import numpy as np

model_preds = np.argmax(tf.squeeze(model_pred_probs[0]), axis=1)
model_preds[:10]

In [None]:
train_data = list(tf_validation_dataset)
train_data[0]

In [None]:
from sklearn import metrics

def calculate_results(y_true, y_pred):
  eval_metrics = {}
  eval_metrics["accuracy"] = metrics.accuracy_score(y_true, y_pred)
  eval_metrics["precision"] = metrics.precision_score(y_true, y_pred, average='weighted') # multiclass
  eval_metrics["recall"] = metrics.recall_score(y_true, y_pred, average='weighted') # multiclass
  eval_metrics["f1_score"] = metrics.f1_score(y_true, y_pred, average='weighted') # multiclass

  return eval_metrics

In [None]:
model_results = calculate_results(val_labels, model_preds)
model_results

In [None]:
## Use your finetuned model
from transformers import pipeline

model_cpt = "laxsvips/minilm-finetuned-emotion"
pipe = pipeline("text-classification", model=model_cpt, return_all_scores=True)

In [None]:
pipe("I am really excited about part 2 of the Hugging Face course")

In [None]:
predicted_scores = pipe("I am so glad you could help me")
predicted_scores