<a href="https://colab.research.google.com/github/ollyekhan/SocialMediaMiningFinalProject/blob/main/Fine_Tuning_flan_t5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Transformers installation
# !pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

# Fine-tune a pretrained model

There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. 🤗 Transformers provides access to thousands of pretrained models for a wide range of tasks. When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning, an incredibly powerful training technique. In this tutorial, you will fine-tune a pretrained model with a deep learning framework of your choice:

* Fine-tune a pretrained model with 🤗 Transformers [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer).
* Fine-tune a pretrained model in TensorFlow with Keras.
* Fine-tune a pretrained model in native PyTorch.

<a id='data-processing'></a>

## Prepare a dataset

Before you can fine-tune a pretrained model, download a dataset and prepare it for training. The previous tutorial showed you how to process data for training, and now you get an opportunity to put those skills to the test!

Begin by loading the [Yelp Reviews](https://huggingface.co/datasets/yelp_review_full) dataset:

In [2]:
# # flan-t5
# import pandas as pd
# from datasets import Dataset

# # Load your local dataset using pandas
# df = pd.read_csv("dataset/fine_tuning_dataset_seed_1.csv")

# # Convert float columns to string
# df['question'] = df['question'].astype(str)
# df['context'] = "Question: " + df['question'] + "\n" + " " + df['context'].astype(str)
# df['answer'] = df['answer'].astype(str)

# # Convert the DataFrame into a Hugging Face dataset
# dataset_dict = {
#     "context": df["context"],
#     "answer": df["answer"]
# }

# # Create a Hugging Face dataset
# custom_dataset = Dataset.from_dict(dataset_dict)

# # print(custom_dataset["context"][2])

# # Now, you can split the dataset using the seed value
# # You would use the same seed number used when saving the dataset
# seed_num = df['seed'].iloc[0]  # Extract the seed value from the DataFrame
# splits = custom_dataset.train_test_split(test_size=0.33, seed=seed_num)

In [3]:
# from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

# def tokenize_function(examples):
#     return tokenizer(examples["context"], examples["answer"], padding="max_length", truncation=True, max_length=512)
# # tokenized_datasets = splits.map(tokenize_function, batched=True)

In [4]:
# train_dataset = splits["train"]
# validation_dataset = splits["test"]

In [5]:
# model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

In [6]:
# batch_size = 8
# epochs = 3
# learning_rate = 3e-5

In [7]:
# import tensorflow as tf
# optimizer = tf.keras.optimizers.Adam(learning_rate)
# loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [8]:
# for epoch in range(epochs):
#     print(f"Epoch {epoch + 1}/{epochs}")
#     for step, batch in enumerate(train_dataset):
#         with tf.GradientTape() as tape:
#             inputs = tokenizer(batch["context"], padding=True, return_tensors="tf", truncation=True)
#             outputs = model(inputs)
#             loss = outputs.loss
#         gradients = tape.gradient(loss, model.trainable_variables)
#         optimizer.apply_gradients(zip(gradients, model.trainable_variables))
#         if step % 100 == 0:
#            print(f"Step {step}/{len(train_dataset)}: Loss: {loss:.4f}")

In [9]:
# # Save the fine-tuned model
# output_dir = "fine_tuned_model"
# model.save_pretrained(output_dir)
# tokenizer.save_pretrained(output_dir)

In [10]:
# validation_loss = []
# for batch in validation_dataset:
#     inputs = tokenizer(batch["input_text"], padding=True, return_tensors="tf", truncation=True)
#     outputs = model(inputs, labels=batch["target_text"], training=False)
#     validation_loss.append(outputs.loss)
# average_validation_loss = tf.reduce_mean(validation_loss)
# print("Average Validation Loss:", average_validation_loss)

# Another attempt

In [None]:
%%bash
pip install nltk
pip install datasets
pip install transformers[torch]
pip install tokenizers
pip install evaluate
pip install rouge_score
pip install sentencepiece
pip install huggingface_hub

In [25]:
import nltk
import evaluate
import numpy as np
from datasets import load_dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [26]:
# Load the tokenizer, model, and data collator
MODEL_NAME = "google/flan-t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [27]:
# flan-t5
import pandas as pd
from datasets import Dataset

# Load your local dataset using pandas
df = pd.read_csv("dataset/fine_tuning_dataset_seed_1.csv")

# Convert float columns to string
df['question'] = df['question'].astype(str)
df['context'] = df['question'] + "\n" + " " + df['context'].astype(str)
df['answer'] = df['answer'].astype(str)

# Convert the DataFrame into a Hugging Face dataset
dataset_dict = {
    "context": df["context"],
    "answer": df["answer"]
}

# Create a Hugging Face dataset
custom_dataset = Dataset.from_dict(dataset_dict)

# Now, you can split the dataset using the seed value
# You would use the same seed number used when saving the dataset
seed_num = df['seed'].iloc[0]  # Extract the seed value from the DataFrame
splits = custom_dataset.train_test_split(test_size=0.33, seed=seed_num)

In [28]:
prefix = "Answer the following question: "

def preprocess_function(examples):
   """Add prefix to the sentences, tokenize the text, and set the labels"""
   # The "inputs" are the tokenized answer:
   inputs = [prefix + doc for doc in examples["context"]]
   model_inputs = tokenizer(inputs, max_length=512, truncation=True)

   # The "labels" are the tokenized outputs:
   labels = tokenizer(text_target=examples["answer"],
                      max_length=512,
                      truncation=True)

   model_inputs["labels"] = labels["input_ids"]
   return model_inputs

In [29]:
tokenized_dataset = splits.map(preprocess_function, batched=True)

Map:   0%|          | 0/402 [00:00<?, ? examples/s]

Map:   0%|          | 0/198 [00:00<?, ? examples/s]

In [30]:
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

In [31]:
def compute_metrics(eval_preds):
   preds, labels = eval_preds

   # decode preds and labels
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

   # rougeLSum expects newline after each sentence
   decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
   decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

   result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

   return result

In [32]:
# Global Parameters
L_RATE = 3e-4
BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH = 4
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
NUM_EPOCHS = 3

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="./results",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   logging_steps=100,
   predict_with_generate=True,
   push_to_hub=False
)

In [33]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

In [34]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum
1,No log,1.681669,0.265216,0.151558,0.241351,0.242295
2,1.929800,1.624817,0.269894,0.150563,0.242129,0.243261
3,1.929800,1.621025,0.271891,0.157727,0.246836,0.247938




TrainOutput(global_step=153, training_loss=1.8056269128338185, metrics={'train_runtime': 216.2365, 'train_samples_per_second': 5.577, 'train_steps_per_second': 0.708, 'total_flos': 825817367052288.0, 'train_loss': 1.8056269128338185, 'epoch': 3.0})

In [35]:
trainer.save_model("fine_tuned_flan_t5_seed_1")

In [36]:
!zip -r fine_tuned_flan_t5_seed_1.zip fine_tuned_flan_t5_seed_1

  adding: fine_tuned_flan_t5_seed_1/ (stored 0%)
  adding: fine_tuned_flan_t5_seed_1/generation_config.json (deflated 29%)
  adding: fine_tuned_flan_t5_seed_1/config.json (deflated 62%)
  adding: fine_tuned_flan_t5_seed_1/spiece.model (deflated 48%)
  adding: fine_tuned_flan_t5_seed_1/special_tokens_map.json (deflated 85%)
  adding: fine_tuned_flan_t5_seed_1/model.safetensors (deflated 7%)
  adding: fine_tuned_flan_t5_seed_1/training_args.bin (deflated 51%)
  adding: fine_tuned_flan_t5_seed_1/tokenizer_config.json (deflated 94%)
  adding: fine_tuned_flan_t5_seed_1/added_tokens.json (deflated 83%)


<a id='trainer'></a>