# FLAN-T5

## What is FLAN-T5?
FLAN-T5 is an open-source, sequence-to-sequence language model developed by Google researchers in late 2022. It is capable of performing various natural language processing tasks and can be used both in research and commercial applications. The model is based on the Transformer architecture and trained on a large corpus of text known as the Colossal Clean Crawled Corpus (C4).

Fine-tuning FLAN-T5 is essential to adapt it to specific tasks and improve its performance. This process allows customization of the model according to the user's needs and data, making it accessible to a wider range of users, including smaller organizations and individual researchers without GPU resources.

## Potential Applications
Potential applications of fine-tuned FLAN-T5 include:
- **Chat and Dialogue Summarization:** FLAN-T5 can condense conversations, providing a quick recap of customer service interactions or business meetings.
- **Text Classification:** Useful for automating the categorization of text into predefined classes, such as sentiment analysis and spam detection.
- **FHIR Resource Generation:** FLAN-T5 can convert clinical text into structured Fast Healthcare Interoperability Resources (FHIR) for easy sharing and integration into healthcare systems.

Fine-tuning FLAN-T5 opens up possibilities for optimizing its performance in various real-world scenarios.


### Library

In [1]:
!pip install nltk
!pip install datasets
!pip install transformers[torch]
!pip install tokenizers
!pip install evaluate
!pip install rouge_score
!pip install sentencepiece
!pip install huggingface_hub

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16
Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m5.5 MB/s[0m eta [36m

### Imports

In [2]:
import nltk
import evaluate
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer

### Dataset preparation

In [3]:
import pandas as pd

# Given dataset
data=pd.read_csv('/content/drive/Shareddrives/dataset/clients/research/finalfeatures.csv')

# Create DataFrame
df = pd.DataFrame(data)

# Function to generate question and answer
def generate_question_answer(row):
    # Formulate the question
    question = f"For store number {row['store_nbr']} in the city of {row['city']}, with products from various categories such as {row['family']}, during a {row['type_of_holiday'].lower()} on {row['year']}-{row['month']}-{row['day']}, with {'no' if row['onpromotion'] == 0 else 'promotions'}, cluster {row['cluster']}, and WTI crude oil price at ${row['dcoilwtico']}, what were the total sales on that day?"

    # Provide the answer
    answer = row['sales']

    return question, answer

# Generate question-answer pairs using list comprehension
question_answer_pairs = [generate_question_answer(row) for _, row in df.iterrows()]

# Extract questions and answers into separate lists
questions, answers = zip(*question_answer_pairs)

# Create a DataFrame from the lists
question_answer_df = pd.DataFrame({'question': questions, 'answer': answers})

# Save the dataframe to a file
question_answer_df.to_csv('datasetqa.csv', index=False)

question_answer_df.head(5)


Unnamed: 0,question,answer
0,"For store number 1.0 in the city of Quito, wit...",0.0
1,"For store number 1.0 in the city of Quito, wit...",0.0
2,"For store number 1.0 in the city of Quito, wit...",21.0
3,"For store number 1.0 in the city of Quito, wit...",0.0
4,"For store number 1.0 in the city of Quito, wit...",3.0


In [4]:
import pandas as pd
import json

def format_and_save_to_json(input_csv_path, output_json_path):
    # Read data from CSV
    data = pd.read_csv(input_csv_path)

    # Convert DataFrame to list of dictionaries in desired format
    formatted_data = []
    for idx, row in data.iterrows():
        formatted_data.append({
            "question": row['question'],
            "answer": row['answer'],
            "id": str(idx)  # Adding an ID based on index (you can adjust this based on your requirements)
        })

    # Save data as JSON
    with open(output_json_path, "w") as json_file:
        json.dump(formatted_data, json_file, indent=4)

# Example usage
input_csv_path = "/content/datasetqa.csv"
output_json_path = "formatted_data.json"

format_and_save_to_json(input_csv_path, output_json_path)


### Loading the model

In [5]:
# Load the tokenizer, model, and data collator

MODEL_NAME = "google/flan-t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [8]:
from datasets import load_dataset

# this dataset uses the new Image feature :)
question_answer_da = load_dataset('/content/dataset.py')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Generating train split: 0 examples [00:00, ? examples/s]

In [9]:
question_answer_da = question_answer_da["train"].train_test_split(test_size=0.2)
# Check the length of the data and its structure
question_answer_da

DatasetDict({
    train: Dataset({
        features: ['id', 'input_text', 'output_text'],
        num_rows: 257637
    })
    test: Dataset({
        features: ['id', 'input_text', 'output_text'],
        num_rows: 64410
    })
})

In [12]:
# We prefix our tasks with "answer the question"
prefix = "Please answer this question: "

# Define the preprocessing function

def preprocess_function(examples):
   """Add prefix to the sentences, tokenize the text, and set the labels"""
   # The "inputs" are the tokenized answer:
   inputs = [prefix + doc for doc in examples["input_text"]]
   model_inputs = tokenizer(inputs, max_length=128, truncation=True)

   # The "labels" are the tokenized outputs:
   labels = tokenizer(text_target=examples["output_text"],
                      max_length=512,
                      truncation=True)

   model_inputs["labels"] = labels["input_ids"]
   return model_inputs

In [None]:
# Map the preprocessing function across our dataset
tokenized_dataset = question_answer_da.map(preprocess_function, batched=True)

nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

def compute_metrics(eval_preds):
   preds, labels = eval_preds

   # decode preds and labels
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

   # rougeLSum expects newline after each sentence
   decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
   decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

   result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

   return result

Map:   0%|          | 0/257637 [00:00<?, ? examples/s]

In [None]:
# Global Parameters
L_RATE = 3e-4
BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH = 4
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
NUM_EPOCHS = 3

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="/content/drive/Shareddrives/dataset/clients/research/Modelsaved/",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   predict_with_generate=True,
   push_to_hub=False
)

In [None]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

trainer.train()

#Inference

In [None]:
last_checkpoint = "/content/drive/Shareddrives/dataset/clients/research/Modelsaved/checkpoint-15000"

finetuned_model = T5ForConditionalGeneration.from_pretrained(last_checkpoint)
tokenizer = T5Tokenizer.from_pretrained(last_checkpoint)

OSError: Incorrect path_or_model_id: '/content/drive/Shareddrives/dataset/clients/research/Modelsaved/checkpoint-15000'. Please provide either the path to a local folder or the repo_id of a model on the Hub.

In [None]:
my_question = "On 2013-02-11, at store number 1 in Quito, Pichincha, under store type D and cluster 13, with 396 transactions recorded, and crude oil price at 97.01, what was the sales quantity of BABY CARE products (ID: 73063), considering whether they were on promotion (On Promotion: 0) in Ecuador during Carnaval (Transferred: False)?"
inputs = ["Please answer this question: " + my_question]

In [None]:
inputs = tokenizer(inputs, return_tensors="pt")
outputs = finetuned_model.generate(**inputs)
answer = tokenizer.decode(outputs[0])
#print(answer)
from textwrap import fill

print(fill(answer, width=80))