# FLAN-T5

## What is FLAN-T5?
FLAN-T5 is an open-source, sequence-to-sequence language model developed by Google researchers in late 2022. It is capable of performing various natural language processing tasks and can be used both in research and commercial applications. The model is based on the Transformer architecture and trained on a large corpus of text known as the Colossal Clean Crawled Corpus (C4).

Fine-tuning FLAN-T5 is essential to adapt it to specific tasks and improve its performance. This process allows customization of the model according to the user's needs and data, making it accessible to a wider range of users, including smaller organizations and individual researchers without GPU resources.

## Potential Applications
Potential applications of fine-tuned FLAN-T5 include:
- **Chat and Dialogue Summarization:** FLAN-T5 can condense conversations, providing a quick recap of customer service interactions or business meetings.
- **Text Classification:** Useful for automating the categorization of text into predefined classes, such as sentiment analysis and spam detection.
- **FHIR Resource Generation:** FLAN-T5 can convert clinical text into structured Fast Healthcare Interoperability Resources (FHIR) for easy sharing and integration into healthcare systems.

Fine-tuning FLAN-T5 opens up possibilities for optimizing its performance in various real-world scenarios.


### Library

In [None]:
!pip install nltk
!pip install datasets
!pip install transformers[torch]
!pip install tokenizers
!pip install evaluate
!pip install rouge_score
!pip install sentencepiece
!pip install huggingface_hub

Collecting accelerate>=0.20.3 (from transformers[torch])
  Using cached accelerate-0.27.2-py3-none-any.whl (279 kB)
Installing collected packages: accelerate
Successfully installed accelerate-0.27.2
Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=587c1fc1a0f51be149e9a13422baad6876317a62893e5ffdb483688d2b103ff0
  S

### Imports

In [None]:
import nltk
import evaluate
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import T5Tokenizer, DataCollatorForSeq2Seq
from transformers import T5ForConditionalGeneration, Seq2SeqTrainingArguments, Seq2SeqTrainer

In [None]:
!sudo echo -ne '\n' | sudo add-apt-repository ppa:alessandro-strada/ppa >/dev/null 2>&1 # note: >/dev/null 2>&1 is used to supress printing
!sudo apt update >/dev/null 2>&1
!sudo apt install google-drive-ocamlfuse >/dev/null 2>&1
!google-drive-ocamlfuse
!sudo apt-get install w3m >/dev/null 2>&1 # to act as web browser
!xdg-settings set default-web-browser w3m.desktop >/dev/null 2>&1 # to set default browser
%cd /content
!mkdir gdrive
%cd gdrive
!mkdir "My Drive"
!google-drive-ocamlfuse "/content/gdrive/My Drive"

/usr/bin/xdg-open: 882: www-browser: not found
/usr/bin/xdg-open: 882: links2: not found
/usr/bin/xdg-open: 882: elinks: not found
/usr/bin/xdg-open: 882: links: not found
/usr/bin/xdg-open: 882: lynx: not found
/usr/bin/xdg-open: 882: w3m: not found
xdg-open: no method available for opening 'https://accounts.google.com/o/oauth2/auth?client_id=564921029129.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fgd-ocaml-auth.appspot.com%2Foauth2callback&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force&state=S7JpNTeVpwbkwV6owrsyVEAcVS-8O%2FEp18V5FedQixg'
/bin/sh: 1: firefox: not found
/bin/sh: 1: google-chrome: not found
/bin/sh: 1: chromium-browser: not found
/usr/bin/open: 882: www-browser: not found
/usr/bin/open: 882: links2: not found
/usr/bin/open: 882: elinks: not found
/usr/bin/open: 882: links: not found
/usr/bin/open: 882: lynx: not found
/usr/bin/open: 882: w3m: not found
xdg-open: no method available for opening

### Dataset preparation

In [None]:
import pandas as pd

# Given dataset
data=pd.read_csv('/content/drive/Shareddrives/dataset/clients/research/finalfeatures.csv')

# Create DataFrame
df = pd.DataFrame(data)

# Function to generate question and answer
def generate_question_answer(row):
    # Formulate the question
    question = f"For store number {row['store_nbr']} in the city of {row['city']}, with products from various categories such as {row['family']}, during a {row['type_of_holiday'].lower()} on {row['year']}-{row['month']}-{row['day']}, with {'no' if row['onpromotion'] == 0 else 'promotions'}, cluster {row['cluster']}, and WTI crude oil price at ${row['dcoilwtico']}, what were the total sales on that day?"

    # Provide the answer
    answer = row['sales']

    return question, answer

# Generate question-answer pairs using list comprehension
question_answer_pairs = [generate_question_answer(row) for _, row in df.iterrows()]

# Extract questions and answers into separate lists
questions, answers = zip(*question_answer_pairs)

# Create a DataFrame from the lists
question_answer_df = pd.DataFrame({'question': questions, 'answer': answers})

# Save the dataframe to a file
question_answer_df.to_csv('datasetqa.csv', index=False)

question_answer_df.head(5)


Unnamed: 0,question,answer
0,"For store number 1.0 in the city of Quito, wit...",0.0
1,"For store number 1.0 in the city of Quito, wit...",0.0
2,"For store number 1.0 in the city of Quito, wit...",21.0
3,"For store number 1.0 in the city of Quito, wit...",0.0
4,"For store number 1.0 in the city of Quito, wit...",3.0


In [None]:
import pandas as pd
import json

def format_and_save_to_json(input_csv_path, output_json_path):
    # Read data from CSV
    data = pd.read_csv(input_csv_path)

    # Convert DataFrame to list of dictionaries in desired format
    formatted_data = []
    for idx, row in data.iterrows():
        formatted_data.append({
            "question": row['question'],
            "answer": row['answer'],
            "id": str(idx)  # Adding an ID based on index (you can adjust this based on your requirements)
        })

    # Save data as JSON
    with open(output_json_path, "w") as json_file:
        json.dump(formatted_data, json_file, indent=4)

# Example usage
input_csv_path = "/content/datasetqa.csv"
output_json_path = "formatted_data.json"

format_and_save_to_json(input_csv_path, output_json_path)


### Loading the model

In [None]:
# Load the tokenizer, model, and data collator

MODEL_NAME = "google/flan-t5-base"

tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
from datasets import load_dataset

# this dataset uses the new Image feature :)
question_answer_da = load_dataset('/content/dataset.py')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [None]:
question_answer_da = question_answer_da["train"].train_test_split(test_size=0.2)
# Check the length of the data and its structure
question_answer_da

DatasetDict({
    train: Dataset({
        features: ['id', 'input_text', 'output_text'],
        num_rows: 257637
    })
    test: Dataset({
        features: ['id', 'input_text', 'output_text'],
        num_rows: 64410
    })
})

In [None]:
# We prefix our tasks with "answer the question"
prefix = "Please answer this question: "

# Define the preprocessing function

def preprocess_function(examples):
   """Add prefix to the sentences, tokenize the text, and set the labels"""
   # The "inputs" are the tokenized answer:
   inputs = [prefix + doc for doc in examples["input_text"]]
   model_inputs = tokenizer(inputs, max_length=128, truncation=True)

   # The "labels" are the tokenized outputs:
   labels = tokenizer(text_target=examples["output_text"],
                      max_length=512,
                      truncation=True)

   model_inputs["labels"] = labels["input_ids"]
   return model_inputs

In [None]:
# Map the preprocessing function across our dataset
tokenized_dataset = question_answer_da.map(preprocess_function, batched=True)

nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

def compute_metrics(eval_preds):
   preds, labels = eval_preds

   # decode preds and labels
   labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
   decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
   decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

   # rougeLSum expects newline after each sentence
   decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
   decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

   result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

   return result

Map:   0%|          | 0/257637 [00:00<?, ? examples/s]

Map:   0%|          | 0/64410 [00:00<?, ? examples/s]

In [None]:
# Global Parameters
L_RATE = 3e-4
BATCH_SIZE = 8
PER_DEVICE_EVAL_BATCH = 4
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 3
NUM_EPOCHS = 3

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="/content/drive/Shareddrives/dataset/clients/research/Modelsaved/",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   weight_decay=WEIGHT_DECAY,
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   predict_with_generate=True,
   push_to_hub=False
)

In [None]:
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset["train"],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics
)

trainer.train()

Epoch,Training Loss,Validation Loss


#Inference

## Online prediction

In [None]:
last_checkpoint = "Jyotiyadav/model2.0"

finetuned_model = T5ForConditionalGeneration.from_pretrained(last_checkpoint)
tokenizer = T5Tokenizer.from_pretrained(last_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.8k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.59k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
import pandas as pd

# Given dataset
data = pd.read_csv('/content/gdrive/My Drive/research/submission_encoded_dataset.csv')

# Fill missing values with 0
data['transactions'] = data['transactions'].fillna(0)
data['type_of_holiday'] = data['type_of_holiday'].fillna(0)

# Create DataFrame
df = pd.DataFrame(data)

# Function to generate question
def generate_question(row):
    # Formulate the question
    question = f"For store number {row['store_nbr']} in the city of {row['city']}, with products from various categories such as {row['family']}, during a {str(row['type_of_holiday']).lower()} on {row['year']}-{row['month']}-{row['day']}, with {'no' if row['onpromotion'] == 0 else 'promotions'}, cluster {row['cluster']}, and WTI crude oil price at ${row['dcoilwtico']}, what were the total sales on that day?"

    return question

# Generate questions using list comprehension
questions = [generate_question(row) for _, row in df.iterrows()]

# Create a DataFrame for questions
questions_df = pd.DataFrame({'question': questions})

# Save the dataframe to a file
questions_df.to_csv('datasetqatest.csv', index=False)

questions_df.head(5)


Unnamed: 0,question
0,"For store number 1 in the city of Quito, with ..."
1,"For store number 1 in the city of Quito, with ..."
2,"For store number 1 in the city of Quito, with ..."
3,"For store number 1 in the city of Quito, with ..."
4,"For store number 1 in the city of Quito, with ..."


In [None]:
my_question = "For store number 1 in the city of Quito, with products from various categories such as AUTOMOTIVE, during a 0 on 2017-8-16, with no, cluster 13, and WTI crude oil price at $46.8, what were the total sales on that day?"
inputs = ["Please answer this question: " + my_question]

In [None]:
inputs = tokenizer(inputs, return_tensors="pt")
outputs = finetuned_model.generate(**inputs)
answer = tokenizer.decode(outputs[0])
#print(answer)
from textwrap import fill

print(fill(answer, width=80))



<pad> 5.0</s>


### Batch Inference

In [None]:
import pandas as pd

# Assuming you have already imported the necessary libraries and defined the functions and model

# Load the DataFrame from the CSV file
df = pd.read_csv("/content/gdrive/datasetqatest.csv")

# Function to generate answer for a given question
def generate_answer_batch(questions):
    inputs = tokenizer(["Please answer this question: " + question for question in questions], return_tensors="pt", padding=True, truncation=True)
    outputs = finetuned_model.generate(**inputs)
    answers = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return answers

# Generate predictions for all questions in the DataFrame
batch_size = 100  # Adjust as needed
predicted_answers = []
for i in range(0, len(df), batch_size):
    questions_batch = df['question'][i:i+batch_size].tolist()
    answers_batch = generate_answer_batch(questions_batch)
    predicted_answers.extend(answers_batch)

# Add the predicted answers to the DataFrame
df['predicted_answer'] = predicted_answers

# Save the DataFrame to a CSV file
df.to_csv('/content/gdrive/My Drive/research/predicted_answer.csv', index=False)




In [None]:
df1=pd.read_csv('/content/test.csv')
# Select only the 'id' column from dataframe1
id_column = df1['id']
predictions = df['predicted_answer']
# Combine id_column and predicted_answers into a new DataFrame
dataframe2 = pd.DataFrame({'id': id_column, 'predictions': predicted_answers})

# Save the dataframe2 to a CSV file
dataframe2.to_csv('/content/gdrive/My Drive/research/submission.csv', index=False)

In [None]:
dataframe2

Unnamed: 0,id,predictions
0,3000888,5.0
1,3000889,0.0
2,3000890,1.0
3,3000891,1412.0
4,3000892,0.0
...,...,...
28507,3029395,319.299
28508,3029396,88.7
28509,3029397,2193.923
28510,3029398,6.0


In [None]:
import pandas as pd

# Assuming df is your DataFrame containing the data
dataframe2.rename(columns={"predictions": "sales"}, inplace=True)


In [None]:
# Save the dataframe2 to a CSV file
dataframe2.to_csv('/content/gdrive/My Drive/research/submission.csv', index=False)