## [FLAN-T5 Model](https://arxiv.org/pdf/2210.11416v5.pdf)

**Finetuning language models** on a collection of **datasets phrased as instructions** has been shown to improve
model performance and generalization to unseen tasks. In this paper we explore instruction finetuning
with a particular focus on **(1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on
chain-of-thought data.** We find that instruction finetuning with the above aspects dramatically improves
performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT),
and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation, RealToxicityPrompts).
For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PaLM 540B by a large margin
(+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as
75.2% on five-shot MMLU. We also **publicly release Flan-T5 checkpoints**,1 which achieve strong few-shot
performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a
general method for improving the performance and usability of pretrained language models.

### [google/flan-t5-small](https://huggingface.co/google/flan-t5-small)
80M parameters

In [1]:
!pip install transformers -U
!pip install datasets

Collecting transformers
  Downloading transformers-4.39.3-py3-none-any.whl.metadata (134 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m402.1 kB/s[0m eta [36m0:00:00[0m1m357.2 kB/s[0m eta [36m0:00:01[0m
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.39.3-py3-none-any.whl (8.8 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m0m eta [36m0:00:01[0m0:01[0m01[0m
[?25hDownloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0mm
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uni

## Task 1: Use a pre-trained google/flan-t5-small as the model.

In [65]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name="google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

#### Model Prediction

In [66]:
import re
from transformers import GenerationConfig
generation_config = GenerationConfig(max_new_tokens=100)
#generation_config = GenerationConfig(max_new_tokens=100,  do_sample=True, temperature=0.1)

def generate_llm_prediction(prompt):
    #print(f"\n========== Input Prompt ===============\n{prompt}")
    inputs = tokenizer(prompt, return_tensors='pt')
    output = tokenizer.decode(model.generate(inputs["input_ids"],generation_config=generation_config)[0],skip_special_tokens=True)
    output = re.sub('---*','',str(output))
    #print(f"Output:\n{output}")
    return output

## Task 2: Verify if the summarization task works.

For this task, let's take some examples from DialogSum dataset and observe the performance on these examples

[DialogSum Dataset](https://huggingface.co/datasets/knkarthick/dialogsum): DialogSum is a large-scale **dialogue summarization dataset**, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding manually labeled summaries and topics.


In [4]:
import datasets
dataset_name = "knkarthick/dialogsum"
dataset = datasets.load_dataset(dataset_name)

TRAINING_DATA_COUNT = len(dataset['train'])
TEST_DATA_COUNT = len(dataset['test'])

print(f"Train dataset size: {TRAINING_DATA_COUNT}")
print(f"Test dataset size: {TEST_DATA_COUNT}")

Train dataset size: 12460
Test dataset size: 1500


In [5]:
import random

def visualize_dataset(dataset: datasets.Dataset, indices: list[int] = None) -> None:
    if not indices:
        # Generate a list of 5 random integers from the training data range
        indices = random.sample(range(TRAINING_DATA_COUNT), 5)
    for index in indices:
        print('='*100)
        print(f'>> Index: {index}')
        if index < 0 or index > TRAINING_DATA_COUNT:
            print(f"Incorrect Index: {index}")
            continue
        print(f'\n== Dialogue:\n{dataset["train"][index]["dialogue"]}')
        print(f'\n== Summary:\n{dataset["train"][index]["summary"]}')

#visualize_dataset(dataset)
visualize_dataset(dataset, indices=[0,40,60])

>> Index: 0

== Dialogue:
#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.

== Summary:
Mr. Smith's getting a check-u

#### Zero-Shot Inference

In [6]:
def create_zero_shot_prompt(dialogue: str, prompt_index: int) -> str:
    if prompt_index == 1:
        summarization_zero_shot_prompt = f"""Summarize the following Dialogue\n\nDialogue:\n{dialogue}
        """

    elif prompt_index == 2:
        summarization_zero_shot_prompt = f"""Generate a concise summary of the following Dialogue\n\nDialogue:\n{dialogue}
        """

    elif prompt_index == 3:
        summarization_zero_shot_prompt = f"""Summarize the following Dialogue in maximum two sentences, mentioning character's information\n\nDialogue:\n{dialogue}:
        """
    return summarization_zero_shot_prompt

In [7]:
print(create_zero_shot_prompt(dialogue=dataset["test"][0]["dialogue"], prompt_index=3))

Summarize the following Dialogue in maximum two sentences, mentioning character's information

Dialogue:
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, all office communications are restricted to email correspondence and official memos. The use of Instant Message programs by employees during working hours is strictly prohibited.
#Person2#: Sir, does this apply to intra-office communications only? Or will it also restrict external communications?
#Person1#: It should apply to all communications, not only in this office between employees, but also any outside communications.
#Person2#: But sir, many employees use Instant Messaging to communicate with their clients.
#Person1#: They will just have to change their communication methods. I don't want any - on

#### One-Shot Inference

In [8]:
def create_one_shot_prompt(dialogue: str, example_index: int, prompt_index: int) -> str:

    example_dialogue = dataset["train"][example_index]["dialogue"]
    example_summary = dataset["train"][example_index]["summary"]

    if prompt_index == 1:
        summarization_one_shot_prompt = f"""Summarize the following Dialogue\n\nDialogue:\n{example_dialogue}\n\nSummary:\n{example_summary}\n\n-----\n\nDialogue:\n{dialogue}\n\nSummary:\n
        """

    elif prompt_index == 2:
        summarization_one_shot_prompt = f"""Generate a concise summary of the following Dialogue\n\nDialogue:\n{example_dialogue}\n\nSummary:\n{example_summary}\n\n-----\n\nDialogue:\n{dialogue}\n\nSummary:\n
        """

    elif prompt_index == 3:
        summarization_one_shot_prompt = f"""Summarize the following Dialogue in maximum two sentences, mentioning character's information\n\nDialogue:\n{example_dialogue}\n\nSummary:\n{example_summary}\n\n-----\n\nDialogue:\n{dialogue}\n\nSummary:\n
        """

    return summarization_one_shot_prompt

In [9]:
print(create_one_shot_prompt(dialogue=dataset["test"][0]["dialogue"], example_index=2, prompt_index=3))

Summarize the following Dialogue in maximum two sentences, mentioning character's information

Dialogue:
#Person1#: Excuse me, did you see a set of keys?
#Person2#: What kind of keys?
#Person1#: Five keys and a small foot ornament.
#Person2#: What a shame! I didn't see them.
#Person1#: Well, can you help me look for it? That's my first time here.
#Person2#: Sure. It's my pleasure. I'd like to help you look for the missing keys.
#Person1#: It's very kind of you.
#Person2#: It's not a big deal.Hey, I found them.
#Person1#: Oh, thank God! I don't know how to thank you, guys.
#Person2#: You're welcome.

Summary:
#Person1#'s looking for a set of keys and asks for #Person2#'s help to find them.

-----

Dialogue:
#Person1#: Ms. Dawson, I need you to take a dictation for me.
#Person2#: Yes, sir...
#Person1#: This should go out as an intra-office memorandum to all employees by this afternoon. Are you ready?
#Person2#: Yes, sir. Go ahead.
#Person1#: Attention all staff... Effective immediately, 

#### Few-Shot Inference

In [10]:
def create_few_shot_prompt(dialogue: str, num_shots: int, prompt_index: int) -> str:

    indices = random.sample(range(TRAINING_DATA_COUNT), num_shots)

    if prompt_index == 1:
        summarization_few_shot_prompt = f"""Summarize the following Dialogue"""

    elif prompt_index == 2:
        summarization_few_shot_prompt = f"""Generate a concise summary of the following Dialogue """

    elif prompt_index == 3:
        summarization_few_shot_prompt = f"""Summarize the following Dialogue in maximum two sentences, mentioning character's information"""
    for example_index in indices:
        example_dialogue = dataset["train"][example_index]["dialogue"]
        example_summary = dataset["train"][example_index]["summary"]
        summarization_few_shot_prompt += f"\n\nDialogue:\n{example_dialogue}\n\nSummary:\n{example_summary}\n\n-----"

    summarization_few_shot_prompt += f"\n\nDialogue:\n{dialogue}\n\nSummary:\n"

    return summarization_few_shot_prompt

In [11]:
print(create_few_shot_prompt(dialogue=dataset["test"][0]["dialogue"], num_shots=2, prompt_index=2))

Generate a concise summary of the following Dialogue 

Dialogue:
#Person1#: This is a wonderful pie. Is it homemade?
#Person2#: It is, but I didn't make it. Jack did.
#Person1#: I didn't know your husband cooked.
#Person2#: Every week he makes something wonderful. He makes great fresh bread. Sometimes we give some to our neighbors.
#Person1#: What else does your amazing husband do?
#Person2#: He makes dinner every night.
#Person1#: Really? I don't even know how to fry an egg.
#Person2#: Jack even does the washing. I spend longer hours traveling from my home to my office and spend fewer hours at home. So he doesn't mind.
#Person1#: Yes, our company is a little far from your home. Who does the cleaning?
#Person2#: We both do. That way it only takes a small part of Saturday.

Summary:
#Person2# tells #Person1# that her husband makes something delicious every week, and makes dinner every night. #Person2# and her husband both do the cleaning.

-----

Dialogue:
#Person1#: Hi, Albert. You kno

### Human/Qualitative Evaluation

In [12]:
from tqdm import tqdm
import pandas as pd
pd.set_option('display.max_colwidth', None)

def create_summarization_model_result(num_test_examples: int, output_csv_file: str, random_sample: bool = False) -> pd.DataFrame:
    if random_sample:
        test_examples_indices = random.sample(range(TEST_DATA_COUNT), num_test_examples)
    else:
        test_examples_indices = range(num_test_examples)

    test_indices, test_dialogues, test_summaries = [],[],[]
    zero_shot_prediction_summaries_1, zero_shot_prediction_summaries_2, zero_shot_prediction_summaries_3 = [],[],[]
    one_shot_prediction_summaries_1, one_shot_prediction_summaries_2, one_shot_prediction_summaries_3 = [],[],[]
    few_shot_prediction_summaries_1, few_shot_prediction_summaries_2, few_shot_prediction_summaries_3 = [],[],[]

    for test_index in tqdm(test_examples_indices):
        test_dialogue = dataset["test"][test_index]["dialogue"]
        test_summary = dataset["test"][test_index]["summary"]
        test_indices.append(test_index)
        test_dialogues.append(test_dialogue)
        test_summaries.append(test_summary)

        for prompt_index in range(1,4):
            zero_shot_prompt = create_zero_shot_prompt(dialogue = test_dialogue, prompt_index = prompt_index)
            zero_shot_output = generate_llm_prediction(zero_shot_prompt)
            one_shot_prompt = create_one_shot_prompt(dialogue = test_dialogue, example_index=prompt_index, prompt_index = prompt_index)
            one_shot_output = generate_llm_prediction(one_shot_prompt)
            few_shot_prompt = create_few_shot_prompt(dialogue = test_dialogue, num_shots=3, prompt_index = prompt_index)
            few_shot_output = generate_llm_prediction(few_shot_prompt)

            if prompt_index == 1:
                zero_shot_prediction_summaries_1.append(zero_shot_output)
                one_shot_prediction_summaries_1.append(one_shot_output)
                few_shot_prediction_summaries_1.append(few_shot_output)
            elif prompt_index == 2:
                zero_shot_prediction_summaries_2.append(zero_shot_output)
                one_shot_prediction_summaries_2.append(one_shot_output)
                few_shot_prediction_summaries_2.append(few_shot_output)
            else:
                zero_shot_prediction_summaries_3.append(zero_shot_output)
                one_shot_prediction_summaries_3.append(one_shot_output)
                few_shot_prediction_summaries_3.append(few_shot_output)

    df = pd.DataFrame({'Index':test_indices,'Dialogue':test_dialogues,'Gold Summary':test_summaries,
                       'Zero_Shot_Pred_1':zero_shot_prediction_summaries_1,'Zero_Shot_Pred_2':zero_shot_prediction_summaries_2,'Zero_Shot_Pred_3':zero_shot_prediction_summaries_3,
                       'One_Shot_Pred_1':one_shot_prediction_summaries_1,'One_Shot_Pred_2':one_shot_prediction_summaries_2,'One_Shot_Pred_3':one_shot_prediction_summaries_3,
                       'Few_Shot_Pred_1':few_shot_prediction_summaries_1,'Few_Shot_Pred_2':few_shot_prediction_summaries_2,'Few_Shot_Pred_3':few_shot_prediction_summaries_3,
                      })
    df.to_csv(output_csv_file,index=False)
    return df

In [14]:
df = create_summarization_model_result(num_test_examples = 5, output_csv_file = 'Summarization_Evaluation_Sample_5.csv', random_sample = True)
df.head()

  0%|                                                                                                                               | 0/5 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1384 > 512). Running this sequence through the model will result in indexing errors
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:34<00:00,  6.90s/it]


Unnamed: 0,Index,Dialogue,Gold Summary,Zero_Shot_Pred_1,Zero_Shot_Pred_2,Zero_Shot_Pred_3,One_Shot_Pred_1,One_Shot_Pred_2,One_Shot_Pred_3,Few_Shot_Pred_1,Few_Shot_Pred_2,Few_Shot_Pred_3
0,1106,"#Person1#: Good evening. How many people of your party?\n#Person2#: Three. Two adults and one kid.\n#Person1#: For buffet?\n#Person2#: Yes. How much do you charge for it?\n#Person1#: Thirty for each adult, twenty each kid.\n#Person2#: I see. Where can I get the food?\n#Person1#: Please go to the tables over there for cold dishes and vegetables. The hot dishes are on the other side.\n#Person2#: Do I need to pay extra charges for drinks like cola and juice?\n#Person1#: Not for soft drinks. But we charge ten yuan for each alcohol order.",#Person2# tells #Person1# the charge policy at #Person2#'s buffet.,The party is going to be a great time.,The price is ten yuan.,The party is going to be a great evening.,The party is going to be a party.,The party is going to be a great evening.,The party is going to be a party.,The party is going to be a great evening.,The party is going to be a great evening.,The party is going to be a party.
1,1413,"#Person1#: Good morning. How can I help you?\n#Person2#: I'd like to open a new account.\n#Person1#: Have you filled out an application form?\n#Person2#: Yes. And I've brought some documents along with me, too. Do you need to see my passport?\n#Person1#: Yes. I'll just have my assistant look over these quickly and then we'll move on to the next step. Did you want to open up a checking account and a savings account?\n#Person2#: Yes. Does the checking account come with a debit card?\n#Person1#: Yes. Actually, both accounts come with cards that you can use in ATM machines, so that you won't have to come in to the bank to make a transaction.\n#Person2#: That's very convenient.\n#Person1#: It is. Our customers really like it. Do you have any other questions about your new accounts?\n#Person2#: Yes. What's the maximum amount that you are allowed to have in an overdraft?\n#Person1#: The maximum is $ 1000.\n#Person2#: Is there a penalty for having an overdraft?\n#Person1#: Yes, but it's not much. You just have to pay 1 % interest on the account. It's much lower rate than any of our loans and it's much better than owing money to most credit cards.\n#Person2#: That's true. Is everything alright with my documents?\n#Person1#: They're all in order. If you just sign your name here, you'll receive your cards and pin numbers in the mail in about three weeks.\n#Person2#: Thank you very much.\n#Person1#: You're welcome.","#Person1# helps #Person2# to open a new account. #Person1# answers #Person2#'s questions about the debit card, the maximum amount in an overdraft, and the penalty for having an overdraft.",You're welcome.,You're welcome.,The bank has a new account.,You're welcome.,#Person1#'s looking for a new account.,#Person1#: I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear that. I'm sorry to hear,#Person1#: I'd like to open a new account. I've brought some documents along with me. I'll have my assistant look over these quickly and then we'll move on to the next step. Did you want to open up a checking account and a savings account?,How can I help you?,You're welcome.
2,378,"#Person1#: It's time for desserts! Are you still hungry?\n#Person2#: I've always got room for something sweet!\n#Person1#: what are you going to try first?\n#Person2#: I've never tried traditional Greek yogurt, so I want to try that first.\n#Person1#: do they serve the yogurt with anything?\n#Person2#: I believe they add locally produced honey to it.\n#Person1#: that sounds good. I'm going to start with an Italian tiramisu.\n#Person2#: do you want to try some of my yogurt. It's a favorite everyday dessert in Greece.\n#Person1#: ok. Mmm.\n#Person2#: what do you think? How does it taste?\n#Person1#: it's nice, but it's rather plain. Do you want to try my tiramisu?\n#Person2#: sure. I'll just have a bite.\n#Person1#: what do you think? Does it taste good?\n#Person2#: it's absolutely delicious! That is the best tiramisu I've ever had!\n#Person1#: I'm glad you like it. I don't care for it. Why don't you finish my tiramisu so that I can try one of those fried bananas?\n#Person2#: ok. I've had one of those before. They're really sweet and crunchy.\n#Person1#: do you know where they are from.\n#Person2#: I believe they are a local delicacy in the South.\n#Person1#: do you want me to get you one, too?\n#Person2#: yeah, why not? We've already pigged out as it is!\n#Person1#: ok, I'll be back with two fried bananas in a few minutes. Wait for me here!","#Person2# has traditional Greek yogurt, which #Person1# thinks rather plain. #Person1# has an Italian tiramisu, which #Person2# thinks delicious. #Person1# goes and gets both of them a fried banana.",Person1#: I'm going to try some of my tiramisu. I'm going to try some Italian tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu.,Person1#: I'm going to try some of my tiramisu.,Person1#: I'm going to try some of my tiramisu.,The desserts are coming up.,Person1#: I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramis,#Person1#: I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tiramisu. I'm going to try some of my tir,The desserts are coming up.,The desserts are coming up.,The desserts are coming up.
3,822,"#Person1#: Is there a bus that'll go all the way to Sons from PHS?\n#Person2#: Where is this Sons located?\n#Person1#: The Sons on Fair Oaks and Orange Grove.\n#Person2#: You're going to need to take two buses to get to that Sons.\n#Person1#: Which buses will I have to take?\n#Person2#: First, you need to get on the 268 going west.\n#Person1#: Then what do I do?\n#Person2#: You need to get off on Fair Oaks and Washington.\n#Person1#: What's next?\n#Person2#: Get on the 261, and it'll take you the rest of the way to Sons.\n#Person1#: There's nothing else?\n#Person2#: That's all there is to it.",#Person2# tells #Person1# the bus route to get to Sons.,You're going to need to take two buses to get to Sons.,You'll need to take two buses to get to Sons.,You're going to have to take two buses to get to Sons.,You'll need to take two buses to get to Sons.,You'll need to take two buses to get to Sons.,You're going to need to take two buses to get to Sons.,You're going to need to take two buses to get to Sons.,You'll need to take two buses to get to Sons.,You're going to need to take two buses to get to Sons.
4,787,"#Person1#: Hey Sarah, are you all right? You look upset.\n#Person2#: As a matter of fact, I am a bit upset. I just came out of a meeting and it didn't go very well.\n#Person1#: What happened?\n#Person2#: No one would listen to any of my suggestions. Instead, they just kept arguing with each other.\n#Person1#: Who was chairing the meeting?\n#Person2#: Bob.\n#Person1#: Well, I can tell you from experience that Bob might come off a little strong sometimes.\n#Person2#: That's exactly what happened! He kept interrupting everyone with his own suggestions and did not want to hear what others had to say. Then he expected everyone to agree with him.\n#Person1#: What was the meeting about?\n#Person2#: We were trying to come up with ideas to streamline the office's workflow to make it more efficient.\n#Person1#: It's ironic that the meeting was anything but efficient.\n#Person2#: Exactly. I had tons of ideas that I wanted to share, but they just wouldn't let me finish. What should I have done to get my point across?\n#Person1#: You have to keep things short and sweet. When you get a chance to speak, try not to get into too many unnecessary details.\n#Person2#: Short and sweet? But what if I have to explain something complicated?\n#Person1#: You can always bring up the main points during the meeting and speak to those who are directly involved after the meeting. Not everyone needs to know all that information.\n#Person2#: That's a good idea, I think I will try that at the next meeting.","Sarah is upset because Bob kept interrupting everyone else during a meeting, making it impossible to elaborate her ideas. #Person1# gives Sarah a useful tip to get her point across at the next meeting.",The meeting was a bit abysmal.,The meeting was a bit abysmal.,The meeting was not well.,The meeting was not well. It was a very short meeting.,The meeting was a bit abysmal. It was a meeting that everyone was arguing with.,The meeting was not well.,The meeting was not well. It was a very difficult meeting.,The meeting was not well. It was a very short meeting.,The meeting was not well.


#### Human Analysis:

1. Current Model summarization results especially with zero & one shot do not perform well. The improvement in the few shot results demonstrates model's in-context learning capabilities 
2. Since human evaluation takes time, we also analyze the quantitative results in the next section to get a better understanding of the entire test set
3. The results can be improvised by
   - Experimenting with other prompts
   - Experimenting with different configuration parameters of the model (e.g: do_sample=True for different decoding strategies)
   - Perform fine-tuning on data
   - Using larger parameters model

### Metric/Quantitative Evaluation

We will use **[ROUGE](https://https://huggingface.co/spaces/evaluate-metric/rouge)** metric for our evaluation which is the most common metric used in Summarization tasks.

In [15]:
!pip install evaluate
!pip install rouge-score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




### Metric/Quantitative Evaluation

We will use **[ROUGE](https://https://huggingface.co/spaces/evaluate-metric/rouge)** metric for our evaluation which is the most common metric used in Summarization tasks.

In [16]:
import evaluate
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [17]:
def compute_rouge(predictions,gold_answers):
    results = rouge.compute(predictions=predictions,
                            references=gold_answers,
                            use_stemmer = True,
                            use_aggregator=True)
    return results

def create_rouge_eval_result(df: pd.DataFrame, output_csv_file: str, gold_column: str) -> pd.DataFrame:
    
    gold_answers = df[gold_column].tolist()
    zero_shot_prediction_answers_1, zero_shot_prediction_answers_2, zero_shot_prediction_answers_3 = df['Zero_Shot_Pred_1'].tolist(), df['Zero_Shot_Pred_2'].tolist(), df['Zero_Shot_Pred_3'].tolist()
    one_shot_prediction_answers_1, one_shot_prediction_answers_2, one_shot_prediction_answers_3 = df['One_Shot_Pred_1'].tolist(), df['One_Shot_Pred_2'].tolist(), df['One_Shot_Pred_3'].tolist()
    few_shot_prediction_answers_1, few_shot_prediction_answers_2, few_shot_prediction_answers_3 = df['Few_Shot_Pred_1'].tolist(), df['Few_Shot_Pred_2'].tolist(), df['Few_Shot_Pred_3'].tolist()
    
    columns = ['rouge1', 'rouge2','rougeL', 'rougeLSum']
    rows = ['Zero_Shot_Prompt_1', 'Zero_Shot_Prompt_2', 'Zero_Shot_Prompt_3', 'One_Shot_Prompt_1', 'One_Shot_Prompt_2', 'One_Shot_Prompt_3','Few_Shot_Prompt_1', 'Few_Shot_Prompt_2', 'Few_Shot_Prompt_3']
    df_res = pd.DataFrame(index=rows, columns=columns)
    
    df_res.loc['Zero_Shot_Prompt_1'] = list(compute_rouge(zero_shot_prediction_answers_1,gold_answers).values())
    df_res.loc['Zero_Shot_Prompt_2'] = list(compute_rouge(zero_shot_prediction_answers_2,gold_answers).values())
    df_res.loc['Zero_Shot_Prompt_3'] = list(compute_rouge(zero_shot_prediction_answers_3,gold_answers).values())
    
    df_res.loc['One_Shot_Prompt_1'] = list(compute_rouge(one_shot_prediction_answers_1,gold_answers).values())
    df_res.loc['One_Shot_Prompt_2'] = list(compute_rouge(one_shot_prediction_answers_2,gold_answers).values())
    df_res.loc['One_Shot_Prompt_3'] = list(compute_rouge(one_shot_prediction_answers_3,gold_answers).values())
    
    df_res.loc['Few_Shot_Prompt_1'] = list(compute_rouge(few_shot_prediction_answers_1,gold_answers).values())
    df_res.loc['Few_Shot_Prompt_2'] = list(compute_rouge(few_shot_prediction_answers_2,gold_answers).values())
    df_res.loc['Few_Shot_Prompt_3'] = list(compute_rouge(few_shot_prediction_answers_3,gold_answers).values())
    
    df_res.to_csv(output_csv_file)
    return df_res
	

In [18]:
df = create_summarization_model_result(num_test_examples = 100, output_csv_file = 'Summarization_Evaluation_100_samples.csv', random_sample = True)
df.head()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [12:30<00:00,  7.51s/it]


Unnamed: 0,Index,Dialogue,Gold Summary,Zero_Shot_Pred_1,Zero_Shot_Pred_2,Zero_Shot_Pred_3,One_Shot_Pred_1,One_Shot_Pred_2,One_Shot_Pred_3,Few_Shot_Pred_1,Few_Shot_Pred_2,Few_Shot_Pred_3
0,855,"#Person1#: Tomorrow is Mike's birthday. I have just received the invitation to his party. Did Mike invite you, too?\n#Person2#: Yes. I received his invitation this morning. But he didn't tell me what time the party will begin.\n#Person1#: I'll ring him up and ask him about it. How will you go to his party?\n#Person2#: I'll drive to his party after work. Would you like to take my car there?\n#Person1#: I would be glad to. Thank you.",#Person1# and #Person2# are going to Mike's birthday party tomorrow.,Mike invited you to his party tomorrow.,Mike invited you to his party tomorrow.,Mike invited him to his party.,Mike invited him to his party. He didn't tell him what time the party will begin. He will drive to his party after work.,Mike invited him to his party. He didn't tell him what time the party will begin. He will drive to his party after work.,Mike invited him to his party. He didn't tell him what time the party will begin. He will drive to his party after work.,Mike invited him to his party tomorrow. He didn't tell him what time the party will begin. He will drive to his party after work.,Mike invited him to his party. He didn't tell him what time the party will begin. He will drive to his party after work.,Mike invited him to his party. He didn't tell him what time the party will begin. He will drive to his party after work.
1,995,"#Person1#: I want to make sure my son receives this letter. It has an important certificate in it.\n#Person2#: You can send it either by certified mail or registered mail. If you only want to make sure it is received, send it by certified mail. It's less expensive.\n#Person1#: OK. How about this package?\n#Person2#: What's in it?\n#Person1#: A watch.\n#Person2#: You should insure it for the value of the watch. And send it by registered mail if it's more expensive. As it's the safest way.",#Person1# will send a certificate by certified mail and a watch by registered mail.,You can send it by registered mail.,You can send it by registered mail.,You should send the watch to the recipient.,Send the package to your son.,You can send a certificate to your son.,Send the package to the person who is sending the package.,Send the package to your son.,You can send a certificate to your son.,You can send a package to your son.
2,935,"#Person1#: Thank you for organizing this great baby shower for me! I'Ve always been to baby showers but never actually had one held for me! Let's get started!\n#Person2#: Ok, let's start opening some presents!\n#Person1#: Oh look! What a great little bib for the baby! This will definitely come in handy! Oh wow, you also got me a stroller! That's so great! Thank you!\n#Person2#: This next one is from Betty.\n#Person1#: A highchair and car seat! Wow Betty, thank you so much! I really appreciate it!\n#Person2#: One more from Carla.\n#Person1#: A playpen and crib! Thanks Carla! This is just what I needed!\n#Person2#: OK, that's all of them. No more gifts. Now who wants to guess when the baby is due?\n#Person1#: Umm. I think my water just broke! Get me to a hospital!",#Person2# organized a great baby shower for #Person1#. #Person1# receives many gifts. #Person1#'s water broke.,The baby shower is going to be a great one!,The baby shower is going to be a great one!,The baby is due to be born in the hospital.,The baby shower is going to be a great one!,#Person1#: I'm going to get a baby shower. I'm going to get a baby shower.,#Person1#: I'm going to get a baby shower for my baby.,#Person1#: Thank you for organizing this great baby shower for me!,#Person1#: I'm going to get a baby shower. I'm going to get a baby shower.,#Person1#: I'm going to get a baby shower. I'm going to get a baby shower. I'm going to get a baby shower. I'm going to get a baby shower. I'm going to get a baby shower. I'm going to get a baby shower. I'm going to get a baby shower. I'm going to get a baby shower. I'm going to get
3,1457,#Person1#: Hi. What's up? \n#Person2#: Nothing much. What's new with you? \n#Person1#: Not too much. I've been pretty busy. \n#Person2#: Me too. Seems like all I do is eat and sleep. \n#Person1#: Gotta go. Call me tonight. \n#Person2#: Okay. Check you later.,#Person1# is busy while #Person2# is flexible.,I'm sorry.,Person1#: I'm not sure what to do.,Person1#: I'm not sure what to do.,The new person is coming to her.,"Person1#: Hi, I'm sorry. I'm not sure what to do. I'm going to call you tonight.","Person1#: Hi, I'm sorry. I'm not sure what to do. I'm going to call you tonight.",,"Person1#: Hi, I'm sorry. I'm not sure what to do tonight.",The new person is coming to her.
4,446,"#Person1#: Hi, Dan, I'm calling to check on that order of 100 computers were the tenth of September. However, it has been delayed for 2 days.\n#Person2#: Yes, I know. I mean to call you and tell you that the factory is short of hands at the moment. They say they can get the order to you by the eighteenth.\n#Person1#: Oh, that's too late. If you can give me Steve's phone number, I'll call him and tell him about this. Do you have his number handy?\n#Person2#: Yes, it's 87506638.\n#Person1#: Sorry, is that double 6 or double 3?\n#Person2#: Double 6.\n#Person1#: I suppose he can't really complain. Those computers are a bargain.\n#Person2#: Exactly. A few days, it shouldn't make that much difference. Thanks for understanding, Darlene.\n#Person1#: No problem.",Darlen calls Dan to check the delayed order of computers. Dan explains to her the reason for the delay. Darlene decides to talk to Steven.,The order has been delayed for 2 days.,The order has been delayed for 2 days.,The order has been delayed for 2 days.,The order has been delayed for 2 days.,The order has been delayed for 2 days.,Dan is calling to check on the order of 100 computers. It has been delayed for 2 days. Steve will call him and tell him about it.,#Person1#: I'm calling to check on the order of 100 computers. It's delayed for 2 days. Steve will call him and tell him about it.,The order has been delayed for 2 days. Steve will call him and tell him about it.,The order has been delayed for 2 days.


In [49]:
df_res = create_rouge_eval_result(df, output_csv_file = 'Summarization_Evaluation_100_samples_Metrics.csv', gold_column = 'Gold Summary') 
df_res.head(12)

Unnamed: 0,rouge1,rouge2,rougeL,rougeLSum
Zero_Shot_Prompt_1,0.183994,0.052109,0.160555,0.161329
Zero_Shot_Prompt_2,0.208274,0.059867,0.181422,0.182395
Zero_Shot_Prompt_3,0.223767,0.061977,0.193929,0.194391
One_Shot_Prompt_1,0.224383,0.068509,0.194876,0.195347
One_Shot_Prompt_2,0.219647,0.060113,0.189187,0.189051
One_Shot_Prompt_3,0.22811,0.062997,0.200011,0.199574
Few_Shot_Prompt_1,0.209409,0.05746,0.180622,0.180401
Few_Shot_Prompt_2,0.217498,0.062132,0.190294,0.191047
Few_Shot_Prompt_3,0.21155,0.057521,0.184962,0.184321


#### Task 2 Summary
1. There is no substantial difference in the Rouge score in various methods
2. The results can be improvised by
   - Experimenting with other prompts
   - Experimenting with different configuration parameters of the model (e.g: do_sample=True for different decoding strategies)
   - Perform fine-tuning on data
   - Using a larger parameters model

## Task 3: Verify if the Q&A task works.

We will experiment with samples from SQuAD dataset for Q&A task. We will consider this as a **Generative QA** task

SQuAD, short for Stanford Question Answering Dataset, is a dataset designed for training and evaluating question answering systems. It consists of real questions posed by humans on a set of Wikipedia articles, where the answer to each question is a specific span of text within the corresponding article. The dataset is widely used in the field of natural language processing (NLP) and serves as a benchmark for evaluating the performance of machine learning and artificial intelligence models in understanding and answering questions.

In [67]:
import datasets
dataset_name = "rajpurkar/squad_v2"
dataset = datasets.load_dataset(dataset_name)

TRAINING_DATA_COUNT = len(dataset['train'])
TEST_DATA_COUNT = len(dataset['validation'])

print(f"Train dataset size: {TRAINING_DATA_COUNT}")
print(f"Test dataset size: {TEST_DATA_COUNT}")

Train dataset size: 130319
Test dataset size: 11873


In [68]:
# Sample Data
from pprint import pprint 
sample_data = next(iter(dataset['train']))
pprint(sample_data)

{'answers': {'answer_start': [269], 'text': ['in the late 1990s']},
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born '
            'September 4, 1981) is an American singer, songwriter, record '
            'producer and actress. Born and raised in Houston, Texas, she '
            'performed in various singing and dancing competitions as a child, '
            'and rose to fame in the late 1990s as lead singer of R&B '
            "girl-group Destiny's Child. Managed by her father, Mathew "
            "Knowles, the group became one of the world's best-selling girl "
            "groups of all time. Their hiatus saw the release of Beyoncé's "
            'debut album, Dangerously in Love (2003), which established her as '
            'a solo artist worldwide, earned five Grammy Awards and featured '
            'the Billboard Hot 100 number-one singles "Crazy in Love" and '
            '"Baby Boy".',
 'id': '56be85543aeaaa14008c9063',
 'question': 'When did

In [69]:
def read_data(dataset: datasets.Dataset, data_type: str) -> dict:

    contexts = []
    questions = []
    answers = []

    for sample_data in dataset[data_type]:
        contexts.append(sample_data.get('context',''))
        questions.append(sample_data.get('question',''))
        if 'answers' in sample_data and 'text' in sample_data['answers']:
            answers.append(sample_data['answers']['text'])
        else:
            answers.append('')
    return {'contexts':contexts, 'questions':questions, 'answers':answers}

train_data = read_data(dataset,'train')
valid_data = read_data(dataset,'validation')

#### Zero-Shot Inference

In [70]:
def create_zero_shot_prompt(context: str, question: str, prompt_index: int) -> str:
    if prompt_index == 1:
        qa_zero_shot_prompt = f"""Given a Context and a Question, utilize the context to answer the question. Don't use any other information\n\nContext:\n{context}\n\nQuestion:\n{question}
        """

    elif prompt_index == 2:
        qa_zero_shot_prompt = f"""Answer the given Question based on the provided Context. Don't use any other information\n\nContext:\n{context}\n\nQuestion:\n{question}
        """

    elif prompt_index == 3:
        qa_zero_shot_prompt = f"""Given a Context and a Question, extract the relevant content from the context that answers the given question. Don't generate anything on your own. Return only the extracted content\n\nContext:\n{context}\n\nQuestion:\n{question}
        """
    return qa_zero_shot_prompt

In [71]:
print(create_zero_shot_prompt(context=dataset["validation"][0]["context"], question=dataset["validation"][0]["question"], prompt_index=1))

Given a Context and a Question, utilize the context to answer the question. Don't use any other information

Context:
The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.

Question:
In what country is Normandy located?
        


#### One-Shot Inference

In [72]:
def create_one_shot_prompt(context: str, question: str, example_index: int, prompt_index: int) -> str:

    example_context = dataset["train"][example_index].get('context','')
    example_question = dataset["train"][example_index].get('question','')
    if 'answers' in dataset["train"][example_index] and 'text' in dataset["train"][example_index]['answers']:
        example_answer = dataset["train"][example_index]['answers']['text'][0]
    else:
         example_answer = ''

    if prompt_index == 1:
        qa_one_shot_prompt = f"""Given a Context and a Question, utilize the context to answer the question. Don't use any other information\n\nContext:\n{example_context}\n\nQuestion:\n{example_question}\n\nAnswer:\n{example_answer}\n\n-----\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nAnswer:\n
        """
    
    elif prompt_index == 2:
        qa_one_shot_prompt = f"""Answer the given Question based on the provided Context. Don't use any other information\n\nContext:\n{example_context}\n\nQuestion:\n{example_question}\n\nAnswer:\n{example_answer}\n\n-----\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nAnswer:\n
        """

    elif prompt_index == 3:
        qa_one_shot_prompt = f"""Given a Context and a Question, extract the relevant content from the context that answers the given question. Don't generate anything on your own. Return only the extracted content\n\nContext:\n{example_context}\n\nQuestion:\n{example_question}\n\nAnswer:\n{example_answer}\n\n-----\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nAnswer:\n
        """

    return qa_one_shot_prompt

In [73]:
print(create_one_shot_prompt(context=dataset["validation"][0]["context"], question=dataset["validation"][0]["question"], example_index=2, prompt_index=2))

Answer the given Question based on the provided Context. Don't use any other information

Context:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

Question:
When did Beyonce leave Destiny's Child and become a solo singer?

Answer:
2003

-----

Context:
The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th 

#### Few-Shot Inference

In [74]:
def create_few_shot_prompt(context: str, question: str, num_shots: int, prompt_index: int) -> str:

    indices = random.sample(range(TRAINING_DATA_COUNT), num_shots)

    if prompt_index == 1:
        qa_few_shot_prompt = f"""Given a Context and a Question, utilize the context to answer the question. Don't use any other information"""

    elif prompt_index == 2:
        qa_few_shot_prompt = f"""Answer the given Question based on the provided Context. Don't use any other information"""

    elif prompt_index == 3:
        qa_few_shot_prompt = f"""Given a Context and a Question, extract the relevant content from the context that answers the given question. Don't generate anything on your own. Return only the extracted content"""
    for example_index in indices:
        example_context = dataset["train"][example_index].get('context','')
        example_question = dataset["train"][example_index].get('question','')
        if 'answers' in dataset["train"][example_index] and 'text' in dataset["train"][example_index]['answers'] and len(dataset["train"][example_index]['answers']['text'])>0:
            example_answer = dataset["train"][example_index]['answers']['text'][0]
        else:
             example_answer = ''
        qa_few_shot_prompt += f"\n\nContext:\n{example_context}\n\nQuestion:\n{example_question}\n\nAnswer:\n{example_answer}\n\n-----"

    qa_few_shot_prompt += f"\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nAnswer:\n"

    return qa_few_shot_prompt

In [75]:
print(create_few_shot_prompt(context=dataset["validation"][0]["context"], question=dataset["validation"][0]["question"], num_shots=2, prompt_index=2))

Answer the given Question based on the provided Context. Don't use any other information

Context:
The governing bodies in each country operate league systems in a domestic season, normally comprising several divisions, in which the teams gain points throughout the season depending on results. Teams are placed into tables, placing them in order according to points accrued. Most commonly, each team plays every other team in its league at home and away in each season, in a round-robin tournament. At the end of a season, the top team is declared the champion. The top few teams may be promoted to a higher division, and one or more of the teams finishing at the bottom are relegated to a lower division.

Question:
What could happen to the top few teams at the end of the season?

Answer:
promoted to a higher division

-----

Context:
GE's history of working with turbines in the power-generation field gave them the engineering know-how to move into the new field of aircraft turbosuperchargers.

### Human/Qualitative Evaluation

In [76]:
from tqdm import tqdm
import pandas as pd
pd.set_option('display.max_colwidth', None)

def create_qa_model_result(num_test_examples: int, output_csv_file: str, random_sample: bool = False) -> pd.DataFrame:
    if random_sample:
        test_examples_indices = random.sample(range(TEST_DATA_COUNT), num_test_examples)
    else:
        test_examples_indices = range(num_test_examples)

    test_indices, test_contexts, test_questions, test_answers = [],[],[],[]
    zero_shot_prediction_answers_1, zero_shot_prediction_answers_2, zero_shot_prediction_answers_3 = [],[],[]
    one_shot_prediction_answers_1, one_shot_prediction_answers_2, one_shot_prediction_answers_3 = [],[],[]
    few_shot_prediction_answers_1, few_shot_prediction_answers_2, few_shot_prediction_answers_3 = [],[],[]

    for test_index in tqdm(test_examples_indices):
        test_context = dataset["validation"][test_index].get('context','')
        test_question = dataset["validation"][test_index].get('question','')
        if 'answers' in dataset["validation"][test_index] and 'text' in dataset["validation"][test_index]['answers'] and len(dataset["validation"][test_index]['answers']['text'])>0:
            test_answer = dataset["validation"][test_index]['answers']['text'][0]
        else:
            test_answer = ''
             
        test_indices.append(test_index)
        test_contexts.append(test_context)
        test_questions.append(test_question)
        test_answers.append(test_answer)

        for prompt_index in range(1,4):
            zero_shot_prompt = create_zero_shot_prompt(context = test_context, question = test_question, prompt_index = prompt_index)
            zero_shot_output = generate_llm_prediction(zero_shot_prompt)
            one_shot_prompt = create_one_shot_prompt(context = test_context, question = test_question, example_index=prompt_index, prompt_index = prompt_index)
            one_shot_output = generate_llm_prediction(one_shot_prompt)
            few_shot_prompt = create_few_shot_prompt(context = test_context, question = test_question, num_shots=3, prompt_index = prompt_index)
            few_shot_output = generate_llm_prediction(few_shot_prompt)

            if prompt_index == 1:
                zero_shot_prediction_answers_1.append(zero_shot_output)
                one_shot_prediction_answers_1.append(one_shot_output)
                few_shot_prediction_answers_1.append(few_shot_output)
            elif prompt_index == 2:
                zero_shot_prediction_answers_2.append(zero_shot_output)
                one_shot_prediction_answers_2.append(one_shot_output)
                few_shot_prediction_answers_2.append(few_shot_output)
            else:
                zero_shot_prediction_answers_3.append(zero_shot_output)
                one_shot_prediction_answers_3.append(one_shot_output)
                few_shot_prediction_answers_3.append(few_shot_output)

    df = pd.DataFrame({'Index':test_indices,'Context':test_contexts,'Question':test_questions,'Gold Summary':test_answers,
                       'Zero_Shot_Pred_1':zero_shot_prediction_answers_1,'Zero_Shot_Pred_2':zero_shot_prediction_answers_2,'Zero_Shot_Pred_3':zero_shot_prediction_answers_3,
                       'One_Shot_Pred_1':one_shot_prediction_answers_1,'One_Shot_Pred_2':one_shot_prediction_answers_2,'One_Shot_Pred_3':one_shot_prediction_answers_3,
                       'Few_Shot_Pred_1':few_shot_prediction_answers_1,'Few_Shot_Pred_2':few_shot_prediction_answers_2,'Few_Shot_Pred_3':few_shot_prediction_answers_3,
                      })
    df.to_csv(output_csv_file,index=False)
    return df

In [30]:
df = create_qa_model_result(num_test_examples = 5, output_csv_file = 'QA_Evaluation_Sample_5.csv', random_sample = True)
df.head()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:20<00:00,  4.08s/it]


Unnamed: 0,Index,Context,Question,Gold Summary,Zero_Shot_Pred_1,Zero_Shot_Pred_2,Zero_Shot_Pred_3,One_Shot_Pred_1,One_Shot_Pred_2,One_Shot_Pred_3,Few_Shot_Pred_1,Few_Shot_Pred_2,Few_Shot_Pred_3
0,11083,"One of the most famous people born in Warsaw was Maria Skłodowska-Curie, who achieved international recognition for her research on radioactivity and was the first female recipient of the Nobel Prize. Famous musicians include Władysław Szpilman and Frédéric Chopin. Though Chopin was born in the village of Żelazowa Wola, about 60 km (37 mi) from Warsaw, he moved to the city with his family when he was seven months old. Casimir Pulaski, a Polish general and hero of the American Revolutionary War, was born here in 1745.",What year was Casimir Wola born in Warsaw?,,1745,1745,1745,1745,1745,1745,1745,1745,1745
1,1043,"On 18 November 2015, Sky announced Sky Q, a range of products and services to be available in 2016. The Sky Q range consists of three set top boxes (Sky Q, Sky Q Silver and Sky Q Mini), a broadband router (Sky Q Hub) and mobile applications. The Sky Q set top boxes introduce a new user interface, Wi-Fi hotspot functionality, Power-line and Bluetooth connectivity and a new touch-sensitive remote control. The Sky Q Mini set top boxes connect to the Sky Q Silver set top boxes with a Wi-Fi or Power-line connection rather than receive their own satellite feeds. This allows all set top boxes in a household to share recordings and other media. The Sky Q Silver set top box is capable of receiving and displaying UHD broadcasts, which Sky will introduce later in 2016.",What is the name of Sky Q's dial-up router?,,Sky Q Hub,Sky Q Hub,Sky Q Hub,Sky Q Hub,Sky Q Hub,Sky Q Hub,Sky Q Hub,Sky Q Hub,Sky Q Hub
2,5885,"The modern trend in design is toward integration of previously separated specialties, especially among large firms. In the past, architects, interior designers, engineers, developers, construction managers, and general contractors were more likely to be entirely separate companies, even in the larger firms. Presently, a firm that is nominally an ""architecture"" or ""construction management"" firm may have experts from all related fields as employees, or to have an associated company that provides each necessary skill. Thus, each such firm may offer itself as ""one-stop shopping"" for a construction project, from beginning to end. This is designated as a ""design build"" contract where the contractor is given a performance specification and must undertake the project from design to construction, while adhering to the performance specifications.",The modern trend in design is toward integration of what?,previously separated specialties,previously separated specialties,previously separated specialties,previously separated specialties,previously separated specialties,previously separated specialties,previously separated specialties,previously separated specialties,previously separated specialties,previously separated specialties
3,4495,"Merit Network, Inc., an independent non-profit 501(c)(3) corporation governed by Michigan's public universities, was formed in 1966 as the Michigan Educational Research Information Triad to explore computer networking between three of Michigan's public universities as a means to help the state's educational and economic development. With initial support from the State of Michigan and the National Science Foundation (NSF), the packet-switched network was first demonstrated in December 1971 when an interactive host to host connection was made between the IBM mainframe computer systems at the University of Michigan in Ann Arbor and Wayne State University in Detroit. In October 1972 connections to the CDC mainframe at Michigan State University in East Lansing completed the triad. Over the next several years in addition to host to host interactive connections the network was enhanced to support terminal to host connections, host to host batch connections (remote job submission, remote printing, batch file transfer), interactive file transfer, gateways to the Tymnet and Telenet public data networks, X.25 host attachments, gateways to X.25 data networks, Ethernet attached hosts, and eventually TCP/IP and additional public universities in Michigan join the network. All of this set the stage for Merit's role in the NSFNET project starting in the mid-1980s.",What was eventual Merits role?,,TCP/IP and additional public universities in Michigan join the network,TCP/IP and additional public universities in Michigan join the network,TCP/IP and additional public universities in Michigan,TCP/IP and additional public universities in Michigan join the network,TCP/IP and additional public universities in Michigan,TCP/IP and additional public universities in Michigan,TCP/IP and additional public universities in Michigan,TCP/IP and additional public universities in Michigan,TCP/IP
4,7184,"Socialists attribute the vast disparities in wealth to the private ownership of the means of production by a class of owners, creating a situation where a small portion of the population lives off unearned property income by virtue of ownership titles in capital equipment, financial assets and corporate stock. By contrast, the vast majority of the population is dependent on income in the form of a wage or salary. In order to rectify this situation, socialists argue that the means of production should be socially owned so that income differentials would be reflective of individual contributions to the social product.",How do socialists think the means of production shouldn't be owned?,,so that income differentials would be reflective of individual contributions to the social product,so that income differentials would be reflective of individual contributions to the social product.,"Socialists attribute the vast disparities in wealth to the private ownership of the means of production by a class of owners, creating a situation where a small portion of the population lives off unearned property income by virtue of ownership titles in capital equipment, financial assets and corporate stock",unanswerable,so that income differentials would be reflective of individual contributions to the social product,"Socialists attribute the vast disparities in wealth to the private ownership of the means of production by a class of owners, creating a situation where a small portion of the population lives off unearned property income by virtue of ownership titles in capital equipment, financial assets and corporate stock.",unanswerable,so that income differentials would be reflective of individual contributions to the social product,Socialists attribute the vast disparities in wealth to the private ownership of the means of production by a class of owners


#### Human Analysis:

1. Current Model QA results are comparatively better as compared to summarization task. The model generally gives the correct result although the results is not specific as present in the gold answers
2. We also analyze the quantitative results in the next section to get a better understanding of the entire test set

### Metric/Quantitative Evaluation

We will use **[ROUGE](https://https://huggingface.co/spaces/evaluate-metric/rouge)** metric for our evaluation.

In [77]:
import evaluate
rouge = evaluate.load('rouge')
df = create_qa_model_result(num_test_examples = 100, output_csv_file = 'QA_Evaluation_100_samples.csv', random_sample = False)
df.head()


  0%|                                                                                                                             | 0/100 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (756 > 512). Running this sequence through the model will result in indexing errors
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [05:48<00:00,  3.49s/it]


Unnamed: 0,Index,Context,Question,Gold Summary,Zero_Shot_Pred_1,Zero_Shot_Pred_2,Zero_Shot_Pred_3,One_Shot_Pred_1,One_Shot_Pred_2,One_Shot_Pred_3,Few_Shot_Pred_1,Few_Shot_Pred_2,Few_Shot_Pred_3
0,0,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",In what country is Normandy located?,France,France,France,France,France,France,France,France,France,France
1,1,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",When were the Normans in Normandy?,10th and 11th centuries,10th and 11th centuries,10th and 11th centuries,10th and 11th centuries,10th and 11th centuries,10th and 11th centuries,10th and 11th centuries,10th and 11th centuries,10th and 11th centuries,10th and 11th centuries
2,2,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",From which countries did the Norse originate?,"Denmark, Iceland and Norway","Denmark, Iceland and Norway","Denmark, Iceland and Norway","Denmark, Iceland and Norway","Denmark, Iceland and Norway","Denmark, Iceland and Norway","Denmark, Iceland and Norway","Denmark, Iceland and Norway","Denmark, Iceland and Norway","Denmark, Iceland and Norway"
3,3,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",Who was the Norse leader?,Rollo,Rollo,Rollo,Rollo,Rollo,Rollo,Rollo,Rollo,Rollo,Rollo
4,4,"The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (""Norman"" comes from ""Norseman"") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",What century did the Normans first gain their separate identity?,10th century,10th century,10th century,10th century,10th century,10th century,10th century,10th,10th,10th


In [78]:
df_res = create_rouge_eval_result(df, output_csv_file = 'QA_Evaluation_100_samples_Metrics.csv', gold_column = 'Gold Summary') 
df_res.head(12)

Unnamed: 0,rouge1,rouge2,rougeL,rougeLSum
Zero_Shot_Prompt_1,0.372901,0.166798,0.373461,0.371901
Zero_Shot_Prompt_2,0.378381,0.164298,0.377333,0.376214
Zero_Shot_Prompt_3,0.32178,0.14269,0.319766,0.322031
One_Shot_Prompt_1,0.359464,0.15469,0.359,0.359774
One_Shot_Prompt_2,0.362279,0.156524,0.36136,0.363381
One_Shot_Prompt_3,0.356888,0.156524,0.355547,0.357417
Few_Shot_Prompt_1,0.358107,0.134821,0.355845,0.357607
Few_Shot_Prompt_2,0.358917,0.143857,0.357607,0.358917
Few_Shot_Prompt_3,0.332827,0.13144,0.329642,0.330827


#### Task 3 Summary
1. The results are comparatively similar with an average Rouge of 0.37
2. The results can further be improvised by
   - Experimenting with other prompts
   - Experimenting with different configuration parameters of the model (e.g: do_sample=True for different decoding strategies)
   - Perform fine-tuning on data
   - Using a larger parameters model

## Task 4: Verify if English to French translation task works.

We will experiment with samples from **[DiaBLa](https://huggingface.co/datasets/rbawden/DiaBLa)** dataset for Translation task.

The dataset is an English-French dataset for the evaluation of Machine Translation (MT) for informal, written bilingual dialogue.

The dataset contains 144 spontaneous dialogues (5,700+ sentences) between native English and French speakers, mediated by one of two neural MT systems in a range of role-play settings. See below for some basic statistics. The dialogues are accompanied by fine-grained sentence-level judgments of MT quality, produced by the dialogue participants themselves, as well as by manually normalised versions and reference translations produced a posteriori. See here for information about evaluation.

In [33]:
import datasets
dataset_name = "rbawden/DiaBLa"
dataset = datasets.load_dataset(dataset_name)

TEST_DATA_COUNT = len(dataset['test'])
print(f"Test dataset size: {TEST_DATA_COUNT}")

Test dataset size: 5748


In [34]:
TRAINING_DATA_COUNT = 5000
TEST_DATA_COUNT = 748

In [35]:
# Sample Data
from pprint import pprint 
sample_data = next(iter(dataset['test']))
pprint(sample_data)

{'dialogue_history': [],
 'dialogue_meta': {'end_time': '',
                   'final_evaluation_user1': {'coherence': 'average',
                                              'grammaticality': 'good',
                                              'meaning': 'average',
                                              'style': 'average',
                                              'word_choice': 'average'},
                   'final_evaluation_user2': {'coherence': '',
                                              'grammaticality': '',
                                              'meaning': '',
                                              'style': '',
                                              'word_choice': ''},
                   'scenario': [['You are both stuck in a lift at work.',
                                 'Vous êtes tous les deux bloqué(e)s dans un '
                                 'ascenseur au travail.'],
                                ['You are an employee and you 

In [36]:
pprint(dataset['test'][4])

{'dialogue_history': [{'id': 'dialogue-2018-04-25T16-20-36.087170_french_english_1_2_0',
                       'mt': 'On semble avoir arrêté de bouger.',
                       'norm': '',
                       'orig': 'We appear to have stopped moving.',
                       'ref': "J'ai l'impression qu'on s'est arrêtés.",
                       'utterance_meta': {'eval_judgment': 'medium',
                                          'eval_problems': ['style'],
                                          'eval_verbatim': '',
                                          'lang': 'english'}},
                      {'id': 'dialogue-2018-04-25T16-20-36.087170_french_english_1_2_1',
                       'mt': 'Je ne te paye pas pour rester là.',
                       'norm': '',
                       'orig': "I don't pay you to just stand there.",
                       'ref': 'Je ne vous paye pas à rester là debout à rien '
                              'faire.',
                       'u

In [37]:
def read_data(dataset: datasets.Dataset) -> dict:

    source_texts = []
    target_texts = []

    for sample_data in dataset['test']:
        lang = sample_data['utterance_meta']['lang']
        if lang == 'french':
            source_texts.append(sample_data.get('ref',''))
            target_texts.append(sample_data.get('orig',''))
        else:
            source_texts.append(sample_data.get('orig',''))
            target_texts.append(sample_data.get('ref',''))
        
    train_source_texts = source_texts[:5000]
    train_target_texts = target_texts[:5000]
    test_source_texts = source_texts[5000:]
    test_target_texts = target_texts[5000:]
    
    return {'sources':train_source_texts, 'targets':train_target_texts}, {'sources':test_source_texts, 'targets':test_target_texts}

train_data, test_data = read_data(dataset)

#### Zero-Shot Inference

In [38]:
def create_zero_shot_prompt(source: str, prompt_index: int) -> str:
    if prompt_index == 1:
        translation_zero_shot_prompt = f"""Translate the following dialogue from English to French\n\nEnglish:\n{source}\n\nFrench:\n
        """

    elif prompt_index == 2:
        translation_zero_shot_prompt = f"""Please provide a French translation of the following English dialogue\n\nEnglish:\n{source}\n\nFrench:\n
        """

    elif prompt_index == 3:
        translation_zero_shot_prompt = f"""Translate the following dialogue from English to French. Return only the translated French sentence\n\nEnglish:\n{source}\n\nFrench:\n
        """
    return translation_zero_shot_prompt

In [39]:
print(create_zero_shot_prompt(source=test_data['sources'][0], prompt_index=1))

Translate the following dialogue from English to French

English:
how long have you been feeling like this?

French:

        


In [40]:
def create_one_shot_prompt(source: str,  example_index: int, prompt_index: int) -> str:

    example_source = train_data['sources'][example_index]
    example_target = train_data['targets'][example_index]

    if prompt_index == 1:
        translation_one_shot_prompt = f"""Translate the following dialogue from English to French\n\nEnglish:\n{example_source}\n\nFrench:\n{example_target}\n\n-----\n\nEnglish:\n{source}\n\nFrench:\n
        """

    elif prompt_index == 2:
        translation_one_shot_prompt = f"""Please provide a French translation of the following English dialogue\n\nEnglish:\n{example_source}\n\nFrench:\n{example_target}\n\n-----\n\nEnglish:\n{source}\n\nFrench:\n
        """

    elif prompt_index == 3:
        translation_one_shot_prompt = f"""Translate the following dialogue from English to French. Return only the translated French sentence\n\nEnglish:\n{example_source}\n\nFrench:\n{example_target}\n\n-----\n\nEnglish:\n{source}\n\nFrench:\n
        """
    return translation_one_shot_prompt


In [41]:
print(create_one_shot_prompt(source=test_data['sources'][0],  example_index=3, prompt_index=1))

Translate the following dialogue from English to French

English:
You're totally right. I'll try to call reception.

French:
Vous avez tout à fait raison, je vais essayer de téléphoner à l'accueil.

-----

English:
how long have you been feeling like this?

French:

        


In [42]:
def create_few_shot_prompt(source: str, num_shots: int, prompt_index: int) -> str:

    indices = random.sample(range(TRAINING_DATA_COUNT), num_shots)

    if prompt_index == 1:
        translation_few_shot_prompt = f"""Translate the following dialogue from English to French
        """

    elif prompt_index == 2:
        translation_few_shot_prompt = f"""Please provide a French translation of the following English dialogue
        """

    elif prompt_index == 3:
        translation_few_shot_prompt = f"""Translate the following dialogue from English to French. Return only the translated French sentence
        """

    for example_index in indices:
        example_source = train_data['sources'][example_index]
        example_target = train_data['targets'][example_index]
        translation_few_shot_prompt += f"\n\nEnglish:\n{example_source}\n\nFrench:\n{example_target}\n\n-----"

    translation_few_shot_prompt += f"\n\nEnglish:\n{source}\n\nFrench:\n"
    
    return translation_few_shot_prompt


In [43]:
print(create_few_shot_prompt(source=test_data['sources'][0], num_shots=2, prompt_index=1))

Translate the following dialogue from English to French
        

English:
So have you made a guest list for the party?

French:
Alors, est-ce que tu as préparé une liste d'invités pour la soirée ?

-----

English:
I think I need a plaster.

French:
J'ai besoin d'un pansement.

-----

English:
how long have you been feeling like this?

French:



### Human/Qualitative Evaluation

In [44]:
from tqdm import tqdm
import pandas as pd
pd.set_option('display.max_colwidth', None)

def create_translation_model_result(num_test_examples: int, output_csv_file: str, random_sample: bool = False) -> pd.DataFrame:
    if random_sample:
        test_examples_indices = random.sample(range(TEST_DATA_COUNT), num_test_examples)
    else:
        test_examples_indices = range(num_test_examples)

    test_indices, test_sources, test_targets = [],[],[]
    zero_shot_prediction_targets_1, zero_shot_prediction_targets_2, zero_shot_prediction_targets_3 = [],[],[]
    one_shot_prediction_targets_1, one_shot_prediction_targets_2, one_shot_prediction_targets_3 = [],[],[]
    few_shot_prediction_targets_1, few_shot_prediction_targets_2, few_shot_prediction_targets_3 = [],[],[]

    for test_index in tqdm(test_examples_indices):
        test_source = test_data['sources'][test_index]
        test_target = test_data['targets'][test_index]
        test_indices.append(test_index)
        test_sources.append(test_source)
        test_targets.append(test_target)

        for prompt_index in range(1,4):
            zero_shot_prompt = create_zero_shot_prompt(source = test_source, prompt_index = prompt_index)
            zero_shot_output = generate_llm_prediction(zero_shot_prompt)
            one_shot_prompt = create_one_shot_prompt(source = test_source, example_index=prompt_index, prompt_index = prompt_index)
            one_shot_output = generate_llm_prediction(one_shot_prompt)
            few_shot_prompt = create_few_shot_prompt(source = test_source, num_shots=3, prompt_index = prompt_index)
            few_shot_output = generate_llm_prediction(few_shot_prompt)

            if prompt_index == 1:
                zero_shot_prediction_targets_1.append(zero_shot_output)
                one_shot_prediction_targets_1.append(one_shot_output)
                few_shot_prediction_targets_1.append(few_shot_output)
            elif prompt_index == 2:
                zero_shot_prediction_targets_2.append(zero_shot_output)
                one_shot_prediction_targets_2.append(one_shot_output)
                few_shot_prediction_targets_2.append(few_shot_output)
            else:
                zero_shot_prediction_targets_3.append(zero_shot_output)
                one_shot_prediction_targets_3.append(one_shot_output)
                few_shot_prediction_targets_3.append(few_shot_output)

    df = pd.DataFrame({'Index':test_indices,'Source':test_sources,'Gold Targets':test_targets,
                       'Zero_Shot_Pred_1':zero_shot_prediction_targets_1,'Zero_Shot_Pred_2':zero_shot_prediction_targets_2,'Zero_Shot_Pred_3':zero_shot_prediction_targets_3,
                       'One_Shot_Pred_1':one_shot_prediction_targets_1,'One_Shot_Pred_2':one_shot_prediction_targets_2,'One_Shot_Pred_3':one_shot_prediction_targets_3,
                       'Few_Shot_Pred_1':few_shot_prediction_targets_1,'Few_Shot_Pred_2':few_shot_prediction_targets_2,'Few_Shot_Pred_3':few_shot_prediction_targets_3,
                      })
    df.to_csv(output_csv_file,index=False)
    return df

In [45]:
df = create_translation_model_result(num_test_examples = 5, output_csv_file = 'Translation_Evaluation_Sample_5.csv', random_sample = True)
df.head()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:47<00:00,  9.58s/it]


Unnamed: 0,Index,Source,Gold Targets,Zero_Shot_Pred_1,Zero_Shot_Pred_2,Zero_Shot_Pred_3,One_Shot_Pred_1,One_Shot_Pred_2,One_Shot_Pred_3,Few_Shot_Pred_1,Few_Shot_Pred_2,Few_Shot_Pred_3
0,345,"Oh no, it's just us!","Oh non, on est entre nous!","Oh no, il s'est just us!","Oh no, it's just us!","Oh no, il s'est just us!","Oh no, il y a pas à rester là de rien faire.","Oh no, it's just us!","Oh, il n'est pas seulement nous!","Oh, il y a pas d'être à l'égard de vous!","Oh no, it's just us!","Oh, il n'est pas seulement nous!"
1,386,Nothing else?,Rien d'autre ?,,,,Je ne vous avez pas à rester là de rien faire.,,"Vous avez tout à fait raison, je vais essayer de téléphoner à l'accueil.",,,
2,430,"I was doing as you asked, chef, but since I was busy making the starter for table 6, I didn't see the time go by and I let the cream burn...","J'ai fait comme vous m'avez demandé, chef, mais comme j'étais aussi occupée à faire l'entrée pour la table 6, j'ai pas fait attention au temps et j'ai laissé brûler la crème...","Je suis d'adopter à l'heure, chef, mais depuis que je suis d'adopter le starter pour le table 6, je ne peux pas voir l'heure et j'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai","I was doing as you asked, chef, but since I was busy making the starter for table 6, I didn't see the time go by and I let the cream burn...","Je suis d'adopter comme vous avez demandé, chef, mais depuis que je suis d'adopter le starter pour le table 6, je ne peux pas voir l'heure et j'ai l'adopter le cream...","Je a fait l'assaint, chef, mais depuis que j'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai",,"Je a fait l'adoption à l'heure, chef, mais depuis que j'ai l'adoption à l'entrée pour le table 6, je ne peux pas voir l'heure et j'ai l'adoption à l'adoption à l'adoption à l'adoption à l'ad","Je suis d'adopter comme vous avez demandé, chef, mais depuis que je suis d'adopter le starter pour le table 6, je ne peux pas voir l'heure et j'ai l'adopter le cream...","Je suis d'adopter à l'heure, chef, mais depuis longtemps j'ai l'heure pour le table 6, je ne peux pas voir l'heure et j'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l","Je a fait l'assaint, chef, mais depuis que j'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai l'ai"
3,281,Oh dear.,Ouh là.,"Oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d",Oh dear.,"Oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d","Oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d",Oh dear.,"Oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d","Oh, dear.","Oh, dear.","Oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d'oh, d"
4,499,A star!!,Une star !!,A star!!,A star!!,A star!,A star!,A star!,A star!,A star!,A star!,A star!


#### Human Analysis:

1. Since it is difficult to analyze the French data, we will proceed with quantitative results
2. We analyze the quantitative results in the next section to get a better understanding of the entire test set

### Metric/Quantitative Evaluation

We will use **[ROUGE](https://https://huggingface.co/spaces/evaluate-metric/rouge)** metric for our evaluation.

In [46]:
import evaluate
rouge = evaluate.load('rouge')
df = create_translation_model_result(num_test_examples = 100, output_csv_file = 'Translation_Evaluation_100_samples.csv', random_sample = True)
df.head()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [11:56<00:00,  7.16s/it]


Unnamed: 0,Index,Source,Gold Targets,Zero_Shot_Pred_1,Zero_Shot_Pred_2,Zero_Shot_Pred_3,One_Shot_Pred_1,One_Shot_Pred_2,One_Shot_Pred_3,Few_Shot_Pred_1,Few_Shot_Pred_2,Few_Shot_Pred_3
0,516,The lift engineer or the receptionist?,L'ascensoriste ou le réceptionniste ?,Le ingénieur de lift ou le réception?,Le ingénieur de lift ou le réception?,Le ingénieur de lift ou le réception?,Le ingénieur de lift ou le réception?,,Le ingénieur de lift ou le reservisteur?,Le ingénieur de lift ou le réception?,Le ingénieur de lift ou le réception?,Le ingénieur de lift ou le réception?
1,584,I'm guessing her child is just as beautiful as she is.,Je suis sûre que son enfant est tout aussi mignon qu'elle.,Je pense que elle s'est sa mère sa couleur sa couleur.,Je pense que elle a sa mère s'est juste mais belle que elle s'est.,Je pense que elle s'est sa mère sa couleur sa couleur.,Je pense que elle s'est tellement belle que elle s'est.,Je pense que elle a sa mère s'est juste mais belle que elle s'est.,Je pense que elle s'est tellement belle que elle s'est.,Je pense que elle s'est tellement belle que elle s'est.,Je pense que elle s'est aussi belle que elle s'est.,Je pense que elle s'est tellement belle que elle s'est.
2,23,if that's what you can even call it,Si on peut appeler ça comme ça.,si ce serait l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de,if that's what you can even call it French,Si cela est ce qui vous pouvez ne peux pas s'est-elle-même.,Je ne vous avez pas à rester là de rien faire.,if that's what you can even call it,"Vous avez tout à fait raison, je vais essayer de téléphoner à l'accueil.",Si cela est l'un vrai!,"Si cela est l'unique de l'enseignement universitaire pour changer son comportement au quotidien, et pour encontrer son entourage à faire même.",Je peux aussi s'adresser à l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard de l'égard
3,686,Are there chickens there too?,Est-ce qu'il y a aussi des poulets ?,,,,,,,,Are there chickens too?,
4,664,Or perhaps not.,Ou peut-être pas,"Or, ne peut pas.","Or, ne peut pas.","Or, ne peut pas.","Or, ne peut pas pas.",,"Or, ne peut pas pas.","Or, ne peut pas.","Or, ne peut pas.","Or, ne peut pas."


In [51]:
df_res = create_rouge_eval_result(df, output_csv_file = 'Translation_Evaluation_100_samples_Metrics.csv', gold_column = 'Gold Targets') 
df_res.head(12)

Unnamed: 0,rouge1,rouge2,rougeL,rougeLSum
Zero_Shot_Prompt_1,0.25992,0.088658,0.248204,0.24477
Zero_Shot_Prompt_2,0.20455,0.066176,0.193535,0.190966
Zero_Shot_Prompt_3,0.280298,0.105562,0.267427,0.264743
One_Shot_Prompt_1,0.239909,0.082398,0.227984,0.224784
One_Shot_Prompt_2,0.12633,0.041883,0.118497,0.116902
One_Shot_Prompt_3,0.264272,0.087616,0.250971,0.248329
Few_Shot_Prompt_1,0.249927,0.081767,0.237022,0.233343
Few_Shot_Prompt_2,0.23116,0.07876,0.220261,0.217108
Few_Shot_Prompt_3,0.259871,0.091511,0.24707,0.244429


#### Task 4 Summary
1. The results are comparatively similar in all the cases
2. The results can be improvised by
   - Experimenting with other prompts
   - Experimenting with different configuration parameters of the model (e.g: do_sample=True for different decoding strategies)
   - Perform fine-tuning on data
   - Using larger parameters model

## Task 5: Programmatically print the names of all the model layers and their dimensions.

In [52]:
model_name="google/flan-t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
for name, param in model.named_parameters():
  print(name, param.shape)

shared.weight torch.Size([32128, 512])
encoder.block.0.layer.0.SelfAttention.q.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.k.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.v.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.o.weight torch.Size([512, 384])
encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight torch.Size([32, 6])
encoder.block.0.layer.0.layer_norm.weight torch.Size([512])
encoder.block.0.layer.1.DenseReluDense.wi_0.weight torch.Size([1024, 512])
encoder.block.0.layer.1.DenseReluDense.wi_1.weight torch.Size([1024, 512])
encoder.block.0.layer.1.DenseReluDense.wo.weight torch.Size([512, 1024])
encoder.block.0.layer.1.layer_norm.weight torch.Size([512])
encoder.block.1.layer.0.SelfAttention.q.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.k.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.v.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.o

## Task 6: Programmatically print the total number of parameters/weights in this model.

In [53]:
def print_numbers_of_parameters(model):
    num_params=0
    for param in model.parameters():
        num_params += param.numel()
    print(f"Total Parameters: {num_params}")
  
print_numbers_of_parameters(model)

Total Parameters: 76961152


## Task 7: Set the tensor in final layer (decoder.final_layer_norm.weight) to all zeros.

In [54]:
import torch

model_name="google/flan-t5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
print('Original Weight:')
print(model.decoder.final_layer_norm.weight.shape)
print(model.decoder.final_layer_norm.weight[:100])
model.decoder.final_layer_norm.weight = torch.nn.parameter.Parameter(model.decoder.final_layer_norm.weight*0)
print('Weight after Setting to zeros:')
print(model.decoder.final_layer_norm.weight.shape)
print(model.decoder.final_layer_norm.weight[:100])
model.save_pretrained("updated_flant5-small")

Original Weight:
torch.Size([512])
tensor([ 0.1558,  0.1646,  0.1820,  0.2079,  0.1589,  0.1422,  0.1585,  0.1427,
         0.1365,  0.1570,  0.1667,  0.1327,  0.1798,  0.3268,  0.2090,  0.2623,
         0.1838,  0.1857,  0.1811,  0.1959,  0.1546,  0.2135,  0.1513,  0.1635,
         0.1806,  0.1441,  0.1797,  0.2065,  0.1790,  0.2043,  0.1641,  0.1499,
         0.1387,  0.2249,  0.1704,  0.6170,  0.1823,  0.1758,  0.1611,  0.2402,
         0.1628,  0.2287,  0.1613,  0.1843,  0.2164,  0.2677,  0.1847,  0.1596,
         0.2500,  0.1959,  0.1547,  0.2002,  0.1702,  0.1439,  0.1979,  0.1590,
         0.1490,  0.1504,  0.2603,  0.1593,  0.1508,  0.2010,  0.1984,  0.1558,
         0.1526,  0.1565,  0.1676,  0.7530,  0.1664,  0.1540,  0.0463,  0.1646,
        -0.0066,  0.1754,  0.1569,  0.2540,  0.1964,  0.2072,  0.2011,  0.2037,
         0.2167,  0.1654,  0.1696,  0.1270,  0.1451,  0.1714,  0.1802,  0.1709,
         0.1647,  0.2128,  0.1757,  0.1353,  0.1522,  0.1424,  0.1464,  0.2132,
     

In [55]:
model = AutoModelForSeq2SeqLM.from_pretrained("updated_flant5-small")
print(model.decoder.final_layer_norm.weight[:100])

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        -0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.], grad_fn=<SliceBackward0>)


## Task 8: Verify if the Q&A task works after resetting the weights of the above layer.

In [64]:
model = AutoModelForSeq2SeqLM.from_pretrained("updated_flant5-small")

df = create_qa_model_result(num_test_examples = 100, output_csv_file = 'QA_Updated_Evaluation_100_samples.csv', random_sample = False)
print(df.head())

df_res = create_rouge_eval_result(df, output_csv_file = 'QA_Updated_Evaluation_100_samples_Metrics.csv', gold_column = 'Gold Summary') 
print(df_res.head(12))

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [35:54<00:00, 21.54s/it]


   Index  \
0      0   
1      1   
2      2   
3      3   
4      4   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Context  \
0  The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They wer

#### Conclusion
- Q&A Task does not work

## Task 9: Replace the decoder.final_layer_norm.weight with a layer of smaller dimensions and adjust all the dependent layers to match the dimension

In [265]:
model = AutoModelForSeq2SeqLM.from_pretrained("updated_flant5-small")

print('== Original Weights:')
print(f"model.decoder.final_layer_norm.weight.shape: ",model.decoder.final_layer_norm.weight.shape)
print(f"model.lm_head.weight.shape: ",model.lm_head.weight.shape)

new_dimension = 256

model.decoder.final_layer_norm.weight = torch.nn.parameter.Parameter(torch.zeros(new_dimension))
original_weight = model.lm_head.weight.clone()
model.lm_head.weight = torch.nn.Parameter(torch.zeros(original_weight.size(0),new_dimension))
model.lm_head.weight = torch.nn.Parameter(original_weight[:,:new_dimension])

print('\n== Weights after reducing dimension:')
print(model.decoder.final_layer_norm.weight.shape)
print(model.lm_head.weight.shape)

== Original Weights:
model.decoder.final_layer_norm.weight.shape:  torch.Size([512])
model.lm_head.weight.shape:  torch.Size([32128, 512])

== Weights after reducing dimension:
torch.Size([256])
torch.Size([32128, 256])


In [267]:
for name, param in model.named_parameters():
  print(name, param.shape)

shared.weight torch.Size([32128, 512])
encoder.block.0.layer.0.SelfAttention.q.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.k.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.v.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.o.weight torch.Size([512, 384])
encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight torch.Size([32, 6])
encoder.block.0.layer.0.layer_norm.weight torch.Size([512])
encoder.block.0.layer.1.DenseReluDense.wi_0.weight torch.Size([1024, 512])
encoder.block.0.layer.1.DenseReluDense.wi_1.weight torch.Size([1024, 512])
encoder.block.0.layer.1.DenseReluDense.wo.weight torch.Size([512, 1024])
encoder.block.0.layer.1.layer_norm.weight torch.Size([512])
encoder.block.1.layer.0.SelfAttention.q.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.k.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.v.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.o

In [6]:
model = AutoModelForSeq2SeqLM.from_pretrained("updated_flant5-small")
for name, param in model.named_parameters():
  print(name, param.shape)

shared.weight torch.Size([32128, 512])
encoder.block.0.layer.0.SelfAttention.q.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.k.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.v.weight torch.Size([384, 512])
encoder.block.0.layer.0.SelfAttention.o.weight torch.Size([512, 384])
encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight torch.Size([32, 6])
encoder.block.0.layer.0.layer_norm.weight torch.Size([512])
encoder.block.0.layer.1.DenseReluDense.wi_0.weight torch.Size([1024, 512])
encoder.block.0.layer.1.DenseReluDense.wi_1.weight torch.Size([1024, 512])
encoder.block.0.layer.1.DenseReluDense.wo.weight torch.Size([512, 1024])
encoder.block.0.layer.1.layer_norm.weight torch.Size([512])
encoder.block.1.layer.0.SelfAttention.q.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.k.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.v.weight torch.Size([384, 512])
encoder.block.1.layer.0.SelfAttention.o

In [10]:
import torch

new_dimension = 256

# Update shared.weight
model.shared.weight = torch.nn.Parameter(
    model.shared.weight[:, :new_dimension])

# Update encoder block layers
num_encoder_blocks = 8
for i in range(num_encoder_blocks):

    # Update SelfAttention.weight
    model.encoder.block[i].layer[0].SelfAttention.q.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[0].SelfAttention.q.weight[:, :new_dimension])
    model.encoder.block[i].layer[0].SelfAttention.k.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[0].SelfAttention.k.weight[:, :new_dimension])
    model.encoder.block[i].layer[0].SelfAttention.v.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[0].SelfAttention.v.weight[:, :new_dimension])
    model.encoder.block[i].layer[0].SelfAttention.o.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[0].SelfAttention.o.weight[:new_dimension, :])

    # Update layer_norm.weight
    model.encoder.block[i].layer[0].layer_norm.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[0].layer_norm.weight[:new_dimension])
    model.encoder.block[i].layer[1].layer_norm.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[1].layer_norm.weight[:new_dimension])

    # Update DenseReluDense.wo.weight
    model.encoder.block[i].layer[1].DenseReluDense.wi_0.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[1].DenseReluDense.wi_0.weight[:new_dimension*2,:new_dimension])
    model.encoder.block[i].layer[1].DenseReluDense.wi_1.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[1].DenseReluDense.wi_1.weight[:new_dimension*2,:new_dimension])
    model.encoder.block[i].layer[1].DenseReluDense.wo.weight = torch.nn.Parameter(
        model.encoder.block[i].layer[1].DenseReluDense.wo.weight[:new_dimension, :new_dimension*2])

# Update encoder final_layer_norm.weight
model.encoder.final_layer_norm.weight = torch.nn.Parameter(
    model.encoder.final_layer_norm.weight[:new_dimension])


# Update decoder block layers
num_decoder_blocks = 8
for i in range(num_decoder_blocks):
    
    # Update SelfAttention.weight
    model.decoder.block[i].layer[0].SelfAttention.q.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[0].SelfAttention.q.weight[:, :new_dimension])
    model.decoder.block[i].layer[0].SelfAttention.k.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[0].SelfAttention.k.weight[:, :new_dimension])
    model.decoder.block[i].layer[0].SelfAttention.v.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[0].SelfAttention.v.weight[:, :new_dimension])
    model.decoder.block[i].layer[0].SelfAttention.o.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[0].SelfAttention.o.weight[:new_dimension, :])

    # Update layer_norm.weight
    model.decoder.block[i].layer[0].layer_norm.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[0].layer_norm.weight[:new_dimension])
    model.decoder.block[i].layer[1].layer_norm.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[1].layer_norm.weight[:new_dimension])
    model.decoder.block[i].layer[2].layer_norm.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[2].layer_norm.weight[:new_dimension])

    # Update EncDecAttention.weight
    model.decoder.block[i].layer[1].EncDecAttention.q.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[1].EncDecAttention.q.weight[:, :new_dimension])
    model.decoder.block[i].layer[1].EncDecAttention.k.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[1].EncDecAttention.k.weight[:, :new_dimension])
    model.decoder.block[i].layer[1].EncDecAttention.v.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[1].EncDecAttention.v.weight[:, :new_dimension])
    model.decoder.block[i].layer[1].EncDecAttention.o.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[1].EncDecAttention.o.weight[:new_dimension,:])

    # Update DenseReluDense.wo.weight
    model.decoder.block[i].layer[2].DenseReluDense.wi_0.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[2].DenseReluDense.wi_0.weight[:new_dimension*2,:new_dimension])
    model.decoder.block[i].layer[2].DenseReluDense.wi_1.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[2].DenseReluDense.wi_1.weight[:new_dimension*2,:new_dimension])
    model.decoder.block[i].layer[2].DenseReluDense.wo.weight = torch.nn.Parameter(
        model.decoder.block[i].layer[2].DenseReluDense.wo.weight[:new_dimension, :new_dimension*2])

# Update decoder final_layer_norm.weight
model.decoder.final_layer_norm.weight = torch.nn.Parameter(
    model.decoder.final_layer_norm.weight[:new_dimension])

# Update lm_head.weight
model.lm_head.weight = torch.nn.Parameter(
    model.lm_head.weight[:, :new_dimension])

In [11]:
for name, param in model.named_parameters():
  print(name, param.shape)

shared.weight torch.Size([32128, 256])
encoder.block.0.layer.0.SelfAttention.q.weight torch.Size([384, 256])
encoder.block.0.layer.0.SelfAttention.k.weight torch.Size([384, 256])
encoder.block.0.layer.0.SelfAttention.v.weight torch.Size([384, 256])
encoder.block.0.layer.0.SelfAttention.o.weight torch.Size([256, 384])
encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight torch.Size([32, 6])
encoder.block.0.layer.0.layer_norm.weight torch.Size([256])
encoder.block.0.layer.1.DenseReluDense.wi_0.weight torch.Size([512, 256])
encoder.block.0.layer.1.DenseReluDense.wi_1.weight torch.Size([512, 256])
encoder.block.0.layer.1.DenseReluDense.wo.weight torch.Size([256, 512])
encoder.block.0.layer.1.layer_norm.weight torch.Size([256])
encoder.block.1.layer.0.SelfAttention.q.weight torch.Size([384, 256])
encoder.block.1.layer.0.SelfAttention.k.weight torch.Size([384, 256])
encoder.block.1.layer.0.SelfAttention.v.weight torch.Size([384, 256])
encoder.block.1.layer.0.SelfAttention.o.we

## Task 10: Reload the original google/flan-t5-small model.

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name="google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

## Task 11: Train the model for a Q&A task that takes context as additional input along with the question. You can use SQuAD dataset (https://rajpurkar.github.io/SQuAD-explorer/ ) or the smaller Topioca dataset (h_ps://mcgill-nlp.github.io/topiocqa/). Choose an appropriate task prefix/trigger word and justify the choice.

I am fine-tuning on SQuAD dataset using Parameter Efficient Fine-Tuning (PEFT) technique

In [None]:
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(
    "google/flan-t5-small", 
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

In [None]:
from datasets import load_dataset
dataset = load_dataset("rajpurkar/squad_v2")

In [None]:
## Create Instruction-based dataset

def tokenize_function(example):

    prompts = []
    for context,question in zip(example["context"],example["question"]):
        prompts.append(f"""Given a Context and a Question, utilize the context to answer the question. Don't use any other information\n\nContext:\n{context}\n\nQuestion:\n{question}\n\nAnswer:\n""")

    answers = []
    for answer in example["answers"]:
        ans=''
        if 'text' in answer and len(answer['text'])>0:
            ans = answer['text'][0]
        answers.append(ans)

    example['input_ids'] = tokenizer(prompts, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(answers, padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

tokenized_datasets = dataset.map(tokenize_function, batched=True)

In [None]:
#tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 5 == 0, with_indices=True)

print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
from peft import LoraConfig, get_peft_model , TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

peft_model = get_peft_model(model, lora_config)


In [None]:
print_trainable_parameters(peft_model)

In [None]:
from transformers import TrainingArguments, Trainer

output_dir = 'peft-squadv2-qa-training'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3,
    num_train_epochs=10,
    logging_steps=1,
    max_steps=10    
)
    
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

In [None]:
peft_trainer.train()

peft_model_path="peft-squad-v2-qa-checkpoints"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

In [None]:
from peft import PeftModel, PeftConfig

peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")

peft_model = PeftModel.from_pretrained(peft_model_base, 
                                       'peft-squad-v2-qa-checkpoints/', 
                                       torch_dtype=torch.bfloat16,
                                       is_trainable=False)

In [None]:
print_trainable_parameters(peft_model)

In [None]:
def generate_llm_prediction(prompt):
    inputs = tokenizer(prompt, return_tensors='pt')
    peft_model_outputs = model.generate(input_ids=inputs["input_ids"], generation_config=generation_config)
    output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
    output = re.sub('---*','',str(output))
    return output

## Task 12: Evaluate the quality of the model

In [79]:
model = peft_model

df = create_qa_model_result(num_test_examples = 100, output_csv_file = 'QA_PEFT_Evaluation_100_samples.csv', random_sample = False)
print(df.head())

df_res = create_rouge_eval_result(df, output_csv_file = 'QA_PEFT_Evaluation_100_samples_Metrics.csv', gold_column = 'Gold Summary') 
print(df_res.head(12))

           Unnamed: 0    rouge1    rouge2    rougeL  rougeLSum
0  Zero_Shot_Prompt_1  0.372387  0.159821  0.375336   0.373538
1  Zero_Shot_Prompt_2  0.369750  0.156583  0.371631   0.369631
2  Zero_Shot_Prompt_3  0.345125  0.151690  0.348074   0.344720
3   One_Shot_Prompt_1  0.353196  0.149690  0.357905   0.354684
4   One_Shot_Prompt_2  0.351934  0.149690  0.356725   0.353705
5   One_Shot_Prompt_3  0.353835  0.155524  0.358135   0.354820
6   Few_Shot_Prompt_1  0.352333  0.137214  0.355655   0.352857
7   Few_Shot_Prompt_2  0.344083  0.143940  0.348901   0.345777
8   Few_Shot_Prompt_3  0.353351  0.134857  0.356560   0.353800


#### Conclusion:

1. The results are almost similar or slightly better than the In-Context Learning-based approach.
2. The results can be improvised by
    - fine-tuning on more data
    - experimenting with other prompts
    - using a larger parameter model