**Authors:**

- Ravi Teja Kothuru (Primary)
- Soumi Ray
- Anwesha Sarangi

**Title of the Project:** SmartChat: A Context-Aware Conversational Agent

**Description of the Project:** Develop a chatbot that can effectively adapt to context and topic shifts in a conversation, leveraging the Stanford Question Answering Dataset to provide informed and relevant responses, and thereby increasing user satisfaction and engagement.

**Objectives of the Project:** Create a user-friendly web or app interface that enables users to have natural and coherent conversations with the chatbot, with high satisfaction rating.

**Name of the Dataset:** Stanford Question Answering Dataset

**Description of the Dataset:** The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles. The answer to every question is a segment of text, or span, from the corresponding reading passage. There are 100,000+ question-answer pairs on 500+ articles. More information can be found at: https://rajpurkar.github.io/SQuAD-explorer/

**Dataset Source:**

Kaggle (https://www.kaggle.com/datasets/stanfordu/stanford-question-answering-dataset)

***Number of Variables in Dataset:*** There are 2 variables in this dataset

- data
- version

Each of these have other variables such as:

- ***context:*** A lengthy paragraph that has some information.
- ***question:*** A question based on the context.
- ***answer:*** An answer to the context from the context.
- ***ans_start:*** The index value of context where the answer to the question is started.
- ***ans_end:*** The index value of context where the answer to the question is ended.

***Size of the Dataset:*** The dataset has 2 JSON files. One is for training and the other is for validation

- Training Dataset's filename is train-v1.1.json and it size is 30.3 MB.
- Validation Dataset's filename is dev-v1.1.json and it size is 4.9 MB.


# Install/Import all the necessary libraries

In [3]:
# Install necessary libraries: datasets for dataset handling, torch for PyTorch framework,
# peft for parameter-efficient fine-tuning, transformers for NLP model handling,
# evaluate for model evaluation, and safetensors for safe tensor handling.
!pip install datasets torch peft transformers evaluate safetensors numpy pandas matplotlib scikit-learn nltk rouge-score

# Import the PyTorch library for building and training neural networks.
import torch

# Call the garbage collector to free up memory; useful for managing memory during model training.
import gc

# Import the necessary classes from the transformers library:
# - AutoTokenizer: Automatically selects the correct tokenizer for a given model.
# - AutoModelForCausalLM: Automatically loads the correct causal language model.
# - AutoConfig: Loads the configuration of the model, useful for understanding model architecture.
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig

# Libraries essential for setting up and managing the training process of machine learning models.
from transformers import TrainingArguments,Trainer

# Import functions from safetensors library to handle safe tensor loading and saving.
from safetensors.torch import load_model, save_model

# Import the load_dataset function from the datasets library to easily load datasets for training or evaluation.
from datasets import load_dataset

# Import the evaluate library to perform evaluations on model predictions.
import evaluate

# Import PeftModel and PeftConfig from the peft library for efficient fine-tuning of models.
from peft import PeftModel, PeftConfig

# Importing necessary classes for configuring and using Low-Rank Adaptation in model tuning.
from peft import LoraConfig, TaskType, get_peft_model

# Import the os module to interact with the operating system, such as file paths and directories.
import os

# Importing the AutoPeftModelForCausalLM class from the peft library for fine-tuning causal language models.
from peft import AutoPeftModelForCausalLM

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting absl-py (from rouge-score)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m[31m1.2 MB/s[0m eta [36m0:00:01[0m
[?25hDownloading absl_py-2.1.0-py3-none-any.whl (133 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.7/133.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0mMB/s[0m eta [36m0:00:01[0m
[?25hDownloading click-8.1.7-py3-none-any.whl (97 kB)
[2K   

# Decide which architecture to use

## Comparison of the architectures

To develop chatbots, we have different architectures. 
Let us better understand about them before deciding which one to use.

# Differences Between Seq2Seq, Transformers, GPT, and GPT-2 (Small, Medium & Large)

| Feature       | Seq2Seq                                           | Transformers                                    | GPT                                               | GPT-2 Small, Medium & Large                       |
|---------------|--------------------------------------------------|------------------------------------------------|--------------------------------------------------|--------------------------------------------------|
| **Definition**| A model that transforms an input sequence into an output sequence using an encoder and decoder. | A deep learning architecture using self-attention mechanisms to process input sequences. | A specific Transformer model designed for generating text by predicting the next word in a sequence. | Variants of the GPT model with different sizes and capacities for generating text. |
| **Usage**     | Tasks where input and output are sequences, like translation and summarization. | A wide range of NLP tasks, including translation and summarization. | Primarily used for text generation tasks like chatbots and text completion. | Used for similar text generation tasks, with larger models generally providing better performance and coherence. |
| **Information**| Consists of an encoder that processes the input and a decoder that generates the output. | Composed of an encoder and decoder stack, using self-attention to capture relationships between words. | Utilizes only the decoder part of the Transformer, focusing on unidirectional text generation. | Small has fewer parameters, while medium and large have progressively more, enhancing their ability to understand and generate text. |
| **Strengths** | Effective for varying output lengths; good at capturing context. | Can process sequences in parallel; captures long-range dependencies well. | Excellent at generating coherent and contextually relevant text; adapts to various topics. | Larger models (medium and large) can generate more sophisticated and nuanced text compared to the small model. |
| **Limitations**| Struggles with long sequences due to fixed-length context vectors; may not capture long-range dependencies well. | Requires substantial data and computational power; complexity can make fine-tuning harder. | May generate repetitive or nonsensical outputs; unidirectional nature limits contextual understanding compared to bidirectional models. | Small may struggle with complexity in tasks, while larger models require more computational resources and memory. |
| **Applications**| Machine translation, text summarization, conversational agents. | Machine translation, text generation, sentiment analysis. | Chatbots, text completion, creative writing assistance. | Similar applications as GPT, with larger models often preferred for more demanding tasks. |


## Final Decision of the Architecture to use for training

I have decided to use ***GPT-2 Medium*** sized model because of the following reasons:

- Excellent at generating coherent and contextually relevant texts.
- Considering the time and computational resources constraints, we need a model that has optimal size and generates relevant text properly. Therefore, GPT2-medium sized architectural model would be sufficient.
- Provides a detailed understanding of the question asked based on the context.

# Load the GPT2-Medium Tokenizer

1. **Load a Text Processing Tool:** The code is using a specific tool, which is nothing but `openai-community/gpt2-medium`, designed to convert text into a format that a model can understand. This tool is associated with a specific version of a language model.

2. **Set Padding Token:** The code assigns a special symbol to represent empty spaces or padding in the text. This symbol helps the model understand when there is no actual content.

3. **Assign Padding Token ID:** The code also sets an identifier for the padding symbol. This identifier is a number that the model uses to recognize the padding symbol in the processed text.

In [2]:
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    "openai-community/gpt2-medium"
)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

# Load the `openai-community/gpt2-medium` base model

**Automatic configuration:**

1. **Objective**: The goal is to manage the configuration settings for different models in a uniform manner through automation. It makes loading model configurations easier.

2. **Management of Configuration**: It oversees factors like the quantity of layers, hidden size, and vocabulary size.

3. **Simplicity of Use**: `AutoConfig` allows users to avoid manually setting multiple parameters, as they can easily load them by providing the model name. This decreases the possibility of mistakes and simplifies transitioning between various models.

4. **Compatibility**: The set-up is customized to match the particular model design, guaranteeing all adjustments are suitable for the model in use.

**AutoModelForCausalLM:** is used for generating language models that predict the next word in a sequence based on the input sequence.

1. **Goal**: This class is made for designing models that produce text in a sequential way. This means the model uses the previous words in a sequence to predict the next word.

2. **Pre-trained Models**: Users can import pre-trained causal language models that have already been trained on extensive datasets. This saves time and computational resources by eliminating the need for the user to train the model from the beginning.

3. **Ability to Generate**: The model's capabilities include text completion, dialogue generation, and other tasks that require producing coherent sequences of text.

4. **Adaptable and Expandable**: This class is compatible with various iterations of generative models, simplifying the utilization of diverse architectures (such as GPT-2, etc).

In [13]:
# load  model
config = AutoConfig.from_pretrained(
    "openai-community/gpt2-medium"
)

model = AutoModelForCausalLM.from_pretrained(
    "openai-community/gpt2-medium"
)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# Load SQuAD dataset from HuggingFace datasets library

- Please be informed - this SQuAD (Stanford Question Answering) dataset is available in HuggingFace's `datasets` library itself. Hence, there is no need of downloading the dataset files to the local and loading them.

- Since we are working on SQuAD v1.1, `squad` should be sufficient. The `load_dataset` function from `datasets` library helps in loading the required dataset.

- As part of memory management, I have deleted the actual loaded dataset variables post splitting them into `train_ds` and `eval_ds`.

- Cleanup unused memory that is no longer needed by the program, using `gc.collect()`.

In [14]:
# # load dataset
dataset = load_dataset("squad")
vali_ds = dataset['validation']
spilt_ds = dataset['train'].train_test_split(test_size=0.2)
train_ds = spilt_ds['train'].shuffle(seed=42)
eval_ds = spilt_ds['test'].shuffle(seed=42)

# clear original dataset
del dataset
del spilt_ds
gc.collect()

1962

# Verify the existence of CPU/GPU and confirm what is being used

In [15]:
# Check if GPU is available. If yes, use it. Else, use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Currently using the device: {device}")

Currently using the device: cuda


# Preprocess the data as needed

Before proceeding with the data preprocessing and making it ready for training, let us first understand the meaning of LoRA and PEFT.

# Comparison of LoRA and PEFT

| Aspect               | LoRA (Low-Rank Adaptation)                                    | PEFT (Parameter-Efficient Fine-Tuning)                      |
|----------------------|--------------------------------------------------------------|-------------------------------------------------------------|
| **Definition**       | A technique that adapts pre-trained models by introducing low-rank matrices, allowing for efficient model tuning. | A broader framework for fine-tuning models with minimal parameter updates, enhancing efficiency and speed. |
| **Purpose**          | To reduce the number of parameters that need to be updated during model training while maintaining performance. | To fine-tune large models with a focus on reducing computational costs and resource usage. |
| **Examples**         | Used in language models like GPT-2 to enhance performance with fewer resources. | Applied in various architectures (e.g., BERT, T5) for tasks such as text classification, summarization, and more. |
| **Usages**           | Primarily used for fine-tuning large pre-trained models in specific tasks with limited data or computational resources. | Utilized in a wide range of applications, including natural language processing, computer vision, and more, where full fine-tuning is impractical. |
| **Methodology**      | Involves modifying only certain layers or components of a model, keeping the majority of the parameters unchanged. | Can include methods like LoRA, prompt tuning, and adapter layers to achieve efficient fine-tuning. |
| **Benefits**         | Reduces memory and computational overhead, making it feasible to use large models in resource-constrained environments. | Allows for faster training times and lower resource consumption while still achieving high performance. |
| **Limitations**      | May not achieve the same level of performance as full fine-tuning in some cases. | The effectiveness can vary depending on the specific task and model architecture used. |


## Make use of LORA and PEFT

1. **Configuration Creation**: The code is creating a new configuration for a specific type of model tuning.

2. **Low-Rank Adaptation (LoRA)**: The technique being used is called Low-Rank Adaptation, which helps to adapt pre-trained models efficiently.

    3. **Rank Parameter**: A parameter is set that controls the rank of the adaptation, allowing the model to learn from a lower-dimensional representation.

    4. **Scaling Factor**: Another parameter defines a scaling factor that adjusts how much influence the adaptation has during the training process.

    5. **Dropout Rate**: A value is specified for dropout, which is a method used to prevent overfitting by randomly ignoring some neurons during training.

    6. **Input/Output Configuration**: The setup includes an option for how inputs and outputs are handled in the adaptation process.

    7. **Bias Handling**: The configuration specifies whether to use bias in the model adaptation, in this case, it is set not to include bias.

    8. **Task Type Specification**: Finally, the code indicates that the model is being adapted for a specific task type, which is generating text based on previous inputs.


In [16]:
config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    fan_in_fan_out=True,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
    )

In [22]:
# Apply the LoRA's config into PEFT model
lora_model = get_peft_model(model, config)

# Send the LoRA model object to the device (CPU/GPU) for further processing and training
lora_model.to(device)

# Print the count of trainable params count and %
lora_model.print_trainable_parameters()

# Delete unused memory/cache from GPU
torch.cuda.empty_cache()
print("Deleted unuesed memory from GPU")

#define preprocess function
def preprocess_function(examples):
    # Format the required input columns from the dataset into a list
    inputs = [f"Context: {c}\nQuestion: {q}\nAnswer:" for q, c in zip(examples['question'], examples['context'])]
    
    # Apply tokenization to inputs
    model_inputs = tokenizer(inputs,padding="max_length", truncation=True, max_length=256,return_tensors='pt')
    
    # Format the required target columns from the dataset into a list
    targets = [','.join(a['text']) if len(a['text']) > 0 else '' for a in examples['answers']]

    # Apply tokenization to targets
    labels = tokenizer(targets, padding="max_length", truncation=True, max_length=256, return_tensors='pt')
    
    model_inputs["labels"] = labels['input_ids']
    model_inputs["labels_mask"] = labels['attention_mask']
    
    return model_inputs

# Preprocess train information
tok_train_ds = train_ds.map(preprocess_function, batched=True)
tok_train_ds.set_format(type="torch", columns=["input_ids", "attention_mask","labels","labels_mask"])

# Preprocess evaluation information
tok_eval_ds = eval_ds.map(preprocess_function, batched=True)
tok_eval_ds.set_format(type="torch", columns=["input_ids", "attention_mask","labels","labels_mask"])

trainable params: 786,432 || all params: 355,609,600 || trainable%: 0.2212
Deleted unuesed memory from GPU


# Train the model, save the result and set for evaluation

Let us understand the following code to train the model.

1. **Importing Libraries**:
   - The code starts by importing `TrainingArguments` and `Trainer` from the `transformers` library, which are essential for setting up and managing the training process of machine learning models.

2. **Setting Training Arguments**:
   - A `TrainingArguments` object named `training_args` is created, where various parameters related to training are specified:
   - `per_device_train_batch_size=32`: This sets the number of training samples to be processed at one time (batch size) to 32 for each device (like CPU or GPU).
   - `per_device_eval_batch_size=32`: Similarly, this sets the evaluation batch size to 32.
   - `output_dir="./results"`: This specifies the directory where the training results (like model checkpoints) will be saved, here it's a folder named "results".
   - `learning_rate=2e-4`: This sets the learning rate, which controls how much to change the model's weights during training; in this case, it is set to 0.0002.
   - `weight_decay=0.01`: This adds a penalty to the model weights to prevent overfitting, with a decay factor of 0.01.
   - `evaluation_strategy="epoch"`: This indicates that the model will be evaluated at the end of each training epoch.
   - `save_strategy="epoch"`: This means the model will be saved at the end of each epoch as well.
   - `load_best_model_at_end=True`: After training, the best model (based on evaluation metrics) will be loaded automatically.
   - `num_train_epochs=3`: This sets the total number of times the model will go through the entire training dataset, which is 3 epochs.

3. **Creating the Trainer**:
   - A `Trainer` object named `trainer` is instantiated, which will handle the training and evaluation of the model. The following parameters are passed to it:
   - `model=lora_model`: This specifies the model to be trained, which is referenced by the variable `lora_model`.
   - `args=training_args`: This passes the training arguments that were defined earlier.
   - `tokenizer=tokenizer`: This provides the tokenizer that will be used to preprocess the text data for the model.
   - `train_dataset=tok_train_ds`: This sets the training dataset, which is referenced by the variable `tok_train_ds`.
   - `eval_dataset=tok_eval_ds`: This sets the evaluation dataset, referenced by the variable `tok_eval_ds`.

In summary, this code is preparing everything necessary to train a machine learning model using specified settings and datasets, while also ensuring that evaluation and saving of the best model are handled automatically.


## Interpretation post training

- We noticed that post training, the evaluation loss is `0.1442018747329712` which means the model got trained very well.
- Due to time and computational resources constraints, we have used a less powerful GPU. As a result, we had to use only 3 epochs and it still took 2 hours 26 minutes to complete the training.

In [23]:
# Prepare for training the gpt2-medium model
training_args = TrainingArguments(
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    output_dir="./results",
    learning_rate=2e-4,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    num_train_epochs=3,
    )

trainer = Trainer(model=lora_model,
                  args=training_args,
                  tokenizer=tokenizer,
                  train_dataset=tok_train_ds,
                  eval_dataset=tok_eval_ds
                )

# save the trained model
lora_model.save_pretrained("gpt2-medium-lora")

# train the model
trainer.train()

# eval the traiend model
evaluation_result = trainer.evaluate()
print(evaluation_result)

  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
1,0.1477,0.14534
2,0.1468,0.14445
3,0.146,0.144202


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


{'eval_loss': 0.1442018747329712, 'eval_runtime': 223.1288, 'eval_samples_per_second': 78.52, 'eval_steps_per_second': 2.456, 'epoch': 3.0}


# Validate the trained 'gpt2-medium' model using Validation dataset

## Load the validation dataset

- Here, we have directly loaded the validation dataset from HuggingFace's `datasets` library.
- As there are more than 10000 records we initially loaded only the first 5 records of the validation dataset.

In [4]:
dataset = load_dataset("squad")
validation_ds = dataset['validation'].select(range(5))
validation_ds

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 5
})

# Function to format the input prompt and tokenize it

- `prompts` list contains formatted strings for each question and its corresponding context.
- For each pair of question (`q`) and context (`c`) in `examples`, the function creates a string that starts with `"Context: "`, followed by the context text, then `"Question: "`, followed by the question text, and finally ends with `"<|start_answer|>"`.
- After creating the list of formatted strings, the `tokenizer` function is called. This function converts the list of prompts into numerical representations that the model can understand.

**Parameters of Tokenization**:
   - `padding="max_length"`: This ensures that all the tokenized prompts are padded to the maximum length specified by `max_length`.
   - `truncation=True`: This allows any prompts that exceed the `max_length` to be shortened.
   - `max_length=256`: This sets the maximum length for each tokenized prompt to 256 tokens.
   - `return_tensors='pt'`: This specifies that the output should be returned as PyTorch tensors, which can be used for model input.

- Finally, the function returns the tokenized prompts, which are now in a format suitable for input into a model.

Look at the following code for the same.

In [5]:
def format_prompts(examples):
    prompts = [f"Context: {c}\nQuestion: {q}\n <|start_answer|>" for q, c in zip(examples['question'], examples['context'])]
    return tokenizer(prompts, padding="max_length", truncation=True, max_length=256, return_tensors='pt')

- The function named `extract_answer`, which takes one parameter called `generated_text`. This parameter is expected to be a string containing some text.
- `start_token` is defined with the value `"<|start_answer|>"`. This token is used to identify where the actual answer starts in the generated text.
- The function searches for the `start_token` in the `generated_text` and finds its position using the `find` method. 
- It then adds the length of `start_token` to this position to determine the starting index (`start_idx`) of the actual answer text.
- The code checks if `start_idx` is not equal to -1, which means the `start_token` was found in the text. 
- If the token is found, it extracts the answer text starting from `start_idx` and removes any leading or trailing whitespace using the `strip` method.

**Purpose**:
- The main purpose of this function is to isolate and return the answer portion of the `generated_text`, which is located after a specific start token.

In [6]:
def extract_answer(generated_text):
    start_token = "<|start_answer|>"
    
    # Extract the text between the special tokens
    start_idx = generated_text.find(start_token) + len(start_token)
    
    if start_idx != -1:
        return generated_text[start_idx:].strip()
    else:
        # Return original text if tokens not found
        return generated_text

## Load the fine-tuned and trained gpt2-medium-lora model

- As there are more than 10000 records in the validation dataset, the Jupyter notebook cannot handle printing all those records.
- Hence, I have printed only 5 records of the validation dataset to make you better understand.
- Detailed prediction will be demonstrated using the Chatbot UI.

In [8]:
# load fine-tuning model
ft_model = AutoPeftModelForCausalLM.from_pretrained("gpt2-medium-lora")

inputs = format_prompts(validation_ds)
outputs = ft_model.generate(**inputs, max_new_tokens=32, do_sample=True, top_k=50, top_p=0.95, temperature=0.7, num_return_sequences=1)

# generate anwsers
generated_answers = [extract_answer(tokenizer.decode(output, skip_special_tokens=True)) for output in outputs]

# display result
for input, output, ans in zip(inputs['input_ids'], outputs, generated_answers):
    print("\n-------compare--------")
    print(f"\ninput:\n{tokenizer.decode(input, skip_special_tokens=True)}")
    print(f"\nanswer:\n{ans}")

model.safetensors:  19%|#9        | 294M/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.



-------compare--------

input:
Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
Question: Which NFL team represented the AFC at Super Bowl 50?
 <|start_answer|>

answer:
Cleveland Browns
Cleveland Browns/Atlanta Falcons
The Cleveland Browns won the Super Bowl wit

## Evaluate the predictions using `evaluate`

- In this code, we are assessing the degree to which our model's responses align with the accurate responses in the SQuAD dataset.
- Initially, we import the SQuAD evaluation metric to evaluate our model's performance.
- Next, we create two separate lists: one for the predictions made by our model and another for the actual correct answers.
- To make predictions, we generate a list of dictionaries containing the ID of each example along with the model's generated answer.
- In the same way, we generate a reference list with the IDs and accurate answers from the dataset.
- In the end, we utilize the evaluation metric to analyze the predictions in comparison to the right answers, calculate the outcomes, and then display them.
- This procedure assists us in understanding the precision of our model's responses to queries.

In [9]:
# evaluate based on squad metric
ft_squad_metric = evaluate.load("squad")

# prepare prediciotns
predictions = [{'id': example['id'],
                'prediction_text': answer} for example, answer in zip(validation_ds, generated_answers)]

# prepare references
references = [{'id': example['id'],
               'answers': example['answers']} for example in validation_ds]

# compute and print result
ft_results = ft_squad_metric.compute(predictions=predictions, references=references)
print(ft_results)

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

{'exact_match': 0.0, 'f1': 0.0}
