The provided code snippets are used to install Python packages and their specific versions, which are likely required for developing a chatbot based on LLM (Language Model) FLAN T5. Here's a brief description of each snippet:

1. `%pip install --upgrade pip`: This command upgrades the `pip` package manager to its latest version. It ensures that you have the latest version of `pip` before installing other dependencies.

2. `%pip install --disable-pip-version-check \ torch==1.13.1 \ torchdata==0.5.1 \ pandas --quiet`: This command installs specific versions of Python packages quietly, meaning without displaying extensive installation output. It installs the following packages:
   - `torch==1.13.1`: PyTorch library with version 1.13.1.
   - `torchdata==0.5.1`: A library for data handling and processing with PyTorch.
   - `pandas`: A popular data manipulation library for Python.

3. `%pip install \ transformers==4.27.2 \ datasets==2.11.0 \ evaluate==0.4.0 \ rouge_score==0.1.2 \ loralib==0.1.1 \ peft==0.3.0 --quiet`: This command installs specific versions of Python packages quietly and installs the following packages:
   - `transformers==4.27.2`: The Transformers library, which includes various pre-trained language models, including T5.
   - `datasets==2.11.0`: A library for managing and accessing datasets efficiently.
   - `evaluate==0.4.0`: A package for evaluating models.
   - `rouge_score==0.1.2`: A library for calculating ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, commonly used in text summarization tasks.
   - `loralib==0.1.1`: A library that may have custom functions or utilities for your chatbot project.
   - `peft==0.3.0`: A package for evaluating text generation quality.

These code snippets are setting up the necessary Python packages and their versions to create a development environment for your chatbot project based on LLM FLAN T5. Make sure you have the required dependencies installed to work on your chatbot successfully.

In [None]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 \
    pandas --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet


Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.2
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m90.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m2.9 MB/s[0m eta [36m0

The provided code snippets import necessary Python libraries and classes for developing a chatbot based on the LLM FLAN T5 model. Here's a brief description of these code snippets:

1. `import pandas as pd`: Imports the pandas library and assigns it the alias 'pd,' which is commonly used for data manipulation and analysis.

2. `from datasets import Dataset`: Imports the `Dataset` class from the `datasets` library. This class is used for managing and handling datasets efficiently during chatbot development.

3. `from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer`: Imports several classes from the `transformers` library, which is crucial for working with pre-trained language models like FLAN T5. These classes include:
   - `AutoModelForSeq2SeqLM`: This class is used for sequence-to-sequence language modeling tasks, like text generation, which is often used in chatbot development.
   - `AutoTokenizer`: This class is used for tokenizing text data and preparing it for model input.
   - `TrainingArguments`: This class is used to define training arguments and configurations for fine-tuning language models.
   - `Trainer`: This class is used for training and fine-tuning language models on custom datasets.

4. `import torch`: Imports the PyTorch library, which is commonly used for deep learning tasks, including working with transformers and neural networks.

5. `import time`: Imports the time library, which can be used for measuring execution time or adding delays in your code if needed.

These code snippets are essential for setting up the necessary libraries and tools for chatbot development using the FLAN T5 model and managing datasets effectively.

In [None]:
import pandas as pd
from datasets import Dataset  # Import the Dataset class
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer
import torch
import time

from datasets import Dataset

The provided code snippet is responsible for loading data from an Excel file and preparing it for use in your chatbot development. Here's a brief description:

1. `excel_path = 'NIAA RGS _ GIM changes_Edited.xlsx'`: Defines the file path for the Excel file containing your data. Make sure to replace this with the actual path to your Excel file.

2. `df = pd.read_excel(excel_path)`: Uses pandas to read the data from the Excel file specified by the `excel_path` and stores it in a DataFrame named `df`. This assumes that your Excel file contains tabular data.

3. `dataset = Dataset.from_pandas(df)`: Converts the DataFrame `df` into a dataset using the `from_pandas` method provided by the `Dataset` class from the `datasets` library. This prepares your data for use in training and fine-tuning your chatbot model, making it easier to work with structured data.

In summary, this code snippet loads data from an Excel file, stores it in a DataFrame, and then converts it into a format suitable for use in your chatbot development project. Make sure that the Excel file contains the required columns ('Topic' and 'Message') as indicated in the comment.

In [None]:
# Load data from Excel
excel_path = 'NIAA RGS _ GIM changes_Edited.xlsx'
df = pd.read_excel(excel_path)

# Assuming your Excel columns are named 'Topic' and 'Message'
dataset = Dataset.from_pandas(df)

The provided code snippet is responsible for initializing and loading a pre-trained FLAN T5 model along with its tokenizer. Here's a brief description:

1. `model_name = 'google/flan-t5-base'`: Defines the name of the pre-trained FLAN T5 model that you want to use. In this case, it's set to 'google/flan-t5-base,' which is the model you've chosen for your chatbot.

2. `original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)`: Loads the pre-trained FLAN T5 model using the `AutoModelForSeq2SeqLM.from_pretrained` method from the `transformers` library. It initializes the model and specifies the data type for PyTorch tensors as `torch.bfloat16`. The `original_model` variable will hold the loaded model.

3. `tokenizer = AutoTokenizer.from_pretrained(model_name)`: Initializes the tokenizer for the same pre-trained FLAN T5 model using the `AutoTokenizer.from_pretrained` method. The `tokenizer` variable will store the loaded tokenizer, which is used to tokenize input text and prepare it for model input.

In summary, this code snippet loads the pre-trained FLAN T5 model and its tokenizer, allowing you to use them for text generation and chatbot tasks. The model is loaded with reduced precision using `torch.bfloat16` for potential memory and speed optimizations, but you can adjust this based on your specific requirements.

In [None]:
model_name = 'google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

The provided code snippet performs several important tasks, including measuring the number of trainable parameters in the original model and testing the model's zero-shot inference capabilities on a specific conversation. Here's a brief description:

1. `print_number_of_trainable_model_parameters(model)`: This function calculates and returns the number of trainable and total model parameters in the given model. It iterates through the named parameters of the model, counting both trainable and non-trainable ones. The function returns a formatted string that includes the counts and the percentage of trainable parameters compared to all parameters.

2. `print(print_number_of_trainable_model_parameters(original_model))`: This line calls the `print_number_of_trainable_model_parameters` function on the `original_model` and prints the results, providing insights into the model's size and capacity.

3. `index = 28`: Specifies the index of the dataset record that you want to use for zero-shot inference. This indicates which conversation and summary to test the model on.

4. `dialogue = dataset['dialogue'][index]` and `summary = dataset['summary '][index]`: Extracts the conversation and summary from the dataset based on the specified index.

5. `prompt = f"...": Defines a prompt for the model, which instructs it to summarize the provided conversation. It includes the dialogue and a placeholder for the summary.

6. `inputs = tokenizer(prompt, return_tensors='pt')`: Tokenizes the prompt using the previously loaded tokenizer and returns it as a PyTorch tensor.

7. `output = tokenizer.decode(...)`: Uses the `original_model` to generate a summary of the conversation based on the tokenized input. The generated summary is then decoded from tokens into readable text.

8. `dash_line = '-'.join(...)`: Creates a dashed line to visually separate different sections of the output.

9. The code prints the following:
   - The input prompt, which contains the conversation and the placeholder for the summary.
   - The baseline human summary, which is the actual summary for comparison.
   - The model's generated summary through zero-shot inference.

This code snippet effectively tests your model's ability to summarize a conversation and provides a comparison between the model's output and the human-generated summary for evaluation.

In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

#Test the Model with Zero Shot Inferencing

index = 28

dialogue = dataset['dialogue'][index]
summary = dataset['summary '][index]

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')


trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%
---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

Scholarships

Summary:

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
GIM offers scholarship to the students based on their merit and financial background.                                Goa Institute of Management offers scholarship worth Rs. 5 lakhs each to the meritorious students so that money doesn't become a barrier in learning. Click below to know more.                                        https://gim.ac.in/admission/fees-and-financial-aid

---------------------------------------------------------------------------------------------------
MODEL GENERATION - ZERO SHOT:
Obtain a scholarship.


The provided code snippet performs full fine-tuning of your chatbot model using tokenized datasets and a custom tokenization function. Here's a brief description:

1. `tokenize_function(example)`: This function takes an example from your dataset and performs tokenization for use in fine-tuning. It adds start and end prompts to the conversation, tokenizes the input dialogue and summary, and stores them as 'input_ids' and 'labels.' For the labels, it uses 'input_ids,' which means the model should generate summaries that match the input text.

2. The function handles potential None values in the 'dialogue' column, ensuring that each example has a properly formatted prompt and summary.

3. It initializes empty lists to store tokenized input and labels for each example.

4. Inside the loop, it tokenizes each example individually, using the `tokenizer` to process the start_prompt, dialogue, and end_prompt.

5. It appends the tokenized input and labels to their respective lists.

6. After processing all examples, it converts the lists to PyTorch tensors, storing them in the 'input_ids' and 'labels' fields of the example.

7. `tokenized_datasets = dataset.map(...)`: This line maps the `tokenize_function` to your dataset using the `map` function. It applies the tokenization process to all examples in the dataset, handling batching efficiently.

8. `tokenized_datasets = tokenized_datasets.remove_columns(['dialogue', 'summary '])`: Removes the original 'dialogue' and 'summary ' columns from the tokenized dataset, as they are no longer needed.

9. `tokenized_datasets = tokenized_datasets.filter(...)`: This line filters the tokenized dataset, selecting only every 100th example (index % 100 == 0) from the dataset. This step reduces the size of the dataset for demonstration purposes.

10. Finally, the code prints the shapes of the datasets and the tokenized datasets themselves.

This code snippet prepares your dataset for fine-tuning by tokenizing the input and labels, and it selects a subset of examples to work with. You can adjust the filtering criteria as needed based on your dataset size and requirements.

In [None]:
#Perform Full Fine-Tuning

def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '

    # Check for None values in the 'dialogue' column
    dialogues = example["dialogue"]
    prompt = [start_prompt + (dialogue if dialogue is not None else '') + end_prompt for dialogue in dialogues]

    # Ensure the column name matches the one in your dataset
    summaries = example["summary "]

    # Initialize empty lists to store tokenized input and labels
    input_ids_list = []
    labels_list = []

    for single_prompt, single_summary in zip(prompt, summaries):
        # Tokenize each example individually
        tokenized_output = tokenizer(single_prompt, single_summary, padding="max_length", truncation=True, return_tensors="pt", max_length=512)

        # Append tokenized input and labels to the lists
        input_ids_list.append(tokenized_output['input_ids'])
        labels_list.append(tokenized_output['input_ids'])  # Use 'input_ids' as labels for summarization

    # Convert lists to PyTorch tensors
    example['input_ids'] = torch.cat(input_ids_list, dim=0)
    example['labels'] = torch.cat(labels_list, dim=0)

    return example


# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['dialogue', 'summary '])

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

print(f"Shapes of the datasets:")
#print(f"Training: {tokenized_datasets['train'].shape}")
#print(f"Validation: {tokenized_datasets['validation'].shape}")
#print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Map:   0%|          | 0/64 [00:00<?, ? examples/s]

Filter:   0%|          | 0/64 [00:00<?, ? examples/s]

Shapes of the datasets:
Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 1
})


The provided code snippet is responsible for fine-tuning your chatbot model with the preprocessed dataset. Here's a brief description:

1. `output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'`: Defines an output directory for saving the fine-tuned model. The directory name includes a timestamp to make it unique.

2. `training_args = TrainingArguments(...)`: Initializes training arguments for the fine-tuning process. Key parameters include:
   - `output_dir`: The directory where the fine-tuned model and logs will be saved.
   - `learning_rate`: The learning rate for training the model.
   - `num_train_epochs`: The number of training epochs (set to 1 in this case).
   - `weight_decay`: Weight decay regularization parameter.
   - `logging_steps`: How often to log training progress.
   - `max_steps`: The maximum number of training steps (set to 1, which is a very low value for demonstration purposes).

3. `trainer = Trainer(...)`: Initializes a Trainer object for fine-tuning. Key arguments include:
   - `model`: The original model to be fine-tuned.
   - `args`: The training arguments defined earlier.
   - `train_dataset` and `eval_dataset`: The entire preprocessed dataset is used for both training and evaluation.

4. `trainer.train()`: Initiates the fine-tuning process. The model will be trained for the specified number of epochs (1 in this case) with the given training arguments.

This code snippet sets up the training process for your chatbot model, fine-tuning it with the preprocessed dataset. Keep in mind that the `max_steps` parameter is set to a low value (1) for demonstration purposes, and you should increase it for a full training run. The fine-tuned model will be saved in the specified `output_dir`.

In [None]:
# Fine-Tune the Model with the Preprocessed Dataset

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets,  # Use the entire preprocessed dataset
    eval_dataset=tokenized_datasets  # Use the entire preprocessed dataset for evaluation
)

# Fine-tune the model
trainer.train()



Step,Training Loss
1,39.75


TrainOutput(global_step=1, training_loss=39.75, metrics={'train_runtime': 0.5137, 'train_samples_per_second': 15.574, 'train_steps_per_second': 1.947, 'total_flos': 684757352448.0, 'train_loss': 39.75, 'epoch': 1.0})

The provided code snippet defines a function for calculating ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores for a given set of hypotheses and references. Here's a brief description:

1. `def calculate_rouge(hypotheses, references)`: Defines a Python function named `calculate_rouge` that takes two arguments: `hypotheses` and `references`. These arguments represent lists of text samples that you want to evaluate.

2. `scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)`: Initializes a ROUGE scorer object using the `RougeScorer` class from the `rouge_scorer` library. The scorer is configured to calculate ROUGE scores for three types: 'rouge1,' 'rouge2,' and 'rougeL,' and it uses stemming during the evaluation (stemming helps match similar words).

3. `rouge_scores = scorer.score(hypotheses, references)`: Uses the scorer to calculate ROUGE scores by comparing the text samples in the `hypotheses` list to the corresponding samples in the `references` list. The function returns a dictionary of ROUGE scores for each specified type ('rouge1,' 'rouge2,' 'rougeL').

4. `return rouge_scores`: Returns the computed ROUGE scores as a dictionary.

This code snippet defines a convenient function for calculating ROUGE scores, which can be used to evaluate the quality of text generation or summarization tasks, such as evaluating the output of your chatbot against reference summaries or human-generated text.

In [None]:
# Define the ROUGE scorer
def calculate_rouge(hypotheses, references):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = scorer.score(hypotheses, references)
    return rouge_scores


The provided code snippet defines the output directory where the fine-tuned model and related files will be saved. Here's a brief description:

1. `output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'`: This line sets up the `output_dir` variable, which specifies the directory where the fine-tuned model and its associated files will be saved.

- `f'./dialogue-summary-training-{str(int(time.time()))}'`: This is a formatted string that creates a unique directory name based on the current timestamp using `time.time()`. The format of the directory name is 'dialogue-summary-training-' followed by a timestamp in seconds since the epoch. This ensures that each fine-tuning run creates a distinct output directory to avoid overwriting previous results.

In essence, this line prepares a directory path for storing the output of the fine-tuning process, making it easy to locate and manage the trained model and related artifacts.

In [None]:
# Fine-Tune the Model with the Preprocessed Dataset

output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

The provided code snippet performs the fine-tuning of the model with the specified training arguments and dataset. Here's a brief description:

1. `training_args = TrainingArguments(...)`: Initializes training arguments for the fine-tuning process. These arguments are provided to customize the training configuration:
   - `output_dir`: Specifies the directory where the fine-tuned model and related files will be saved. This directory was previously defined based on the timestamp.
   - `learning_rate`: Sets the learning rate for training to `1e-5` (0.00001).
   - `num_train_epochs`: Specifies the number of training epochs to be 1.
   - `weight_decay`: Sets a weight decay regularization parameter to 0.01.
   - `logging_steps`: Determines how often the training progress is logged, with a frequency of 1 step in this case.
   - `max_steps`: Specifies the maximum number of training steps, which is set to 1 for demonstration purposes. In practice, you would use a larger value to complete multiple training epochs.

2. `trainer = Trainer(...)`: Initializes a Trainer object for fine-tuning. The key arguments include:
   - `model`: The original model (`original_model`) that you want to fine-tune.
   - `args`: The training arguments defined earlier, specifying the training configuration.
   - `train_dataset` and `eval_dataset`: These arguments specify the datasets to be used for training and evaluation. In this case, the entire preprocessed dataset (`tokenized_datasets`) is used for both training and evaluation.

3. `trainer.train()`: Initiates the fine-tuning process. The model will be trained for one epoch (as specified by `num_train_epochs`) with the specified training settings.

Please note that the `max_steps` value is set to 1 in this snippet, which means the training will only run for a single step for demonstration purposes. In a real fine-tuning scenario, you should set `max_steps` to a higher value to ensure the model trains sufficiently. The fine-tuned model will be saved in the specified `output_dir`.

In [None]:
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
)

# Fine-tune the model
trainer.train()

Step,Training Loss
1,39.5


TrainOutput(global_step=1, training_loss=39.5, metrics={'train_runtime': 0.225, 'train_samples_per_second': 35.551, 'train_steps_per_second': 4.444, 'total_flos': 684757352448.0, 'train_loss': 39.5, 'epoch': 1.0})

The provided code snippet evaluates the fine-tuned model and calculates evaluation metrics, including ROUGE scores. Here's a brief description:

1. `eval_results = trainer.evaluate()`: This line evaluates the fine-tuned model using the evaluation dataset specified in the `eval_dataset` argument when initializing the `trainer` object. It calculates various evaluation metrics, including ROUGE scores.

2. `print("Eval Results:")`: This line prints a header indicating that the following output will display evaluation results.

3. `print(eval_results)`: This line prints the evaluation results, which are stored in the `eval_results` dictionary. The dictionary contains various metrics and statistics computed during the evaluation, including ROUGE scores for different n-gram and recall-based metrics (e.g., rouge1, rouge2, rougeL).

The output will display the evaluation results, allowing you to assess the model's performance on your specific task, such as text summarization or chatbot responses. ROUGE scores are commonly used for assessing the quality of generated text in natural language processing tasks.

In [None]:
# Evaluate the model and calculate ROUGE scores
eval_results = trainer.evaluate()
# Inspect the eval_results dictionary
print("Eval Results:")
print(eval_results)

Eval Results:
{'eval_loss': 42.5, 'eval_runtime': 0.0564, 'eval_samples_per_second': 17.736, 'eval_steps_per_second': 17.736, 'epoch': 1.0}


The provided code snippet is used to check for the correct key containing model predictions in the `eval_results` dictionary. Here's a brief description:

1. `predictions_key = "predictions"`: This line defines a variable `predictions_key` and assigns it the string "predictions." This variable is intended to hold the correct key for accessing model predictions from the `eval_results` dictionary.

2. `print("Keys in eval_results:")`: This line prints a header indicating that the following output will display the keys present in the `eval_results` dictionary.

3. `print(eval_results.keys())`: This line prints the keys present in the `eval_results` dictionary. This will help you identify the correct key that corresponds to the model's predictions.

4. `for key, value in eval_results.items():`: This loop iterates through all the key-value pairs in the `eval_results` dictionary.

5. `print(f"{key}: {value}")`: Within the loop, this line prints each key along with its associated value in the `eval_results` dictionary. You can inspect the output to determine which key contains the model's predictions.

By running this code, you can identify the specific key that holds the model's predictions, allowing you to access and analyze the generated outputs as needed for further evaluation or post-processing.

In [None]:
# Check for the correct key containing model predictions
predictions_key = "predictions"  # Replace with the correct key based on the eval_results structure

# Print keys in eval_results to identify the correct key
print("Keys in eval_results:")
print(eval_results.keys())

# Print values associated with each key in eval_results
for key, value in eval_results.items():
    print(f"{key}: {value}")

Keys in eval_results:
dict_keys(['eval_loss', 'eval_runtime', 'eval_samples_per_second', 'eval_steps_per_second', 'epoch'])
eval_loss: 42.5
eval_runtime: 0.0564
eval_samples_per_second: 17.736
eval_steps_per_second: 17.736
epoch: 1.0


The provided code snippet evaluates the fine-tuned model and calculates various evaluation metrics, including ROUGE scores, and then prints the entire `eval_results` dictionary. Here's a brief description:

1. `eval_results = trainer.evaluate()`: This line evaluates the fine-tuned model using the evaluation dataset specified in the `eval_dataset` argument when initializing the `trainer` object. It computes various evaluation metrics, including ROUGE scores, and stores the results in the `eval_results` dictionary.

2. `print("Eval Results:")`: This line prints a header indicating that the following output will display the evaluation results.

3. `print(eval_results)`: This line prints the entire `eval_results` dictionary, which contains the computed evaluation metrics and statistics. This includes ROUGE scores, along with other metrics, depending on your specific evaluation setup.

By printing the `eval_results` dictionary, you can examine the detailed evaluation results to assess the model's performance on your task, such as text summarization or chatbot responses. ROUGE scores are often useful for evaluating the quality of generated text in natural language processing tasks.

In [None]:
# Evaluate the model and calculate ROUGE scores
eval_results = trainer.evaluate()

# Print the entire eval_results dictionary
print("Eval Results:")
print(eval_results)

Eval Results:
{'eval_loss': 42.5, 'eval_runtime': 0.0577, 'eval_samples_per_second': 17.317, 'eval_steps_per_second': 17.317, 'epoch': 1.0}


Installing the `rouge-score` library using the `pip` package manager. This library provides tools for calculating ROUGE scores, which are commonly used for evaluating the quality of generated text in natural language processing tasks, such as text summarization or machine translation.

In [None]:
!pip install rouge-score

[0m

Imported the `rouge_scorer` module from the `rouge_score` library. This module provides tools for calculating ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, which are useful for evaluating the quality of generated text in natural language processing tasks.


In [None]:
from rouge_score import rouge_scorer

The provided code defines a Python function called `calculate_rouge` that calculates ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores for a set of hypotheses and references. Here's a brief description of the code:

1. `def calculate_rouge(hypotheses, references)`: This function takes two arguments: `hypotheses` and `references`. These arguments represent lists of text samples that you want to evaluate. Typically, `hypotheses` contains generated or predicted text, while `references` contains the ground truth or reference text for comparison.

2. `scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)`: Inside the function, it initializes a ROUGE scorer object named `scorer` using the `RougeScorer` class from the `rouge_scorer` library. The scorer is configured to calculate three types of ROUGE scores: 'rouge1' (unigram), 'rouge2' (bigram), and 'rougeL' (longest common subsequence). Additionally, `use_stemmer=True` is set to enable stemming during the evaluation, which helps match similar words.

3. `rouge_scores = scorer.score(hypotheses, references)`: The function uses the `scorer` to calculate ROUGE scores by comparing the text samples in the `hypotheses` list to the corresponding samples in the `references` list. It computes ROUGE scores for the specified types ('rouge1,' 'rouge2,' 'rougeL') and returns them as a dictionary.

4. `return rouge_scores`: Finally, the function returns the computed ROUGE scores as a dictionary, where each key represents a ROUGE metric, and the corresponding value is the computed score.

You can use this `calculate_rouge` function to evaluate the quality of generated text by comparing it to reference text, which is a common practice in natural language processing tasks like text summarization, machine translation, and more.

In [None]:
# Define the ROUGE scorer
def calculate_rouge(hypotheses, references):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = scorer.score(hypotheses, references)
    return rouge_scores


The updated code defines a Python function called `calculate_rouge` that calculates ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores for a set of hypotheses and references. Here's a brief description of the updated code:

1. `def calculate_rouge(hypotheses, references)`: This function takes two arguments: `hypotheses` and `references`. These arguments represent lists of text samples that you want to evaluate. Typically, `hypotheses` contains generated or predicted text, while `references` contains the ground truth or reference text for comparison.

2. `scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)`: Inside the function, it initializes a ROUGE scorer object named `scorer` using the `RougeScorer` class from the `rouge_scorer` library. The scorer is configured to calculate three types of ROUGE scores: 'rouge1' (unigram), 'rouge2' (bigram), and 'rougeL' (longest common subsequence). Additionally, `use_stemmer=True` is set to enable stemming during the evaluation, which helps match similar words.

3. `rouge_scores = {}`: Initializes an empty dictionary named `rouge_scores` to store the computed ROUGE scores.

4. The code iterates through each ROUGE metric ('rouge1,' 'rouge2,' 'rougeL') in a loop:
   - `scores = scorer.score(hypotheses, references, score_type=metric)`: It calculates the ROUGE scores for the current metric using the `score` method of the `scorer` object. The `score_type` parameter specifies the current metric being computed.
   - The computed ROUGE scores for precision, recall, and F1-score are extracted from the `scores` object and stored in a dictionary with keys 'precision,' 'recall,' and 'f1,' respectively. These values are stored under the current metric's name in the `rouge_scores` dictionary.

5. `return rouge_scores`: Finally, the function returns the computed ROUGE scores as a dictionary of dictionaries, where each key represents a ROUGE metric ('rouge1,' 'rouge2,' 'rougeL'), and each value is a dictionary containing precision, recall, and F1-score for that metric.

This updated `calculate_rouge` function provides a more detailed breakdown of ROUGE scores, including precision, recall, and F1-score for each metric, making it more informative for evaluating the quality of generated text.

In [None]:
def calculate_rouge(hypotheses, references):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = {}

    for metric in ['rouge1', 'rouge2', 'rougeL']:
        scores = scorer.score(hypotheses, references, score_type=metric)
        rouge_scores[metric] = {
            'precision': scores.precision,
            'recall': scores.recall,
            'f1': scores.fmeasure,
        }

    return rouge_scores


Install the `rouge` package, which provides tools for calculating ROUGE scores. This package can be useful for evaluating the quality of generated text in natural language processing tasks such as text summarization or machine translation.


After installation, you can use the `rouge` package to calculate ROUGE scores for your text evaluation tasks.

In [None]:
!pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
[0m

The provided code snippet performs several key tasks for fine-tuning a transformer-based model (original_model) on a custom dataset for summarization:

1. **Loading Data**: It loads data from an Excel file named 'NIAA RGS _ GIM changes_Edited.xlsx' into a pandas DataFrame (`df`).

2. **Splitting Data**: The dataset is split into training (`train_df`) and testing (`test_df`) sets using the `train_test_split` function from `sklearn.model_selection`.

3. **Dataset Conversion**: The training and testing DataFrames are converted into datasets (`train_dataset` and `test_dataset`) using the `Dataset.from_pandas` function from the `datasets` library.

4. **Loading Pretrained Model and Tokenizer**: The original_model is loaded from 'google/flan-t5-base' using the `AutoModelForSeq2SeqLM.from_pretrained` function, and its tokenizer is loaded using the `AutoTokenizer.from_pretrained` function from the Hugging Face Transformers library.

5. **Tokenization Function**: A tokenization function `tokenize_function` is defined to preprocess the dataset. It adds start and end prompts, tokenizes input dialogues and summaries, and prepares the data for summarization.

6. **Tokenization and Preprocessing**: The `map` method is used to apply the `tokenize_function` to both the training and testing datasets (`tokenized_train_dataset` and `tokenized_test_dataset`). The columns 'dialogue' and 'summary ' are removed from the tokenized datasets as they are no longer needed.

7. **Fine-Tuning Configuration**: Training arguments (`training_args`) are defined to configure the fine-tuning process. These include settings such as output directory, learning rate, number of epochs, batch size, and evaluation strategy.

8. **Trainer Initialization**: A `Trainer` object (`trainer_original`) is created using the original_model, training arguments, and the tokenized training and testing datasets.

9. **Fine-Tuning**: The `train` method is called on the `trainer_original` object to start the fine-tuning process. The model is trained on the tokenized training dataset and evaluated on the tokenized testing dataset.

The code performs fine-tuning of the original_model for summarization using the specified dataset and training configuration. You can adapt this code for your specific use case and data by modifying the dataset loading, preprocessing, and fine-tuning settings as needed.

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
from sklearn.model_selection import train_test_split
from rouge import Rouge
import torch
import time
import pandas as pd

# Load data from Excel
excel_path = 'NIAA RGS _ GIM changes_Edited.xlsx'
df = pd.read_excel(excel_path)

# Split the dataset into train, validation, and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Load original model
model_name = 'google/flan-t5-base'
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize function
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '

    dialogues = example["dialogue"]
    prompt = [start_prompt + (dialogue if dialogue is not None else '') + end_prompt for dialogue in dialogues]

    summaries = example["summary "]

    input_ids_list = []
    labels_list = []

    for single_prompt, single_summary in zip(prompt, summaries):
        tokenized_output = tokenizer(single_prompt, single_summary, padding="max_length", truncation=True, return_tensors="pt", max_length=512)
        input_ids_list.append(tokenized_output['input_ids'])
        labels_list.append(tokenized_output['input_ids'])  # Use 'input_ids' as labels for summarization

    example['input_ids'] = torch.cat(input_ids_list, dim=0)
    example['labels'] = torch.cat(labels_list, dim=0)

    return example

# Tokenize and preprocess the datasets
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_train_dataset = tokenized_train_dataset.remove_columns(['dialogue', 'summary '])

tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = tokenized_test_dataset.remove_columns(['dialogue', 'summary '])

# Fine-tune the original_model
training_args = TrainingArguments(
    output_dir='./original_model_training',
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    save_steps=10,
    evaluation_strategy="steps",
    eval_steps=10,
)

trainer_original = Trainer(
    model=original_model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset
)

trainer_original.train()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Map:   0%|          | 0/13 [00:00<?, ? examples/s]



Step,Training Loss,Validation Loss
10,41.5,41.423077
20,40.25,41.346153


TrainOutput(global_step=26, training_loss=39.50961538461539, metrics={'train_runtime': 15.6799, 'train_samples_per_second': 3.253, 'train_steps_per_second': 1.658, 'total_flos': 34922624974848.0, 'train_loss': 39.50961538461539, 'epoch': 1.0})

The provided code snippet generates summaries for the entire test dataset using the fine-tuned `original_model` and then calculates ROUGE scores for evaluation. Here's a brief description of the code:

1. **Model Evaluation Setup**:
   - `original_model.eval()`: Sets the `original_model` to evaluation mode.
   - `original_model_summaries = []`: Initializes an empty list to store the generated summaries by the `original_model`.

2. **Generating Summaries**:
   - It iterates through each example in the `tokenized_test_dataset`.
   - Converts the 'input_ids' list for each example to a PyTorch tensor and adds a batch dimension.
   - Moves the tensor to the same device as the model (`original_model.device`).
   - Generates summaries for each example using the `original_model.generate` method and appends the decoded summary to the `original_model_summaries` list.

3. **Reference Summaries**:
   - Assumes that `tokenized_test_dataset['labels']` is a list of lists of token IDs representing reference summaries.
   - Decodes these token IDs to obtain the reference summaries and stores them in the `decoded_refs` list.

4. **ROUGE Score Calculation**:
   - Initializes a Rouge object (`rouge`) for calculating ROUGE scores.
   - Uses the `rouge.get_scores` method to calculate ROUGE scores between the generated summaries (`original_model_summaries`) and the reference summaries (`decoded_refs`).
   - The `avg=True` argument computes average ROUGE scores across all examples, and `ignore_empty=True` omits empty references from the calculation.

5. **Printing Results**:
   - Prints the calculated ROUGE scores for the `original_model`.

This code evaluates the `original_model` on the test dataset and provides ROUGE scores for assessing the quality of the generated summaries in comparison to the reference summaries. It's a common practice for evaluating the performance of text summarization models.

In [None]:
# Generate summaries for the entire test dataset
original_model.eval()
original_model_summaries = []

for example in tokenized_test_dataset:
    # Convert input_ids list to a tensor
    input_ids_tensor = torch.tensor(example['input_ids']).unsqueeze(0)  # Add a batch dimension
    # Move the tensor to the same device as the model
    input_ids_tensor = input_ids_tensor.to(original_model.device)

    # Generate output using the model
    output = original_model.generate(input_ids_tensor, max_length=200)
    original_model_summaries.append(tokenizer.decode(output[0], skip_special_tokens=True))

# Assuming tokenized_test_dataset['labels'] is a list of lists of token IDs
decoded_refs = [tokenizer.decode(ref, skip_special_tokens=True) for ref in tokenized_test_dataset['labels']]

# Compute ROUGE scores for the original_model
rouge = Rouge()
original_model_results = rouge.get_scores(
    hyps=original_model_summaries,
    refs=decoded_refs,
    avg=True,
    ignore_empty=True
)

print('ORIGINAL MODEL:')
print(original_model_results)


ORIGINAL MODEL:
{'rouge-1': {'r': 0.2737338834388524, 'p': 0.4940725940725941, 'f': 0.3343147239572375}, 'rouge-2': {'r': 0.14084765137396715, 'p': 0.36153846153846153, 'f': 0.19584751710789408}, 'rouge-l': {'r': 0.2737338834388524, 'p': 0.4940725940725941, 'f': 0.3343147239572375}}


#ROUGE-1:

1. r (Recall): 27.37% - This indicates that 27.37% of the words in the
reference summaries were also found in the generated summaries. A higher recall means more of the reference summary content was captured.

2. p (Precision): 49.41% - This shows that 49.41% of the words in the generated summaries were also in the reference summaries. Higher precision means the generated summary contained fewer extraneous elements.

3. f (F1-Score): 33.43% - This is the harmonic mean of precision and recall. An F1-Score balances both precision and recall, providing a single measure of summary quality.

#ROUGE-2:

1. r: 14.08% - This is the recall but calculated on bigram (pairs of consecutive words) matches between the generated and reference summaries.

2. p: 36.15% - This is the precision calculated on bigrams.

3. f: 19.58% - The F1-Score for bigram matches.

#ROUGE-L:

1. r: 27.37% - This is the recall calculated based on the longest common subsequence between the generated and reference summaries.

2. p: 49.41% - Precision based on the longest common subsequence.

3. f: 33.43% - The F1-Score based on the longest common subsequence.


---


#Interpretation:
1. The ROUGE-1 scores suggest that the model is capturing a fair amount of the content from the reference summaries (as indicated by the recall) and the content it does generate is relatively accurate (as shown by the precision).

2. The ROUGE-2 scores are generally lower than ROUGE-1 scores, which is common as it's more challenging for the model to match pairs of words (bigrams) exactly. This score is more indicative of the model's ability to capture the finer details and structure of the reference summaries.

3. The ROUGE-L scores are identical to ROUGE-1, which is unusual as they
typically differ. ROUGE-L focuses on the longest common sequence, reflecting how well the model captures the order and structure of the content.



---



---



The provided code defines a function called `generate_response` for generating a response to a given prompt using a fine-tuned language model. Here's a brief description of how the code works:

1. `model.eval()`: Sets the model to evaluation mode. This is important to ensure that the model behaves consistently during inference and doesn't update its weights.

2. Device Handling:
   - It determines the appropriate device to use for inference, either "cuda" (GPU) if available or "cpu" (CPU) if a GPU is not available.
   - It moves the model to the chosen device using `model.to(device)`.

3. Encoding the Prompt:
   - The input prompt is tokenized and encoded using the provided tokenizer.
   - The resulting input tensors are moved to the same device as the model.

4. Generating Response:
   - The model generates a response based on the encoded input.
   - `output_sequences` contains the generated token IDs.

5. Decoding Response:
   - The generated token IDs are decoded into text using the tokenizer, skipping special tokens.

6. The generated response is returned from the function.

The code then demonstrates the use of the `generate_response` function by taking a user's prompt as input and generating a response using the `original_model` and `tokenizer`. The generated response is printed and displayed to the user.

This code allows you to interactively use the fine-tuned model to generate responses to user prompts. It can be useful for building chatbots, question-answering systems, or any application where text generation is required.

In [None]:
def generate_response(prompt, model, tokenizer):
    # Ensure the model is in evaluation mode
    model.eval()

    # Move the model to the appropriate device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Encode the prompt
    inputs = tokenizer(prompt, return_tensors='pt', max_length=512, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate a response
    output_sequences = model.generate(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=200)

    # Decode the generated response
    generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)

    return generated_text

# Assuming original_model and tokenizer are already defined and available in your notebook
# Take prompt from user
user_prompt = input("Please enter your prompt: ")

# Generate and print the response
response = generate_response(user_prompt, original_model, tokenizer)
print("Generated Response:", response)


Please enter your prompt: Goa Institute of Management
Generated Response: Goa Institute of Management


The provided code defines an evaluation function `evaluate_model` that evaluates a fine-tuned language model using a specified evaluation metric (e.g., ROUGE). Here's a brief description of how the code works:

1. `model.eval()`: Sets the model to evaluation mode to ensure consistent behavior during inference.

2. Device Handling:
   - Determines the device on which to perform inference (either "cuda" GPU or "cpu" CPU) based on the device of the model.
   
3. Metric Loading:
   - Loads the specified evaluation metric (e.g., ROUGE) using `load_metric` from the `datasets` library.

4. Evaluation Loop:
   - Iterates through the examples in the `test_dataset`.
   - Converts input_ids and labels to PyTorch tensors and moves them to the appropriate device.
   - Generates output sequences based on the input using the model.
   - Decodes the generated output and reference labels into text.
   - Adds the decoded prediction and reference to the metric for evaluation.

5. Metric Computation:
   - Computes the evaluation metric (e.g., ROUGE) scores based on the collected predictions and references using `rouge.compute()`.

6. Returns the computed metric scores.

7. Finally, the code evaluates the `original_model` using the `evaluate_model` function with the `tokenized_test_dataset` and prints the evaluation results.

This code allows you to evaluate the performance of your fine-tuned language model using a specified metric on a test dataset. You can adapt it for various evaluation metrics and datasets as needed for your specific natural language processing tasks.

In [None]:
from datasets import load_metric

# Assuming tokenized_test_dataset is available from your training script
def evaluate_model(model, tokenizer, test_dataset, metric="rouge"):
    model.eval()
    device = model.device

    # Load the metric
    rouge = load_metric(metric)

    # Evaluate the model
    for example in test_dataset:
        input_ids = torch.tensor(example['input_ids']).unsqueeze(0).to(device)
        labels = torch.tensor(example['labels']).unsqueeze(0).to(device)

        # Generate output
        with torch.no_grad():
            output = model.generate(input_ids, max_length=200)

        # Decode and add to metric
        decoded_pred = tokenizer.decode(output[0], skip_special_tokens=True)
        decoded_label = tokenizer.decode(labels[0], skip_special_tokens=True)
        rouge.add(prediction=decoded_pred, reference=decoded_label)

    # Compute the metric
    final_scores = rouge.compute()
    return final_scores

# Evaluate the model
evaluation_results = evaluate_model(original_model, tokenizer, tokenized_test_dataset)
print(evaluation_results)


  rouge = load_metric(metric)


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

{'rouge1': AggregateScore(low=Score(precision=0.35048826173826175, recall=0.23432155563734516, fmeasure=0.2614372469635628), mid=Score(precision=0.5414835164835166, recall=0.30082991720922764, fmeasure=0.36345614007201077), high=Score(precision=0.7398751248751247, recall=0.38196455163106696, fmeasure=0.4789081455074896)), 'rouge2': AggregateScore(low=Score(precision=0.19461538461538486, recall=0.06741627848419993, fmeasure=0.10078431372549022), mid=Score(precision=0.4256410256410256, recall=0.16887086841432605, fmeasure=0.2349154391707583), high=Score(precision=0.6692307692307692, recall=0.2832616282560292, fmeasure=0.3874766535091942)), 'rougeL': AggregateScore(low=Score(precision=0.33841158841158847, recall=0.22109480431848866, fmeasure=0.25137326361486395), mid=Score(precision=0.5336788211788213, recall=0.29134070708480875, fmeasure=0.3570428416936563), high=Score(precision=0.7438789335664335, recall=0.3792223664115678, fmeasure=0.47749970416918347)), 'rougeLsum': AggregateScore(low