<a href="https://colab.research.google.com/github/imranahmed123/DataScience-AI-ML/blob/main/M5_NB_MiniProject_1_Medical_Q%26A_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Part-A: Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries.

Please refer to ***M6 Assignment-1 Fine-tune GPT2*** to get familiar with how to load pre-trained gpt2 tokenizer and model.

### Import required packages

In [1]:
!pip -q install -U accelerate
!pip -q install -U transformers
!pip -q install torch

In [2]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [3]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv
MedQuAD.csv.1
MedQuAD.csv.2


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** pd.read_csv()

In [4]:
df = pd.read_csv("MedQuAD.csv")
df.shape

(16412, 6)

In [5]:
df.head()

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [0.5 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [None]:
# YOUR CODE HERE

In [None]:
# Drop missing values
# YOUR CODE HERE


In [6]:
# Handle missing values
# Drop rows with missing values in 'Question' and 'Answer' columns
df_cleaned = df.dropna(subset=['Question', 'Answer'])

# Check if any missing values remain
print(df_cleaned.isnull().sum())


Focus             14
CUI              565
SemanticType     597
SemanticGroup    565
Question           0
Answer             0
dtype: int64


- **Remove duplicates from data considering `Question` and `Answer` columns**

In [None]:
# Check duplicates
# YOUR CODE HERE

In [None]:
# Drop duplicates
# YOUR CODE HERE

In [None]:
# Check duplicates
# YOUR CODE HERE

In [7]:
# Check for duplicates based on 'Question' and 'Answer'
duplicates_before = df.duplicated(subset=['Question', 'Answer']).sum()
print(f"Number of duplicates before removal: {duplicates_before}")

# Drop duplicates based on 'Question' and 'Answer'
df_cleaned = df.drop_duplicates(subset=['Question', 'Answer'])

# Check for duplicates again after removal
duplicates_after = df_cleaned.duplicated(subset=['Question', 'Answer']).sum()
print(f"Number of duplicates after removal: {duplicates_after}")


Number of duplicates before removal: 48
Number of duplicates after removal: 0


**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [1 Mark]**

In [None]:
# YOUR CODE HERE

In [None]:
# Top 100 Focus categories names
# YOUR CODE HERE

In [8]:
# Group by the 'Focus' column and count the number of records for each category
focus_category_counts = df_cleaned['Focus'].value_counts()

# Display the top 100 'Focus' categories along with their counts
top_100_focus_categories = focus_category_counts.head(100)

# Print the top 100 categories
print(top_100_focus_categories)


Focus
Breast Cancer                                     53
Prostate Cancer                                   43
Stroke                                            35
Skin Cancer                                       34
Alzheimer's Disease                               30
                                                  ..
Poland syndrome                                   11
Opitz G/BBB syndrome                              11
Polycythemia Vera                                 11
Diabetic Kidney Disease                           10
What I need to know about Gestational Diabetes    10
Name: count, Length: 100, dtype: int64


### Create Training and Validation set

**Exercise 4: Create training and validation set [2 Marks]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [None]:
# YOUR CODE HERE

In [9]:
# Create a dictionary to store the samples for training and validation
training_samples = []
validation_samples = []

# Loop through the top 100 focus categories
for category in top_100_focus_categories.index:
    # Filter rows belonging to the current category
    category_rows = df_cleaned[df_cleaned['Focus'] == category]

    # If there are more than 5 rows for this category, we can select 4 for training and 1 for validation
    if len(category_rows) >= 5:
        # Randomly sample 4 rows for training
        train_sample = category_rows.sample(n=4, random_state=42)

        # Remove the sampled rows from the remaining category rows
        remaining_rows = category_rows.drop(train_sample.index)

        # Randomly select 1 row for validation from the remaining rows
        val_sample = remaining_rows.sample(n=1, random_state=42)

        # Append these samples to the respective lists
        training_samples.append(train_sample)
        validation_samples.append(val_sample)

# Concatenate the training and validation samples into separate DataFrames
training_set = pd.concat(training_samples)
validation_set = pd.concat(validation_samples)

# Verify the sizes of training and validation sets
print(f"Training set size: {training_set.shape}")
print(f"Validation set size: {validation_set.shape}")


Training set size: (400, 6)
Validation set size: (100, 6)


### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks: [1.5 Marks]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [None]:
# YOUR CODE HERE

In [10]:
# Function to combine 'Question' and 'Answer' columns for a given dataset
def combine_question_answer(df):
    combined_text = df.apply(lambda row: f"<question>{row['Question']}<answer>{row['Answer']}", axis=1)
    # Join all the rows using '\n' to create a single string
    return '\n'.join(combined_text)

# Combine Question and Answer for training data
train_sequence = combine_question_answer(training_set)

# Combine Question and Answer for validation data
val_sequence = combine_question_answer(validation_set)

# Save the combined sequences as text files
# Since this is for Google Colab, we will save the files locally
with open('training_data.txt', 'w') as train_file:
    train_file.write(train_sequence)

with open('validation_data.txt', 'w') as val_file:
    val_file.write(val_sequence)

# Output file paths (for Google Colab use, these can be downloaded)
print("Training data saved as: training_data.txt")
print("Validation data saved as: validation_data.txt")


Training data saved as: training_data.txt
Validation data saved as: validation_data.txt


- **Join the combined text using '\n' into a single string for training and validation separately**

In [None]:
# YOUR CODE HERE

In [11]:
# Function to combine 'Question' and 'Answer' into the desired format
def combine_question_answer(df):
    combined_text = df.apply(lambda row: f"<question>{row['Question']}<answer>{row['Answer']}", axis=1)
    # Join all combined rows into a single string, separated by '\n'
    return '\n'.join(combined_text)

# Combine Question and Answer for the training data
train_sequence = combine_question_answer(training_set)

# Combine Question and Answer for the validation data
val_sequence = combine_question_answer(validation_set)




- **Save the training and validation strings as text files**

In [None]:
# YOUR CODE HERE

In [12]:
# Save the combined sequences as text files in Google Colab
with open('training_data_sequence.txt', 'w') as train_file:
    train_file.write(train_sequence)

with open('validation_data_sequence.txt', 'w') as val_file:
    val_file.write(val_sequence)

# Output to indicate where the files are saved in Google Colab
print("Training data saved as: training_data_sequence.txt")
print("Validation data saved as: validation_data_sequence.txt")

Training data saved as: training_data_sequence.txt
Validation data saved as: validation_data_sequence.txt


In [13]:
from google.colab import files
files.download('training_data_sequence.txt')
files.download('validation_data_sequence.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**Exercise 6: Load pre-trained GPT2Tokenizer [0.5 Mark]**

- Use checkpoint = "gpt2"

In [None]:
# YOUR CODE HERE

In [14]:
# Import the necessary module from the transformers library
from transformers import GPT2Tokenizer

# Load the pre-trained GPT2 tokenizer
checkpoint = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)

# Print a message to confirm the tokenizer has been loaded
print(f"GPT-2 tokenizer loaded from checkpoint: {checkpoint}")


GPT-2 tokenizer loaded from checkpoint: gpt2


**Exercise 7: Tokenize train and validation data and form TextDataset objects [0.5 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

In [None]:
# YOUR CODE HERE

In [15]:
# Set the pad_token to eos_token (end-of-sequence) for padding purposes
tokenizer.pad_token = tokenizer.eos_token

# Function to load and tokenize text data
def load_and_tokenize_data(file_path, tokenizer, block_size=512):
    # Load the text data from the file
    with open(file_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Tokenize the text data using the pre-trained tokenizer
    tokenized_data = tokenizer(text, return_tensors="pt", max_length=block_size, truncation=True, padding=True)

    return tokenized_data

# Paths to the saved training and validation data
train_file_path = "training_data_sequence.txt"
val_file_path = "validation_data_sequence.txt"

# Load and tokenize training and validation data
train_data = load_and_tokenize_data(train_file_path, tokenizer)
val_data = load_and_tokenize_data(val_file_path, tokenizer)

# Print the size of the datasets to confirm
print(f"Training data tokenized: {len(train_data['input_ids'])} samples")
print(f"Validation data tokenized: {len(val_data['input_ids'])} samples")

# Form TextDataset objects (using the tokenized inputs)
train_dataset = TextDataset(tokenizer=tokenizer, file_path=train_file_path, block_size=512)
val_dataset = TextDataset(tokenizer=tokenizer, file_path=val_file_path, block_size=512)

# Print to confirm dataset creation
print("Training and validation datasets created successfully.")


Training data tokenized: 1 samples
Validation data tokenized: 1 samples
Training and validation datasets created successfully.


**Exercise 8: Create a DataCollator object [0.5 Mark]**

In [None]:
# YOUR CODE HERE

In [16]:
# Import necessary module from transformers
from transformers import DataCollatorForLanguageModeling

# Create a DataCollator object for language modeling with GPT-2
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,  # Use the pre-trained GPT-2 tokenizer
    mlm=False  # Set to False because GPT-2 is not trained with masked language modeling
)

# Print a message to confirm the DataCollator creation
print("DataCollatorForLanguageModeling created successfully.")


DataCollatorForLanguageModeling created successfully.


**Exercise 9: Load pre-trained GPT2LMHeadModel [0.5 Mark]**

In [None]:
# YOUR CODE HERE

In [17]:
# Import the GPT2LMHeadModel from the transformers library
from transformers import GPT2LMHeadModel

# Load the pre-trained GPT2LMHeadModel using the checkpoint "gpt2"
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Print a message to confirm the model has been loaded successfully
print("GPT2LMHeadModel loaded successfully.")


GPT2LMHeadModel loaded successfully.


**Exercise 10: Fine-tune GPT2 Model [1 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [None]:
# Set up the training arguments

# YOUR CODE HERE

In [None]:
# Train the model
# YOUR CODE HERE

# Save the model
# YOUR CODE HERE

# Save the tokenizer
# YOUR CODE HERE

In [25]:
# crashing for all available RAM memory
# Import necessary modules
#Try 1:
from transformers import Trainer, TrainingArguments

# Set up the optimized training arguments
training_args = TrainingArguments(
    output_dir="gpt2_fine_tuned",  # Output directory
    overwrite_output_dir=True,              # Overwrite the output directory
    num_train_epochs=5,                     # Reduced number of epochs
    per_device_train_batch_size=8,          # Increased batch size (if memory allows)
    per_device_eval_batch_size=8,           # Increased eval batch size
    warmup_steps=20,                        # Reduced number of warmup steps
    weight_decay=0.01,                      # Weight decay
    logging_dir="logs",                     # Directory for storing logs
    logging_steps=100,                      # Log every 100 steps (less frequent)
    evaluation_strategy="epoch",            # Evaluate at the end of each epoch
    save_strategy="epoch",                  # Save at the end of each epoch
    fp16=True,                              # Use mixed precision for faster training
    gradient_accumulation_steps=4,          # Accumulate gradients over 4 batches
)

# Create a Trainer object
trainer = Trainer(
    model=model,                            # The GPT-2 model
    args=training_args,                     # Optimized training arguments
    train_dataset=train_dataset,            # Training dataset
    eval_dataset=val_dataset,               # Validation dataset
    data_collator=data_collator,            # Data collator for dynamic padding
)

# Train the model
trainer.train()

# Save the model
trainer.save_model("gpt2_fine_tuned")  # Save the fine-tuned model

# Save the tokenizer
tokenizer.save_pretrained("gpt2_fine_tuned")  # Save the tokenizer

# Print a message confirming that the model and tokenizer were saved
print("Model and tokenizer saved successfully at gpt2_fine_tuned")


KeyboardInterrupt: 

In [27]:
import pandas as pd

# Load the data from your dataset (CSV, TXT, etc.)
# This example assumes you have a CSV file with 'text' column
df = pd.read_csv("MedQuAD.csv")

# Select only 100 samples for training and 20 for validation
train_df = df.sample(n=100, random_state=42)
val_df = df.sample(n=20, random_state=42)

# Save these smaller datasets into temporary files for quick execution
train_df.to_csv("train_subset.csv", index=False)
val_df.to_csv("val_subset.csv", index=False)

# Create TextDataset from the smaller files
from transformers import TextDataset, DataCollatorForLanguageModeling

# Load the reduced dataset into TextDataset
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="train_subset.csv",
    block_size=512
)

val_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="val_subset.csv",
    block_size=512
)

# You can proceed with training now using these smaller datasets


In [28]:
# Import necessary modules
# Try2:
from transformers import Trainer, TrainingArguments, GPT2LMHeadModel, GPT2Tokenizer

# Load a smaller version of the GPT-2 model (distilgpt2)
model = GPT2LMHeadModel.from_pretrained('distilgpt2')
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')

# Enable gradient checkpointing to reduce memory usage
model.gradient_checkpointing_enable()

# Reduce dataset size for quicker execution
#train_dataset = train_dataset.select(range(100))  # Only 100 samples for training
#val_dataset = val_dataset.select(range(20))       # Only 20 samples for validation

# Set up the optimized training arguments
training_args = TrainingArguments(
    output_dir="gpt2_fine_tuned",  # Output directory
    overwrite_output_dir=True,              # Overwrite the output directory
    num_train_epochs=2,                     # Reduced number of epochs
    per_device_train_batch_size=2,          # Reduced batch size to 2
    per_device_eval_batch_size=2,           # Reduced eval batch size to 2
    warmup_steps=20,                        # Reduced number of warmup steps
    weight_decay=0.01,                      # Weight decay
    logging_dir="logs",                     # Directory for storing logs
    logging_steps=100,                      # Log every 100 steps
    evaluation_strategy="epoch",            # Evaluate at the end of each epoch
    save_strategy="epoch",                  # Save at the end of each epoch
    fp16=True,                              # Use mixed precision for faster training
    gradient_accumulation_steps=4,          # Accumulate gradients over 4 batches
)

# Create a Trainer object
trainer = Trainer(
    model=model,                            # The GPT-2 model
    args=training_args,                     # Optimized training arguments
    train_dataset=train_dataset,            # Training dataset
    eval_dataset=val_dataset,               # Validation dataset
    data_collator=data_collator,            # Data collator for dynamic padding
)

# Train the model
trainer.train()

# Save the model
trainer.save_model("gpt2_fine_tuned")  # Save the fine-tuned model

# Save the tokenizer
tokenizer.save_pretrained("gpt2_fine_tuned")  # Save the tokenizer

# Print a message confirming that the model and tokenizer were saved
print("Model and tokenizer saved successfully at gpt2_fine_tuned")


Epoch,Training Loss,Validation Loss
1,No log,2.878483
2,No log,2.725487


Model and tokenizer saved successfully at gpt2_fine_tuned


**Exercise 11: Test Model with user input prompts [1 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [None]:
# YOUR CODE HERE

In [None]:
# Load the fine-tuned model and tokenizer

# YOUR CODE HERE

In [None]:
# Response from model

# YOUR CODE HERE

In [None]:
# Testing with given prompt 1

# YOUR CODE HERE

In [None]:
# Testing with given prompt 2

# YOUR CODE HERE

In [29]:
# Import necessary module from transformers
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Function to generate a response from the model
def generate_response(model, tokenizer, prompt, max_length=50, num_return_sequences=1):
    # Encode the prompt into input_ids for the model
    inputs = tokenizer.encode(prompt, return_tensors="pt")

    # Generate output using the fine-tuned model
    outputs = model.generate(
        inputs,
        max_length=max_length,          # Maximum length of the generated sequence
        num_return_sequences=num_return_sequences,  # Number of responses to generate
        no_repeat_ngram_size=2,         # Avoid repetition of words
        do_sample=True,                 # Sampling to introduce variability in response
        top_k=50,                       # Number of highest probability words considered
        top_p=0.95,                     # Cumulative probability for token selection
        temperature=0.7                 # Controls randomness: lower is more conservative
    )

    # Decode the output to get the generated text
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

# Load the fine-tuned model and tokenizer
# Assuming you have already fine-tuned and saved the model
model = GPT2LMHeadModel.from_pretrained("gpt2_fine_tuned")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2_fine_tuned")

# Testing with given prompt 1
prompt_1 = "What are the symptoms of diabetes?"
response_1 = generate_response(model, tokenizer, prompt_1)
print(f"Prompt 1: {prompt_1}\nResponse 1: {response_1}\n")

# Testing with given prompt 2
prompt_2 = "How is hypertension treated?"
response_2 = generate_response(model, tokenizer, prompt_2)
print(f"Prompt 2: {prompt_2}\nResponse 2: {response_2}\n")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Prompt 1: What are the symptoms of diabetes?
Response 1: What are the symptoms of diabetes? The term is often used to describe the common symptoms that cause diabetes. The following list of symptoms may be helpful to you.

Diabetes
People who have diabetes who are diagnosed with type 1 diabetes are more

Prompt 2: How is hypertension treated?
Response 2: How is hypertension treated?

In the following papers, the authors (in this paper) examined the occurrence of hypertension in the U.S. population (U. S.M.) from the 1970s through 1990s. In their study,



**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [1 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [None]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

# YOUR CODE HERE

In [None]:
# Testing with finetuned model: prompt 1

# YOUR CODE HERE

In [None]:
# Testing with untuned model: prompt 1

# YOUR CODE HERE

In [None]:
# Testing with finetuned model: prompt 2

# YOUR CODE HERE

In [None]:
# Testing with untuned model: prompt 2

# YOUR CODE HERE

In [None]:
# Testing with finetuned model: prompt 3

# YOUR CODE HERE

In [None]:
# Testing with untuned model: prompt 3

# YOUR CODE HERE

Step 1: Load the pre-trained GPT-2 model (untuned)

In [30]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load a pre-trained GPT2 model (untuned, not fine-tuned on MedQuAD)
untuned_model = GPT2LMHeadModel.from_pretrained('gpt2')
untuned_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')


Step 2: Load the fine-tuned GPT-2 model

In [31]:
# Load the fine-tuned model from the previous steps (MedQuAD fine-tuned GPT2)
finetuned_model = GPT2LMHeadModel.from_pretrained('gpt2_fine_tuned')
finetuned_tokenizer = GPT2Tokenizer.from_pretrained('gpt2_fine_tuned')


Step 3: Generate responses using both models

In [32]:
def generate_response(model, tokenizer, prompt, max_length=50):
    # Encode the prompt into token IDs
    inputs = tokenizer.encode(prompt, return_tensors="pt")

    # Generate the output using the model
    outputs = model.generate(
        inputs,
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,   # Prevents repetition of words
        do_sample=True,           # Sampling to generate diverse outputs
        top_k=50,                 # Consider top 50 tokens for next word
        top_p=0.95,               # Nucleus sampling to consider tokens until probability mass is 0.95
        temperature=0.7           # Control randomness, lower means less random
    )

    # Decode the generated token IDs back into human-readable text
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


Step 4: Test both models with the three input prompts

Prompt 1: "What precautions to take for a healthy life?"

In [33]:
# Fine-tuned model response
finetuned_response_1 = generate_response(finetuned_model, finetuned_tokenizer, "What precautions to take for a healthy life?")
print("Fine-tuned model response 1:", finetuned_response_1)

# Untuned model response
untuned_response_1 = generate_response(untuned_model, untuned_tokenizer, "What precautions to take for a healthy life?")
print("Untuned model response 1:", untuned_response_1)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Fine-tuned model response 1: What precautions to take for a healthy life?

The best advice for preventing a cancer is to never take a pill or an aspirin for about a week.
You should take the pill regularly and not take it every day. You should also take
Untuned model response 1: What precautions to take for a healthy life?

Do not use or consume alcohol or tobacco.
. Do not smoke or drink.


Prompt 2: "What to do after being diagnosed with cancer?"

In [34]:
# Fine-tuned model response
finetuned_response_2 = generate_response(finetuned_model, finetuned_tokenizer, "What to do after being diagnosed with cancer?")
print("Fine-tuned model response 2:", finetuned_response_2)

# Untuned model response
untuned_response_2 = generate_response(untuned_model, untuned_tokenizer, "What to do after being diagnosed with cancer?")
print("Untuned model response 2:", untuned_response_2)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Fine-tuned model response 2: What to do after being diagnosed with cancer? How much time do you have to spend with a cancer patient? The answer is 3-6 months.

It's a simple question. There's no doctor's recommendation for what to take. It
Untuned model response 2: What to do after being diagnosed with cancer?

Take your time. Get it right when you can.
,
. See you later!
"Treatment of cancer is not based on treatment," said Dr. J.H. H


Prompt 3: "What to do when feeling sick?"

In [35]:
# Fine-tuned model response
finetuned_response_3 = generate_response(finetuned_model, finetuned_tokenizer, "What to do when feeling sick?")
print("Fine-tuned model response 3:", finetuned_response_3)

# Untuned model response
untuned_response_3 = generate_response(untuned_model, untuned_tokenizer, "What to do when feeling sick?")
print("Untuned model response 3:", untuned_response_3)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Fine-tuned model response 3: What to do when feeling sick? How do you feel?

Here's a list of all the steps that you can take to get rid of your symptoms.
1) Do your doctors have a recommendation for treatment? In order to learn about
Untuned model response 3: What to do when feeling sick?

It can take many forms. If you have severe stomach pain, you may need to go to the emergency room. However, it is usually best to avoid the doctor if you experience pain or aching throat
