# Advanced Certification Programme in AI and MLOps
## A programme by IISc and TalentSprint
### Mini-Project: Medical Q&A using GPT2 | Deployment on Hugging Face Spaces

## Learning Objectives

At the end of the experiment, you will be able to:

* perform data preprocessing, EDA and feature extraction on the Medical Q&A dataset
* load a pre-trained tokenizer
* finetune a GPT-2 language model for medical question-answering
* upload your fine-tuned model to Hugging Face Model Hub
* deploy application with uploaded model on HuggingFace Spaces using Gradio

## Dataset Description

The dataset used in this project is the *Medical Question Answering Dataset* ([MedQuAD](https://github.com/abachaa/MedQuAD/tree/master)). It includes medical question-answer pairs along with additional information, such as the question type, the question *focus*, its UMLS(Unified Medical Language System) details like - Concept Unique Identifier(*CUI*) and Semantic *Type* and *Group*.

To know more about this data's collection, and construction method, refer to this [paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4).

The data is extracted and is in CSV format with below features:

- **Focus**: the question focus
- **CUI**: concept unique identifier
- **SemanticType**
- **SemanticGroup**
- **Question**
- **Answer**

## Grading = 10 Points

## Information

Healthcare professionals often have to refer to medical literature and documents while seeking answers to medical queries. Medical databases or search engines are powerful resources of upto date medical knowledge. However, the existing documentation is large and makes it difficult for professionals to retrieve answers quickly in a clinical setting. The problem with search engines and informative retrieval engines is that these systems return a list of documents rather than answers. Instead, healthcare professionals can use question answering systems to retrieve short sentences or paragraphs in response to medical queries. Such systems have the biggest advantage of generating answers and providing hints in a few seconds.

### Problem Statement

Fine-tune gpt2 model on medical-question-answering-dataset for performing response generation for medical queries. Later, deploy the fine-tuned model on Hugging Face Spaces.

Please refer to ***M6 Assignment-1 Fine-tune GPT2*** and ***M6 AdditionalNB Fine-tune GPT2 for TextClassification*** to get familiar with how to load pre-trained gpt2 tokenizer and model.

Please refer to ***The demo session held on 14 Sep - Hugging Face Spaces Deployment*** to get familiar with how to do deployment using Hugging Face Spaces.

### Installing Dependencies

In [1]:
%%capture
!pip -q uninstall pyarrow -y
!pip -q install pyarrow==15.0.2
!pip -q install datasets
!pip -q install accelerate
!pip -q install transformers

In [None]:
pip install datasets

### <font color="#990000">Restart Session/Runtime</font>

### Import required packages

In [20]:
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

import warnings
warnings.filterwarnings('ignore')

In [2]:
#@title Download the dataset
!wget -q https://cdn.iisc.talentsprint.com/AIandMLOps/MiniProjects/Datasets/MedQuAD.csv
!ls | grep ".csv"

MedQuAD.csv


**Exercise 1: Read the MedQuAD.csv dataset**

**Hint:** pd.read_csv()

In [23]:
# pd.set_option('display.max_colwidth', None)
# pd.set_option('display.max_rows', None)
# pd.reset_option('display.max_colwidth')
# pd.reset_option('display.max_rows')

In [50]:
# YOUR CODE HERE
data = pd.read_csv('MedQuAD.csv')

In [51]:
data.head(5)

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
0,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is (are) Adult Acute Lymphoblastic Leukem...,Key Points - Adult acute lymphoblastic leukemi...
1,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What are the symptoms of Adult Acute Lymphobla...,"Signs and symptoms of adult ALL include fever,..."
2,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,How to diagnose Adult Acute Lymphoblastic Leuk...,Tests that examine the blood and bone marrow a...
3,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,What is the outlook for Adult Acute Lymphoblas...,Certain factors affect prognosis (chance of re...
4,Adult Acute Lymphoblastic Leukemia,C0751606,T191,Disorders,Who is at risk for Adult Acute Lymphoblastic L...,Previous chemotherapy and exposure to radiatio...


### Pre-processing and EDA

**Exercise 2: Perform below operations on the dataset [0.5 Mark]**

- Handle missing values
- Remove duplicates from data considering `Question` and `Answer` columns

- **Handle missing values**

In [52]:
data.isna().sum()

Unnamed: 0,0
Focus,14
CUI,565
SemanticType,597
SemanticGroup,565
Question,0
Answer,5


In [53]:
# YOUR CODE HERE
data.dropna(subset =['Answer'], inplace=True)

In [54]:
data.isna().sum()

Unnamed: 0,0
Focus,14
CUI,565
SemanticType,597
SemanticGroup,565
Question,0
Answer,0


- **Remove duplicates from data considering `Question` and `Answer` columns**

In [55]:
print(data.shape)

(16407, 6)


In [56]:
# YOUR CODE HERE
data = data.drop_duplicates(['Question','Answer'])

In [57]:
print(data.shape)

(16359, 6)


**Exercise 3: Display the category name, and the number of records belonging to top 100 categories of `Focus` column [0.5 Mark]**

In [58]:
# Total categories in Focus column
# YOUR CODE HERE
data['Focus'].nunique()

5125

In [59]:
# Displaying the distinct categories of Focus column and the number of records belonging to each category
# (Top 100 only)

# YOUR CODE HERE
data['Focus'].value_counts().head(100)

Unnamed: 0_level_0,count
Focus,Unnamed: 1_level_1
Breast Cancer,53
Prostate Cancer,43
Stroke,35
Skin Cancer,34
Alzheimer's Disease,30
...,...
MECP2 duplication syndrome,11
Holt-Oram syndrome,11
Ehlers-Danlos syndrome,11
Hearing Loss,10


In [60]:
# Top 100 Focus categories names

# YOUR CODE HERE
test100 = data['Focus'].value_counts().head(100).reset_index()

In [61]:
test100['Focus'].unique()

array(['Breast Cancer', 'Prostate Cancer', 'Stroke', 'Skin Cancer',
       "Alzheimer's Disease", 'Lung Cancer', 'Colorectal Cancer',
       'High Blood Cholesterol', 'Heart Attack', 'Heart Failure',
       'High Blood Pressure', "Parkinson's Disease", 'Leukemia',
       'Osteoporosis', 'Shingles', 'Age-related Macular Degeneration',
       'Diabetes', 'Hemochromatosis', 'Diabetic Retinopathy', 'Psoriasis',
       'Gum (Periodontal) Disease', 'Kidney Disease', 'COPD', 'Cataract',
       'Balance Problems', 'Dry Mouth',
       'Prescription and Illicit Drug Abuse',
       'Medicare and Continuing Care', 'Gout', 'Glaucoma',
       'Wilson Disease', 'Problems with Taste', 'Neuroblastoma',
       'Rheumatoid Arthritis', 'Short Bowel Syndrome', 'Osteoarthritis',
       'Narcolepsy', 'Endometrial Cancer', 'Pituitary Tumors', 'Dry Eye',
       'Kidney Dysplasia', 'Anxiety Disorders',
       'Peripheral Arterial Disease (P.A.D.)',
       'Urinary Tract Infections in Children', 'Surviving Cance

In [62]:
df = data[data['Focus']=='Laron syndrome']

In [63]:
df[:4]

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
3498,Laron syndrome,C0271568,T047,Disorders,What is (are) Laron syndrome ?,Laron syndrome is a condition that occurs when...
3499,Laron syndrome,C0271568,T047,Disorders,What are the symptoms of Laron syndrome ?,What are the signs and symptoms of Laron syndr...
3500,Laron syndrome,C0271568,T047,Disorders,What causes Laron syndrome ?,What causes Laron syndrome? Laron syndrome is ...
3501,Laron syndrome,C0271568,T047,Disorders,Is Laron syndrome inherited ?,Is Laron syndrome inherited? Most cases of Lar...


In [64]:
df[4:5]

Unnamed: 0,Focus,CUI,SemanticType,SemanticGroup,Question,Answer
3502,Laron syndrome,C0271568,T047,Disorders,How to diagnose Laron syndrome ?,How is Laron syndrome diagnosed? A diagnosis o...


### Create Training and Validation set

**Exercise 4: Create training and validation set [1 Mark]**

- Consider 4 samples per `Focus` category, for each top 100 categories, from the dataset (It will give 400 samples for training)

- Consider 1 sample per `Focus` category (different from training set), for each top 100 categories, from the dataset (It will give 100 samples for validation)

In [66]:
# YOUR CODE HERE
train = pd.DataFrame()
validation = pd.DataFrame()

for i in test100['Focus']:
  df = data[data['Focus']==i]
  df_sample = df.sample(n=5, random_state=42)
  df1 = df_sample[:4]
  df2 = df_sample[4:5]
  train = pd.concat([train,df1], ignore_index = True)
  validation = pd.concat([validation,df2], ignore_index = True)

In [67]:
train.shape

(400, 6)

In [68]:
validation.shape

(100, 6)

### Pre-process `Question` and `Answer` text

**Exercise 5: Perform below tasks:  [1 Mark]**

- Combine `Question` and `Answer` for train and validation data as shown below:
    - sequence = *'\<question\>' + question-text + '\<answer\>' + answer-text + '\<end\>'*

- Join the combined text using '\n' into a single string for training and validation separately

- Save the training and validation strings as separate text files

- **Combine Question and Answer for train and val data**

In [69]:
# Combine Questions and Answers for train and val data
## sequence = '<question>' + question + '<answer>' + answer

# YOUR CODE HERE
train['combined'] = '<question>'+ train['Question'] + '<answer>' + train['Answer']+'<end>'
validation['combined'] = '<question>'+ validation['Question'] + '<answer>' + validation['Answer'] + '<end>'

In [71]:
train['combined'][1]

'<question>What is (are) Breast Cancer ?<answer>A mammogram can often detect breast changes in women who have no signs of breast cancer. Often, it can find a breast lump before it can be felt. If the results indicate that cancer might be present, your doctor will advise you to have a follow-up test called a biopsy.<end>'

- **Join the combined text using '\n' into a single string for training and validation separately**

In [72]:
# Train and Validation text for all Q&As

# YOUR CODE HERE
train_seq = ''
val_seq = ''

for i in train['combined']:
  train_seq += ''.join(i)+'\n'

for i in validation['combined']:
  val_seq += ''.join(i)+'\n'

- **Save the training and validation strings as text files**

In [73]:
# Save the training and validation data as text files

# YOUR CODE HERE
with open("train.txt", "w") as file:
    file.write(train_seq)

with open("val.txt", "w") as file:
    file.write(val_seq)

**Exercise 6: Load pre-trained GPT2Tokenizer**

- Use checkpoint = "gpt2"

In [74]:
# Set up the tokenizer
# YOUR CODE HERE
checkpoint = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)    # also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

# set pad_token_id to unk_token_id
tokenizer.pad_token = tokenizer.unk_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

**Exercise 7: Tokenize train and validation data [0.5 Mark]**

- Use the loaded pre-trained tokenizer
- Use training and validation data saved in text files

In [75]:
# YOUR CODE HERE
from datasets import load_dataset

train_file_path = 'train.txt'
val_file_path = 'val.txt'

dataset = load_dataset("text", data_files={"train": train_file_path,
                                           "validation": val_file_path})

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

In [76]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 400
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 100
    })
})

In [77]:
block_size = 256     # max tokens in an input sample

def tokenize_function(examples):
    return tokenizer(examples["text"], padding='max_length', truncation=True, max_length=block_size, return_tensors='pt')

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

**Exercise 8: Create a DataCollator object**

In [78]:
# Create a Data collator object
# YOUR CODE HERE
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="pt")

**Exercise 9: Load pre-trained GPT2LMHeadModel**

In [79]:
# Set up the model
# YOUR CODE HERE
model = GPT2LMHeadModel.from_pretrained(checkpoint)    # also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

**Exercise 10: Fine-tune GPT2 Model [1 Mark]**

- Specify training arguments and create a TrainingArguments object (Use 30 epochs)

- Train a GPT-2 model using the provided training arguments

- Save the resulting trained model and tokenizer to a specified output directory

In [80]:
# Set up the training arguments
# YOUR CODE HERE
model_output_path = "/content/gpt_model"

training_args = TrainingArguments(
    output_dir = model_output_path,  # Directory to save model checkpoints and outputs
    overwrite_output_dir = True,
    per_device_train_batch_size = 4,  # Batch size per device (GPU/CPU) during training
    per_device_eval_batch_size = 4,   # Batch size for evaluation
    num_train_epochs = 30,
    save_steps = 1_000, # Save checkpoint every 1000 steps
    save_total_limit = 2, # Limit the total number of checkpoints to 2 (older ones will be deleted)
    logging_dir = './logs', # Directory for storing logs
    )

In [82]:

trainer = Trainer(
    model = model, # The model to be trained
    args = training_args,
    data_collator = data_collator,  # Function to collect and prepare batches of data
    train_dataset = tokenized_datasets["train"], # Training dataset (after tokenization)
    eval_dataset = tokenized_datasets["validation"], # Validation dataset (after tokenization)
    )

trainer.train() # The model object that you passed into the Trainer will be updated with the trained weights. This means the model will be "trained" and can be used immediately for inference

Step,Training Loss
500,1.9941
1000,1.2779
1500,0.8426
2000,0.5737
2500,0.4291
3000,0.365


TrainOutput(global_step=3000, training_loss=0.9137444508870443, metrics={'train_runtime': 931.0632, 'train_samples_per_second': 12.888, 'train_steps_per_second': 3.222, 'total_flos': 1567752192000000.0, 'train_loss': 0.9137444508870443, 'epoch': 30.0})

In [83]:
# Save the model
# YOUR CODE HERE
saved_model_path = "/content/finetuned_gpt2_model"
trainer.save_model(saved_model_path)

# Save the tokenizer
# YOUR CODE HERE
tokenizer.save_pretrained(saved_model_path)

('/content/finetuned_gpt2_model/tokenizer_config.json',
 '/content/finetuned_gpt2_model/special_tokens_map.json',
 '/content/finetuned_gpt2_model/vocab.json',
 '/content/finetuned_gpt2_model/merges.txt',
 '/content/finetuned_gpt2_model/added_tokens.json')

**Exercise 11: Test Model with user input prompts [1 Mark]**

- Create `generate_response()` function that takes a trained *model*, *tokenizer*, and a *prompt* string as input and generates a response using the GPT-2 model

- Test it with some user input prompts

In [84]:
def generate_response(model, tokenizer, prompt, max_length=200):

    # YOUR CODE HERE
    input_ids = tokenizer.encode(prompt, return_tensors="pt")      # 'pt' for returning pytorch tensor

    # Check the device of the model
    device = next(model.parameters()).device

    # Move input_ids to the same device as the model
    input_ids = input_ids.to(device)

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

In [85]:
# Load the fine-tuned model and tokenizer

# YOUR CODE HERE
# YOUR CODE HERE

my_model = GPT2LMHeadModel.from_pretrained(saved_model_path)
my_tokenizer = GPT2Tokenizer.from_pretrained(saved_model_path)

In [86]:
# Testing with a sample prompt 1

prompt = 'what is cancer?'# YOUR CODE HERE
response = generate_response(my_model, my_tokenizer, prompt)
print("Generated response:")
response

Generated response:


"what is cancer? Cancer is a disease of the living -- it destroys and destroys cells. It is a slow, painful process that often destroys the lining of your eye or the bone. It can take up to 10 years for cancer to break out of your system. It is most common in older people. It is more common in men than in women. It is more common in older people's families. It is more common in older adults than in children. It is more common in people with a family history of the disease. It is more common in people who have never had cancer. It is more common in people who have had cancer. It appears gradually over time as the disease gets worse. The signs and symptoms of cancer vary, but can include headaches, fatigue, loss of appetite, abdominal swelling, abdominal pain, constipation, and skin discoloration. There are three types of breast-conserving ophthalmologists: Cacchiroporosis, Autoantibodies,"

In [87]:
# Testing with a sample prompt 2

prompt = 'What is to be done after getting cold?'# YOUR CODE HERE
response = generate_response(my_model, my_tokenizer, prompt)
# YOUR CODE HERE
response

'What is to be done after getting cold? - After you have had your cold, you should drink plenty of water, cold milk, or hot milk before going to bed. You should also drink plenty of milk, hot or cold, to help prevent digestive problems. You should also drink plenty of water, cold milk, or hot milk to help prevent intestinal problems. - After you have had your cold, you should drink plenty of water, cold milk, or hot milk before going to bed. You should also drink plenty of milk, hot or cold, to help prevent digestive problems. You should also drink plenty of water, cold milk, or hot milk to help prevent intestinal problems. - After you have had your cold, you should drink plenty of water, cold milk, or hot milk before going to bed. You should also drink plenty of milk, hot or cold, to help prevent intestinal problems. - After you have had your cold, you should drink plenty of water, cold milk, or'

**Exercise 12: Compare the performance of a *GPT2 model* with the *GPT2 model fine-tuned* on MedQuAD data [0.5 Mark]**

- Load another pre-trained GPT2LMHeadModel and do not fine-tune it

- To generate response using the untuned model, pass it as a parameter to `generate_response()` function

- Test both models (fine-tuned and untuned) with below user input prompts:

    - "What precautions to take for a healthy life?"
    - "What to do after being diagnosed with cancer?"
    - "What to do when feeling sick?"

In [98]:
# Load a pre-trained GPT2 model, do not finetune it with MedQuAD data

# YOUR CODE HERE
untuned_model = GPT2LMHeadModel.from_pretrained(checkpoint)    # also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

In [101]:
# Testing with finetuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = generate_response(my_model, my_tokenizer, prompt)
response

'What precautions to take for a healthy life? Check with your doctor if you have any questions about taking certain medications. Ask your doctor about taking certain medications. Ask your doctor about talking with his or her family doctor before taking a medication. Ask your doctor about talking with his or her doctor during treatment for depression. Talk with people who have had serious depression and ask what steps you can take to reach your goals. Ask your doctor about talking with your family doctor before taking a medication. Talk with people who have had serious depression and ask what steps you can take to reach your goals. Ask your doctor about talking with your family doctor before taking a medication. Talk with people who have had serious depression and ask what steps you can take to reach your goals. Ask your doctor about talking with your family doctor before taking a medication. Talk with people who have had serious depression and ask what steps you can take to reach your 

In [102]:
# Testing with untuned model: prompt 1

prompt = "What precautions to take for a healthy life?"
response = generate_response(untuned_model, tokenizer, prompt)
response

"What precautions to take for a healthy life?\n\nThe following are some of the most common questions you'll hear from your doctor or nurse about your health.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause cancer?\n\nThe risks of taking a drug that can cause cancer are very high.\n\nWhat are the risks of taking a drug that can cause"

In [103]:
# Testing with finetuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = generate_response(my_model, my_tokenizer, prompt)
response

'What to do after being diagnosed with cancer? - Talk with your doctor or other health care provider about what you should do after you notice a change in your symptoms. - Check with your doctor or other health care provider to make sure that the change is right for you. - Check with your doctor or other health care provider to make sure that the change is right for you. Make changes if you think your symptoms might be worsening. Ask your doctor or other health care provider what you can do after you notice a change in your symptoms. Make changes if you think your symptoms might be worsening. Ask your doctor or other health care provider what you can do after you notice a change in your symptoms. If you notice a change in your symptoms that are worsening, talk with your doctor or other health care provider about what steps you can take to get around the symptoms. If you notice a change in your symptoms that are bothering you, talk with your doctor or other health care provider about wh

In [95]:
# Testing with untuned model: prompt 2

prompt = "What to do after being diagnosed with cancer?"
response = generate_response(untuned_model, tokenizer, prompt)
response

"What to do after being diagnosed with cancer?\n\nThe first step is to get your doctor's approval for a treatment.\n\nIf you have a cancer diagnosis, you may need to get a second opinion.\n\nIf you have a cancer diagnosis, you may need to get a second opinion. If you have a cancer diagnosis, you may need to get a third opinion.\n\nIf you have a cancer diagnosis, you may need to get a third opinion. If you have a cancer diagnosis, you may need to get a fourth opinion.\n\nIf you have a cancer diagnosis, you may need to get a fourth opinion. If you have a cancer diagnosis, you may need to get a fifth opinion.\n\nIf you have a cancer diagnosis, you may need to get a fifth opinion. If you have a cancer diagnosis, you may need to get a sixth opinion.\n\nIf you have a cancer diagnosis, you may need to get a sixth opinion. If you have"

In [104]:
# Testing with finetuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(my_model, my_tokenizer, prompt)
response



In [97]:
# Testing with untuned model: prompt 3

prompt = "What to do when feeling sick?"
response = generate_response(untuned_model, tokenizer, prompt)
response

"What to do when feeling sick?\n\nThe first thing you should do is to get your body to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick, you should take a few minutes to relax.\n\nIf you're feeling sick"

## Push your model to Hugging Face Model Hub

**Exercise 13: Follow below steps to push your fine-tuned model to HuggingFace Model Hub**

1. [Sign up](https://huggingface.co/join) for a Hugging Face account
2. Create an access token for your account and save it
3. Store your access token in the Hugging Face cache folder within colab
4. Push your fine-tuned model and tokenizer to Model Hub
5. Load the model back from Hub and test it with user input prompts

* **Create an access token for your account**

    Once you have an account, to create an access token:
    
    - Go to your `Settings`, then click on the `Access Tokens` tab. Click on the `New token` button to create a new User Access Token.
    - Select a Token type as `Write` and give a name for your token
    - Click on Create token
    - Once a token is created save it somewhere
    - When required later, use the old saved token or create a new token again

    To know more about Access Tokens, refer [here](https://huggingface.co/docs/hub/security-tokens).

* **Store your access token in the Hugging Face cache folder within colab**

    Once you have your User Access Token, run the following command to authenticate your identity to the Hub.
    - `!huggingface-cli login`
    - Paste your Access token when prompted
    - Type **n** when prompted to Add token as git credential? (Y/n)

    For more details on login, refer [here](https://huggingface.co/docs/huggingface_hub/quick-start#login).

In [105]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


* **Push your fine-tuned model and tokenizer to Model Hub [0.5 Mark]**

    - Use `push_to_hub()` method of your model and tokenizer both, to push them on hub
    - Specify name for your repository where the model and tokenizer will be pushed using `repo_id` parameter
    - Push model and tokenizer to the same repository

    - **Hint:**

        - Use `push_to_hub()` method of your model. For parameter details, refer [here](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.push_to_hub).
        - Use `push_to_hub()` method of your tokenizer. For parameter details, refer [here](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.push_to_hub).
        - Access your pushed model at `https://huggingface.co/[YOUR-USER-NAME]/[YOUR-MODEL-REPO-NAME]/tree/main`

In [106]:
# Push model
# YOUR CODE HERE
my_repo = "medicalQnA-gpt2"
my_model.push_to_hub(repo_id= my_repo, commit_message= "Upload fine-tuned model")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kpavan2004/medicalQnA-gpt2/commit/689e7f0c4f807a8f530464958e933b2eaecb74ec', commit_message='Upload fine-tuned model', commit_description='', oid='689e7f0c4f807a8f530464958e933b2eaecb74ec', pr_url=None, pr_revision=None, pr_num=None)

In [107]:
# Push tokenizer
# YOUR CODE HERE
my_tokenizer.push_to_hub(repo_id= my_repo, commit_message= "Upload tokenizer used")

CommitInfo(commit_url='https://huggingface.co/kpavan2004/medicalQnA-gpt2/commit/a7efef8c70b13e0472280a5fbcec3927585438dd', commit_message='Upload tokenizer used', commit_description='', oid='a7efef8c70b13e0472280a5fbcec3927585438dd', pr_url=None, pr_revision=None, pr_num=None)

* **Load the model and tokenizer back from Hub and test it with user input prompts [0.5 Mark]**

    - In many cases, the architecture you want to use can be guessed from the name or the path of the pretrained model you are supplying to the `from_pretrained()` method. **AutoClasses** can be used to automatically retrieve the relevant model given the name/path to the pretrained weights/config/vocabulary.

    - Instantiating one of `AutoConfig`, `AutoModel`, and `AutoTokenizer` will directly create a class of the relevant architecture.

    - When the GPT2 Model transformer has a language modeling head on top, you can use an auto class with language modeling head on top as well - `AutoModelWithLMHead`.

    - Specify full path of your model repo i.e. ***''YOUR-USER-NAME/YOUR-REPO-NAME''*** while calling `from_pretrained()` method.

In [108]:
from transformers import AutoModelWithLMHead, AutoTokenizer

In [109]:
# Load your model from hub
username = "kpavan2004"      # change it to your HuggingFace username

my_checkpoint = username + '/' + my_repo       # eg. "yrajm1997/gita-text-generation-gpt2"
my_checkpoint

loaded_model = AutoModelWithLMHead.from_pretrained(my_checkpoint)

config.json:   0%|          | 0.00/932 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/498M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

In [110]:
# Load your tokenizer from hub
loaded_tokenizer = AutoTokenizer.from_pretrained(my_checkpoint)

tokenizer_config.json:   0%|          | 0.00/525 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

In [111]:
# Response from loaded model

prompt = "What is the outlook for Skin Cancer ?"
response = generate_response(loaded_model, loaded_tokenizer, prompt)
response

'What is the outlook for Skin Cancer?<answer>The outlook for skin cancer varies depending on the type of skin cancer. Most skin cancers are curable. Some may recur.<end>The highest rates of skin cancer are found in older adults. The remaining risk factors are high blood pressure, diabetes, and high cholesterol.<end>The risk factors for skin cancer include - cigarette, cigar, and pipe smoking. Cigarette smoking is the most common type of cancer among American men. It is more common in men younger than 40 years of age. - secondhand smoke. People who smoke secondhand smoke are at higher risk of developing skin cancer. - secondhand smoke. People who smoke secondhand smoke are at higher risk of developing thick, red, and maroon skin.<end>The risk factors for thick, red, and maroon skin are not known.<end>The risk factors for skin cancer include - secondhand smoke. People who smoke secondhand smoke are at higher risk'

## Gradio Implementation

Gradio is an open-source python library that allows us to quickly create easy-to-use, customizable UI components for our ML model, any API, or any arbitrary function in just a few lines of code. We can integrate the GUI directly into the Python notebook, or we can share the link with anyone.

**Exercise 14: Create a Gradio app for your fine-tuned model pushed on Hugging Face Model Hub [1 Marks]**

- Install and import `gradio` library
- Create a function to use your fine-tuned model for response generation
    - Use the model and tokenizer directly within the function, do not pass them as parameters
    - Function should take input prompt text, and max response length as its input parameters
    - Function should output the generated response text
- Create input and output gradio elements
- Create a gradio interface object
- Launch the interface to generate UI

In [112]:
!pip -q install gradio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.1/18.1 MB[0m [31m93.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.7/318.7 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.6/94.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m108.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [118]:
import gradio as gr

In [126]:
# Function for response generation

def generate_query_response(prompt, max_length=200):

    model = loaded_model
    tokenizer = loaded_tokenizer

    # YOUR CODE HERE ...
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    if max_length is None:
        max_length = len(input_ids[0]) + 1

    # Check the device of the model
    device = next(model.parameters()).device

    # Move input_ids to the same device as the model
    input_ids = input_ids.to(device)

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=int(max_length),
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)


In [127]:
# Gradio elements

# Input from user
in_prompt = 'how to lead a healthy life?'
in_max_length = 200

# Output response
out_response = generate_query_response(in_prompt, in_max_length)
print(out_response)

how to lead a healthy life? The answers to these questions depend on which gene mutations you have, what stage of your disease you have, and what part of your body is affected. Learn more about how your body reacts to new mutations in the GHR gene.
question>What are the treatments for GM1 gangliosidosis?<answer>How might GM1 gangliosidosis be treated? Because the genetic material that causes GM1 gangliosidosis is abnormal, it is very difficult to diagnose. Fortunately, many genetic tests can diagnose this condition. These can include - a genetic test that identifies the gene responsible for the disorder - a biopsy to look for abnormalities in the blood and tissues - a biopsy to look for abnormalities in the liver and spleen - a biopsy to look for abnormalities in the brain and nervous system - a tomography (an image of the brain and spinal cord taken while a person is unconscious) to look for abnormalities in the blood


In [None]:
# Gradio interface to generate UI link
iface = gr.Interface(fn=generate_query_response,
                    inputs = [gr.Textbox(label="Enter your your prompt"),
                               gr.Textbox(label="Enter max output length")],
                    outputs="textbox",
                    title = "Medical QnA Bot using GPT2",
                    description = "Medical QnA Bot using GPT2, trained using MedQuAD dataset",
                    allow_flagging = 'never')

iface.launch(share = True,debug=True)
# YOUR CODE HERE to launch the interface

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://818e8f9dce1e1cf8d3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


## Upload your Gradio application on Hugging Face Spaces

**Exercise 15: Upload your Gradio application on Hugging Face Spaces [2 Marks]**

1. Start a new Hugging Face Space by going to your profile and [clicking "New Space"](https://huggingface.co/new-space)

2. Provide details for your space:
    - Space name
    - License (eg. [MIT](https://opensource.org/licenses/MIT))
    - Space SDK (software development kit) (eg. `Gradio`)
    - Space hardware (CPU basic)
    - Choose whether your Space is public or private
    - Click "Create Space"

3. Go to ***Add files -> Create a new file*** option to add below files:
    - `requirements.txt`: should contain the dependencies to run your app such as transformers, torch, and gradio
    - `app.py`: should contain steps to
        - import required packages
        - load your fine-tuned model and tokenizer from the Model Hub
        - function to use your fine-tuned model for response generation
        - create input and output gradio elements
        - create a gradio inference object
        - launch the interface to generate UI

4. Access the `App` tab of your repository to see the build progress (debug if error persists)

5. Once the app has built successfully, test the application running on your Space with a user input prompt



In [None]:
# Hugging face Space link
https://huggingface.co/spaces/kpavan2004/MedicalQnA