## Setting up

In [1]:
%%capture
%pip install -U bitsandbytes
%pip install -U transformers
%pip install -U accelerate
%pip install -U peft
%pip install -U trl
%pip install hf_xek

In [2]:
import wandb

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

wb_token = user_secrets.get_secret("wandb")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tuning for velsera', 
    job_type="training", 
    anonymous="allow"
)

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mmohitlakshya[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [3]:
import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer, SFTConfig
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig,
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

2025-05-04 20:12:17.433224: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-05-04 20:12:17.433414: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-05-04 20:12:17.570114: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Loading Dataset

- Originally the data is provided in text files.
- Those text files have three field, id, title and abstract
- Loading 1000 files on kaggle was cumbersome, so I combined those files into a dataframe and uploaded a single file

In [6]:
cancer_df = pd.read_csv("/kaggle/input/cancer-non-cancer/cancer.csv")
non_cancer_df = pd.read_csv("/kaggle/input/cancer-non-cancer/non_cancer.csv")
df = pd.concat([cancer_df, non_cancer_df])
df["label"] = ["yes" if y == "cancer" else "no" for y in df["y"]]
df.head()

Unnamed: 0,id,title,abstract,y,label
0,31055803,[Analysis of age-specific cytogenetic changes ...,OBJECTIVE: To characterize cytogenetic changes...,cancer,yes
1,31164412,T-Cell Deletion of MyD88 Connects IL17 and Ika...,Cancer development requires a favorable tissue...,cancer,yes
2,31094905,MYCN Amplified Relapse Following Resolution of...,Congenital neuroblastoma with placental involv...,cancer,yes
3,31498304,In Vivo Inhibition of MicroRNA to Decrease Tum...,MicroRNAs (miRNAs) are important regulators of...,cancer,yes
4,30897768,Breast Cancer and miR-SNPs: The Importance of ...,Recent studies in cancer diagnostics have iden...,cancer,yes


In [9]:
df.label.value_counts() # equal split between cancer and non-cancer files

label
yes    500
no     500
Name: count, dtype: int64

## Prepare dataset for training and evaluation

In [10]:
df = df.sample(frac=1, random_state=85).reset_index(drop=True)

# Split the DataFrame
X = df.drop(columns="label", axis=1)
y = df["label"]

x_train, x_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=0.5, random_state=42)

# Define the prompt generation functions
def generate_prompt(**data_point):
    return f"""Given the title and abstract of a research paper 
Title: {data_point["title"]}
Abstract: {data_point["abstract"]}

Is the paper related with cancer? Answer with "yes" or "no" only.
Answer: {data_point["label"]}""".strip()

def generate_test_prompt(**data_point):
    return f"""Given the title and abstract of a research paper
Title: {data_point["title"]}
Abstract: {data_point["abstract"]}

Is the paper related with cancer? Answer with "yes" or "no" only.
Answer:""".strip()

x_train["text"] = [generate_prompt(title=t, abstract=a, label=y) for t, a, y in zip(x_train["title"], x_train["abstract"], y_train)]
x_val["text"] = [generate_test_prompt(title=t, abstract=a) for t, a in zip(x_val["title"], x_val["abstract"])]

In [11]:
# Convert to datasets
train_data = Dataset.from_pandas(x_train[["text"]])
eval_data = Dataset.from_pandas(x_val[["text"]])

In [12]:
train_data['text'][3]

'Given the title and abstract of a research paper \nTitle: Genetics meets pathology - an increasingly important relationship.\nAbstract: The analytical power of modern methods for DNA analysis has outstripped our capability to interpret and understand the data generated. To make good use of this genomic data in a biomedical setting (whether for research or diagnosis), it is vital that we understand the mechanisms through which mutations affect biochemical pathways and physiological systems. This lies at the centre of what genetics is all about, and it is the reason why genetics and genomics should go hand in hand whenever possible. In this Annual Review Issue of The Journal of Pathology, we have assembled a collection of 16 expert reviews covering a wide range of topics. Through these, we illustrate the power of genetic analysis to improve our understanding of normal physiology and disease pathology, and thereby to think in rational ways about clinical management. Copyright   2016 Path

## Loading the model and tokenizer

- We are using `unsloths` quantized models.
- These models are small in size

In [13]:
from kaggle_secrets import UserSecretsClient
from huggingface_hub.hf_api import HfFolder

user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("HUGGING_FACE_TOKEN")
HfFolder.save_token(secret_value_0)

In [14]:
base_model_name = "unsloth/Llama-3.2-1B-bnb-4bit"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="float16",
    # quantization_config=bnb_config, 
).to(device)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Ignore the javascript error, this is because it is not able to show the loading bar

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

## Model evalution before fine-tuning

In [82]:
# def predict(test, model, tokenizer):
#     y_pred = []
#     categories = ["yes", "no"]
    
#     for i in tqdm(range(len(test))):
#         prompt = test.iloc[i]["text"]
#         pipe = pipeline(task="text-generation", 
#                         model=model, 
#                         tokenizer=tokenizer, 
#                         # return_tensors=True,
#                         max_new_tokens=2,
#                         # num_return_sequences=4,
#                         temperature=0.1, )
        
#         result = pipe(prompt) #, output_scores= True) #, "output_logits": True, "return_dict_in_generate": True})
#         results.append(result)
#         print(result)
#         answer = result[0]['generated_text'].split("Answer:")[-1].strip()
        
#         # Determine the predicted category
#         for category in categories:
#             if category.lower() in answer.lower():
#                 y_pred.append(category)
#                 break
#         else:
#             y_pred.append("none")
        
#     return y_pred

In [16]:
def predict(test, model, tokenizer):
    predictions = []
    categories = ["yes", "no"]
    
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        
        # Tokenize input
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        input_ids = inputs["input_ids"]
        
        # Generate output directly with the model
        generated_outputs = model.generate(
            input_ids,
            do_sample=True,
            max_new_tokens=2,
            temperature=0.1,
            num_return_sequences=1,
            output_scores=True,
            return_dict_in_generate=True
        )
        
        gen_sequences = generated_outputs.sequences[:, input_ids.shape[-1]:]
        
        # Convert logits to probabilities
        probs = torch.stack(generated_outputs.scores, dim=1).softmax(-1)
        
        # Collect probability of the generated tokens
        gen_probs = torch.gather(probs, 2, gen_sequences[:, :, None]).squeeze(-1)
        
        # Calculate overall probability of the sequence (product of token probabilities)
        sequence_prob = gen_probs.prod(-1).item()
        
        # Decode the generated tokens
        generated_text = tokenizer.decode(gen_sequences[0], skip_special_tokens=True)
        
        # Determine the predicted category
        predicted_label = "none"
        for category in categories:
            if category.lower() in generated_text.lower():
                predicted_label = category
                break
        
        # Create prediction dictionary
        prediction = {
            "label": predicted_label,
            "probability": float(sequence_prob),
            "input_text": prompt
        }
        
        predictions.append(prediction)

    x_test = test.copy()
    x_test["predicted_label"] = [x["label"] for x in predictions]
    x_test["predicted_probability"] = [x["probability"] for x in predictions]
    
    return x_test

In [17]:
y_val_pred_df = predict(x_val, model, tokenizer)

100%|██████████| 200/200 [00:47<00:00,  4.19it/s]


In [18]:
y_val_pred_df.head()

Unnamed: 0,id,title,abstract,y,text,predicted_label,predicted_probability
526,31266596,Renal cell carcinomas with a mesenchymal strom...,A subset of renal cell neoplasms contains a me...,cancer,Given the title and abstract of a research pap...,yes,0.884039
557,31212132,Identification of a novel orally bioavailable ...,Extracellular regulated kinase 5 (ERK5) signal...,cancer,Given the title and abstract of a research pap...,yes,1.0
290,31182949,Lack of Response to Vemurafenib in Melanoma Ca...,Vemurafenib has been developed to target commo...,cancer,Given the title and abstract of a research pap...,yes,0.196826
141,26432562,A Case of Recurrent Ischemic Stroke Involving ...,CASE REPORT: A 58-year-old man presenting with...,non_cancer,Given the title and abstract of a research pap...,yes,0.803174
239,30937513,[Bronchoalveolar lavage in hairy cell leukemia...,We report a 78-year-old male patient suffering...,cancer,Given the title and abstract of a research pap...,yes,0.735218


In [19]:
def evaluate(y_true, y_pred):
    labels = ["yes", "no"]
    mapping = {label: idx for idx, label in enumerate(labels)}
    
    def map_func(x):
        return mapping.get(x, -1)  # Map to -1 if not found, but should not occur with correct data
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true_mapped)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {labels[label]}: {label_accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped, target_names=labels, labels=list(range(len(labels))))
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped, labels=list(range(len(labels))))
    print('\nConfusion Matrix:')
    print(conf_matrix)

In [20]:
evaluate(y_val, y_val_pred_df["predicted_label"])

Accuracy: 0.525
Accuracy for label yes: 1.000
Accuracy for label no: 0.000

Classification Report:
              precision    recall  f1-score   support

         yes       0.53      1.00      0.69       105
          no       0.00      0.00      0.00        95

    accuracy                           0.53       200
   macro avg       0.26      0.50      0.34       200
weighted avg       0.28      0.53      0.36       200


Confusion Matrix:
[[105   0]
 [ 95   0]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Finetuning the model

In [21]:
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

In [22]:
modules = find_all_linear_names(model)
modules

['k_proj', 'o_proj', 'v_proj', 'up_proj', 'gate_proj', 'q_proj', 'down_proj']

## Setting up the model

In [23]:
output_dir="fine-tuned-model"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)

training_arguments = SFTConfig(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=1,                       # number of training epochs
    per_device_train_batch_size=1,            # batch size per device during training
    gradient_accumulation_steps=8,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_32bit",
    logging_steps=1,                         
    learning_rate=2e-4,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="wandb",                  # report metrics to w&b
    eval_strategy="steps",              # save checkpoint every epoch
    eval_steps = 0.2,
    max_seq_length=512,
    dataset_text_field="text",
    packing=False,
    dataset_kwargs={
    "add_special_tokens": False,
    "append_concat_token": False,
    }
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    processing_class=tokenizer,
)

Converting train dataset to ChatML:   0%|          | 0/600 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/600 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/600 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/600 [00:00<?, ? examples/s]

Converting eval dataset to ChatML:   0%|          | 0/200 [00:00<?, ? examples/s]

Adding EOS to eval dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Tokenizing eval dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

Truncating eval dataset:   0%|          | 0/200 [00:00<?, ? examples/s]

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


## Model Training

- Restricting to only 1 epoch, given the time constraint

In [24]:
# Train model
trainer.train()



Step,Training Loss,Validation Loss
15,2.0653,1.917533
30,1.8211,1.9025
45,1.7634,1.886745
60,1.6698,1.886993
75,1.871,1.88685


TrainOutput(global_step=75, training_loss=1.8784265120824177, metrics={'train_runtime': 626.7109, 'train_samples_per_second': 0.957, 'train_steps_per_second': 0.12, 'total_flos': 1296270023393280.0, 'train_loss': 1.8784265120824177})

In [25]:
wandb.finish()
model.config.use_cache = True

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,█▅▁▁▁
eval/mean_token_accuracy,▁▆▇█▇
eval/num_tokens,▁▃▅▆█
eval/runtime,▁▇█▇█
eval/samples_per_second,█▂▁▂▁
eval/steps_per_second,█▁▁▁▁
train/epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
train/grad_norm,▄▃▂▂▄▃█▃▃▃▂▂▁▃▃▂▃▁▂▁▁▂▂▂▁▃▂▂▃▂▂▂▃▂▂▁▂▂▃▃
train/learning_rate,▁▃███████▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁

0,1
eval/loss,1.88685
eval/mean_token_accuracy,0.5926
eval/num_tokens,212176.0
eval/runtime,44.0854
eval/samples_per_second,4.537
eval/steps_per_second,0.567
total_flos,1296270023393280.0
train/epoch,1.0
train/global_step,75.0
train/grad_norm,0.20886


## Saving the model and tokenizer

In [26]:
# Save trained model and tokenizer
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

('fine-tuned-model/tokenizer_config.json',
 'fine-tuned-model/special_tokens_map.json',
 'fine-tuned-model/tokenizer.json')

## Testing model after fine-tuning 

In [27]:
y_val_pred_df = predict(x_val, model, tokenizer)
evaluate(y_val, y_val_pred_df["predicted_label"])

100%|██████████| 200/200 [00:54<00:00,  3.66it/s]

Accuracy: 0.930
Accuracy for label yes: 0.886
Accuracy for label no: 0.979

Classification Report:
              precision    recall  f1-score   support

         yes       0.98      0.89      0.93       105
          no       0.89      0.98      0.93        95

    accuracy                           0.93       200
   macro avg       0.93      0.93      0.93       200
weighted avg       0.93      0.93      0.93       200


Confusion Matrix:
[[93 12]
 [ 2 93]]





In [28]:
x_test["text"] = [generate_test_prompt(title=t, abstract=a) for t, a in zip(x_test["title"], x_test["abstract"])]

y_preds_test = predict(x_test, model, tokenizer)
evaluate(y_test, y_preds_test["predicted_label"])

100%|██████████| 200/200 [00:52<00:00,  3.77it/s]

Accuracy: 0.950
Accuracy for label yes: 0.924
Accuracy for label no: 0.979

Classification Report:
              precision    recall  f1-score   support

         yes       0.98      0.92      0.95       105
          no       0.92      0.98      0.95        95

    accuracy                           0.95       200
   macro avg       0.95      0.95      0.95       200
weighted avg       0.95      0.95      0.95       200


Confusion Matrix:
[[97  8]
 [ 2 93]]





## Conclusion

- The accuracy improved after finetuning. Before finetuning overall accuracy was around 55%, after finetuning the accuracy jumped to 95%
- The accuracy can further be improved, by using a bigger model or may be increasing the number of epochs. These can be further explored but in interest of time and resources, will park it for later.