<a href="https://colab.research.google.com/github/joonhee0416/CPSC477-Final/blob/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CPSC 477 Final Project Part 2: Finetuning Mistral 7B-Instruct

Followed this tutorial: https://adithyask.medium.com/a-beginners-guide-to-fine-tuning-mistral-7b-instruct-model-0f39647b20fe

Imported `dataset/train.jsonl`

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
!pip install -q torch
!pip install -q git+https://github.com/huggingface/transformers #huggingface transformers for downloading models weights
!pip install -q datasets #huggingface datasets to download and manipulate datasets
!pip install -q peft #Parameter efficient finetuning - for qLora Finetuning
!pip install -q bitsandbytes #For Model weights quantisation
!pip install -q trl #Transformer Reinforcement Learning - For Finetuning using Supervised Fine-tuning
!pip install -q wandb -U #Used to monitor the model score during training

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━

In [None]:
import json
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from huggingface_hub import notebook_login
from peft import LoraConfig, PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from trl import SFTTrainer

In [None]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
train_dataset = load_dataset('csv', data_files='./train.csv' , split='train')
val_dataset = load_dataset('csv', data_files='./val.csv' , split='train')

In [None]:
print(len(train_dataset[0]["text"]))
val_dataset[0]

9487


{'filename': 'HLF_q4_2021.txt',
 'text': "<s>[INST] You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript. Below is an earnings call transcript.\n\nEarnings Call Transcript:\n These reconciliations, together with additional supplemental information, are available at the Investor Relations section of our website, herbalife.com.\nAdditionally, when management makes reference to volumes during this conference call, they are referring to volume points.\n2021 was another record year for Herbalife Nutrition.\nEven during this period of continued global uncertainty due to the pandemic, our entrepreneurial direct sales channel helped consumers around the world pursue their nutrition and wellness goals by giving them access to our high-quality nutrition products.\nFor the full ye

In [None]:
new_model = "mistralai-Code-Instruct" #set the name of the new model

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 2

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = True

# Batch size per GPU for training
per_device_train_batch_size = 1

# Batch size per GPU for evaluation
per_device_eval_batch_size = 1

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 4

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 1e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule (constant a bit better than cosine)
lr_scheduler_type = "constant"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learni_ng rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 25

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

In [None]:
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

# Load the base model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)

base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

# Load MitsralAi tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
import torch
torch.cuda.empty_cache()
import gc
gc.collect()

87

In [None]:
eval_prompt = train_dataset[0]["text"].split("#")[0] + "# "
# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

base_model.eval()
with torch.no_grad():
    print(tokenizer.decode(base_model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))

[INST] You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript. Below is an earnings call transcript.

Earnings Call Transcript:
 Also on the call are Brian McDade, chief financial officer; and Adam Reuille, chief accounting officer.
As all of us know, 2020 was a difficult year for all of those affected by COVID-19, including our company.
Even with the unprecedented operating environment, we accomplished a great deal.
We earned $9.11 per diluted share and funds from operation for the full year, which includes a $0.06 per share dilution from our recent equity offering in November.
We generated over $2.3 billion in operating cash flow.
We acquired an 80% interest in the Taubman Realty Group, made strategic investments in several widely recognized retail brands at attractive 

In [None]:
print(train_dataset[0]["text"].split("#")[1])
print(len(train_dataset[0]["text"]))

 In 2020, Simon Property Group navigated economic challenges stemming from the pandemic, generating $9.11 per diluted share earnings and $2.3 billion in operating cash flow. During the fourth quarter, FFO reached $2.17 per share, impacted by COVID-19 disruptions, yielding a $0.95 per share loss. Despite these challenges, the company made strategic investments in retail brands, raising $13 billion in debt and equity markets. They completed the Taubman Realty Group acquisition, which boasts a high-performing retail portfolio. Looking ahead to 2021, the company projects FFO growth in the range of $9.50 to $9.75 per share, a 4.3% to 7% increase from 2020. The guidance assumes no further government-mandated shutdowns of domestic properties and includes an estimated $0.15 to $0.20 per share contribution from retailer investments.</s>
9487


In [None]:
import gc
torch.cuda.empty_cache()
gc.collect()

193

In [None]:
# Set LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=400, # the total number of training steps to perform
    logging_steps=50,
    logging_dir="./logs",        # Directory for storing logs
    save_strategy="steps",       # Save the model checkpoint every logging step
    save_steps=50,                # Save checkpoints every 50 steps
    evaluation_strategy="steps", # Evaluate the model every logging step
    eval_steps=50,               # Evaluate and save checkpoints every 50 steps
    do_eval=True,              # Perform evaluation at the end of training
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="wandb",
)

# Initialize the SFTTrainer for fine-tuning
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,  # You can specify the maximum sequence length here
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

max_steps is given, it will override any value given in num_train_epochs


In [None]:
!nvidia-smi

Wed May  8 15:41:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              51W / 400W |   6765MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
import gc
import os
# os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.cuda.empty_cache()
gc.collect()

# Start the training process
trainer.train()

# Save the fine-tuned model
trainer.model.save_pretrained(new_model)



Step,Training Loss,Validation Loss
50,1.8999,1.745044
100,1.6447,1.731574
150,1.3944,1.775493
200,1.1239,1.87456
250,0.8703,2.017644
300,0.6223,2.181498
350,0.4078,2.409936
400,0.2583,2.603613




In [None]:
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    new_model,
    quantization_config=bnb_config,
    device_map={"": 0}
)
fine_tuned_model.eval()
with torch.no_grad():
    generated_code = tokenizer.decode(fine_tuned_model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True)
print(generated_code)



Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[INST] You are a financial advisor tasked with creating a short summary of an earnings call transcript. You only want to summarize or re-iterate points that would be relevant, critical, or informational to someone who wants to skim over the important details of a long transcript. Below is an earnings call transcript.

Earnings Call Transcript:
 Also on the call are Brian McDade, chief financial officer; and Adam Reuille, chief accounting officer.
As all of us know, 2020 was a difficult year for all of those affected by COVID-19, including our company.
Even with the unprecedented operating environment, we accomplished a great deal.
We earned $9.11 per diluted share and funds from operation for the full year, which includes a $0.06 per share dilution from our recent equity offering in November.
We generated over $2.3 billion in operating cash flow.
We acquired an 80% interest in the Taubman Realty Group, made strategic investments in several widely recognized retail brands at attractive 

In [None]:
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
torch.cuda.empty_cache()
gc.collect()

# Merge the model with LoRA weights
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model",safe_serialization=True)
tokenizer.save_pretrained("merged_model")




('merged_model/tokenizer_config.json',
 'merged_model/special_tokens_map.json',
 'merged_model/tokenizer.model',
 'merged_model/added_tokens.json',
 'merged_model/tokenizer.json')

In [None]:
test_dataset = load_dataset('csv', data_files='./test.csv' , split='train')

In [32]:
from tqdm import tqdm

base_model.eval()
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token
with torch.cuda.amp.autocast():
  for i in range(len(test_dataset)):
    eval_prompt = test_dataset[i]["text"].split("#")[0] + "# "

    input_ids = tokenizer(eval_prompt, return_tensors="pt", truncation=True).input_ids.cuda()
    summary = ""
    while len(summary) <= 500:
      output = base_model.generate(input_ids=input_ids, pad_token_id=tokenizer.pad_token_id, max_new_tokens=300, do_sample=True, top_p=0.9,temperature=0.5)
      summary = tokenizer.batch_decode(output.detach().cpu().numpy(), skip_special_tokens=True)[0]
      summary = summary[summary.index("#") + 1:]
      print(f"{test_dataset[i]['filename']}'s length is {len(summary)}")
    with open(f"base_1/{test_dataset[i]['filename']}", "w") as f:
        f.write(summary)

WSO_q2_2021.txt's length is 1248
OFC_q2_2021.txt's length is 1218
LIN_q3_2021.txt's length is 1451
CNC_q3_2021.txt's length is 1214
SKT_q1_2021.txt's length is 1336
CHE_q4_2020.txt's length is 853
MBI_q4_2019.txt's length is 1158
GHL_q4_2020.txt's length is 1289
KFY_q2_2022.txt's length is 1011
WST_q1_2020.txt's length is 1103
UVE_q4_2020.txt's length is 1193
NNN_q1_2021.txt's length is 1174
CHE_q3_2020.txt's length is 1127
GEF_q3_2021.txt's length is 1177
AIT_q2_2020.txt's length is 711
TXT_q1_2021.txt's length is 1158
ATR_q1_2021.txt's length is 1357
LAZ_q2_2021.txt's length is 1565
HZO_q1_2021.txt's length is 1004
PBH_q3_2021.txt's length is 1395
HII_q1_2021.txt's length is 1138
PNW_q4_2019.txt's length is 1164
GD_q1_2021.txt's length is 1102
HCC_q1_2020.txt's length is 1234
CIM_q1_2020.txt's length is 1430
SJI_q3_2021.txt's length is 1379
FR_q4_2020.txt's length is 1150
TXT_q4_2021.txt's length is 1078
WST_q3_2020.txt's length is 868
PLD_q4_2021.txt's length is 1114
FN_q2_2021.txt'

In [None]:
eval_prompt = test_dataset[0]["text"].split("#")[0] + "# "
if tokenizer.pad_token is None:
  tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

input_ids = tokenizer(eval_prompt, return_tensors="pt", truncation=True).input_ids.cuda()
with torch.cuda.amp.autocast():
  outputs = merged_model.generate(input_ids=input_ids, pad_token_id=tokenizer.pad_token_id, max_new_tokens=10000, do_sample=True, top_p=0.9,temperature=0.5)
print(f"\nGenerated summary:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(eval_prompt):]}")



Generated summary:
1 Watsco Earnings Call Summary

Watsco, a leading distributor of heating, ventilation, and air conditioning (HVAC) products in North America, reported a record-breaking second quarter with earnings per share jumping 64% to $3.71 per share on a 66% increase in net income. Sales grew 36% to a record $1.85 billion, and gross profits increased 50% with gross margins expanding 220 basis points. Operating income increased $88 million or 68% to $217 million, and operating margins expanded 220 basis points to a record 11.7%. These results were achieved despite only a modest impact from COVID-related slowdowns in the second quarter of 2020.

Watsco has recently acquired two new companies, TEC and Acme, which have performed well and are now part of the Watsco family. The company is engaged in a very fragmented $50 billion North American market and hopes to find more great companies to join them. Greater scale in this industry provides more capital for the company to fund its 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [33]:
!zip -r base_1.zip base_1/

  adding: base_1/ (stored 0%)
  adding: base_1/FN_q2_2021.txt (deflated 49%)
  adding: base_1/GD_q1_2021.txt (deflated 55%)
  adding: base_1/LAZ_q2_2021.txt (deflated 55%)
  adding: base_1/WSO_q2_2021.txt (deflated 47%)
  adding: base_1/FNF_q3_2020.txt (deflated 50%)
  adding: base_1/CNC_q3_2021.txt (deflated 46%)
  adding: base_1/WST_q3_2020.txt (deflated 43%)
  adding: base_1/FR_q4_2020.txt (deflated 46%)
  adding: base_1/MSI_q3_2021.txt (deflated 52%)
  adding: base_1/HCC_q1_2020.txt (deflated 54%)
  adding: base_1/AIR_q1_2022.txt (deflated 49%)
  adding: base_1/NI_q3_2021.txt (deflated 48%)
  adding: base_1/ATR_q1_2021.txt (deflated 51%)
  adding: base_1/GHL_q4_2020.txt (deflated 53%)
  adding: base_1/CHE_q3_2020.txt (deflated 50%)
  adding: base_1/TDY_q4_2020.txt (deflated 53%)
  adding: base_1/LIN_q3_2021.txt (deflated 47%)
  adding: base_1/NHI_q1_2020.txt (deflated 46%)
  adding: base_1/KWR_q1_2020.txt (deflated 52%)
  adding: base_1/SJI_q3_2021.txt (deflated 52%)
  adding: base