# Finetune Phi-2

This finetuning setup was performed on Azure Machine Learning on a VM with a GPU (Standard_NC6s_v3 - 1 x NVIDIA Tesla V100). Once the enviroment is created using the associated requirements.txt file on the VM, its much cleaner to open this notebook on VS code if you are using AML, or any other means to open a notebook. 

!NOTE: Compared to GPT models, there has been little work done to make the content of Phi-2 safe, therefore beware about using this model out of the box without safety checks/ logic such as Azure Content Moderator.

#### Enviroment Setup

Clone this repo 

    git clone https://github.com/microsoft/dstoolkit-phi2-finetune.git

Once the repo has been cloned, create a new Python enviroment and activate it

    python -m virtualenv env
    env\Scripts\Activate

Install Python requirements from requirements.txt

    pip install -r requirements.txt

#### Setup

In [1]:
# Import Libraries
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model

  from .autonotebook import tqdm as notebook_tqdm


#### Data Prep
Prepare data into training and validation sets. Usually, in data science, we also have a test set to test our model on unseen data which replicates the real world population. In this example, we can vertify this during inference. Validation set helps us to check that our model isnt overfitting to the train set during the training process. Because our use case involves training a QnA agent, it doesnt make sense to withhold questions from the training set, therefore i will use the entire data for training, and a random subset to vertify it is learning alongside our training metrics. If your use case differs (such as using continuous text), then you may change your files and split type to suit your needs. For our training and validation we need two jsonl files which operate as line level json files.

This can be formatted as key-value pairs if the text we wish to finetune is in question-answer format, or as a note if we are trying to finetune continuous text.

see below for both examples (jsonl file looks like so):

        {"question": "Does the Sun rise in the East or West?", "answer": "The Sun rises in the East."}
        {"question": "What is the biggest UK festival?", "answer": "The largest UK music festival is Glastonbury."}
OR

        {"note": "continuousTextExample2"}
        {"note": "continuousTextExample2"}

We can break down sections of continuous text to make the training quicker, so each example is smaller. Whether this is done or not, there should be some form of data quality check or preprocessing step. The original paper which introduced Phi-1 (Textbooks are all you need) emphasized the need for high quality data over quantity, and the Phi models are all originally trained on relativly low quantity but high quality Python textbooks.

For our example, we will use a QnA dataset and therefore the former formatting example above. I have selected the [Microsoft 365 FAQ](https://www.microsoft.com/en-us/microsoft-365/microsoft-365-for-home-and-school-faq) which contains question-answer pairs of commonly asked questions for 365 products. To take this web link and put it in the above format, i utilized  [Azure Language Studio's Question-Answering](https://learn.microsoft.com/en-us/azure/ai-services/language-service/question-answering/overview) service which has the ability to parse a FAQ HTML page into question-answer pairs. These were then exported into a csv file. We can then use code or GPT (with examples) to automate the generation of the required format as a jsonl file. A code example is provided in the cell below.

In [2]:
# To convert csv/ excel to jsonl - you do not need to run this cell if you have your own data formated as above.
# !NOTE: in csv, every row was another QnA pair and each column was: Question, Answer.
import pandas as pd
import json
from numpy import random

trainSplit:float=0.7 # set our split between train and validation

qnaData:object=pd.read_excel("QnA_MSFT365.xlsx") # read in data
jsonList=list() # create empty list to store jsonl structure
for index, row in qnaData.iterrows(): # iterate over rows
    jsonList.append({"question": row['Question'], "answer": row['Answer']}) # append in required format

indexSplit:int=int(trainSplit*len(jsonList)) # get the index where the train-val split will occur
random.shuffle(jsonList) # randomise list order so we can split it randomly

# format into train and validation set
trainSet:list=jsonList # [:indexSplit] # commented as we wish to train over entire set
valSet:list=jsonList[indexSplit:]

# save train and val
with open("train.jsonl", 'w') as f:
    for item in trainSet:
        f.write(json.dumps(item) + "\n")
with open("val.jsonl", 'w') as f:
    for item in valSet:
        f.write(json.dumps(item) + "\n")

In [3]:
# Load and Format Data - saved as "train.jsonl", "val.jsonl"
dataName:str="train.jsonl"
valName:str="val.jsonl"
trainDataset, evalDataset = load_dataset('json', data_files=dataName, split='train'), load_dataset('json', data_files=valName, split='train')

def formattingFunc(textExample:str) -> str:
    """
    This function formats our text to be continuous rather than in json format. The output of this function is submitted directly to Phi-2 for finetuning.
    """
    text:str=f"Question: {textExample['question']}\nAnswer: {textExample['answer']}" # if QnA
    # text:str=f"{example['note']}" # if continuous text
    return text

Generating train split: 56 examples [00:00, 744.42 examples/s]
Generating train split: 17 examples [00:00, 6966.60 examples/s]


#### Load Model and Tokenizer
This is the model which will be finetuned - will will be usig Phi-2. We will also adjust the padding in the input data so that we can determine the appropriate max_length of our input tokens. Larger max_length would be more computationally expensive so it may be worth adjusting your training, validation examples if you have large data examples. Each input will be padded with our end of sequence (eos) token.

In [7]:
# Load our base model
baseModelName:str="microsoft/phi-2"

# Load our base model
model:object=AutoModelForCausalLM.from_pretrained(baseModelName,
                                             torch_dtype=torch.float32, # fixes issue in inference related to float16 values producing "!!!!" rather than output.
                                             device_map="auto",
                                             trust_remote_code=True,
                                             load_in_8bit=True)

# Load our tokenizer
tokenizer:object=AutoTokenizer.from_pretrained(
    baseModelName,
    padding_side="left", # add padding so that our input sequences are all the same length. Left means that pad token is repeated until we reach our input text.
    add_eos_token=True, # end of sequence token
    add_bos_token=True, # beginning of sequence token
    use_fast=False,
)
tokenizer.pad_token = tokenizer.eos_token # set out pad token to be the same as eos token

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.36s/it]


In [8]:
def tokenizePrompt(prompt:object) -> dict:
    """
    Tokenizes prompt based on prompt and tokenizer.
    """
    tokenizedPrompt:dict=tokenizer(formattingFunc(prompt))
    return tokenizedPrompt

# Format and Tokenize datasets.
tokenizedTrain:dict=trainDataset.map(tokenizePrompt)
tokenizedVal:dict=evalDataset.map(tokenizePrompt)

# count lengths of both datasets so we can adjust max length
lengthTokens:list=[len(x['input_ids']) for x in tokenizedTrain] # count lengths of tokenizedTrain
if tokenizedVal != None:
    lengthTokens += [len(x['input_ids']) for x in tokenizedVal] # count lengths of tokenizedVal
maxLengthTokens:int=max(lengthTokens) + 2 #  we could also visualise lengthTokens using matplotlib if we wish to see the distribution
tokenDiffOriginal:int=maxLengthTokens-min(lengthTokens) # create metric original

# this function will set all tokens to the same length using left hand padding and the eos token (setup above)
def tokenizePromptAdjustedLengths(prompt:object):
    """
    Tokenizes prompt with adjusted lengths with left handed padding. All sequences will be of the same length which will assist training.
    """
    tokenizedResponse = tokenizer(
        formattingFunc(prompt),
        truncation=True,
        max_length=maxLengthTokens,
        padding="max_length",
    )
    return tokenizedResponse

del tokenizedTrain; del tokenizedVal # clean up old variables
tokenizedTrain:dict=trainDataset.map(tokenizePromptAdjustedLengths) # apply adjusted size tokenization
tokenizedVal:dict=evalDataset.map(tokenizePromptAdjustedLengths)

# count adjusted size difference
lengthTokens:list=[len(x['input_ids']) for x in tokenizedTrain] # count lengths of tokenizedTrain
if tokenizedVal != None:
    lengthTokens += [len(x['input_ids']) for x in tokenizedVal] # count lengths of tokenizedVal
tokenDiffAdjusted:int=max(lengthTokens)-min(lengthTokens) # create metric adjusted

print(f"| Diff Token Size |\nOriginal Lengths: {tokenDiffOriginal}\nAdjusted Lengths: {tokenDiffAdjusted}") # compare size differences using metrics from original and adjusted lengths.

Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 56/56 [00:00<00:00, 625.80 examples/s]


| Diff Token Size |
Original Lengths: 261
Adjusted Lengths: 0





#### Get Model Infomation and Set up LoRA layers for finetuning.
LoRA (Low-Rank Adaptation) is a finetuning technique which freezes the pre-trained model weights and instead interjects trainable matrices into each layer of the Transformer architecture (https://arxiv.org/abs/2106.09685).

In [9]:
loraConfig:object=LoraConfig(
    r=64, # Rank of low-rank matrix, controls the number of parameters trained - a higher rank allowing more parameters to be trained and larger update matrices (and more compute cost). Play with this and see how it effects number of trainable params.
    lora_alpha=16, # LoRA scaing factor of learned weights: alpha/r
    target_modules=[ # modules (eg attention blocks) to apply LoRA matrices.
        "Wqkv",
        "fc1",
        "fc2",
    ],
    bias="none", # should bias parameters also be trained: none, all, lora_only
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model:object=get_peft_model(model, loraConfig) # parameter-efficient fine tune - freeze pretrained model parameters and add small number of tunable adapters on top.
print(f"Model Architecture:\n{model}")
model.print_trainable_parameters() # print trainable parameters

Model Architecture:
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (embed_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiAttention(
              (q_proj): Linear8bitLt(in_features=2560, out_features=2560, bias=True)
              (k_proj): Linear8bitLt(in_features=2560, out_features=2560, bias=True)
              (v_proj): Linear8bitLt(in_features=2560, out_features=2560, bias=True)
              (dense): Linear8bitLt(in_features=2560, out_features=2560, bias=True)
              (rotary_emb): PhiRotaryEmbedding()
            )
            (mlp): PhiMLP(
              (activation_fn): NewGELUActivation()
              (fc1): lora.Linear8bitLt(
                (base_layer): Linear8bitLt(in_features=2560, out_features=10240, bias=True)
                (lora_dropout): ModuleDict(
          

#### Training

In [10]:
# Setup train run parameters
project:str="Finetune"
modelName:str=baseModelName.replace("\\", "_").replace("/", "_")
run_name:str=f"{project}-{modelName}"
output_dir:str="./" + run_name # this will be the dir to store run infomation and model weights

# get GPU count for CUDA.
print(f"GPU COUNT: {torch.cuda.device_count()}")
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

GPU COUNT: 1


In [11]:
stepsSaveEvalLoss:int=50
numberStepPartitions:int=20 # stepsSaveEvalLoss muliplied by numberStepPartitions gets max_steps - done so that the last step is always a multiple of stepsSaveEvalLoss and it saves.
max_steps:int=stepsSaveEvalLoss*numberStepPartitions
trainer:object=Trainer(
    model=model,
    train_dataset=tokenizedTrain,
    eval_dataset=tokenizedVal,
    args=TrainingArguments(
        output_dir=output_dir, # output dir defined above
        warmup_steps=1, # number of steps for the warmup phase where the learning rate is gradually increased from a low value to the maximum value where normal schedule begins - can improve the stability and performance.
        per_device_train_batch_size=2, # specifies the batch size per device for training. It should be an integer that is greater than zero.
        gradient_accumulation_steps=1, # specifies the number of steps to accumulate gradients before performing a backward and an optimizer step. It should be an integer that is greater than zero. The effective batch size is the product of this argument and the per_device_train_batch_size
        max_steps=max_steps, # max number of training steps
        learning_rate=2.5e-5, # aim for small LR for finetuning scenarios
        optim="paged_adamw_8bit", # optimiser type to adjust LR during training
        logging_dir=f"{output_dir}/logs", # Where logs are stored for training
        logging_steps=stepsSaveEvalLoss, # train loss cadence
        do_eval=True, # perform eval on eval set
        evaluation_strategy="steps", # eval model loss set to steps
        eval_steps=stepsSaveEvalLoss, # eval loss cadence
        save_strategy="steps", # checkpoint model progress strategy set to steps
        save_steps=stepsSaveEvalLoss, # save every x steps cadence
    ),
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), # mlm - masked language modeling
)
model.config.use_cache = False  # silence warnings for training

# Train - The output should be a table with a row at stepsSaveEvalLoss cadence and columns as Step, Training loss and Validation Loss.
trainer.train()



Step,Training Loss,Validation Loss
50,2.1625,2.003207
100,1.779,1.718166
150,1.5935,1.527336
200,1.3822,1.365474
250,1.2834,1.209522
300,1.1508,1.050137
350,1.0162,0.932797
400,0.9472,0.808629
450,0.8156,0.709524
500,0.7579,0.61675


Checkpoint destination directory ./Finetune-microsoft_phi-2/checkpoint-50 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./Finetune-microsoft_phi-2/checkpoint-100 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./Finetune-microsoft_phi-2/checkpoint-150 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./Finetune-microsoft_phi-2/checkpoint-200 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./Finetune-microsoft_phi-2/checkpoint-250 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./Finetune-microsoft_phi-2/checkpoint-300 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./

TrainOutput(global_step=1000, training_loss=0.8971948947906494, metrics={'train_runtime': 1119.2081, 'train_samples_per_second': 1.787, 'train_steps_per_second': 0.893, 'total_flos': 9658921328640000.0, 'train_loss': 0.8971948947906494, 'epoch': 35.71})

#### Inference of trained model

Kill the GPU process to completely clear memory:

    nvidia smi > kill [PID]
OR

    Kernel > Restart Kernel

In [1]:
# Empty VRAM and clear model, trainer variables
try: 
    del model
    del tokenizer
    del trainer
    import gc
    gc.collect()
except:
    pass

# load libraries for inference
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# memory cleared so recreate parameters
baseModelName:str="microsoft/phi-2"
project:str="Finetune"
max_steps:int=1000

modelName:str=baseModelName.replace("\\", "_").replace("/", "_")
run_name:str=f"{project}-{modelName}"
output_dir:str="./" + run_name # this will be the dir to store run infomation and model weights

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# reload our base model and tokeniser
modelInference:object=AutoModelForCausalLM.from_pretrained(
    baseModelName,  # Phi2, same as before
    torch_dtype=torch.float32, # fixes issue in inference related to float16 values producing "!!!!" rather than output.
    device_map="auto",                                      
    trust_remote_code=True,
    load_in_8bit=True,
)
tokenizerInference:object=AutoTokenizer.from_pretrained(baseModelName,
                                               add_bos_token=True,
                                               trust_remote_code=True,
                                               use_fast=False)
tokenizerInference.pad_token = tokenizerInference.eos_token

# load finetuned QLoRA adapters which were saved during training
finetunedFolder:str=f"{output_dir}/checkpoint-{max_steps}" # get latest model by default (can change if you see better performance on other models)
FTmodel:object=PeftModel.from_pretrained(modelInference, finetunedFolder) # load FT model

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.32s/it]


We can play with the repetition penalty, which can influence the likelihood of repeated content. A higher repetition penalty makes the model less likely to generate repeated phrases or words in the text, while a lower repetition penalty allows more repetition.

!Note if an issue persists where "!!!!" is produced instead of text output from the model it is related to an issue setting torch_dtype=torch.float16 rather than torch.float32 when loading the model. See here for more details: https://huggingface.co/microsoft/phi-2/discussions/89


In [5]:
# model hyperparameters
repetition_penalty:float=1.0
max_tokens:int=200

# test a prompt
testPrompt:str="How do I install Microsoft 365 or Office?"

formattedPrompt:str=f"question: {testPrompt}\nanswer: " # format like training set formatting, see above.
tokenisedPrompt:dict=tokenizerInference(formattedPrompt, return_tensors="pt").to("cuda") # tokenise prompt
FTmodel.eval() # set in inference mode
with torch.no_grad():
    response:str=tokenizerInference.decode(FTmodel.generate(**tokenisedPrompt, max_new_tokens=max_tokens, repetition_penalty=repetition_penalty)[0], skip_special_tokens=True)
    print(response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


question: How do I install Microsoft 365 or Office?
answer:  [Install Office](https://go.microsoft.com/fwlink/p/?LinkID=403719) and [Install Microsoft 365](https://go.microsoft.com/fwlink/p/?LinkID=808164) are the best ways to install Microsoft 365 or Office. You can also download and install older versions of Office on PC or Mac for free. Learn more about installing Office apps. [Office Home & Business](https://go.microsoft.com/fwlink/p/?LinkID=808164) and [Office Home & Student](https://go.microsoft.com/fwlink/p/?LinkID=808164) are subscription plans that include the Office apps, along with additional features. Learn more about Microsoft 365 subscriptions. [Office for Mac](https://go.microsoft.com/fwlink/p/?LinkID=808164) and [Office for Windows tablets](https://
