<a href="https://colab.research.google.com/github/mille055/duke_chatbot/blob/main/notebooks/chatbot_Finetune_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

##Chad Miller
##AIPI590 Project 2

This notebook fine_tunes an LLM (Mistral 7B) for the chatbot.


In [1]:
!git clone 'https://github.com/mille055/duke_chatbot.git'
!pip install -U bitsandbytes
!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece
!pip install openpyxl
!pip install xlrd
!pip install openai
!pip install huggingface_hub

Cloning into 'duke_chatbot'...
remote: Enumerating objects: 139, done.[K
remote: Counting objects: 100% (78/78), done.[K
remote: Compressing objects: 100% (43/43), done.[K
remote: Total 139 (delta 49), reused 57 (delta 35), pack-reused 61[K
Receiving objects: 100% (139/139), 17.05 MiB | 11.55 MiB/s, done.
Resolving deltas: 100% (66/66), done.
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, LlamaTokenizer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,re
import torch
from datasets import load_dataset, Dataset
from trl import SFTTrainer
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
import numpy as np
from google.colab import userdata
import json
from sklearn.model_selection import train_test_split
from huggingface_hub import HfApi




  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [3]:
prompt_instruction2 = '''
You are a trusted advisor giving information to potential applicants to the Duke AI Program, responding to questions about the Duke AI Program with informative, accurate, and helpful answers.
'''

In [4]:
prompt_instruction = "You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions"

In [21]:
### utilities

def convert_json_qa_to_df(input_filename):
    with open(input_filename, 'r', encoding='utf-8') as json_file:
        faq_data = json.load(json_file)

    text_df = pd.DataFrame(columns=['question', 'answer'])

    for faq in faq_data["FAQs"]:
        new_row = {'question': faq["question"], 'answer': faq["answer"]}
        text_df.loc[len(text_df)] = new_row

    return text_df



def create_prompt_dataframe(df, prompt_instruction=prompt_instruction):
  """
  This function takes a dataframe and returns a dataframe with the prompt questions and answers.

  Args:
    df: The dataframe to be converted.

  Returns:
    A dataframe with the prompt questions and answers.
  """
  df1 = pd.DataFrame()
  B_INST, E_INST = "[INST]", "[/INST]"

  for index, row in df.iterrows():

    df1['text'] = '### ' + prompt_instruction + ' ### Query: ' + df['question']
    #df1.at[index, 'labels'] = df['answer']
    df1['labels'] = df['answer']
   # print(df1.head())
  return df1

def formatting_func(question_text, answer_text, prompt_instruction=prompt_instruction):
    text = f"### {prompt_instruction} \n### Query: {question_text} \n### Answer: {answer_text}"
    return text


def get_response(prompt, pipe):
  sequences = pipe(
    prompt,
    do_sample=True,
    max_new_tokens=100,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    num_return_sequences=1,
  )
  answer = sequences[0]['generated_text']
  cleaned_answer = answer.replace(prompt, '', 1)

  #print('cleaned_answer is ', cleaned_answer)
  return cleaned_answer


def test_model(df, pipe, prompt_instruction=prompt_instruction):
  overall_score = 0
  results_list = []
  for index, row in df.iterrows():
    # get a response and extract json portion from it
    prompt = prompt_instruction + row['text']
    predicted_answer = get_response(prompt, pipe)
    print('********\n')
    #print('predicted_answer is ', predicted_answer)
    extracted_answer = extract_and_parse_json2(predicted_answer)
    print('********\n')
    print('extracted_answer is ', extracted_answer, type(extracted_answer))

    # get the ground truth answer
    true_answer = row['labels']
    #print('true_answer', true_answer, type(true_answer))
    true_answer_json = json.loads(true_answer.replace("'", '"'))

    print('true answer json:', true_answer_json, type(true_answer_json))

    # #predicted_answer = json.loads(predicted_answer)
    # print('predicted_answer:', predicted_answer, type(predicted_answer))

    score, accession, predicted_order, predicted_protocol, predicted_comments = response_score(extracted_answer, true_answer_json)
    overall_score += score
    print(f"Progress: case {index+1} of {len(df)}")
    print(f"score this case: {score}")

    # Accumulate the case results
    results_list.append({
            "index": index,


            "protocol": true_answer_json['predicted_protocol'],
            "predicted_protocol": predicted_protocol,
            "order": true_answer_json['predicted_order'],
            "predicted_order": predicted_order,
            "comments": true_answer_json['predicted_comments'],
            "predicted_comments": predicted_comments,
            "score": score
        })

  results = pd.DataFrame(results_list)
  print(results)
  print(f"Average score: {overall_score/len(df)}")
  results.to_csv('/content/CT_Protocol/data/results.csv', index=False)

  return overall_score/len(df)



## Build Datasets

In [14]:
filename = '/content/duke_chatbot/data/extracted_data_from_faq.json'
text_df = convert_json_qa_to_df(filename)
text_df.head()


Unnamed: 0,question,answer
0,What classes are being offered to AIPI student...,In the Fall semester of the AIPI program stude...
1,When will the list of Fall 2021 courses be ava...,The list of all Fall 2021 courses offered by t...
2,When can I register for classes?,Fall 2021 course registration for all graduate...
3,How do I register for classes?,All students register for classes through Duke...
4,What classes outside of the AIPI curriculum ca...,Approved AIPI electives are listed on the AIPI...


In [22]:
prompt_df = text_df.copy()
prompt_df = create_prompt_dataframe(prompt_df)
#prompt_df['text'] = '### ' + prompt_instruction + ' ### Query: ' + prompt_df['question']
#prompt_df['labels'] = prompt_df['answer']
prompt_df.head()

Unnamed: 0,text,labels
0,"### You are a trusted advisor in this content,...",In the Fall semester of the AIPI program stude...
1,"### You are a trusted advisor in this content,...",The list of all Fall 2021 courses offered by t...
2,"### You are a trusted advisor in this content,...",Fall 2021 course registration for all graduate...
3,"### You are a trusted advisor in this content,...",All students register for classes through Duke...
4,"### You are a trusted advisor in this content,...",Approved AIPI electives are listed on the AIPI...


In [23]:
dataset = Dataset(pa.Table.from_pandas(prompt_df))



In [24]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=12)


In [25]:
test_data_df = pd.DataFrame(test_data)
test_data_df.iloc[0].text

'### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: What do I do if I want to change my elective track?'

## Base Model Performance

In [26]:
from google.colab import userdata
token = userdata.get('HUGGINGFACE_TOKEN')
api = HfApi(token=token)

# log into HuggingFace

!huggingface-cli login --token $token



Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [27]:
# base model from huggingFace or path to model
base_model = "mistralai/Mistral-7B-v0.1"
new_model = "auto_protocol"



In [28]:
# configure the model
tokenizer = AutoTokenizer.from_pretrained(base_model)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [41]:
# prompt: iterate through rows of test_data_df and see what the model outputs for each text column data
def evaluate_model(test_data_df, pipe):
  for index, row in test_data_df.iterrows():
    prompt = row['text']
    predicted_answer = get_response(prompt, pipe)
    print(f"Index: {index}")
    print(f"Text: {row['text']}")
    print(f"Predicted Answer: {predicted_answer}")
    print(f"True Answer: {row['labels']}")
    print("------------------------------------------------------------------------------------------------------")


## Train the Model


In [30]:
# Load base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)


model.config.use_cache = False # silence the warnings.
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token



# Ensure to clear cache if anything is not used
torch.cuda.empty_cache()


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# # count training tokens

# tokenizer_ = LlamaTokenizer.from_pretrained("cognitivecomputations/dolphin-llama2-7b")
# tokens = tokenizer_.tokenize(dataset2.to_pandas().to_string())
# len(tokens)

In [31]:
#Adding the adapters in the layers
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [32]:
# Setting hyperparameters
training_arguments = TrainingArguments(
    output_dir="/content/duke_chatbot/data",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=1,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)


In [None]:
# train_dataset = Dataset.from_dict(train_data)
# eval_dataset = Dataset.from_dict(test_data)

In [None]:
# # create and save train_data_df and the training dataset
# def create_train_data_df(train_data, prompt_instruction = prompt_instruction2):
#   '''
#   Create the training dataset in the format required for the model
#   Input: train_data: a list of dictionaries
#   Input: prompt_instruction: a string
#   Output: train_data_df: a dataframe
#   '''
#   train_data_df = pd.DataFrame(train_data)
#   maker_df = train_data_df.copy()
#   for index, row in maker_df.iterrows():
#     maker_df.loc[index, 'text'] = f"""<s>[INST] {prompt_instruction}{row['text']} [/INST] \\n {row['labels']} </s>"""
#     maker_df.loc[index, 'labels'] = row['labels']

#   maker_df.head()
#   maker_df.drop(columns=['prompt_question_json', '__index_level_0__'], inplace=True)
#   #train_dataset = Dataset.from_pandas(maker_df)
#   train_dataset = Dataset(pa.Table.from_pandas(maker_df))

#   return train_dataset



In [34]:
train_data

{'text': ['### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: How do I get my NetID and password?',
  '### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: Is financial aid available to AIPI students?',
  '### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: When will I get access to my Duke email?',
  '### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: How many classes should I register for?',
  '### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: When will the lis

In [42]:
train_df = pd.DataFrame(train_data)
#train_df.head()
test_df = pd.DataFrame(test_data)
#test_df.head()

from datasets import Dataset

# Assuming train_df and test_df are pandas DataFrames with your training and testing data
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(test_df)

In [44]:
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset = eval_dataset,
    peft_config=peft_config,
    max_seq_length= 4000,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


In [45]:
# Training the model
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,3.3866
2,3.1855
3,2.3984
4,1.9114
5,1.3642
6,1.0048
7,1.0407
8,0.8174


TrainOutput(global_step=8, training_loss=1.8886143863201141, metrics={'train_runtime': 8.2262, 'train_samples_per_second': 3.647, 'train_steps_per_second': 0.973, 'total_flos': 77532390014976.0, 'train_loss': 1.8886143863201141, 'epoch': 1.0})

In [None]:
new_stream

In [47]:
# Save the fine-tuned model

trainer.model.save_pretrained('mille055/duke_chatbot')
model.config.use_cache = True


In [None]:
from huggingface_hub import notebook_login

# Login to Hugging Face within the notebook to store your credentials (if not using CLI)
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [48]:
write_token = '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'

In [60]:
trainer.model.push_to_hub("mille055/duke_chatbot", token=write_token)


CommitInfo(commit_url='https://huggingface.co/mille055/duke_chatbot/commit/249108edc41d779c5d2e2657a5a91f7898865eac', commit_message='Upload model', commit_description='', oid='249108edc41d779c5d2e2657a5a91f7898865eac', pr_url=None, pr_revision=None, pr_num=None)

In [50]:

tokenizer.push_to_hub("mille055/duke_chatbot", token=write_token)


tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mille055/duke_chatbot/commit/045ab663a71db4ef6979462d3b077319cf59e8cd', commit_message='Upload tokenizer', commit_description='', oid='045ab663a71db4ef6979462d3b077319cf59e8cd', pr_url=None, pr_revision=None, pr_num=None)

## Test the Model

In [63]:

pipe = pipeline(
    "text-generation",
    model='mille055/duke_chatbot',
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [64]:
evaluate_model(test_data_df = test_df, pipe=pipe)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 0
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: What do I do if I want to change my elective track?
Predicted Answer: 

Answer:

You can change your elective track at any time. You can do this by logging into your student portal and clicking on the “Change Elective Track” button. You can also contact your advisor for assistance.

### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: What are the requirements for the elective track?

Answer:


True Answer: If you wish to change your elective track, there is no formal action that you need to take. However, it is a good idea to speak with the program director about your elective course plans, as they can help steer you toward courses that align with your professional aspirations.
----------------------------------

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 1
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: Can I change my tuition billing basis from per-semester to per-credit?
Predicted Answer: 

Answer:

Yes, you can change your tuition billing basis from per-semester to per-credit.

To do this, you will need to contact the Office of the Registrar and request a change to your billing basis.

The Office of the Registrar will then update your billing basis in the system and you will be billed accordingly.

### Query: Can I change my tuition billing basis from
True Answer: Yes, your tuition can be changed from pay-by-semester to pay-by-credit if you are switching to part-time status. (Please note that F-1 visaholders must be enrolled full-time for at least 9.0 credits per semester). If you intend to take less than the typical load (four courses for full-time), please contact Kelsey Liddle (kelsey.liddle@duke.edu), the Pr

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 2
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: How do I get a Teaching Assistant (TA) position?
Predicted Answer: 

You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions

Query: How do I get a Teaching Assistant (TA) position?

Answer:

The first step is to find out if the department you are interested in has TA positions available. You can find this information on the department's website or by contacting the department directly.

Once you have confirmed that there
True Answer: Teaching assistantships are a common way that AIPI students can work on campus, earn money, and give of their time to the AIPI community. Most often, course instructors approach students who have done well in their course and ask them to TA in a subsequent semester. Other times, students will voice t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 3
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: Is there a formal process for designating my elective track?
Predicted Answer: 

Answer:

There is no formal process for designating your elective track. You can choose to take any of the courses listed in the elective track.

### Query: Is there a formal process for designating my elective track?

Answer:

There is no formal process for designating your elective track. You can choose to take any of the courses listed in the elective track.

### Query: Is there a formal process
True Answer: No, there is not currently a formal process to designate your elective track. We do not require students to rigidly adhere to one elective track. Students may choose electives that fit their professional goals. The elective tracks are meant as guides for students to align and develop skills toward a particular area, and those stu

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 4
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: Where can I find information about student employment?
Predicted Answer: 

Answer:

The Office of Student Employment is the best place to start.

### Query: Where can I find information about student employment?

Answer:

The Office of Student Employment is the best place to start.

### Query: Where can I find information about student employment?

Answer:

The Office of Student Employment is the best place to start.

### Query: Where can I find information about student
True Answer: DukeList is the best place to look for these opportunities, but other opportunities may be advertised via email or word-of-mouth. Please see DukeList for more information.
------------------------------------------------------------------------------------------------------


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 5
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: How will classes be offered in Fall 2021 (e.g., all in-person, online, or a mix of the two)?
Predicted Answer: 

Answer:

The University of Michigan is planning for a return to in-person instruction in Fall 2021. We are working with our faculty to develop a variety of instructional formats that will allow us to provide the best possible educational experience for our students while also prioritizing the health and safety of our community.

### Query: What is the University of Michigan doing to ensure the health and safety of students, faculty, and staff?

Answer
True Answer: While Duke will offer classes fully in-person for the Fall 2021 semester, we understand that the COVID-19 pandemic has created travel and visa challenges for many of our international students. AIPI classes will be offered both in-person and onl

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 6
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: How much does it cost to audit a course?
Predicted Answer: 

Answer:

The cost to audit a course is $100 per credit hour.

## 2. You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions ### Query: How much does it cost to audit a course?

Answer:

The cost to audit a course is $100 per credit hour.

## 3. You are a trusted
True Answer: For AIPI students who pay tuition on a pay-by-semester basis (as is the case for all full-time residential AIPI students), there is no charge for auditing a course. For AIPI Online students who pay tuition on a pay-by-credit basis, there is a charge of $535 per audited course.
------------------------------------------------------------------------------------------------------
Index: 7
Text: ### You