<a href="https://colab.research.google.com/github/mille055/duke_chatbot/blob/main/notebooks/chatbot_Finetune_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

##Chad Miller
##AIPI590 Project 2

This notebook fine_tunes an LLM (Mistral 7B) for the chatbot.


In [1]:
!git clone 'https://github.com/mille055/duke_chatbot.git'
!pip install -U bitsandbytes
!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece
!pip install openpyxl
!pip install xlrd
!pip install openai
!pip install huggingface_hub

Cloning into 'duke_chatbot'...
remote: Enumerating objects: 143, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 143 (delta 51), reused 57 (delta 35), pack-reused 61[K
Receiving objects: 100% (143/143), 17.07 MiB | 43.59 MiB/s, done.
Resolving deltas: 100% (68/68), done.
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.0-py3-none-manylinux_2_24_x86_64.whl (102.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, LlamaTokenizer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,re
import torch
from datasets import load_dataset, Dataset
from trl import SFTTrainer
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
import numpy as np
from google.colab import userdata
import json
from sklearn.model_selection import train_test_split
from huggingface_hub import HfApi
from google.colab import userdata
from huggingface_hub import notebook_login


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [3]:
prompt_instruction2 = '''
You are a trusted advisor giving information to potential applicants to the Duke AI Program, responding to questions about the Duke AI Program with informative, accurate, and helpful answers.
'''

In [4]:
prompt_instruction = "You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions"

In [15]:
### utilities

def convert_json_qa_to_df(input_filename):
    with open(input_filename, 'r', encoding='utf-8') as json_file:
        faq_data = json.load(json_file)

    text_df = pd.DataFrame(columns=['question', 'answer'])

    for faq in faq_data["FAQs"]:
        new_row = {'question': faq["question"], 'answer': faq["answer"]}
        text_df.loc[len(text_df)] = new_row

    return text_df



def create_prompt_dataframe(df, prompt_instruction=prompt_instruction):
  """
  This function takes a dataframe and returns a dataframe with the prompt questions and answers.

  Args:
    df: The dataframe to be converted.

  Returns:
    A dataframe with the prompt questions and answers.
  """
  df1 = pd.DataFrame()
  B_INST, E_INST = "[INST]", "[/INST]"

  for index, row in df.iterrows():

    df1['text'] = '### ' + prompt_instruction + ' \n### Query: ' + df['question'] + ' \n### Answer: '
    #df1.at[index, 'labels'] = df['answer']
    df1['labels'] = df['answer']
   # print(df1.head())
  return df1

def formatting_func(question_text, answer_text, prompt_instruction=prompt_instruction):
    text = f"### {prompt_instruction} \n### Query: {question_text} \n### Answer: {answer_text}"
    return text


def get_response(prompt, pipe):
  sequences = pipe(
    prompt,
    do_sample=True,
    max_new_tokens=200,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    num_return_sequences=1,
  )
  answer = sequences[0]['generated_text']
  cleaned_answer = answer.replace(prompt, '', 1)

  #print('cleaned_answer is ', cleaned_answer)
  return cleaned_answer


def test_model(df, pipe, prompt_instruction=prompt_instruction):
  overall_score = 0
  results_list = []
  for index, row in df.iterrows():
    # get a response and extract json portion from it
    prompt = prompt_instruction + row['text']
    predicted_answer = get_response(prompt, pipe)
    print('********\n')
    #print('predicted_answer is ', predicted_answer)
    extracted_answer = extract_and_parse_json2(predicted_answer)
    print('********\n')
    print('extracted_answer is ', extracted_answer, type(extracted_answer))

    # get the ground truth answer
    true_answer = row['labels']
    #print('true_answer', true_answer, type(true_answer))
    true_answer_json = json.loads(true_answer.replace("'", '"'))

    print('true answer json:', true_answer_json, type(true_answer_json))

    # #predicted_answer = json.loads(predicted_answer)
    # print('predicted_answer:', predicted_answer, type(predicted_answer))

    score, accession, predicted_order, predicted_protocol, predicted_comments = response_score(extracted_answer, true_answer_json)
    overall_score += score
    print(f"Progress: case {index+1} of {len(df)}")
    print(f"score this case: {score}")

    # Accumulate the case results
    results_list.append({
            "index": index,


            "protocol": true_answer_json['predicted_protocol'],
            "predicted_protocol": predicted_protocol,
            "order": true_answer_json['predicted_order'],
            "predicted_order": predicted_order,
            "comments": true_answer_json['predicted_comments'],
            "predicted_comments": predicted_comments,
            "score": score
        })

  results = pd.DataFrame(results_list)
  print(results)
  print(f"Average score: {overall_score/len(df)}")
  results.to_csv('/content/CT_Protocol/data/results.csv', index=False)

  return overall_score/len(df)



## Build Datasets

In [16]:
filename = '/content/duke_chatbot/data/extracted_data_from_faq.json'
text_df = convert_json_qa_to_df(filename)
text_df.head()


Unnamed: 0,question,answer
0,What classes are being offered to AIPI student...,In the Fall semester of the AIPI program stude...
1,When will the list of Fall 2021 courses be ava...,The list of all Fall 2021 courses offered by t...
2,When can I register for classes?,Fall 2021 course registration for all graduate...
3,How do I register for classes?,All students register for classes through Duke...
4,What classes outside of the AIPI curriculum ca...,Approved AIPI electives are listed on the AIPI...


In [17]:
prompt_df = text_df.copy()
prompt_df = create_prompt_dataframe(prompt_df)
#prompt_df['text'] = '### ' + prompt_instruction + ' ### Query: ' + prompt_df['question']
#prompt_df['labels'] = prompt_df['answer']
prompt_df.head()

Unnamed: 0,text,labels
0,"### You are a trusted advisor in this content,...",In the Fall semester of the AIPI program stude...
1,"### You are a trusted advisor in this content,...",The list of all Fall 2021 courses offered by t...
2,"### You are a trusted advisor in this content,...",Fall 2021 course registration for all graduate...
3,"### You are a trusted advisor in this content,...",All students register for classes through Duke...
4,"### You are a trusted advisor in this content,...",Approved AIPI electives are listed on the AIPI...


In [18]:
prompt_df.iloc[0].text, prompt_df.iloc[0].labels


('### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions \n### Query: What classes are being offered to AIPI students in Fall 2021? \n### Answer: ',
 'In the Fall semester of the AIPI program students take a fixed schedule of courses (electives are taken in the Spring).  Students should plan to register for the following courses: - AIPI 503: Bootcamp [0 units] (On-campus, Online MEng, Online Certificate students) - AIPI 510: Sourcing Data for Analytics [3 units] (On-campus, Online MEng, Online Certificate students) - AIPI 520: Modeling Process & Algorithms [3 units] (On-campus & Online MEng students) - AIPI 530: AI in Practice [3 units] (On-campus students) - MENG 570: Business Fundamentals for Engineers [3 units] (On-campus students completing in 12 months) - AIPI 501: Industry Seminar Series [0 units] (On-campus & Online MEng students)  The full list of Pratt courses will be made available to 

In [19]:
dataset = Dataset(pa.Table.from_pandas(prompt_df))



In [20]:
train_data, test_data = train_test_split(dataset, test_size=0.2, random_state=12)


In [22]:
test_data_df = pd.DataFrame(test_data)
test_data_df.iloc[0].text, test_data_df.iloc[0].labels

('### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions \n### Query: What do I do if I want to change my elective track? \n### Answer: ',
 'If you wish to change your elective track, there is no formal action that you need to take. However, it is a good idea to speak with the program director about your elective course plans, as they can help steer you toward courses that align with your professional aspirations.')

In [24]:
train_data_df = pd.DataFrame(train_data)
train_data_df.iloc[0].text, train_data_df.iloc[0].labels

('### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions \n### Query: How do I get my NetID and password? \n### Answer: ',
 'You should receive a separate email from the Office of Information Technology (OIT) with instructions to set up your NetID and email alias. Your NetID is your electronic key to online resources, including your Duke email account, DukeHub, Sakai, MyDuke, Box cloud storage, and more. Please set up your NetID as soon as possible.')

## Base Model Performance

In [25]:

token = userdata.get('HUGGINGFACE_TOKEN')
api = HfApi(token=token)

# log into HuggingFace

!huggingface-cli login --token $token



Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [26]:
# base model from huggingFace or path to model
base_model = "mistralai/Mistral-7B-v0.1"
new_model = "auto_protocol"



In [27]:
# configure the model
tokenizer = AutoTokenizer.from_pretrained(base_model)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [28]:
# prompt: iterate through rows of test_data_df and see what the model outputs for each text column data
def evaluate_model(test_data_df, pipe):
  for index, row in test_data_df.iterrows():
    prompt = row['text']
    predicted_answer = get_response(prompt, pipe)
    print(f"Index: {index}")
    print(f"Text: {row['text']}")
    print(f"Predicted Answer: {predicted_answer}")
    print(f"True Answer: {row['labels']}")
    print("------------------------------------------------------------------------------------------------------")


In [29]:
evaluate_model(test_data_df, pipe)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 0
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions 
### Query: What do I do if I want to change my elective track? 
### Answer: 
Predicted Answer: 

> I am a current student in the MBA program at the University of XXXX. I am in my second year and I am currently in the Finance track. I have been thinking about changing my track to the Marketing track. I have been thinking about this for a while now and I am wondering if I can change my track. I am wondering if I can change my track to the Marketing track. I am wondering if I can change my track to the Marketing track. I am wondering if I can change my track to the Marketing track. I am wondering if I can change my track to the Marketing track. I am wondering if I can change my track to the Marketing track. I am wondering if I can change my track to the Marketing track. I am wondering if I can change my track to the Marketing tr

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 1
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions 
### Query: Can I change my tuition billing basis from per-semester to per-credit? 
### Answer: 
Predicted Answer: 

> Yes, you can change your tuition billing basis from per-semester to per-credit. This change will take effect for the next semester. To make this change, please contact the Office of the Registrar.

### Query: Can I change my tuition billing basis from per-semester to per-credit? 
### Answer: 

> Yes, you can change your tuition billing basis from per-semester to per-credit. This change will take effect for the next semester. To make this change, please contact the Office of the Registrar.

### Query: Can I change my tuition billing basis from per-semester to per-credit? 
### Answer: 

> Yes, you can change your tuition billing basis from per-semester to per-credit. This change will take effect for
True Answer:

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 2
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions 
### Query: How do I get a Teaching Assistant (TA) position? 
### Answer: 
Predicted Answer: 

> I'm a first year student and I'm interested in becoming a TA. I'm wondering what the best way to go about getting a TA position is. I've heard that you can apply for a TA position, but I'm not sure how to do that. I've also heard that you can get a TA position by asking your professors. Is that true?

I'm a first year student and I'm interested in becoming a TA. I'm wondering what the best way to go about getting a TA position is. I've heard that you can apply for a TA position, but I'm not sure how to do that. I've also heard that you can get a TA position by asking your professors. Is that true?

Yes, you can apply for a TA position. You can also ask your professors if they have any openings.
True Answer: Teaching assistantships 

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 3
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions 
### Query: Is there a formal process for designating my elective track? 
### Answer: 
Predicted Answer: 

> I am a student in the M.S. program in Computer Science. I am interested in the elective track in Data Science. Is there a formal process for designating my elective track? 

> I have already taken the following courses:

> CSCI 5000: Introduction to Data Science

> CSCI 5000: Introduction to Machine Learning

> CSCI 5000: Introduction to Artificial Intelligence

> CSCI 5000: Introduction to Computer Vision

> CSCI 5000: Introduction to Natural Language Processing

> CSCI 5000: Introduction to Deep Learning

> CSCI 5000: Introduction to Reinforcement Learning

> CSCI 5000: Introduction to Robotics

> CSCI 5000
True Answer: No, there is not currently a formal process to designate your elective track. We do not require stu

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 4
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions 
### Query: Where can I find information about student employment? 
### Answer: 
Predicted Answer: 

> The Office of Student Employment is the primary resource for students seeking employment on campus. The Office of Student Employment is located in the Student Center, Room 101.

> The Office of Student Employment is the primary resource for students seeking employment on campus. The Office of Student Employment is located in the Student Center, Room 101.

> The Office of Student Employment is the primary resource for students seeking employment on campus. The Office of Student Employment is located in the Student Center, Room 101.

> The Office of Student Employment is the primary resource for students seeking employment on campus. The Office of Student Employment is located in the Student Center, Room 101.

> The Office of S

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 5
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions 
### Query: How will classes be offered in Fall 2021 (e.g., all in-person, online, or a mix of the two)? 
### Answer: 
Predicted Answer: 

### Query: What is the current COVID-19 situation in the area? 
### Answer: 

### Query: What is the current COVID-19 situation on campus? 
### Answer: 

### Query: What is the current COVID-19 situation in the area? 
### Answer: 

### Query: What is the current COVID-19 situation on campus? 
### Answer: 

### Query: What is the current COVID-19 situation in the area? 
### Answer: 

### Query: What is the current COVID-19 situation on campus? 
### Answer: 

### Query: What is the current COVID-19 situation in the area? 
### Answer: 

### Query: What is the current COVID-19 situation on campus? 
### Answer: 

### Query: What is the current COVID-1
True Answer: While Duke will offer classes f

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Index: 6
Text: ### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions 
### Query: How much does it cost to audit a course? 
### Answer: 
Predicted Answer: 

The cost of auditing a course is $100 per credit hour. This fee is in addition to the cost of tuition and fees.

### Query: What is the difference between auditing and taking a course for credit? 
### Answer: 

When you audit a course, you do not receive credit for the course and you do not have to take exams or complete assignments. You are not required to attend class, but you are welcome to attend as many classes as you like. When you take a course for credit, you are required to attend class and complete all assignments and exams.

### Query: Can I change my mind and take a course for credit after I have audited it? 
### Answer: 

Yes, you can change your mind and take a course for credit after you have audited it. However, you will need

## Train the Model


In [30]:
# Load base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
        base_model,
        load_in_4bit=True,
        quantization_config=bnb_config,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
)


model.config.use_cache = False # silence the warnings.
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token



# Ensure to clear cache if anything is not used
torch.cuda.empty_cache()


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# # count training tokens

# tokenizer_ = LlamaTokenizer.from_pretrained("cognitivecomputations/dolphin-llama2-7b")
# tokens = tokenizer_.tokenize(dataset2.to_pandas().to_string())
# len(tokens)

In [31]:
#Adding the adapters in the layers
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [32]:
# Setting hyperparameters
training_arguments = TrainingArguments(
    output_dir="/content/duke_chatbot/data",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=1,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)


In [None]:
# train_dataset = Dataset.from_dict(train_data)
# eval_dataset = Dataset.from_dict(test_data)

In [None]:
# # create and save train_data_df and the training dataset
# def create_train_data_df(train_data, prompt_instruction = prompt_instruction2):
#   '''
#   Create the training dataset in the format required for the model
#   Input: train_data: a list of dictionaries
#   Input: prompt_instruction: a string
#   Output: train_data_df: a dataframe
#   '''
#   train_data_df = pd.DataFrame(train_data)
#   maker_df = train_data_df.copy()
#   for index, row in maker_df.iterrows():
#     maker_df.loc[index, 'text'] = f"""<s>[INST] {prompt_instruction}{row['text']} [/INST] \\n {row['labels']} </s>"""
#     maker_df.loc[index, 'labels'] = row['labels']

#   maker_df.head()
#   maker_df.drop(columns=['prompt_question_json', '__index_level_0__'], inplace=True)
#   #train_dataset = Dataset.from_pandas(maker_df)
#   train_dataset = Dataset(pa.Table.from_pandas(maker_df))

#   return train_dataset



In [33]:
train_data

{'text': ['### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions \n### Query: How do I get my NetID and password? \n### Answer: ',
  '### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions \n### Query: Is financial aid available to AIPI students? \n### Answer: ',
  '### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions \n### Query: When will I get access to my Duke email? \n### Answer: ',
  '### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions \n### Query: How many classes should I register for? \n### Answer: ',
  '### You are a trusted advisor in this content, helping to explain the text to prospective or current studen

In [35]:
train_df = pd.DataFrame(train_data)
# #train_df.head()
test_df = pd.DataFrame(test_data)
# #test_df.head()

from datasets import Dataset

# # Assuming train_df and test_df are pandas DataFrames with your training and testing data
train_dataset = Dataset.from_pandas(train_df)
eval_dataset = Dataset.from_pandas(test_df)

In [36]:
train_dataset[0]

{'text': '### You are a trusted advisor in this content, helping to explain the text to prospective or current students who are seeking answers to questions \n### Query: How do I get my NetID and password? \n### Answer: ',
 'labels': 'You should receive a separate email from the Office of Information Technology (OIT) with instructions to set up your NetID and email alias. Your NetID is your electronic key to online resources, including your Duke email account, DukeHub, Sakai, MyDuke, Box cloud storage, and more. Please set up your NetID as soon as possible.',
 '__index_level_0__': 35}

In [37]:
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    #eval_dataset = test_data,
    peft_config=peft_config,
    max_seq_length= 4000,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)


In [38]:
# Training the model
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,3.3434
2,2.9207
3,2.1588
4,1.6988
5,1.3802
6,1.0939
7,1.0165
8,0.7718


TrainOutput(global_step=8, training_loss=1.7980089113116264, metrics={'train_runtime': 8.8817, 'train_samples_per_second': 3.378, 'train_steps_per_second': 0.901, 'total_flos': 87904616103936.0, 'train_loss': 1.7980089113116264, 'epoch': 1.0})

In [39]:
# Save the fine-tuned model

trainer.model.save_pretrained('mille055/duke_chatbot2')
model.config.use_cache = True


In [None]:


# Login to Hugging Face within the notebook to store your credentials (if not using CLI)
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [40]:
write_token = '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'

In [41]:
trainer.model.push_to_hub("mille055/duke_chatbot2", token=write_token)


adapter_model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mille055/duke_chatbot2/commit/da79b0022d958c3c73a5db1751a7fd7f6ab45ad2', commit_message='Upload model', commit_description='', oid='da79b0022d958c3c73a5db1751a7fd7f6ab45ad2', pr_url=None, pr_revision=None, pr_num=None)

In [42]:

tokenizer.push_to_hub("mille055/duke_chatbot2", token=write_token)


tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mille055/duke_chatbot2/commit/bcccb47f13a4ad121d2ac940c9633371084b9934', commit_message='Upload tokenizer', commit_description='', oid='bcccb47f13a4ad121d2ac940c9633371084b9934', pr_url=None, pr_revision=None, pr_num=None)

## Test the Model

In [44]:

pipe = pipeline(
    "text-generation",
    model='mille055/duke_chatbot2',
    tokenizer = tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [46]:
eval_prompt = " What are the steps to apply to the program? "
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")


In [47]:
model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=200, repetition_penalty=1.15)[0], skip_special_tokens=True))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


 What are the steps to apply to the program? й

Applicants must submit a completed application form, including all required supporting documents. Applications will be reviewed on a rolling basis until the class is filled.

What are the requirements for admission?  информация о программе

The following criteria are used in evaluating applicants:

- Academic record (GPA)
- Relevant work experience
- Professional references
- Statement of purpose
- Resume/CV
- TOEFL or IELTS score (for international students only)

How do I know if my application has been received?  информация о программе

You will receive an email confirmation that your application was successfully submitted. If you have not received this email within two business days after submitting your application, please contact us at mba@smu.edu.

When should I expect to hear about my admissions decision?  информация о программе




In [50]:
new_response = get_response(eval_prompt, pipe)
new_response

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


' What are the requirements?  What is the cost?  How long is the program?  What are the benefits?  What is the curriculum?  What is the faculty like?  What is the student body like?  What are the career opportunities?  What is the job placement rate?  What is the return on investment?  What is the accreditation?  What is the program’s history?  What are the continuing education options?  What are the financial aid options?  What are the graduation requirements?  Are there any guarantees?  Is there an orientation program?  What is the loan repayment program?\n\n### Enroll in the Best LPN Schools in Alabama\n\nChoosing the right Licensed Practical Nurse school is perhaps the most important step to launching a new career in the healthcare industry. There are numerous variables that you need to consider when picking a nursing school. These aspects will be prioritized differently conting'