## Setup

To complete the following guide you will need to install the following packages:
- fireworks-ai
- numpy
- pandas
- pronouncing
- requests
- sentence-transformers
- transformers

You will also need:

- Fireworks account (https://fireworks.ai/)
- Fireworks API key
- The firectl command-line interface (https://docs.fireworks.ai/tools-sdks/firectl/firectl)

In [100]:
import json

from fireworks.client import Fireworks
from openai import OpenAI
import numpy as np
import pandas as pd
import pronouncing
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

# Using openai embedding model. 
client = OpenAI()

# Not sure why I even have this???????????
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

embeddings_model = get_embedding

In [2]:
# Sign-in to your Fireworks account
!firectl signin

Signed in as: jayozer@gmail.com
Account ID: jayozer-ce1cd6


In [125]:
!firectl whoami

Signed in as: jayozer@gmail.com
Account ID: jayozer-ce1cd6


In [126]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Get the API key from environment variable
api_key = os.getenv('FIREWORKS_API_KEY')

client = Fireworks(api_key=api_key)

# Replace the line below with your Fireworks account id
account_id = os.getenv('FIREWORKS_ACCOUNT_ID')

## Problem Definition: Pediatric dentistry chatbot FAQ question generation for Intent classification with FAQs

*Note: The pediatric dentistry topics used in this example were synthetically generated by Openai o1-preview model*

LLMs are capable of performing creative writing tasks. However, assessing the quality of such tasks, like question generation, is highly subjective.

### Task
As I am not a professional dentist, I am unable to write high-level faq questions myself. The ideal approach would be to search the web for high-quality questions that matches the topics and style I want the LLM to generate or use a team of dentists. However, this method is very time-consuming (Kind of, I have the scripts for this and should try out). A more efficient approach (most definitely) is to use the "critique and revise" method, where the LLM first generates critiques on how each Q&A can be improved. We then ask the LLM to rewrite the questions based on these critiques. Finally, we fine-tune the LLM on the revised Questions.

# Synthetic data generation
### Data

We will use the following datasets:
- `/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_training_topics.csv`
- `/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_test_topics.csv`

Training dataset consists of 100 and test dataset has 96 unique pediatric dentistry topics. 

In [128]:
training_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_training_topics.csv')
test_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_training_topics.csv')

In [104]:
print(training_data.head())
#print(test_data.head())

   Unnamed: 0                      topic
0           0  Dental Caries in Children
1           1     Early Childhood Caries
2           2           Fluoride Therapy
3           3     Pediatric Oral Hygiene
4           4            Dental Sealants
   Unnamed: 0                                      topic
0           0      Oral Health in Special Needs Children
1           1         Preventing Tooth Decay in Children
2           2           Importance of First Dental Visit
3           3  Dental Treatment for Children with Autism
4           4         Managing Dental Trauma in Toddlers


### Foundation Model Baseline
Our first step is to generate a question for each of the topics in the training data using a foundation_model. We will then use the critique and revise method to improve upon these questions.

In [129]:
# Given a csv file with a list of topics, generates a question for each topic
# system_message = 'You are a professional poet. Write a unique and original contemporary question about the topic suggested by the user. Your response should contain ONLY the content of the question.'
# system_message = 'You are a professional chatbot assistant. Write natural language FAQ questions for each topic suggested by the user. These questions should be written from the perspective of a parent, a teen, or another dentist. The questions should be realistic and conversational, just like a user would ask in a chatbot. Your response should contain ONLY the content of the questions.'


system_message = '''You are an AI assistant for a pediatric dentistry chatbot. Your task is to generate multiple realistic and conversational FAQ questions for each given topic. These questions should reflect what parents, teens, or other dentists might ask about pediatric dental care.

Guidelines:
1. Create 5-7 diverse questions for each topic.
2. Ensure questions are natural and conversational, as if asked by real users in a chat.
3. Vary the complexity and specificity of questions within each topic.
4. Include questions that address common concerns, misconceptions, and practical aspects of pediatric dental care.
5. Phrase questions from different perspectives (e.g., worried parents, curious teens, inquiring dentists).
6. Avoid using technical jargon unless it's part of a question about terminology.

Your response should contain ONLY the list of questions, with each question on a new line, prefixed by a hyphen (-). Do not include any other text or explanations.'''


def generate_questions(model, df):
    responses = list()
    for i, row in enumerate(df.iterrows()):
        response = client.chat.completions.create(
            model=model,
            messages=[
              {"role": "system", "content": system_message},
              {"role": "user", "content": row[1]['topic']}
            ],
        )
        response = response.choices[0].message.content
        responses.append(response)    
    return responses

# Training dataset

In [106]:
# We first generate questions for dentistry topics in our training set
llama_70b_training_questions = generate_questions('accounts/fireworks/models/llama-v3p1-70b-instruct', training_data)

In [109]:
print(llama_70b_training_questions[1])

- What is early childhood caries, and how does it affect my toddler's teeth?
- How can I prevent my 2-year-old from getting cavities when they love sugary snacks?
- Is it true that giving my baby a bottle of milk before bedtime can cause tooth decay?
- Can I use a regular toothpaste for my 18-month-old, or is there a special one for babies?
- What are the signs of early childhood caries, and how can I spot them in my child's teeth?
- How often should I take my toddler to the dentist to prevent early childhood caries?
- If my child has early childhood caries, will they need to get their teeth pulled or filled?
- Can early childhood caries be caused by anything other than sugary foods and drinks?
- Is there a connection between breastfeeding and the risk of early childhood caries?
- How can I clean my toddler's teeth when they don't want to open their mouth for me?


In [110]:
# Check the number of questions generated
num_questions = len(llama_70b_training_questions)
print(f"Number of questions generated: {num_questions}")

# Optionally, you can also print the number of questions for each topic
for i, questions in enumerate(llama_70b_training_questions):
    topic = training_data.iloc[i]['topic']
    num_topic_questions = len(questions.split('\n'))
    print(f"Topic '{topic}': {num_topic_questions} questions")

Number of questions generated: 101
Topic 'Dental Caries in Children': 8 questions
Topic 'Early Childhood Caries': 10 questions
Topic 'Fluoride Therapy': 8 questions
Topic 'Pediatric Oral Hygiene': 8 questions
Topic 'Dental Sealants': 7 questions
Topic 'Space Maintainers': 9 questions
Topic 'Pulp Therapy in Children': 7 questions
Topic 'Stainless Steel Crowns': 7 questions
Topic 'Behavior Management in Pediatric Dentistry': 7 questions
Topic 'Sedation in Pediatric Dentistry': 8 questions
Topic 'Preventive Dentistry for Kids': 8 questions
Topic 'Oral Habits in Children': 7 questions
Topic 'Dental Trauma Management': 8 questions
Topic 'Malocclusion in Children': 8 questions
Topic 'Orthodontics for Children': 8 questions
Topic 'Cleft Lip and Palate Treatment': 8 questions
Topic 'Special Needs Dentistry': 7 questions
Topic 'Infant Oral Health Care': 8 questions
Topic 'Teething in Infants': 8 questions
Topic 'Eruption Patterns in Children': 7 questions
Topic 'Pediatric Oral Pathology': 8 que

### Critique on Training data to generate better questions and data set. 
In the critique step, we create a scoring rubric and ask an LLM to generate improvements to the previously created questions based on the rubric.

In [111]:
# We now use our scoring rubric to generate a list of critiques about each question
question_guidelines = """
1. Relevance and Audience Appropriateness:
   - Is the question directly related to pediatric dentistry?
   - Is it appropriate for the intended audience (parents, teens, or pediatric dentists)?
   - Does it address a common concern or information need for that audience?

2. Natural and Conversational Language:
   - Does the question use language that sounds natural in a chat interaction?
   - Is it free of unnecessary jargon or overly formal phrasing?
   - Would it sound authentic coming from a parent, teen, or dentist in a real-world scenario?

3. Specificity and Actionability:
   - Is the question specific enough to generate a focused, useful answer?
   - Does it avoid being overly broad or vague?
   - Can it lead to actionable advice or clear information for the user?

4. Anticipation of User Needs:
   - Does the question reflect common concerns or information gaps for the intended audience?
   - Does it address potential misconceptions or frequently asked questions in pediatric dentistry?
   - Is it likely to be valuable for multiple users with similar concerns?

5. Diversity and Uniqueness:
   - Does this question add variety to the set of questions for this topic?
   - Does it approach the topic from a different angle compared to other questions?

6. Ethical and Professional Considerations:
   - Does the question respect professional boundaries and ethical guidelines in pediatric dentistry?
   - Does it avoid promoting harmful misconceptions or inappropriate practices?
"""

question_critique_rubric = f'''You are a professional chatbot evaluator responsible for assessing the quality of AI-generated FAQ questions for a pediatric dentistry chatbot.

Assessment Guidelines:
{question_guidelines}

Given the above guidelines, provide a list of ways that the question could be improved. For each suggestion, explain why the improvement is necessary and how it would enhance the question's effectiveness for the pediatric dentistry chatbot. Focus on constructive feedback that can be used to refine and elevate the quality of the FAQ questions.

Your critique should be clear, specific, and actionable, addressing any areas where the question falls short of the ideal standards outlined in the guidelines.'''

def critique_questions(questions, evaluation_model):
    critiques = list()
    for question in questions:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": question_critique_rubric},
                {"role": "user", "content": question}
            ],
        )

        try: 
            response = response.choices[0].message.content
            critiques.append(response)
        except json.JSONDecodeError as jde:
            continue

    return critiques

In [114]:
# Critique the questions generated for pediatric dentistry topics in the training set using a larger LLM. llama_70b_training_questions judged by llama-v3p1-405b-instruct
llama_70b_training_critiques = critique_questions(llama_70b_training_questions, 'accounts/fireworks/models/llama-v3p1-405b-instruct')

In [115]:
#llama 3.1 70B - Works much better. 
print(llama_70b_training_questions[0])
print(llama_70b_training_critiques[0])

- What's the best way to prevent cavities in my toddler's baby teeth?
- How do I know if my child has a cavity, and what are the symptoms?
- Can children get cavities from breast milk or formula?
- Is it true that giving my child a bottle of juice before bedtime can cause tooth decay?
- How often should I take my child to the dentist to check for cavities?
- What's the difference between a cavity and tooth sensitivity in kids?
- Can cavities in baby teeth affect my child's permanent teeth?
- Are there any natural remedies or home treatments that can reverse or prevent tooth decay in children?
Here's a list of potential improvements for each question, along with explanations for why these changes are necessary and how they would enhance the question's effectiveness:

1. What's the best way to prevent cavities in my toddler's baby teeth?
* Improvement: Specify the age range of the toddler (e.g., "What's the best way to prevent cavities in my 1-3 year old child's baby teeth?").
* Reason: 

### Revise
In the revise step, we create a new prompt that tells the LLM to generate a revised question, given the previously generated critiques.

In [117]:
# We now give the LLM both the questions and the critiques, and tell it to improve the question based on the following critiques.
# improvement_sys_message = '''You are a professional chatbot assistant. Improve the generated question, given the following critiques.

# Your response must ONLY contain the content of the improved question. DO NOT TELL ME YOUR CHANGES, JUST GIVE ME THE REVISED QUESTION!'''

improvement_sys_message = '''You are an AI assistant for a pediatric dentistry chatbot. Your task is to improve the given question based on the provided critiques. Follow these guidelines:

1. Carefully analyze the original question and the critiques.
2. Address all points mentioned in the critiques.
3. Maintain the question's relevance to pediatric dentistry.
4. Ensure the improved question is natural, conversational, and appropriate for the intended audience (parents, teens, or dentists).
5. Make the question more specific and actionable if needed.
6. Enhance the question to better anticipate user needs and common concerns.
7. Preserve or improve the question's uniqueness within the topic.
8. Ensure the question adheres to ethical and professional standards in pediatric dentistry.

Your response must ONLY contain the improved question. Do not explain your changes or include any other text. Provide just the revised question, ensuring it's a single, coherent question that incorporates all necessary improvements.'''

def generate_improved_questions(model, questions, critiques):
    improved_questions = []
    for i, question in enumerate(questions):
        user_message = f'''
question:      
{question}

critiques:
{critiques[i]}'''
        
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": improvement_sys_message},
                {"role": "user", "content": user_message}
            ],
        )
        response_text = response.choices[0].message.content
        improved_questions.append(response_text)    

    return improved_questions

In [118]:
#llama_70b_training_improved_questions = generate_improved_questions('accounts/fireworks/models/llama-v3p1-70b-instruct', llama_70b_training_questions, llama_70b_training_critiques)
llama_70b_training_improved_questions = generate_improved_questions(
    'accounts/fireworks/models/llama-v3p1-70b-instruct',
    llama_70b_training_questions,
    llama_70b_training_critiques
)

In [134]:
print(llama_70b_training_improved_questions[0:3])

["Here are the improved questions:\n\n1. What's the best way to prevent cavities in my 1-3 year old child's baby teeth?\n2. What are the common symptoms of cavities in children?\n3. How can I identify a cavity in my child's teeth?\n4. Can breast milk or formula contribute to tooth decay in infants, and how can I minimize the risk?\n5. How does juice consumption affect my child's oral health, and what are some healthy alternatives for bedtime drinks?\n6. How often should I take my preschooler to the dentist for cavity checks?\n7. How can I distinguish between tooth sensitivity and a cavity in my child, and what are the implications for treatment?\n8. How do cavities in baby teeth impact my child's overall oral health, and what are the potential long-term effects on their permanent teeth?\n9. What are some evidence-based, non-invasive methods for preventing tooth decay in children, and what are the limitations of natural remedies?\n10. What are some healthy snack options for kids that ca

In [135]:
print(training_data['topic'][0:2])

0    Dental Caries in Children
1       Early Childhood Caries
Name: topic, dtype: object


# Create & Save the gold questions set with tags. - pk_improved_question_training_dataset.tsv

In [124]:

# Load the training topics
#training_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_training_topics.csv')

import re
import pandas as pd

# Load the training topics
topics = training_data['topic'].tolist()

# Initialize empty lists to store data
questions = []
topics_list = []

# Process each set of improved questions
for idx, text in enumerate(llama_70b_training_improved_questions):
    topic = topics[idx]  # Get the corresponding topic

    # Use regular expressions to find all questions
    qs = re.findall(r'\d+\.\s+(.*?)(?=\n\d+\.|\Z)', text, re.DOTALL)

    # Append each question and its topic to the lists
    for q in qs:
        questions.append(q.strip())
        topics_list.append(topic)

# Create a DataFrame from the lists with columns 'text' and 'label'
df = pd.DataFrame({
    'text': questions,
    'label': topics_list
})

# Save the DataFrame to a TSV file
df.to_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_improved_question_training_dataset.tsv', sep='\t', index=False)

# Display the DataFrame
print(df)


                                                  text  \
0    What's the best way to prevent cavities in my ...   
1    What are the common symptoms of cavities in ch...   
2     How can I identify a cavity in my child's teeth?   
3    Can breast milk or formula contribute to tooth...   
4    How does juice consumption affect my child's o...   
..                                                 ...   
694  What are some common habits that can cause too...   
695  How do sugary drinks like sports drinks and en...   
696  What are the signs and symptoms of tooth wear ...   
697  What are the best ways to prevent tooth wear i...   
698  At what age do children typically start to exp...   

                         label  
0    Dental Caries in Children  
1    Dental Caries in Children  
2    Dental Caries in Children  
3    Dental Caries in Children  
4    Dental Caries in Children  
..                         ...  
694    Dental Wear in Children  
695    Dental Wear in Children  
696   

# works with a couple of issues where LLM produced unwanted information, I removed them manually. Next time try Llama 3.2 90B and then improve using 405B.

############## End of Training Data Generation ####################

# Create the test set using training labels

In [137]:
# Next is to create the test questions using Mixtral mix of experts model
test_questions = generate_questions('accounts/fireworks/models/mixtral-8x22b-instruct', test_data)

In [139]:
print(test_questions[1])

- What is early childhood caries and how can I prevent it in my child?
- How do I know if my toddler has early childhood caries?
- What are the common causes of early childhood caries in infants?
- How can I treat early childhood caries in my 3-year-old?
- What are the long-term effects of early childhood caries on a child's oral health?
- How can I ensure my baby's teeth remain healthy as they grow?
- As a dentist, how do I educate parents about early childhood caries prevention?


In [141]:
print(test_questions[1:3])

["- What is early childhood caries and how can I prevent it in my child?\n- How do I know if my toddler has early childhood caries?\n- What are the common causes of early childhood caries in infants?\n- How can I treat early childhood caries in my 3-year-old?\n- What are the long-term effects of early childhood caries on a child's oral health?\n- How can I ensure my baby's teeth remain healthy as they grow?\n- As a dentist, how do I educate parents about early childhood caries prevention?", "- What is fluoride therapy and why is it important for my child's dental health?\n- How does fluoride help prevent tooth decay in kids?\n- At what age should my child start fluoride treatments at the dentist?\n- How often should fluoride therapy be performed for pediatric patients?\n- Are there any risks or side effects associated with fluoride treatments for children?\n- Can't I just use fluoride toothpaste at home for my child instead of getting professional treatments?\n- Is it possible for my c

In [142]:
# Check the number of test questions generated
num_questions = len(test_questions)
print(f"Number of test questions generated: {num_questions}")

# Optionally, you can also print the number of questions for each topic
for i, questions in enumerate(test_questions):
    topic = training_data.iloc[i]['topic']
    num_topic_questions = len(questions.split('\n'))
    print(f"Topic '{topic}': {num_topic_questions} questions")

Number of test questions generated: 101
Topic 'Dental Caries in Children': 7 questions
Topic 'Early Childhood Caries': 7 questions
Topic 'Fluoride Therapy': 7 questions
Topic 'Pediatric Oral Hygiene': 7 questions
Topic 'Dental Sealants': 7 questions
Topic 'Space Maintainers': 7 questions
Topic 'Pulp Therapy in Children': 7 questions
Topic 'Stainless Steel Crowns': 7 questions
Topic 'Behavior Management in Pediatric Dentistry': 7 questions
Topic 'Sedation in Pediatric Dentistry': 7 questions
Topic 'Preventive Dentistry for Kids': 7 questions
Topic 'Oral Habits in Children': 7 questions
Topic 'Dental Trauma Management': 7 questions
Topic 'Malocclusion in Children': 7 questions
Topic 'Orthodontics for Children': 7 questions
Topic 'Cleft Lip and Palate Treatment': 7 questions
Topic 'Special Needs Dentistry': 7 questions
Topic 'Infant Oral Health Care': 7 questions
Topic 'Teething in Infants': 7 questions
Topic 'Eruption Patterns in Children': 7 questions
Topic 'Pediatric Oral Pathology': 7

In [144]:
import re
import pandas as pd

# Load the training topics
topics = test_data['topic'].tolist()

# Initialize empty lists to store data
questions = []
topics_list = []

# Process each set of improved questions
for idx, text in enumerate(test_questions):
    topic = topics[idx]  # Get the corresponding topic

    # Use regular expressions to find all questions marked by a hyphen '-'
    qs = re.findall(r'- (.*?)(?=\n-|\Z)', text, re.DOTALL)

    # Append each question and its topic to the lists
    for q in qs:
        questions.append(q.strip())
        topics_list.append(topic)

# Create a DataFrame from the lists with columns 'text' and 'label'
df = pd.DataFrame({
    'text': questions,
    'label': topics_list
})

# Save the DataFrame to a TSV file
df.to_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_test_questions_dataset.tsv', sep='\t', index=False)

# Display the DataFrame
print(df)

                                                  text  \
0    What exactly is dental caries and how does it ...   
1        How can I tell if my child has dental caries?   
2    At what age do children typically develop dent...   
3    Is dental caries preventable in children? What...   
4    If left untreated, what are the potential cons...   
..                                                 ...   
702  Could food choices contribute to my teenager's...   
703  Are there any medical conditions that might be...   
704  Can poor oral hygiene lead to chronic bad brea...   
705  Is it possible for sinus infections to cause b...   
706  How can I help my child overcome their bad bre...   

                              label  
0         Dental Caries in Children  
1         Dental Caries in Children  
2         Dental Caries in Children  
3         Dental Caries in Children  
4         Dental Caries in Children  
..                              ...  
702  Causes of Pediatric Bad Breath  

### Critique test questions!!!!! - Not doing it for first run!!!! - DID NOT RUN THIS PART
In the critique step, we create a scoring rubric and ask an LLM to generate improvements to the previously created questions based on the rubric.

In [9]:
# We now use our scoring rubric to generate a list of critiques about each question
question_guidelines = """
-Is the question relevant to the topic and the intended audience (parents, teens, or pediatric dentists)? Does the question directly relate to the pediatric dentistry topic and provide valuable information for the user?

-Is the language natural and conversational? Does the question sound like it would be asked by a parent, teen, or pediatric dentist in a real-world scenario? Is it phrased naturally, as if in a chatbot interaction?

-Does the question anticipate the concerns or needs of the intended user? Does the question reflect the typical concerns of the person asking (e.g., a parent asking about a child's oral health, a teen asking about dental procedures, or a dentist inquiring about professional guidelines)?

-Is the question specific enough to generate a useful answer? Does the question provide enough detail to allow for a meaningful and targeted answer, avoiding overly vague or broad inquiries?

"""

question_critique_rubric = f'''You are a professional chatbot evaluator responsible for assessing the quality of AI-generated FAQ questions.

Assessment Guidelines:
{question_guidelines}

Given the above guidelines, provide a list of ways that the question could be improved.'''

def critique_questions(questions, evaluation_model):
    critiques = list()
    for question in questions:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": question_critique_rubric},
                {"role": "user", "content": question}
            ],
        )

        try: 
            response = response.choices[0].message.content
            critiques.append(response)
        except json.JSONDecodeError as jde:
            continue

    return critiques

In [10]:
test_critiques = critique_questions(test_questions, 'accounts/fireworks/models/llama-v3p1-70b-instruct')

In [None]:
#llama 3.1 70B - Works much better. 
print(test_questions[0])
print(test_critiques[0])

### Revise
In the revise step, we create a new prompt that tells the LLM to generate a revised question, given the previously generated critiques.

In [15]:
# We now give the LLM both the questions and the critiques, and tell it to improve the question based on the following critiques.
improvement_sys_message = '''You are a professional chatbot assistant. Improve the generated question, given the following critiques.

Your response must ONLY contain the content of the improved question. DO NOT TELL ME YOUR CHANGES, JUST GIVE ME THE REVISED QUESTION!'''

def generate_improved_questions(model, questions, critiques):
    improved_questions = []
    for i, question in enumerate(questions):
        user_message = f'''
question:      
{question}

critiques:
{critiques[i]}'''
        
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": improvement_sys_message},
                {"role": "user", "content": user_message}
            ],
        )
        response_text = response.choices[0].message.content
        improved_questions.append(response_text)    

    return improved_questions

In [16]:
#llama_70b_training_improved_questions = generate_improved_questions('accounts/fireworks/models/llama-v3p1-70b-instruct', llama_70b_training_questions, llama_70b_training_critiques)
test_improved_questions = generate_improved_questions(
    'accounts/fireworks/models/llama-v3p1-70b-instruct',
    test_questions,
    test_critiques
)

In [None]:
print(test_improved_questions[0])

In [None]:
import re
import pandas as pd

# Load the training topics
#training_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_training_topics.csv')
topics = test_data['topic'].tolist()

# 'llama_70b_training_improved_questions' is your list of improved questions
# Initialize empty lists to store data
questions = []
topics_list = []

# Process each set of improved questions
for idx, text in enumerate(test_improved_questions):
    topic = topics[idx]  # Get the corresponding topic

    # Use regular expressions to find all questions marked by an asterisk '*'
    qs = re.findall(r'\* (.*)', text)

    # Append each question and its topic to the lists
    for q in qs:
        questions.append(q.strip())
        topics_list.append(topic)

# Create a DataFrame from the lists with columns 'text' and 'label'
df = pd.DataFrame({
    'text': questions,
    'label': topics_list
})

# Save the DataFrame to a TSV file
df.to_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_improved_question_test_dataset.tsv', sep='\t', index=False)

# Display the DataFrame
print(df)

In [None]:
############## End of Test data creation #################### - DID NOT RUN THIS PART

In [None]:
#Scott - The test dataset has topics that are more suitable for what a parent, teen, or dentist would ask. I guess I want to understand at what point I would be overfitting the model. Can I not use both datasets?

# Finetuning

We now fine-tune a smaller LLM using the revised questions. This is similar to the knowledge distillation method from last week's notebook, except we are fine-tuning on the revised question and labels of the larger model, rather than the original questions that it generated.

question = question
category = Tag in Knowledge base
categories = tags


In [147]:
training_examples = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_improved_question_training_dataset.tsv', sep='\t')
test_examples = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_test_questions_dataset.tsv', sep='\t')

# In order to not leak information about the test labels into our prompts, the list of possible tags will be defined 
# based on the training labels.
tags = sorted(training_examples['label'].unique().tolist())
tags_str = '\n'.join(tags)

training_questions = training_examples['text'].tolist()
training_labels = training_examples['label'].tolist()

test_questions = test_examples['text'].tolist()
test_labels = test_examples['label'].tolist()

In [148]:
print(test_questions[0])
print(test_labels[0])

What exactly is dental caries and how does it affect children's teeth?
Dental Caries in Children


In [149]:
print(training_questions[0])
print(training_labels[0])

What's the best way to prevent cavities in my 1-3 year old child's baby teeth?
Dental Caries in Children


In [150]:
print('\n'.join(tags[:3]))

Anomalies of Tooth Formation
Antibiotic Prophylaxis in Pediatric Dentistry
Behavior Management in Pediatric Dentistry


### Finetuning Dataset Curation

We first must transform our dataset into the format expected by Fireworks, and then upload the dataset. The dataset must conform to the schema expected by the Chat Completions API.

See https://docs.fireworks.ai/fine-tuning/fine-tuning-models#conversation for more details
<!-- ticket = question
category = Tag in Knowledge base -->

In [151]:
def create_prompt(question):
    return f"""You are a helpful assistant for a Pediatric Dentistry. You have been asked to classify a pediatric dental question into one of the following tags:
<tags>
{tags_str}
</tags>

Here is the a pediatric dental question:    
<question>{question}</question>

Respond using this format:
<tag>The tag label you chose goes here</tag>"""    

# Create the training jsonl dataset

In [152]:
# Converts the training examples to the format expected by Fireworks.
def training_examples_to_json(examples):
    json_objs = list()
    for idx, example in examples.iterrows():  
        user_msg = create_prompt(example['text'])
        asst_msg = f"<tag>{example['label']}</tag>"
        msg = {"messages": [
            {"role": "user", "content": user_msg}, 
            {"role": "assistant", "content": asst_msg}
        ]}
        json_objs.append(msg)
    
    return json_objs

training_json = training_examples_to_json(training_examples)

In [153]:
# Writes the data to a file so that it can be uploaded to Fireworks
dataset_file_name = 'question-classification_training_data.jsonl'
dataset_id = 'question-classification-v2'

with open(dataset_file_name, 'w') as f:
    for obj in training_json:
        json.dump(obj, f)
        f.write('\n')

# Create the test jsonl dataset

In [154]:
# Converts the test examples to the format expected by Fireworks.
def test_examples_to_json(examples):
    json_objs = list()
    for idx, example in examples.iterrows():  
        user_msg = create_prompt(example['text'])
        asst_msg = f"<tag>{example['label']}</tag>"
        msg = {"messages": [
            {"role": "user", "content": user_msg}, 
            {"role": "assistant", "content": asst_msg}
        ]}
        json_objs.append(msg)
    
    return json_objs

test_json = test_examples_to_json(test_examples)

In [155]:
# Writes the data to a file so that it can be uploaded to Fireworks
dataset_file_name = 'question-classification_test_data.jsonl'
#dataset_id = 'question-classification-v2' #not necessary I do it above

with open(dataset_file_name, 'w') as f:
    for obj in test_json:
        json.dump(obj, f)
        f.write('\n')

In [None]:
# Upload the improved questions to fireworks as our fine-tuning dataset - this one is to create the jsonl file with the system message. 
# def format_question_for_fireworks(topic, question):
#     return {"messages": [
#         {"role": "system", "content": system_message}, 
#         {"role": "user", "content": topic}, 
#         {"role": "assistant", "content": question}
#     ]}

# topics = training_data['topic'].tolist()
# json_objs = list()
# for i, question in enumerate(llama_70b_training_improved_questions):
#     msg = {"messages": [
#         {"role": "system", "content": system_message}, 
#         {"role": "user", "content": topics[i]}, 
#         {"role": "assistant", "content": question}
#     ]}    
#     json_objs.append(msg)

# dataset_file_name = 'question_training_data.jsonl'
# dataset_id = 'improved-question-data-v1'
# with open(dataset_file_name, 'w') as f:
#     for obj in json_objs:
#         json.dump(obj, f)
#         f.write('\n')

# Clean up before I start finetuning

In [160]:
! firectl list datasets

NAME                      CREATE TIME          STATE  DISPLAY_NAME
improved-poem-data-v1     2024-09-25 14:33:24  READY  
poem-data-v1              2024-09-25 09:01:49  READY  
ticket-classification-v1  2024-09-16 18:21:15  READY  

Page 1 of 1
Total size: 3


In [161]:
! firectl list fine-tuning-jobs

NAME                                                         CREATE TIME          STATE      DISPLAY_NAME
improved-poems-v1 (b203ef3333a04c1d95cd37e7ab695b71)         2024-09-25 14:49:32  COMPLETED  improved-poems-v1
poem-generation-v1 (089bc412c11340a296b3fba6ac952868)        2024-09-25 12:55:56  COMPLETED  poem-generation-v1
pk-faq-v2 (427f1b67937e40a5bb3ef06a0c3770d5)                 2024-09-17 06:08:29  COMPLETED  pk-faq-v2
pk-faq-v1 (f0ddb99646244b33a86f9df7edc0faa5)                 2024-09-17 06:08:06  COMPLETED  pk-faq-v1
ticket-classification-v2 (7bb6601f134a41849cc6525c1483fe68)  2024-09-16 18:27:25  COMPLETED  ticket-classification-v2
ticket-classification-v1 (92415c84c0ea453cb0407dabc119cc6b)  2024-09-16 18:26:55  COMPLETED  ticket-classification-v1

Page 1 of 1
Total size: 6


In [158]:
#! firectl delete fine-tuning-job 31181af0bacb4cc2b66e3edfd34b613c


In [159]:
#! firectl delete dataset question-classification-v1


In [162]:
# Upload our dataset to fireworks
!firectl create dataset {dataset_id} {dataset_file_name}

1.94 MiB / 1.94 MiB [-------------------------------] 100.00% 5.37 MiB p/s 600ms


In [163]:
# Create a fine-tuning job
# name of the finetuning job = pk-tag-classification-v1
# dataset = question-classification-v1
# settings file = question-classification_fine_tuning_config.yaml
!firectl create fine-tuning-job --settings-file question-classification_fine_tuning_config.yaml --display-name pk-tag-classification-v2 --dataset {dataset_id} 

Name: accounts/jayozer-ce1cd6/fineTuningJobs/00aac6882c9c400a8ae53da1c4658245
Display Name: pk-tag-classification-v2
Create Time: 2024-09-27 17:00:39
State: CREATING
Dataset: accounts/jayozer-ce1cd6/datasets/question-classification-v2
Status: OK
Created By: jayozer@gmail.com
Conversation:
  Jinja Template: 
{%- set _mode = mode | default('generate', true) -%}
{%- set stop_token = '<|eot_id|>' -%}
{%- set message_roles = ['SYSTEM', 'USER', 'ASSISTANT'] -%}
{%- set ns = namespace(initial_system_message_handled=false, last_assistant_index_for_eos=-1, messages=messages) -%}
{%- for message in ns.messages -%}
    {%- if not message.get('role') -%}
        {{ raise_exception('Key [role] is missing. Original input: ' +  message|tojson) }}
    {%- endif -%}
    {%- if message['role'] | upper not in message_roles -%}
        {{ raise_exception('Invalid role ' + message['role']|tojson + '. Only ' + message_roles|tojson + ' are supported.') }}
    {%- endif -%}
    {%- if 'content' not in message

In [164]:
# NOTE THAT THIS ID WILL CHANGE WHEN YOU RUN THE FINE-TUNING JOB ON YOUR ACCOUNT!!!
# The model id is printed in the stdout of the cell above as Name: accounts/{account_id}/fineTuningJobs/{ft_model_id}
ft_model_id = '00aac6882c9c400a8ae53da1c4658245' 

In [169]:
# Wait until the State of the fine-tuning job is listed as COMPLETED (~10-20 minutes)
!firectl get fine-tuning-job {ft_model_id}

Name: accounts/jayozer-ce1cd6/fineTuningJobs/00aac6882c9c400a8ae53da1c4658245
Display Name: pk-tag-classification-v2
Create Time: 2024-09-27 17:00:39
State: COMPLETED
Dataset: accounts/jayozer-ce1cd6/datasets/question-classification-v2
Status:
  Code: OK
  Message: {'train_runtime': 425.5238, 'train_samples_per_second': 3.323, 'train_steps_per_second': 0.832, 'total_flos': 4.283877676181094e+16, 'train_loss': 0.10231935008942665, 'epoch': 2.0}
Created By: jayozer@gmail.com
Model Id: 00aac6882c9c400a8ae53da1c4658245
Conversation:
  Jinja Template: 
{%- set _mode = mode | default('generate', true) -%}
{%- set stop_token = '<|eot_id|>' -%}
{%- set message_roles = ['SYSTEM', 'USER', 'ASSISTANT'] -%}
{%- set ns = namespace(initial_system_message_handled=false, last_assistant_index_for_eos=-1, messages=messages) -%}
{%- for message in ns.messages -%}
    {%- if not message.get('role') -%}
        {{ raise_exception('Key [role] is missing. Original input: ' +  message|tojson) }}
    {%- endif

# what is evaluation split in finetuning:
The evaluation split in finetuning refers to the portion of the training data set aside for evaluating the model's performance after training. Here are the key points:
By default, fine-tuning jobs do not run post-training evaluation. You can enable model evaluation by setting the evaluation parameter to True. The evaluation_split parameter allows you to configure the amount of training data used for evaluation. The default evaluation split is 15% of the training data. You can adjust this split. For example, to use 20% of the data for evaluation:

evaluation: True
evaluation_split: 0.2

# So essentially I can put all my tags into a single triang set and then test with a portion of that dataset, i dont need a seprate test set. Above I am not using the training for evaluation at all. 

# V3 try to see difference. 
Question egenration by llama-v3p1-8b-instruct
Critique by llama3 405B
Revise questions by llama-v3p1-8b-instruct or llama-v3p2-3b-instruct
Finetune it using accounts/fireworks/models/llama-v3p2-3b-instruct

Or I might fully move to 3.2 models for generation (11B or 70B) - the idea is to finetune a smaller model for knowledge distillation. 

### Current Run Version 2: finetuning a llama-v3p1-8b-instruct my dataset is mixtral 8X22. My training questions are geenrated by llama-v3p1-8b-instruct but critique by 405B
Base Model: accounts/fireworks/models/llama-v3p1-8b-instruct
Epochs: 2
Learning Rate: 0.0002
Lora Rank: 32
Batch Size: 4
Evaluation Split: 0

### Evaluation
Finally, we evaluate the fine-tuned model on our test data. In the previous weeks notebook, the knowledge distillation method resulted in an average LLM judge score of 8.21. We expect to receive a higher score now that we are fine-tuning on the revised questions rather than the initial questions that the large model generated.

In [170]:
# Deploy the fine-tuned model
!firectl deploy {ft_model_id}

In [173]:
# Wait until the the Deploymed Model Refs lists the state of the model as "DEPLOYED" (~5-20 minutes).
!firectl get model {ft_model_id}

Name: accounts/jayozer-ce1cd6/models/00aac6882c9c400a8ae53da1c4658245
Create Time: 2024-09-27 17:12:26
State: READY
Status: OK
Kind: HF_PEFT_ADDON
Base Model Details:
  Checkpoint Format: CHECKPOINT_FORMAT_UNSPECIFIED
Peft Details:
  Base Model: accounts/fireworks/models/llama-v3p1-8b-instruct
  R: 32
  Target Modules: [gate_proj, up_proj, k_proj, q_proj, o_proj, down_proj, v_proj]
Conversation Config:
  Style: jinja
Context Length: 131072
Fine Tuning Job: accounts/jayozer-ce1cd6/fineTuningJobs/00aac6882c9c400a8ae53da1c4658245
Deployed Model Refs: 
  [{
    Name: accounts/jayozer-ce1cd6/deployedModels/00aac6882c9c400a8ae53da1c4658245-2782f5b2
    Deployment: accounts/fireworks/deployments/ee744c5f
    State: DEPLOYED
    Default: true
  }]


### Evaluate Results

We will now deploy our models and evaluate the results. We will calculate the accuracy on three different models

- The base model without any fine-tuning
- Our first fine-tuned model, with the default hyperparameters
- Our second fine-tuned model, with the more aggressive hyperparameters

See https://docs.fireworks.ai/fine-tuning/fine-tuning-models#deploying-the-model-for-inference for more details

<!-- ticket = question
tag = Tag in Knowledge base -->

In [180]:
print(questions[0:3])

["What exactly is dental caries and how does it affect children's teeth?", 'How can I tell if my child has dental caries?', 'At what age do children typically develop dental caries?']


In [183]:
# Uses an LLM to predicted class labels for a list of support questions
def classify_questions(questions, model):
    responses = list()

    for question in questions:
        user_prompt = create_prompt(question)
    
        response = client.chat.completions.create(
            model=model,
            messages=[
                { "role": "user", "content": user_prompt}
            ],
            temperature=0, 
            stop=["</tag>"],
            max_tokens=2048,
        )
        response = response.choices[0].message.content.split("<tag>")[-1].strip()
        responses.append(response)

    return responses


# Calculates the percent of predictions we classified correctly
def evaluate_accuracy(predicted, actual):
    num_correct = sum([predicted[i] == actual[i] for i in range(len(actual))])
    return round(100 * num_correct / len(actual), 2)

In [175]:
print(training_questions[0:2])
print(training_labels[0:2])

print(test_questions[0:2])
print(test_labels[0:2])

["What's the best way to prevent cavities in my 1-3 year old child's baby teeth?", 'What are the common symptoms of cavities in children?']
['Dental Caries in Children', 'Dental Caries in Children']


# check what is going on.

In [202]:
import re

def classify_questions(questions, model, num_samples=10):
    responses = []
    details = []

    for i, question in enumerate(questions[:num_samples]):
        user_prompt = create_prompt(question)
        
        print(f"\nQuestion {i+1}: {question}")
        print(f"Prompt sent to LLM:\n{user_prompt}")
        
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "user", "content": user_prompt}
            ],
            temperature=0, 
            stop=["</tag>"],
            max_tokens=4048,
        )
        
        full_response = response.choices[0].message.content
        # Extract the content within any angle brackets, regardless of tag name
        match = re.search(r'<[^>]*>(.*?)</[^>]*>|<[^>]*>(.*)', full_response, re.DOTALL)
        if match:
            classification = match.group(1) or match.group(2)
        else:
            # If no tag is found, use the full response
            classification = full_response
        
        # Remove any remaining tag elements and strip whitespace
        classification = re.sub(r'</?[^>]+>', '', classification).strip()
        responses.append(classification)
        
        print(f"Full LLM Response: {full_response}")
        print(f"Extracted Classification: {classification}")
        
        details.append({
            "question": question,
            "prompt": user_prompt,
            "full_response": full_response,
            "classification": classification
        })

    return responses, details

def evaluate_accuracy(predicted, actual, num_samples=10):
    correct = 0
    total = min(len(predicted), len(actual), num_samples)
    
    print("\nAccuracy Evaluation:")
    for i in range(total):
        # Remove any tag elements and strip whitespace from both predicted and actual
        clean_predicted = re.sub(r'</?[^>]+>', '', predicted[i]).strip().lower()
        clean_actual = re.sub(r'</?[^>]+>', '', actual[i]).strip().lower()
        
        # Remove any extra whitespace within the labels
        clean_predicted = ' '.join(clean_predicted.split())
        clean_actual = ' '.join(clean_actual.split())
        
        is_correct = clean_predicted == clean_actual
        correct += int(is_correct)
        print(f"Item {i+1}:")
        print(f"  Predicted: {clean_predicted}")
        print(f"  Actual: {clean_actual}")
        print(f"  Correct: {is_correct}")
    
    accuracy = round(100 * correct / total, 2)
    print(f"\nTotal Correct: {correct} out of {total}")
    print(f"Accuracy: {accuracy}%")
    
    return accuracy




# Main execution remains the same

# Main execution
#model_id = 'accounts/fireworks/models/llama-v3p1-8b-instruct'
model_id = f'accounts/{account_id}/models/{ft_model_id}'

print("Classifying Training Questions:")
training_responses, training_details = classify_questions(training_questions, model_id)
training_accuracy = evaluate_accuracy(training_responses, training_labels)
print(f"\nTraining Set Accuracy: {training_accuracy}%")

print("\nClassifying Test Questions:")
test_responses, test_details = classify_questions(test_questions, model_id)
test_accuracy = evaluate_accuracy(test_responses, test_labels)
print(f"\nTest Set Accuracy: {test_accuracy}%")

Classifying Training Questions:

Question 1: What's the best way to prevent cavities in my 1-3 year old child's baby teeth?
Prompt sent to LLM:
You are a helpful assistant for a Pediatric Dentistry. You have been asked to classify a pediatric dental question into one of the following tags:
<tags>
Anomalies of Tooth Formation
Antibiotic Prophylaxis in Pediatric Dentistry
Behavior Management in Pediatric Dentistry
Bruxism in Children
Cavities in Baby Teeth
Chewing Habits in Kids
Choosing a Pediatric Dentist
Cleft Lip and Palate Treatment
Community Dentistry Programs for Children
Composite Fillings for Kids
Conscious Sedation
Dental Anxiety in Children
Dental Caries in Children
Dental Checkups During Pregnancy
Dental Crowns for Children
Dental Emergencies in Children
Dental Fillings for Kids
Dental Health in Preschoolers
Dental Hygiene for Teenagers
Dental Sealants
Dental Wear in Children
Diet and Dental Health
Early Childhood Caries
Early Loss of Baby Teeth
Enamel Hypoplasia in Children


In [176]:
# Determine how the base model without any fine-tuning performs - I think the user question is missing here. Classify this like this...Check classify_questions functions above
model_id = 'accounts/fireworks/models/llama-v3p1-8b-instruct'

training_responses = classify_questions(
    questions=training_questions, 
    model=model_id
)
accuracy = evaluate_accuracy(training_responses, training_labels)
print(f"Training Set Accuracy: {accuracy}%")

test_responses = classify_questions(
    questions=test_questions, 
    model=model_id
)

accuracy = evaluate_accuracy(test_responses, test_labels)
print(f"Test Set Accuracy: {accuracy}%")

Training Set Accuracy: 0.29%
Test Set Accuracy: 0.0%


In [None]:
print(training_questions[0:2])
print(training_responses[0:2])
print(test_responses[0:2])

# Finetuned model accuracy: ft_model_id - I only have one experiment. 

In [177]:
# Determine how the fine-tuned model performs with the default fine-tuning params
model_id = f'accounts/{account_id}/models/{ft_model_id}'

training_responses = classify_questions(
    questions=training_questions, 
    model=model_id
)
accuracy = evaluate_accuracy(training_responses, training_labels)
print(f"Training Set Accuracy: {accuracy}%")

test_responses = classify_questions(
    questions=test_questions, 
    model=model_id
)

accuracy = evaluate_accuracy(test_responses, test_labels)
print(f"Test Set Accuracy: {accuracy}%")

Training Set Accuracy: 68.77%
Test Set Accuracy: 82.46%


In [187]:
print(training_questions[0:2])
print(training_responses[0:2])
print(training_labels[0:2])

["What's the best way to prevent cavities in my 1-3 year old child's baby teeth?", 'What are the common symptoms of cavities in children?']
['Prevention of Oral Disease in Infants', 'Dental Caries in Children']
['Dental Caries in Children', 'Dental Caries in Children']


# second one if I had to create a second yaml file with parameter changes. 

In [None]:
# Determine how the base model performs with the increases rank, epochs, and learning rate
model_id = f'accounts/{account_id}/models/{model_v2_id}'  # ft_model_id_v2

training_responses = classify_questions(
    questions=training_questions, 
    model=model_id
)
accuracy = evaluate_accuracy(training_responses, training_labels)
print(f"Training Set Accuracy: {accuracy}%")

test_responses = classify_questions(
    questions=test_questions, 
    model=model_id
)

accuracy = evaluate_accuracy(test_responses, test_labels)
print(f"Test Set Accuracy: {accuracy}%")

Training Set Accuracy: 60.29%
Test Set Accuracy: 54.41%


In [None]:
# Undeploy the first model (does not cost anything extra, but Fireworks may limit your number of deployed models).
!firectl undeploy {ft_model_id}

In [36]:
# Generate questions on the test set using our fine-tuned model
ft_questions = generate_questions(f'accounts/{account_id}/models/{ft_model_id}', test_data)

In [51]:
# Evaluate questions using the LLM as a Judge strategy
question_evaluation_rubric = f'''You are professional poet responsible for assessing the quality of AI generated questions.

Score each question on a scale of 0 to 10, where 10 represents the best possible question.

Scoring Guidelines:
{question_guidelines}

Think through your reasoning step-by-step and explain your reasoning. Steps for judging a question:
1. Read the question Multiple Times: Read it aloud and silently to capture both the meaning and the sound.
2. Take Notes: Jot down initial impressions, notable phrases, and any questions that arise.
3. Analyze the Elements: Break down the question into its components (content, structure, language, sound).
4. Reflect on Your Experience: Consider your emotional response and personal connection to the question.

The last line in your response MUST be a json object {{"score": XXX}}, where XXX is the score you are giving the response.'''

def evaluate_questions(questions, evaluation_model):
    scores = list()
    for question in questions:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": question_evaluation_rubric},
                {"role": "user", "content": question}
            ],
            temperature=0,
        )

        try: 
            response = response.choices[0].message.content
            score = int(json.loads(response.split('\n')[-1])['score'])  
            scores.append(score)
        except json.JSONDecodeError as jde:
            continue
        
    return sum(scores) / len(scores)

In [52]:
# Use the LLM to evaluate our fine-tuned model
ft_avg_score = evaluate_questions(ft_questions, 'accounts/fireworks/models/llama-v3-70b-instruct')
print(f"Avg LLM Judge Score: {round(ft_avg_score , 2)}")

Avg LLM Judge Score: 8.32


In [54]:
# Undeploy the fine-tuned model (does not cost anything extra, but Fireworks may limit your number of deployed models).
!firectl undeploy {ft_model_id}