## Setup

To complete the following guide you will need to install the following packages:
- fireworks-ai
- numpy
- pandas
- pronouncing
- requests
- sentence-transformers
- transformers

You will also need:

- Fireworks account (https://fireworks.ai/)
- Fireworks API key
- The firectl command-line interface (https://docs.fireworks.ai/tools-sdks/firectl/firectl)

In [2]:
import json

from fireworks.client import Fireworks
from openai import OpenAI
import numpy as np
import pandas as pd
import pronouncing
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

# Using openai embedding model. 
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding

embeddings_model = get_embedding

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [3]:
# Sign-in to your Fireworks account
!firectl signin

Signed in as: jayozer@gmail.com
Account ID: jayozer-ce1cd6


In [4]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Get the API key from environment variable
api_key = os.getenv('FIREWORKS_API_KEY')

client = Fireworks(api_key=api_key)

# Replace the line below with your Fireworks account id
account_id = os.getenv('FIREWORKS_ACCOUNT_ID')

## Problem Definition: Pediatric dentistry chatbot FAQ question generation

*Note: The pediatric dentistry topics used in this example were synthetically generated by Openai o1 model*

LLMs are capable of performing creative writing tasks. However, assessing the quality of such tasks, like question generation, is highly subjective.

### Task
In last week's notebook, we created a framework to quantitatively evaluate LLM-generated poetry. This week, you'll observe how to further improve this solution and apply it to pediatric faqs.

As I am not a professional dentist, I am unable to write high-level questions myself. The ideal approach would be to search the web for high-quality questions that matches the topics and style I want the LLM to generate. However, this method is very time-consuming. A more efficient approach is to use the "critique and revise" method, where the LLM first generates critiques on how each Q&A can be improved. We then ask the LLM to rewrite the questions based on these critiques. Finally, we fine-tune the LLM on the revised Questions.

### Data
The data can be found in the week-4 data folder.

We will use the following datasets:
- `./data/training_poem_topics.csv`
- `./data/test_poem_topics.csv`

Each of those datasets consists of 100 unique poem topics. 

In [5]:
training_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_training_topics.csv')
test_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_test_topics.csv')

### Foundation Model Baseline
Our first step is to generate a poem for each of the topics in the training data using a foundation_model. We will then use the critique and revise method to improve upon these poems.

In [7]:
# Given a csv file with a list of topics, generates a poem for each topic
# system_message = 'You are a professional poet. Write a unique and original contemporary poem about the topic suggested by the user. Your response should contain ONLY the content of the poem.'
system_message = 'You are a professional chatbot assistant. Write natural language FAQ questions for each topic suggested by the user. These questions should be written from the perspective of a parent, a teen, or another dentist. The questions should be realistic and conversational, just like a user would ask in a chatbot. Your response should contain ONLY the questions.'

def generate_questions(model, df):
    responses = list()
    for i, row in enumerate(df.iterrows()):
        response = client.chat.completions.create(
            model=model,
            messages=[
              {"role": "system", "content": system_message},
              {"role": "user", "content": row[1]['topic']}
            ],
        )
        response = response.choices[0].message.content
        responses.append(response)    
    return responses

In [8]:
# We first generate poems for poetry topics in our training set
llama_70b_training_questions = generate_questions('accounts/fireworks/models/llama-v3p1-70b-instruct', training_data)

In [12]:
#print(llama_70b_training_questions[1:5])
print(llama_70b_training_questions[0])

Here are some FAQ questions from different perspectives:

**Parent:**

* How can I prevent my child from getting cavities?
* What are the early signs of tooth decay in kids?
* Is it true that baby teeth are not that important since they'll fall out anyway?
* How often should I take my child to the dentist to prevent dental caries?
* Can I use regular toothpaste for my toddler or is there a special kind?
* Are there any specific foods or drinks that I should limit to prevent cavities in my child?

**Teen:**

* I don't like going to the dentist, can't I just take care of my teeth myself?
* How can I get rid of the white spots on my teeth that my dentist said are early signs of cavities?
* Is it true that sugary drinks like soda and sports drinks can cause cavities?
* Can I still get cavities if I wear braces?
* How long does it take to get a cavity filled and will it hurt?
* Can I prevent cavities by chewing sugar-free gum?

**Dentist:**

* What are the most effective ways to prevent den

### Critique
In the critique step, we create a scoring rubric and ask an LLM to generate improvements to the previously created poems based on the rubric.

In [13]:
# We now use our scoring rubric to generate a list of critiques about each poem
question_guidelines = """
-Is the question relevant to the topic and the intended audience (parents, teens, or pediatric dentists)? Does the question directly relate to the pediatric dentistry topic and provide valuable information for the user?

-Is the language natural and conversational? Does the question sound like it would be asked by a parent, teen, or pediatric dentist in a real-world scenario? Is it phrased naturally, as if in a chatbot interaction?

-Does the question anticipate the concerns or needs of the intended user? Does the question reflect the typical concerns of the person asking (e.g., a parent asking about a child's oral health, a teen asking about dental procedures, or a dentist inquiring about professional guidelines)?

-Is the question specific enough to generate a useful answer? Does the question provide enough detail to allow for a meaningful and targeted answer, avoiding overly vague or broad inquiries?

"""

question_critique_rubric = f'''You are a professional chatbot evaluator responsible for assessing the quality of AI-generated FAQ questions.

Assessment Guidelines:
{question_guidelines}

Given the above guidelines, provide a list of ways that the question could be improved.'''

def critique_questions(questions, evaluation_model):
    critiques = list()
    for question in questions:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": question_critique_rubric},
                {"role": "user", "content": question}
            ],
        )

        try: 
            response = response.choices[0].message.content
            critiques.append(response)
        except json.JSONDecodeError as jde:
            continue

    return critiques

In [16]:
llama_70b_training_critiques = critique_questions(llama_70b_training_questions, 'accounts/fireworks/models/llama-v3p1-70b-instruct')

In [17]:
#llama 3.1 70B - Works much better. 
print(llama_70b_training_questions[0])
print(llama_70b_training_critiques[0])

Here are some FAQ questions from different perspectives:

**Parent:**

* How can I prevent my child from getting cavities?
* What are the early signs of tooth decay in kids?
* Is it true that baby teeth are not that important since they'll fall out anyway?
* How often should I take my child to the dentist to prevent dental caries?
* Can I use regular toothpaste for my toddler or is there a special kind?
* Are there any specific foods or drinks that I should limit to prevent cavities in my child?

**Teen:**

* I don't like going to the dentist, can't I just take care of my teeth myself?
* How can I get rid of the white spots on my teeth that my dentist said are early signs of cavities?
* Is it true that sugary drinks like soda and sports drinks can cause cavities?
* Can I still get cavities if I wear braces?
* How long does it take to get a cavity filled and will it hurt?
* Can I prevent cavities by chewing sugar-free gum?

**Dentist:**

* What are the most effective ways to prevent den

In [15]:
#llama 3 70B model
print(llama_70b_training_questions[0])
print(llama_70b_training_critiques[0])

Here are some FAQ questions from different perspectives:

**Parent:**

* How can I prevent my child from getting cavities?
* What are the early signs of tooth decay in kids?
* Is it true that baby teeth are not that important since they'll fall out anyway?
* How often should I take my child to the dentist to prevent dental caries?
* Can I use regular toothpaste for my toddler or is there a special kind?
* Are there any specific foods or drinks that I should limit to prevent cavities in my child?

**Teen:**

* I don't like going to the dentist, can't I just take care of my teeth myself?
* How can I get rid of the white spots on my teeth that my dentist said are early signs of cavities?
* Is it true that sugary drinks like soda and sports drinks can cause cavities?
* Can I still get cavities if I wear braces?
* How long does it take to get a cavity filled and will it hurt?
* Can I prevent cavities by chewing sugar-free gum?

**Dentist:**

* What are the most effective ways to prevent den

### Revise
In the revise step, we create a new prompt that tells the LLM to generate a revised question, given the previously generated critiques.

In [18]:
# We now give the LLM both the questions and the critiques, and tell it to improve the poem based on the following critiques.
improvement_sys_message = '''You are a professional chatbot assistant. Improve the generated question, given the following critiques.

Your response must ONLY contain the content of the improved question. DO NOT TELL ME YOUR CHANGES, JUST GIVE ME THE REVISED QUESTION!'''


# def generate_improved_poems(model, questions, critiques):
#     responses = list()
#     for i, question in enumerate(questions):

#         user_message = f''''
# question:      
# {question}

# critiques:
# {critiques[i]}'''
        
#         response = client.chat.completions.create(
#             model=model,
#             messages=[
#               {"role": "system", "content": improvement_sys_message},
#               {"role": "user", "content": user_message}
#             ],
#         )
#         response = response.choices[0].message.content
#         responses.append(response)    

#     return responses

def generate_improved_questions(model, questions, critiques):
    improved_questions = []
    for i, question in enumerate(questions):
        user_message = f'''
question:      
{question}

critiques:
{critiques[i]}'''
        
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": improvement_sys_message},
                {"role": "user", "content": user_message}
            ],
        )
        response_text = response.choices[0].message.content
        improved_questions.append(response_text)    

    return improved_questions

In [19]:
#llama_70b_training_improved_poems = generate_improved_poems('accounts/fireworks/models/llama-v3p1-70b-instruct', llama_70b_training_questions, llama_70b_training_critiques)
llama_70b_training_improved_questions = generate_improved_questions(
    'accounts/fireworks/models/llama-v3p1-70b-instruct',
    llama_70b_training_questions,
    llama_70b_training_critiques
)

In [20]:
print(llama_70b_training_improved_questions[0])

Here are the revised questions:

**Parent:**

* What are the most effective ways to prevent cavities in children under the age of 5?
* What are the typical symptoms of tooth decay in children, and how can I identify them?
* Why are baby teeth important, and what role do they play in my child's oral health?
* What is the recommended schedule for dental check-ups for children, and how can I prevent dental caries between visits?
* What type of toothpaste is recommended for toddlers, and what ingredients should I look for to ensure their safety?
* What are the top foods and drinks that contribute to cavities in children, and how can I limit their consumption?

**Teen:**

* What are the benefits of regular dental check-ups, and how can I take care of my teeth at home to prevent cavities?
* What are the best ways to reverse early signs of cavities, and can I prevent them from becoming full-blown cavities?
* How do sugary drinks contribute to cavities, and what are some healthier alternatives

In [22]:
llama_70b_training_improved_questions[0]

"Here are the revised questions:\n\n**Parent:**\n\n* What are the most effective ways to prevent cavities in children under the age of 5?\n* What are the typical symptoms of tooth decay in children, and how can I identify them?\n* Why are baby teeth important, and what role do they play in my child's oral health?\n* What is the recommended schedule for dental check-ups for children, and how can I prevent dental caries between visits?\n* What type of toothpaste is recommended for toddlers, and what ingredients should I look for to ensure their safety?\n* What are the top foods and drinks that contribute to cavities in children, and how can I limit their consumption?\n\n**Teen:**\n\n* What are the benefits of regular dental check-ups, and how can I take care of my teeth at home to prevent cavities?\n* What are the best ways to reverse early signs of cavities, and can I prevent them from becoming full-blown cavities?\n* How do sugary drinks contribute to cavities, and what are some health

In [21]:
print(llama_70b_training_improved_questions[0:2])



In [None]:
# copy llama_70b_training_improved_questions without 

In [23]:
print(training_data['topic'][0:2])

0    Dental Caries in Children
1       Early Childhood Caries
Name: topic, dtype: object


In [24]:
import re
import pandas as pd

# Load the training topics
#training_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_data/pk_training_topics.csv')
topics = training_data['topic'].tolist()

# 'llama_70b_training_improved_questions' is your list of improved questions
# Initialize empty lists to store data
questions = []
topics_list = []

# Process each set of improved questions
for idx, text in enumerate(llama_70b_training_improved_questions):
    topic = topics[idx]  # Get the corresponding topic

    # Use regular expressions to find all questions marked by an asterisk '*'
    qs = re.findall(r'\* (.*)', text)

    # Append each question and its topic to the lists
    for q in qs:
        questions.append(q.strip())
        topics_list.append(topic)

# Create a DataFrame from the lists with columns 'text' and 'label'
df = pd.DataFrame({
    'text': questions,
    'label': topics_list
})

# Save the DataFrame to a TSV file
df.to_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/poppykids/pk_improved_question_training_dataset.tsv', sep='\t', index=False)

# Display the DataFrame
print(df)

                                                   text  \
0     What are the most effective ways to prevent ca...   
1     What are the typical symptoms of tooth decay i...   
2     Why are baby teeth important, and what role do...   
3     What is the recommended schedule for dental ch...   
4     What type of toothpaste is recommended for tod...   
...                                                 ...   
1599  What are the most common oral health issues th...   
1600  What are some systemic health conditions that ...   
1601  What are some key signs and symptoms that can ...   
1602  How do bacteria contribute to bad breath in ch...   
1603  What are some age-specific factors that can co...   

                               label  
0          Dental Caries in Children  
1          Dental Caries in Children  
2          Dental Caries in Children  
3          Dental Caries in Children  
4          Dental Caries in Children  
...                              ...  
1599  Causes of Pe

# works!!!

### Fine-Tuning
We know fine-tune a smaller LLM using the revised poems. This is similar to the knowledge distillation method from last week's notebook, except we are fine-tuning on the revised poems of the larger model, rather than the original poems that it generated.

JAY: My dataset will be different will resemble categories and the topic as labelling - check week 2 dataset and how that one is constructed. 

In [21]:
# Upload the improved poems to fireworks as our fine-tuning dataset
def format_poem_for_fireworks(topic, poem):
    return {"messages": [
        {"role": "system", "content": system_message}, 
        {"role": "user", "content": topic}, 
        {"role": "assistant", "content": poem}
    ]}

topics = training_data['topic'].tolist()
json_objs = list()
for i, poem in enumerate(llama_70b_training_improved_poems):
    msg = {"messages": [
        {"role": "system", "content": system_message}, 
        {"role": "user", "content": topics[i]}, 
        {"role": "assistant", "content": poem}
    ]}    
    json_objs.append(msg)

dataset_file_name = 'poem_training_data.jsonl'
dataset_id = 'improved-poem-data-v1'
with open(dataset_file_name, 'w') as f:
    for obj in json_objs:
        json.dump(obj, f)
        f.write('\n')

In [23]:
# Upload our dataset to fireworks
!firectl create dataset {dataset_id} {dataset_file_name}

In [25]:
# Create a fine-tuning job
!firectl create fine-tuning-job --settings-file poem_generation_fine_tuning_config.yaml --display-name improved-poems-v1 --dataset {dataset_id} 

In [26]:
# NOTE THAT THIS ID WILL CHANGE WHEN YOU RUN THE FINE-TUNING JOB ON YOUR ACCOUNT!!!
# The model id is printed in the stdout of the cell above as Name: accounts/{account_id}/fineTuningJobs/{ft_model_id}
ft_model_id = '3dd6bdfb938546d88a7db95673124266' 

In [30]:
# Wait until the State of the fine-tuning job is listed as COMPLETED (~10-20 minutes)
!firectl get fine-tuning-job {ft_model_id}

### Evaluation
Finally, we evaluate the fine-tuned model on our test data. In the previous weeks notebook, the knowledge distillation method resulted in an average LLM judge score of 8.21. We expect to receive a higher score now that we are fine-tuning on the revised poems rather than the initial poems that the large model generated.

In [32]:
# Deploy the fine-tuned model
!firectl deploy {ft_model_id}

In [35]:
# Wait until the the Deploymed Model Refs lists the state of the model as "DEPLOYED" (~5-20 minutes).
!firectl get model {ft_model_id}

In [36]:
# Generate poems on the test set using our fine-tuned model
ft_poems = generate_poems(f'accounts/{account_id}/models/{ft_model_id}', test_data)

In [48]:
# Evaluate poems based on their average length (# of characters)
def calculate_avg_length(poems):
    return int(np.mean([len(poem) for poem in poems]))

# Evaluate poems based on the pct of stanzas that contain a rhyme
def calculate_rhyming_fct(poem):
    stanzas = poem.split('\n\n')
    stanzas = [stanza for stanza in stanzas if len(stanza.split('\n')) >= 1]
    
    num_rhyming_stanzas = 0
    for stanza in stanzas:
        lines = stanza.split('\n')
        end_words = [line.split(' ')[-1].strip('.?!"\',') for line in lines]
        found_rhyme = False
        for i in range(len(end_words)):
            for j in range(i + 1, len(end_words)):
                found_rhyme = True if found_rhyme or (end_words[j] in pronouncing.rhymes(end_words[i])) else False
                
        if found_rhyme:
            num_rhyming_stanzas += 1

    if not len(stanzas):
        print(poem)
    return num_rhyming_stanzas / len(stanzas)

# Evaluate poems based on how often they have a positive sentiment
def has_positive_sentiment(poem):
    try:
        sentiment = sentiment_pipeline(poem)[0]
        return True if sentiment['label'] == 'POSITIVE' else False
    except:
        return True

In [49]:
# Calculate heuristics of our fine-tuned poems
print("Heuristic Evaluation")
print(f'Average Length: {calculate_avg_length(ft_poems)}')
print(f"Rhyming Pct: {int(100 * np.mean([calculate_rhyming_fct(poem) for poem in ft_poems]))}%")
print(f"Positive Sentiment: {int(100 * np.mean([has_positive_sentiment(poem) for poem in ft_poems]))}%")

Heuristic Evaluation
Average Length: 1255
Rhyming Pct: 73%
Positive Sentiment: 92%


In [51]:
# Evaluate poems using the LLM as a Judge strategy
poem_evaluation_rubric = f'''You are professional poet responsible for assessing the quality of AI generated poems.

Score each poem on a scale of 0 to 10, where 10 represents the best possible poem.

Scoring Guidelines:
{poem_guidelines}

Think through your reasoning step-by-step and explain your reasoning. Steps for judging a poem:
1. Read the Poem Multiple Times: Read it aloud and silently to capture both the meaning and the sound.
2. Take Notes: Jot down initial impressions, notable phrases, and any questions that arise.
3. Analyze the Elements: Break down the poem into its components (content, structure, language, sound).
4. Reflect on Your Experience: Consider your emotional response and personal connection to the poem.

The last line in your response MUST be a json object {{"score": XXX}}, where XXX is the score you are giving the response.'''

def evaluate_poems(poems, evaluation_model):
    scores = list()
    for poem in poems:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": poem_evaluation_rubric},
                {"role": "user", "content": poem}
            ],
            temperature=0,
        )

        try: 
            response = response.choices[0].message.content
            score = int(json.loads(response.split('\n')[-1])['score'])  
            scores.append(score)
        except json.JSONDecodeError as jde:
            continue
        
    return sum(scores) / len(scores)

In [52]:
# Use the LLM to evaluate our fine-tuned model
ft_avg_score = evaluate_poems(ft_poems, 'accounts/fireworks/models/llama-v3-70b-instruct')
print(f"Avg LLM Judge Score: {round(ft_avg_score , 2)}")

Avg LLM Judge Score: 8.32


In [54]:
# Undeploy the fine-tuned model (does not cost anything extra, but Fireworks may limit your number of deployed models).
!firectl undeploy {ft_model_id}