## Setup

To complete the following guide you will need to install the following packages:
- fireworks-ai
- numpy
- pandas
- pronouncing
- requests
- sentence-transformers
- transformers

You will also need:

- Fireworks account (https://fireworks.ai/)
- Fireworks API key
- The firectl command-line interface (https://docs.fireworks.ai/tools-sdks/firectl/firectl)

In [None]:
#! pipenv install fireworks-ai numpy pandas pronouncing requests sentence-transformers transformers

In [4]:
import json
import time

from fireworks.client import Fireworks
import numpy as np
import pandas as pd
import pronouncing
from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

#sentiment_pipeline = pipeline("sentiment-analysis")
sentiment_pipeline = pipeline("sentiment-analysis", 
                              model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", 
                              revision="af0f99b",
                              device=0)  # Use GPU if available, otherwise it will fall back to CPU
embeddings_model = SentenceTransformer('Alibaba-NLP/gte-base-en-v1.5', trust_remote_code=True)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [6]:
# Sign-in to your Fireworks account
!firectl signin

Signed in as: jayozer@gmail.com
Account ID: jayozer-ce1cd6


In [7]:
# Make sure you have the FIREWORKS_API_KEY environment variable set to your account's key!
# os.environ['FIREWORKS_API_KEY'] = 'XXX'

from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Get the API key from environment variable
api_key = os.getenv('FIREWORKS_API_KEY')

client = Fireworks(api_key=api_key)

# Replace the line below with your Fireworks account id
account_id = os.getenv('FIREWORKS_ACCOUNT_ID')

## Problem Definition: Poem Generation

*Note: The poem topics used in this example were synthetically generated by Claude 3 Opus*

LLMs are capable of performing creative writing tasks. However, assessing the quality of such tasks, like poetry generation, is highly subjective.

### Task
We will create an evaluation framework to assess the quality of poetry generated by an LLM. We will then fine-tune a model using the knowledge distillation method (i.e. fine-tuning a smaller model ("student") using output from a larger model ("teacher")), and assess the improvement with our evaluation framework.

### Data

The data can be found in the week-3 `data` folder.

We will use the following datasets:
- `./data/training_poem_topics.csv`
- `./data/test_poem_topics.csv`

Each of those datasets consists of 100 unique poem topics. 

In [8]:
training_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/src/week-3/data/training_poem_topics.csv')
test_data = pd.read_csv('/Users/acrobat/Documents/GitHub/fine-tuning-workshop/src/week-3/data/test_poem_topics.csv')

In [9]:
training_data.iloc[1]['topic']

'A debate between different types of cheese'

### Foundation Model Baseline
Our first step is to generate a poem for each of the topics in the test set using a base model. We will then evaluate the quality of the poems.

In [10]:
# Given a csv file with a list of topics, generates a poem for each topic
system_message = 'You are a professional poet. Write a unique and original contemporary poem about the topic suggested by the user. Your response should contain ONLY the content of the poem.'
def generate_poems(model, df):
    responses = list()
    for i, row in enumerate(df.iterrows()):
        response = client.chat.completions.create(
            model=model,
            messages=[
              {"role": "system", "content": system_message},
              {"role": "user", "content": row[1]['topic']}
            ],
        )
        response = response.choices[0].message.content
        responses.append(response)   

    return responses

In [11]:
# Generates poems for each of the 100 test topics using the base 8B model
#llama_8b_poems = generate_poems('accounts/fireworks/models/llama-v3-8b-instruct', test_data)
llama_8b_poems = generate_poems('accounts/fireworks/models/llama-v3-8b-instruct', test_data) # using 3.1 version


In [12]:
print(llama_8b_poems[3])

Whispers in the evening's hush
I stir from slumber's heavy rush
My metal heart awakens slow
As dawn's pale fingers start to glow

The city's canvas, I'm a stroke
A brush that paints the urban cloak
I flicker, casting shadows deep
As morning's warmth begins to creep

The world awakens, slow and sweet
A gentle dance, my beams to greet
Pedestrians, a blurred haze pass
Their footsteps echoing through my glass

The sun climbs high, I take a rest
My glow subdued, my energy possessed
By the heat of the day's warm rays
I'm a silhouette, a mere grey shade

The world's bustle, I'm still asleep
Dreaming of the night's dark leap
When neon hues and city lights
Will dance with me, and fill my nights

As dusk descends, my heart beats true
I'm alive once more, anew
I cast my glow, a warm embrace
For the night's secrets, a lonely place

The city's pulse, I feel it strong
As night's whispers, my poetry belong
I stand tall, a sentinel of night
Illuminating the city's delight


### Heuristic Evaluation
In class, we discussed using a heuristic-based evaluation approach when the quality of results is subjective. This method involves creating heuristics that align with your desired assessment criteria and evaluating the results based on these metrics.

For this exercise, I've developed the following heuristics to assess our poems:

- Average length (number of characters) - shorter or larger poems
- Rhyming percentage (average percentage of stanzas that rhyme) - modern poetry does not rhyme but LLMs make it rhyme. here for Scott most important one. 
- Positive sentiment percentage (percentage of poems with positive sentiment)

In [13]:
# Evaluate poems based on their average length (# of characters)
def calculate_avg_length(poems):
    return int(np.mean([len(poem) for poem in poems]))

# Evaluate poems based on the pct of stanzas that contain a rhyme
def calculate_rhyming_fct(poem):
    stanzas = poem.split('\n\n')
    stanzas = [stanza for stanza in stanzas if len(stanza.split('\n')) >= 1]
    
    num_rhyming_stanzas = 0
    for stanza in stanzas:
        lines = stanza.split('\n')
        end_words = [line.split(' ')[-1].strip('.?!"\',') for line in lines]
        found_rhyme = False
        for i in range(len(end_words)):
            for j in range(i + 1, len(end_words)):
                found_rhyme = True if found_rhyme or (end_words[j] in pronouncing.rhymes(end_words[i])) else False
                
        if found_rhyme:
            num_rhyming_stanzas += 1
            
    return num_rhyming_stanzas / len(stanzas)

# Evaluate poems based on how often they have a positive sentiment
def has_positive_sentiment(poem):
    sentiment = sentiment_pipeline(poem)[0]
    return True if sentiment['label'] == 'POSITIVE' else False

In [14]:
# Calculate heuristics of the poems generated by our base model
print("Heuristic Evaluation")
print(f'Average Length: {calculate_avg_length(llama_8b_poems)}')
print(f"Rhyming Pct: {int(100 * np.mean([calculate_rhyming_fct(poem) for poem in llama_8b_poems]))}%")
print(f"Positive Sentiment: {int(100 * np.mean([has_positive_sentiment(poem) for poem in llama_8b_poems]))}%")

Heuristic Evaluation
Average Length: 908
Rhyming Pct: 94%
Positive Sentiment: 84%


### LLM as a Judge - (Before you finetune use LLM as Judge)
Another evaluation method we discussed in class is using an LLM as a judge. This approach involves employing a high-quality LLM to assess the quality of the generated results. This method is effective because LLMs are often better at evaluating content than generating it.

To implement this method, you need to create a scoring rubric (i.e., "constitution") to guide the LLM in evaluating the results. The LLM will use this rubric to score each poem on a scale from 0 to 10.

In [15]:
# Now, we evaluate poems using our scoring rubric (i.e. "constitution") - Poetry is subjective - how do you judge a poem? he used guideline from a poetry competition. This is the most important part for th erubric.
poem_guidelines = """- Is the poem original?
- Does the poem contain beauty, power, education or entertainment?
- is the message of the poem clear? Is it a good message, or is it of little value to anyone?
- Is the poem clear in its expression? Does it maintain coherence throughout?
- If the poem is written in rhyming verse, then it should be rated according to how well the rhymes fit, not only with each other, but with the flow and the intended nuance of meaning the verse demands.
- What form does the poem take? Is it a sonnet, free verse, haiku, etc.? How does the form contribute to the poem's impact?
- Does the poet us the best possible choice of words in the poem? A person can ball, cry, sob, whimper, and shed tears, but which term would best fit the mood the poet is trying to convey?"""

poem_evaluation_rubric = f'''You are professional poet responsible for assessing the quality of AI generated poems.

Score each poem on a scale of 0 to 10, where 10 represents the best possible poem.

Scoring Guidelines:
{poem_guidelines}

Think through your reasoning step-by-step and explain your reasoning. Steps for judging a poem:
1. Read the Poem Multiple Times: Read it aloud and silently to capture both the meaning and the sound.
2. Take Notes: Jot down initial impressions, notable phrases, and any questions that arise.
3. Analyze the Elements: Break down the poem into its components (content, structure, language, sound).
4. Reflect on Your Experience: Consider your emotional response and personal connection to the poem.

The last line in your response MUST be a json object {{"score": XXX}}, where XXX is the score you are giving the response.'''

def evaluate_poems(poems, evaluation_model):
    scores = list()
    for poem in poems:
        response = client.chat.completions.create(
            model=evaluation_model,
            messages=[
                {"role": "system", "content": poem_evaluation_rubric},
                {"role": "user", "content": poem}
            ],
            temperature=0,
        )

        try: 
            response = response.choices[0].message.content
            score = int(json.loads(response.split('\n')[-1])['score'])  
            scores.append(score)
        except json.JSONDecodeError as jde:
            continue
        
    return sum(scores) / len(scores)

In [16]:
# We score the poems using our judge model
llm_judge_model = 'accounts/fireworks/models/llama-v3-70b-instruct'
llama_8b_avg_score = evaluate_poems(llama_8b_poems, llm_judge_model)

print(f'Avg LLM Judge Score: {round(llama_8b_avg_score, 2)}')

Avg LLM Judge Score: 8.12


### Knowledge Distillation - Make 8b as good as 70B
One approach to generating data for a fine-tuned model is knowledge distillation. This technique involves transferring knowledge from a large model to a smaller one. It entails generating responses relevant to your use case using the larger model and then using these responses to create a fine-tuning dataset. The smaller model is then fine-tuned on this dataset. In this example, we will use responses from a 70B model to fine-tune an 8B model. We will then use our evaluation framework to assess the quality of our fine-tuned model.

In [17]:
# Next, we generate poems for 100 different topics than the ones we are using for our test set.
llama_70b_training_poems = generate_poems('accounts/fireworks/models/llama-v3-70b-instruct', training_data)

In [18]:
print(llama_70b_training_poems[3])

In the hollow of my chest, a portal sleeps
A whispered promise of a world that seeps
Into my dreams, a nostalgia creeps
For a realm where skies were perpetually deep

In this forgotten dimension, I recall
Cities that shimmered like auroral halls
Where trees bore fruits of starlight, and rivers flowed
With the soft luminescence of a lover's glow

The creatures of that world, they whispered low
Secrets in a language that only the heart knows
Their eyes, like lanterns, lit the twilight air
As they danced beneath the celestial stair

In this realm, time was currency, and hours were spent
Like coins in a fountain, where wishes were invented
The fabric of reality, a tapestry so fine
Was woven with the threads of a forgotten divine

But now, in this world, I'm left to roam
A stranger in a land that's lost its tone
The memories of that dimension, a haunting refrain
Echoes of a love that cannot be regained

Yet, in the silence, I still hear the call
A whispered summons to return to it all
To re

In [19]:
# Upload the improved poems to fireworks as our fine-tuning dataset
def formt_poem_for_fireworks(topic, poem):
    return {"messages": [
        {"role": "system", "content": system_message}, 
        {"role": "user", "content": topic}, 
        {"role": "assistant", "content": poem}
    ]}

topics = training_data['topic'].tolist()
json_objs = list()
for i, poem in enumerate(llama_70b_training_poems):
    msg = {"messages": [
        {"role": "system", "content": system_message}, 
        {"role": "user", "content": topics[i]}, 
        {"role": "assistant", "content": poem}
    ]}    
    json_objs.append(msg)

dataset_file_name = 'poem_training_data.jsonl'
dataset_id = 'poem-data-v1'
with open(dataset_file_name, 'w') as f:
    for obj in json_objs:
        json.dump(obj, f)
        f.write('\n')

# Clean up datasets

In [27]:
#list dataasets
! firectl list datasets


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


NAME                      CREATE TIME          STATE  DISPLAY_NAME
pk-faq-v1                 2024-09-17 05:59:59  READY  
poem-data-v1              2024-09-18 09:37:33  READY  
ticket-classification-v1  2024-09-16 18:21:15  READY  

Page 1 of 1
Total size: 3


In [29]:
# delete datasets
# ! firectl delete dataset <dataset-name> [flags]

#! firectl delete dataset poem-data-v1
! firectl delete dataset pk-faq-v1


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [30]:
! firectl list datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


NAME                      CREATE TIME          STATE  DISPLAY_NAME
ticket-classification-v1  2024-09-16 18:21:15  READY  

Page 1 of 1
Total size: 1


In [50]:
#! firectl delete fine-tuning-job 37079d8b24904cffb7622999e676fa15 # need to copy the id from the finetuning page.


In [31]:
# Upload our dataset to fireworks
!firectl create dataset {dataset_id} {dataset_file_name}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


144.76 KiB / 144.76 KiB [---------------------------] 100.00% 1.04 MiB p/s 300ms


In [41]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [42]:
# Create a fine-tuning job
!firectl create fine-tuning-job --settings-file poem_generation_fine_tuning_config.yaml --display-name poem-generation-v1 --dataset {dataset_id}

Name: accounts/jayozer-ce1cd6/fineTuningJobs/089bc412c11340a296b3fba6ac952868
Display Name: poem-generation-v1
Create Time: 2024-09-25 12:55:56
State: CREATING
Dataset: accounts/jayozer-ce1cd6/datasets/poem-data-v1
Status: OK
Created By: jayozer@gmail.com
Conversation:
  Jinja Template: {%- set _mode = mode | default('generate', true) -%}
{%- set stop_token = '<|eot_id|>' -%}
{%- set message_roles = ['SYSTEM', 'USER', 'ASSISTANT'] -%}
{%- set ns = namespace(initial_system_message_handled=false, last_assistant_index_for_eos=-1, messages=messages) -%}
{%- for message in ns.messages -%}
    {%- if not message.get('role') -%}
        {{ raise_exception('Key [role] is missing. Original input: ' +  message|tojson) }}
    {%- endif -%}
    {%- if message['role'] | upper not in message_roles -%}
        {{ raise_exception('Invalid role ' + message['role']|tojson + '. Only ' + message_roles|tojson + ' are supported.') }}
    {%- endif -%}
    {%- if 'content' not in message -%}
        {{ raise

In [43]:
# NOTE THAT THIS ID WILL CHANGE WHEN YOU RUN THE FINE-TUNING JOB ON YOUR ACCOUNT!!!
# The model id is printed in the stdout of the cell above as Name: accounts/{account_id}/fineTuningJobs/{ft_model_id}
ft_model_id = '089bc412c11340a296b3fba6ac952868' 

In [51]:
# Wait until the State of the fine-tuning job is listed as COMPLETED (~10-20 minutes)
!firectl get fine-tuning-job {ft_model_id}

Name: accounts/jayozer-ce1cd6/fineTuningJobs/089bc412c11340a296b3fba6ac952868
Display Name: poem-generation-v1
Create Time: 2024-09-25 12:55:56
State: COMPLETED
Dataset: accounts/jayozer-ce1cd6/datasets/poem-data-v1
Status:
  Code: OK
  Message: {'train_runtime': 39.6317, 'train_samples_per_second': 5.046, 'train_steps_per_second': 0.656, 'total_flos': 3603288758943744.0, 'train_loss': 0.8363421123761398, 'epoch': 2.0}
Created By: jayozer@gmail.com
Model Id: 089bc412c11340a296b3fba6ac952868
Conversation:
  Jinja Template: {%- set _mode = mode | default('generate', true) -%}
{%- set stop_token = '<|eot_id|>' -%}
{%- set message_roles = ['SYSTEM', 'USER', 'ASSISTANT'] -%}
{%- set ns = namespace(initial_system_message_handled=false, last_assistant_index_for_eos=-1, messages=messages) -%}
{%- for message in ns.messages -%}
    {%- if not message.get('role') -%}
        {{ raise_exception('Key [role] is missing. Original input: ' +  message|tojson) }}
    {%- endif -%}
    {%- if message['r

### Evaluation
Finally, we evaluate the fine-tuned model on our test data. The foundation model at the beginning of this notebook received an LLM judge score of 8.12. We expect our fine-tuned model to receive a higher score.

In [52]:
# Deploy the fine-tuned model
!firectl deploy {ft_model_id}

In [55]:
# Wait until the the Deploymed Model Refs lists the state of the model as "DEPLOYED" (~5-20 minutes).
!firectl get model {ft_model_id}

Name: accounts/jayozer-ce1cd6/models/089bc412c11340a296b3fba6ac952868
Create Time: 2024-09-25 13:02:32
State: READY
Status: OK
Kind: HF_PEFT_ADDON
Base Model Details:
  Checkpoint Format: CHECKPOINT_FORMAT_UNSPECIFIED
Peft Details:
  Base Model: accounts/fireworks/models/llama-v3-8b-instruct-hf
  R: 32
  Target Modules: [o_proj, gate_proj, up_proj, k_proj, v_proj, q_proj, down_proj]
Conversation Config:
  Style: jinja
Context Length: 8192
Fine Tuning Job: accounts/jayozer-ce1cd6/fineTuningJobs/089bc412c11340a296b3fba6ac952868
Deployed Model Refs: 
  [{
    Name: accounts/jayozer-ce1cd6/deployedModels/089bc412c11340a296b3fba6ac952868-29a1ab07
    Deployment: accounts/fireworks/deployments/21c38bed
    State: DEPLOYED
    Default: true
  }]


In [56]:
# Generate poems on the test set using our fine-tuned model
ft_poems = generate_poems(f'accounts/{account_id}/models/{ft_model_id}', test_data)

In [57]:
# Calculate heuristics of our fine-tuned poems
print("Heuristic Evaluation")
print(f'Average Length: {calculate_avg_length(ft_poems)}')
print(f"Rhyming Pct: {int(100 * np.mean([calculate_rhyming_fct(poem) for poem in ft_poems]))}%")
print(f"Positive Sentiment: {int(100 * np.mean([has_positive_sentiment(poem) for poem in ft_poems]))}%")

Heuristic Evaluation
Average Length: 1017
Rhyming Pct: 94%
Positive Sentiment: 82%


In [58]:
# Use the LLM to evaluate our fine-tuned model
ft_avg_score = evaluate_poems(ft_poems, 'accounts/fireworks/models/llama-v3-70b-instruct')
print(f"Avg LLM Judge Score: {round(ft_avg_score , 2)}")

Avg LLM Judge Score: 8.28


In [59]:
# Undeploy the fine-tuned model (does not cost anything extra, but Fireworks may limit your number of deployed models).
!firectl undeploy {ft_model_id}