# Lesson 5: Reward functions with LLM as a judge

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>Access <code>requirements.txt</code>  file:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>.

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>

</div>

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 🚨
&nbsp; <b>Different Run Results:</b> The output generated by AI chat models can vary with each execution due to their dynamic, probabilistic nature. Don't be surprised if your results differ from those shown in the video.</p>

Start by loading dependencies:

In [1]:
import os
import re
from datasets import load_dataset
from dotenv import load_dotenv
from openai import OpenAI

from utils import *

load_dotenv("../.env")

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

pb_client = OpenAI(
    base_url=os.environ["PREDIBASE_MODEL_LLAMA_URL"],
    api_key=os.environ["PREDIBASE_API_KEY"],
)

## The task: creating summaries of earnings call transcripts

Start by loading the earnings call dataset from Hugging Face:

In [2]:
ds = load_dataset("mrSoul7766/ECTSum")
transcript = ds["train"][1]["text"]
print(transcript[:1983])

train.json:   0%|          | 0.00/30.8M [00:00<?, ?B/s]

test.json: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1681 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/495 [00:00<?, ? examples/s]

I'm joined by Tom Greco, our President and Chief Executive Officer; and Jeff Shepherd, our Executive Vice President and Chief Financial Officer.
We also hope that you and your families are healthy and safe.
The health and safety of our team members and customers has been a top priority over the past year.
With strength across all channels, we delivered comparable store sales growth of 24.7%, and margin expansion of 478 basis points versus the prior year.
On a two-year stack, our comp sales growth was 15.4%.
Adjusted diluted earnings per share of $3.34 represented an all-time quarterly high for AAP, and improved more than 230% compared to Q1 2020.
Free cash flow of $259 million was up significantly versus the prior year, and we returned over $203 million to our shareholders through a combination of share repurchases and our quarterly cash dividend.
In addition, we recently announced an updated capital allocation framework targeting top quartile total shareholder return, highlighted by o

Define a summarize prompt and helper function, then create and print a summary 

(Note: the MODEL_NAME is specified in the utils.py file: here you are using Llama-3.1-8B-Instruct-dequantized to generate the summaries.)

In [3]:
SUMMARIZE_PROMPT = """Generate a concise summary of the information in the following earnings call transcript.

Only respond with the summary, do not include any extraneous text.

Transcript:

{transcript}
"""

def summarize(transcript, n=1):
    prompt = SUMMARIZE_PROMPT.format(transcript=transcript)
    messages = [
        {"role": "user", "content": prompt},
    ]

    return pb_client.chat.completions.create(
        model=MODEL_NAME,
        messages=messages,
        n=n,
        temperature=0.9,
    )

In [4]:
resp = summarize(transcript)
summary = resp.choices[0].message.content
print(summary)

AAP reported Q1 comparable store sales growth of 24.7%, driven by strength across all channels, and margin expansion of 478 basis points. Adjusted diluted earnings per share was $3.34, an all-time quarterly high. Free cash flow was $259 million, up significantly from the prior year. The company returned over $203 million to shareholders through share repurchases and its quarterly cash dividend. AAP announced an updated capital allocation framework targeting top-quartile total shareholder return. The company expects to continue to invest in its brands, customer experience, team members, and market expansion to drive top-line growth above market. Guidance was raised to 4-6% comp sales growth and adjusted OI margin range was updated to 9-9.2%.


## Use an LLM as a judge of summary quality

Define a prompt that will tell the OpenAI GPT-4o-mini model to assign a reward score to a summary:

In [5]:
JUDGE_PROMPT_V1 = """
Rate the following summary of an earnings call transcript on a 
scale from 1 to 10. 

1 means the summary is very poor, 10 means the summary is very good.

Provide reasoning followed by the final score at the end 
surrounded by <score> tags.

For example:

<score>1</score>

Transcript:

{transcript}

Summary:

{summary}
"""

def judge_reward_v1(
    transcript: str,
    summary: str, 
    model: str = "gpt-4o-mini", 
    verbose: bool = False,
) -> float:
    prompt = JUDGE_PROMPT_V1.format(
        transcript=transcript, 
        summary=summary,
    )
    messages = [
        {"role": "user", "content": prompt},
    ]

    resp = client.chat.completions.create(
        model=model,
        messages=messages,
        n=1,
        temperature=0,
        )
    completion = resp.choices[0].message.content

    if verbose:
        print(completion)

    try:
        match = re.search(r"<score>(\d+)<\/score>", completion)
        if match is None:
            return 0
    
        # Extract the "score" part from the completion
        score = match.group(1).strip()
        score = int(score)
    except:
        score = 0
    
    return score / 10

Now score the summary you generated above using the new reward function:

In [6]:
score = judge_reward_v1(transcript, summary, verbose=True)
print(score)

The summary provided captures the key financial metrics and strategic initiatives discussed in the earnings call transcript. It effectively highlights the significant growth in comparable store sales, earnings per share, and free cash flow, as well as the company's commitment to returning value to shareholders. Additionally, it mentions the updated guidance for future sales growth and operating income margins, which are crucial for investors.

However, the summary could be improved by including more context about the factors driving the growth, such as the impact of federal stimulus, changes in consumer behavior, and specific categories that performed well. It also lacks mention of the company's operational strategies and initiatives that were discussed in detail in the transcript, which could provide a more comprehensive view of AAP's performance and future outlook.

Overall, while the summary is concise and covers the essential financial highlights, it misses some of the nuances and 

Now generate 8 new summaries and score each one:

In [7]:
resp = summarize(transcript, n=8)
summaries = [choice.message.content for choice in resp.choices]

In [8]:
scores = [judge_reward_v1(transcript, summary) for summary in summaries]
scores

[0.8, 0.8, 0.7, 0.8, 0.8, 0.8, 0.7, 0.8]

## Taking a quiz to assign a reward score

In this section, you'll create a multiple choice quiz that tests key facts from the earnings call transcript. You'll then ask another LLM to take the quiz using different call summaries, and use the quiz score as the reward score.

Start by creating the quiz prompt:

In [9]:
from pydantic import BaseModel
from random import shuffle


QUIZ_PROMPT = """
Generate a multiple-choice quiz based on the information 
in the following earnings call transcript.

Example:

```
1. What was the q1 adjusted earnings per share?
a) $3.34
b) $5.32
c) $2.49
d) $7.78

2. By what percent did same store sales rise in q1?
a) 29.4%
b) 32.1%
c) 24.7%
d) 21.2%

===== ANSWERS =====
1. a
2. c
```

Limit the length of the quiz to the top 10 most relevant questions for financial analysts.

Transcript:

{transcript}
"""

Next, define pydantic classes that define the structure of an individual question, and a quiz comprised of multiple questions. Then define a helper function to create a quiz using structured response from GPT-4o-mini:

In [10]:
class Question(BaseModel):
    text: str
    options: list[str]
    answer: int

    def shuffle_options(self) -> None:
        """Shuffle the options while preserving the correct answer"""
        # Get the correct answer text
        correct = self.options[self.answer]
        
        # Shuffle the options
        shuffled = self.options.copy()
        shuffle(shuffled)
        
        # Update the answer index to match new position
        self.options = shuffled
        self.answer = shuffled.index(correct)

    def __str__(self) -> str:
        """Pretty print a single question"""
        output = [self.text]
        for i, option in enumerate(self.options):
            output.append(f"{chr(65+i)}. {option}")
        return "\n".join(output)


class Quiz(BaseModel):
    questions: list[Question]

    def shuffle_all_questions(self) -> None:
        """Shuffle the options for all questions in the quiz"""
        for question in self.questions:
            question.shuffle_options()
    
    def __str__(self) -> str:
        """Pretty print the entire quiz"""
        output = []
        for i, question in enumerate(self.questions, 1):
            output.append(f"\nQuestion {i}:")
            output.append(str(question))
        return "\n".join(output)


def create_quiz(transcript: str):
    prompt = QUIZ_PROMPT.format(transcript=transcript)
    messages = [
        {"role": "user", "content": prompt},
    ]
    resp = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=messages,
        temperature=0.7,
        response_format=Quiz,
    )

    quiz = resp.choices[0].message.parsed
    quiz.shuffle_all_questions()

    return quiz

In [11]:
quiz = create_quiz(transcript)
print(quiz)


Question 1:
What was the adjusted diluted earnings per share for Q1?
A. $2.49
B. $3.34
C. $5.32
D. $1.00

Question 2:
By what percent did comparable store sales grow in Q1?
A. 21.2%
B. 32.1%
C. 29.4%
D. 24.7%

Question 3:
What was the increase in free cash flow compared to last year?
A. $71 million
B. $203 million
C. $259 million
D. $330 million

Question 4:
What was the percentage increase in adjusted operating income from last year?
A. 91 basis points
B. 387 basis points
C. 230%
D. 478 basis points

Question 5:
How many new independent locations were added to the Carquest family?
A. 29
B. 115
C. 80
D. 13,000

Question 6:
What was the adjusted SG&A expense in Q1?
A. $259 million
B. $203 million
C. $1.2 billion
D. $3.3 billion

Question 7:
What percentage of net sales did adjusted SG&A represent?
A. 23.4%
B. 9.0%
C. 44.8%
D. 35.8%

Question 8:
How much did the company return to shareholders in Q1?
A. $203 million
B. $330 million
C. $71 million
D. $1.1 million

Question 9:
What is the 

Now, define a function that asks an LLM to take a quiz, using a transcript summary as the source material:

In [12]:
letter_to_index = {"A": 0, "B": 1, "C": 2, "D": 3}
index_to_letter = ["A", "B", "C", "D"]


TAKE_QUIZ_PROMPT = """Use the provided summary of a transcript 
to answer the following quiz.

Quiz:

{quiz}

Summary:

{summary}

Respond with just a list of answers and no additional text, 
for example:

[A, D, C, B, B, C, D, A, A, B]

You must provide an answer for all 10 questions. 
If you don't know the answer, answer with "0" for that question. 
Example:

[A, D, 0, B, B, C, D, A, A, B]
"""

def take_quiz(summary, quiz):
    question_strs = []
    for question in quiz.questions:
        question_str = question.text
        for i, option in enumerate(question.options):
            letter = index_to_letter[i]
            question_str += f"\n{letter}. {option}"
        question_strs.append(question_str)
    quiz_str = "\n\n".join(question_strs)

    prompt = TAKE_QUIZ_PROMPT.format(quiz=quiz_str, summary=summary)
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    resp_str = resp.choices[0].message.content
    
    # Convert string representation of list to actual list of strings
    answers = resp_str.strip('[]').split(', ')

    return answers

In [13]:
answers = take_quiz(summaries[0], quiz)
answers

['B', 'D', 'A', 'D', '0', 'A', 'A', 'A', '0', 'C']

Finally, score the LLM's answers to the quiz:

In [14]:
def score_quiz_answers(answers, quiz):
    assert len(answers) == len(quiz.questions)

    total = len(answers)
    correct = 0
    for answer, question in zip(answers, quiz.questions):
        expected_answer = index_to_letter[question.answer]
        if answer == expected_answer:
            correct += 1
    return correct / total

In [15]:
score_quiz_answers(answers, quiz)

0.5

Finally, generate rewards and advantages for all 8 summaries you created earlier:

In [16]:
def print_quiz_table(all_answers, rewards):
    advantages = compute_advantages(rewards)
    length = len(all_answers)
    elems = list(zip(range(length), rewards, advantages))

    headers = ["Index", "Reward", "Advantage"]
    table = tabulate(elems, headers=headers, tablefmt="grid").split("\n")
    for row in table:
        print(row)

In [17]:
all_answers = []
quiz_rewards = []
for summary in summaries:
    answers = take_quiz(summary, quiz)
    all_answers.append(answers)
    quiz_rewards.append(score_quiz_answers(answers, quiz))

In [18]:
print_quiz_table(all_answers, quiz_rewards)

+---------+----------+-------------+
|   Index |   Reward |   Advantage |
|       0 |      0.5 |     1.73205 |
+---------+----------+-------------+
|       1 |      0.5 |     1.73205 |
+---------+----------+-------------+
|       2 |      0.4 |    -0.57735 |
+---------+----------+-------------+
|       3 |      0.4 |    -0.57735 |
+---------+----------+-------------+
|       4 |      0.4 |    -0.57735 |
+---------+----------+-------------+
|       5 |      0.4 |    -0.57735 |
+---------+----------+-------------+
|       6 |      0.4 |    -0.57735 |
+---------+----------+-------------+
|       7 |      0.4 |    -0.57735 |
+---------+----------+-------------+
