# Lesson 6: Reward hacking

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>Access <code>requirements.txt</code>  file:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>.

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>

</div>

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 🚨
&nbsp; <b>Different Run Results:</b> The output generated by AI chat models can vary with each execution due to their dynamic, probabilistic nature. Don't be surprised if your results differ from those shown in the video.</p>

Start by importing dependencies and setting up two clients, one for OpenAI and one for Predibase:

In [1]:
import os
from datasets import load_dataset
from dotenv import load_dotenv
from openai import OpenAI

from utils import *

load_dotenv()

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

pb_client = OpenAI(
    base_url=os.environ["PREDIBASE_MODEL_LLAMA_URL"],
    api_key=os.environ["PREDIBASE_API_KEY"],
)

## Hacking the summarization task with longer summaries

Here, you'll see how longer summaries could lead to higher rewards. Start by loading the same earnings call dataset from the previous lesson:

In [2]:
ds = load_dataset("mrSoul7766/ECTSum")

transcript = ds["train"][1]["text"]

train.json:   0%|          | 0.00/30.8M [00:00<?, ?B/s]

test.json: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/1681 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/495 [00:00<?, ? examples/s]

In [3]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['text', 'summary'],
        num_rows: 1681
    })
    test: Dataset({
        features: ['text', 'summary'],
        num_rows: 495
    })
})


In [4]:
for split in ds:
    print(f"{split}: {ds[split][0]['text']}")

train: For those of you that have not, it is available on the Investor Relations section of our website at investor.
Non-GAAP net earnings and non-GAAP EPS, which have been adjusted for certain items which may affect the comparability of our performance with other companies.
I'm very pleased with the strong start to 2021 and the positive momentum in revenue and margins we delivered in the first quarter, demonstrating the strong operating leverage in our business.
Consolidated revenues increased 11.1% year-over-year in our first full quarter as a stand-alone public company.
The revenue increase included same-store revenue growth of 14.8% and we reported adjusted EBITDA margin that improved to 15.4% of revenues.
This is the first quarter in over a decade that the Company has delivered double-digit same-store revenue growth.
Our teams in the field and our store support centers and Woodhaven are performing at a very high level and are energized and engaged.
As I visit Aaron's stores around

Generate a quiz based on the call:

In [5]:
quiz = generate_quiz(transcript)
print(quiz)


Question 1:
What was the Q1 adjusted earnings per share reported?
A. $2.49
B. $3.34
C. $1.00
D. $5.32

Question 2:
By what percentage did comparable store sales grow in Q1?
A. 24.7%
B. 29.4%
C. 32.1%
D. 21.2%

Question 3:
What was the free cash flow for the quarter?
A. $259 million
B. $203 million
C. $71 million
D. $330 million

Question 4:
How much did the company return to shareholders in Q1?
A. $71 million
B. $203 million
C. $259 million
D. $1.1 billion

Question 5:
What was the adjusted gross profit margin for Q1?
A. 42.6%
B. 41.2%
C. 45.5%
D. 44.8%

Question 6:
What was the adjusted operating income for the quarter?
A. $113 million
B. $299 million
C. $203 million
D. $330 million

Question 7:
What was the rate of adjusted SG&A expense as a percentage of net sales in Q1?
A. 35.8%
B. 32.5%
C. 36.4%
D. 38.1%

Question 8:
What was the net sales increase for Q1?
A. 23.4%
B. 21.2%
C. 20.0%
D. 24.7%

Question 9:
What is the expected range for the adjusted operating income margin for 2021

Generate 8 summaries of the call (again, you'll use the Llama-3.1-8B-Instruct-dequantized, which is defined in the utils.py file):

In [6]:
prompt = f"""Generate a concise bulleted summary of the 
information in the following earnings call transcript.

Only respond with the summary, do not include any extraneous text.

Transcript:

{transcript}
"""

completions = pb_client.chat.completions.create(
    model=MODEL_NAME,
    messages=[
        {"role": "user", "content": prompt},
    ],
    n=8,
    temperature=0.9,
)

Use each summary to take the quiz and get a reward score:

In [7]:
responses = [choice.message.content for choice in completions.choices]
quiz_rewards = [quiz_reward(resp, quiz) for resp in responses]
quiz_rewards

AssertionError: 

The transcript should get a perfect score: check that it does:

In [8]:
transcript_score = quiz_reward(transcript, quiz)
transcript_score

AssertionError: 

Check lengths of the 8 summaries and compare to full transcript:

In [9]:
lengths = [len(resp) for resp in responses]
lengths

[1075, 1050, 1198, 1871, 789, 1344, 916, 1239]

In [10]:
len(transcript)

21810

## Create a penalty function to discourage longer summaries

Here, you'll create a reward function that assigns a negative score (i.e. a penalty) to the model for longer summaries. Over time during training, this penalty should discourage the model from getting higher quiz scores just by writing longer summaries.

In [11]:
def length_penalty_reward(response: str) -> float:
    length = len(response)
    target_length = 1024
    if length <= target_length:
        return 0.0
    else:
        return max(
            (target_length - length) / target_length,
            -10
        ) 

In [12]:
transcript_reward = length_penalty_reward(transcript)
transcript_reward

-10

Show the length penalties and resulting advantages for the 8 summaries:

In [13]:
lengths = [len(resp) for resp in responses]
length_rewards = [
    length_penalty_reward(resp) for resp in responses
]
print_length_table(lengths, length_rewards)

+---------+----------+------------+-------------+
|   Index |   Length |     Reward |   Advantage |
|       0 |     1075 | -0.0498047 |   0.575555  |
+---------+----------+------------+-------------+
|       1 |     1050 | -0.0253906 |   0.669523  |
+---------+----------+------------+-------------+
|       2 |     1198 | -0.169922  |   0.113232  |
+---------+----------+------------+-------------+
|       3 |     1871 | -0.827148  |  -2.41639   |
+---------+----------+------------+-------------+
|       4 |      789 |  0         |   0.76725   |
+---------+----------+------------+-------------+
|       5 |     1344 | -0.3125    |  -0.435542  |
+---------+----------+------------+-------------+
|       6 |      916 |  0         |   0.76725   |
+---------+----------+------------+-------------+
|       7 |     1239 | -0.209961  |  -0.0408761 |
+---------+----------+------------+-------------+


Add the length penalty to the quiz reward score:

In [14]:
def total_reward(length_reward, quiz_reward):
    return length_reward + quiz_reward

In [15]:
total_rewards = [
    total_reward(length_reward, quiz_reward) 
    for length_reward, quiz_reward
    in zip(length_rewards, quiz_rewards)
]

NameError: name 'quiz_rewards' is not defined

Visualize the trade-off between length and quiz score in determining advantages:

In [None]:
from matplotlib import pyplot as plt

advantages = compute_advantages(total_rewards)
min_adv = min(advantages)
max_adv = max(advantages)

plt.figure(figsize=(10,6), facecolor='black')
plt.style.use('dark_background')
scatter = plt.scatter(length_rewards, quiz_rewards, c=advantages, cmap='RdYlGn', s=100, edgecolor='white', vmin=min_adv, vmax=max_adv)
plt.colorbar(scatter, label='Advantage')
plt.xlabel('Length Reward')
plt.ylabel('Quiz Reward')
plt.title('Length Reward vs Quiz Reward (colored by advantage)')
plt.grid(True, alpha=0.2)
plt.show()