# Getting the best out of LLMs

This notebook complements our [blog series](https://opper.ai/blog/getting-the-most-out-of-llms-1) on getting the best out of LLMs. It demoes Opper functions combined with few shot retrieval.




In [2]:
# Install the datasets library from huggingface
!pip install datasets 
!pip install opperai
!pip install pydantic
!pip install tqdm

import os
from datasets import load_dataset

os.environ["OPPER_API_KEY"] = "your-api-key"
os.environ["OPPER_PROJECT"] = "gsm8k"

# Get the dataset from huggingface https://huggingface.co/datasets/gsm8k
dataset = load_dataset("gsm8k", "main")



Now, let's parse this dataset so we can separate the rationale from the values.

In [3]:
def build_datasets(count: int = 100, split: float = 0.7):
    train_count = int(count * split)
    test_count = count - train_count

    train_dataset = [dataset["train"][i] for i in range(train_count)]
    test_dataset = [dataset["train"][i] for i in range(-test_count, 0)]

    def parse_data(data):
        parsed_data = []
        for item in data:
            question, full_answer = item["question"], item["answer"]
            thoughts, _, answer = full_answer.partition("\n#### ")
            parsed_data.append({"question": question, "thoughts": thoughts, "answer": answer or None})
        return parsed_data

    train_data = parse_data(train_dataset)
    test_data = parse_data(test_dataset)

    return train_data, test_data


In [4]:
# We take 1000 samples and split them into 70% training and 30% testing
training, testing = build_datasets(1000, 0.7)

In [5]:
training[0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'thoughts': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.',
 'answer': '72'}

In [6]:
from opperai import fn

@fn(model="openai/gpt3.5-turbo")
async def predict(question: str) -> int:
    """Solves math word problems"""

That's it! Now let's run our dataset through this function and see how well it performs. First let's define our evaluation function.

In [7]:

import asyncio
from tqdm.asyncio import tqdm_asyncio


async def evaluate(test_set, function, max_concurrent_requests: int = 10):
    semaphore = asyncio.Semaphore(max_concurrent_requests)

    async def evaluate_item(item):
        async with semaphore:
            try:
                answer_int = int(item["answer"].replace(",", ""))
                result = await function(item["question"])
                value = result if isinstance(result, int) else result.value
                return {
                    "expected": item["answer"],
                    "got": value,
                    "question": item["question"],
                    "pass": answer_int == value,
                    "raw": result,
                }
            except Exception as e:
                print(f"Exception: {e}")
                return {
                    "expected": item["answer"],
                    "got": None,
                    "question": item["question"],
                    "pass": False,
                    "raw": str(e),
                }

    tasks = [evaluate_item(item) for item in test_set]
    results = []
    for f in tqdm_asyncio.as_completed(tasks):
        result = await f
        results.append(result)
    return results


def calculate_pass_rate(results):
    total = len(results)
    passed = sum(1 for result in results if result["pass"])
    pass_rate = (passed / total) * 100
    print(f"Pass rate: {pass_rate:.2f}%")
    return pass_rate

In [87]:
results= await evaluate(testing, predict, max_concurrent_requests=10)

 51%|███████████████▊               | 153/300 [00:13<00:14, 10.24it/s]

Exception: 'float' object has no attribute 'value'


100%|███████████████████████████████| 300/300 [00:24<00:00, 12.04it/s]


In [88]:
calculate_pass_rate(results)


Pass rate: 26.00%


26.0


## Chain Of Thought

Let's instead ask for a solution based of a reasoning following by a numerical answer.



In [8]:
from pydantic import BaseModel, Field


class MathSolution(BaseModel):
    """A math solution with step-by-step reasoning and final answer"""

    thoughts: str = Field(..., description="step-by-step solving of the problem")
    value: int = Field(description="The final numerical answer to the math problem")


@fn(model="openai/gpt3.5-turbo")
async def predict_cot(question: str) -> MathSolution:
    """Solves math problems"""

In [90]:
results = await evaluate(testing, predict_cot, max_concurrent_requests=100)

100%|███████████████████████████████| 300/300 [00:19<00:00, 15.46it/s]


In [91]:
calculate_pass_rate(results)

Pass rate: 74.33%


74.33333333333333

Much better! Now let's add some examples form the GSM8k dataset

In [92]:
@fn(model="openai/gpt3.5-turbo")
async def predict_cot(question: str) -> MathSolution:
    """
    Solves math problems

    Examples:
    - Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
        Thoughts: Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
        Valuee: 72

    - Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
        Thoughts: Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute. Working 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.
        Value: 10

    - Question: James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year?
        Thoughts: He writes each friend 3*2=<<3*2=6>>6 pages a week So he writes 6*2=<<6*2=12>>12 pages every week That means he writes 12*52=<<12*52=624>>624 pages a year
        Value: 624
    """

In [93]:
results_manual_examples = await evaluate(testing, predict_cot, max_concurrent_requests=100)

  0%|                                 | 1/300 [00:00<02:06,  2.37it/s]

Exception: 


 84%|██████████████████████████     | 252/300 [00:16<00:02, 22.74it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


100%|███████████████████████████████| 300/300 [00:24<00:00, 12.30it/s]


In [94]:
calculate_pass_rate(results_manual_examples)

Pass rate: 73.33%


73.33333333333333

# Synthetic examples

Let's generate examples with Claude 3 Opus

In [9]:
@fn(model="anthropic/claude-3-opus")
async def predict_cot_with_examples(question: str) -> MathSolution:
    """Solves math problems"""

In [68]:
await predict_cot_with_examples(training[0]["question"])

MathSolution(thoughts='To find the total number of clips Natalia sold in April and May:\n1. In April, Natalia sold clips to 48 friends.\n2. In May, she sold half as many clips as in April. \n   Half of 48 is 48 ÷ 2 = 24.\n   So in May, Natalia sold 24 clips.\n3. To get the total, add the clips sold in April and May:\n   April: 48 clips\n   May: 24 clips \n   Total: 48 + 24 = 72 clips\n\nTherefore, the total number of clips Natalia sold altogether in April and May is 72.', value=72)

In [14]:
from opperai import AsyncOpper
from opperai.types import ChatPayload, Message

opper = AsyncOpper()


for qa in training[0:20]:
    payload = ChatPayload(messages=[Message(role="user", content=qa["question"])])
    try:
        answer, api_response = await predict_cot_with_examples.call(payload)
        print(answer)
        await api_response.span.save_example()
    except Exception as e:
        print(e)
        # It is possible the model fails to generate a response with the desired format
        # when it thinks the answer is not numerical!

thoughts="Okay, let's break this down step-by-step:\n1. In April, Natalia sold clips to 48 of her friends. So in April she sold 48 clips.\n2. In May, she sold half as many clips as she did in April.\n   - To calculate half of 48: 48 ÷ 2 = 24\n   - So in May she sold 24 clips\n3. To calculate the total clips sold in April and May, we add the numbers together:\n   - April clips: 48\n   - May clips: 24 \n   - Total: 48 + 24 = 72\n\nTherefore, the total number of clips Natalia sold altogether in April and May is 72." value=72
thoughts="Okay, let's break this down step-by-step:\n1. Weng earns $12 per hour babysitting\n2. Yesterday she babysat for 50 minutes\n3. There are 60 minutes in an hour\n4. So 50 minutes is 50/60 or 5/6 of an hour\n5. If she earns $12 for a full hour, for 5/6 of an hour she earned:\n   $12 * (5/6) = $10\nTherefore, for 50 minutes of babysitting, Weng earned $10." value=10
thoughts="Let's organize the information we have:\n- The wallet costs $100\n- Betty has half of t

In [96]:
@fn(
    model="openai/gpt3.5-turbo",
    few_shot=True,
    few_shot_count=3,
)
async def predict_cot_with_examples(question: str) -> MathSolution:
    """Solves math problems"""

In [97]:
results_synthetic_examples = await evaluate(
    testing, predict_cot_with_examples, max_concurrent_requests=100
)

 54%|████████████████▋              | 161/300 [00:15<00:07, 17.59it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


 84%|██████████████████████████     | 252/300 [00:20<00:02, 19.48it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


 90%|███████████████████████████▉   | 270/300 [00:21<00:01, 24.14it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


 91%|████████████████████████████▎  | 274/300 [00:22<00:02, 10.42it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


 94%|█████████████████████████████  | 281/300 [00:23<00:03,  5.78it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


 97%|██████████████████████████████▏| 292/300 [00:26<00:01,  4.07it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


 99%|██████████████████████████████▌| 296/300 [00:29<00:02,  1.65it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


 99%|██████████████████████████████▋| 297/300 [00:29<00:01,  1.96it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


 99%|██████████████████████████████▊| 298/300 [00:33<00:02,  1.39s/it]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


100%|██████████████████████████████▉| 299/300 [00:34<00:01,  1.37s/it]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}


100%|███████████████████████████████| 300/300 [00:36<00:00,  8.31it/s]

Exception: {"detail":"Structured generation failed, could not generate response matching the schema."}





In [98]:
# We get between 77-81% most of the time here.
calculate_pass_rate(results_synthetic_examples)



Pass rate: 77.67%


77.66666666666666

And that's it! We leveraged synthetic example generation and chain of thought prompting to get much higher accuracy on the GSM8k benchmark