# Base Model Evaluation

Note: intended to be run in [Google Colab](https://colab.research.google.com/) using a T4 runtime.

Select an open-source model and evaluating it's base model against the task. The Llama 3.1 base model only has 8 billion params and will be quantized down to 4 bits compared with GPT-4o at trillions of params.

Models evaluated:
- Meta-Llama-3.1-8B
- Qwen2.5-7B
- gemma-2-9b
- Phi-3-medium-4k-instruct

<img src="./../images/Product-Pricer-Model-Evaluation-Results.jpg" alt="Average Absolute Prediction Error Baseline Models" />

# Dependencies

In [None]:
# pip installs - ignore the error message!

!pip install -q datasets requests torch peft bitsandbytes transformers trl accelerate sentencepiece tiktoken matplotlib

In [None]:
# imports

import os
import re
import math
from tqdm import tqdm
from google.colab import userdata
from huggingface_hub import login
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, set_seed
from peft import LoraConfig, PeftModel
from datasets import load_dataset, Dataset, DatasetDict
from datetime import datetime
import matplotlib.pyplot as plt

# Setup

In [None]:
# Tokenizers

LLAMA_3_1 = "meta-llama/Meta-Llama-3.1-8B"
QWEN_2_5 = "Qwen/Qwen2.5-7B"
GEMMA_2 = "google/gemma-2-9b"
PHI_3 = "microsoft/Phi-3-medium-4k-instruct"

# Constants

BASE_MODEL = LLAMA_3_1
HF_USER = "clanredhead"
DATASET_NAME = f"{HF_USER}/pricer-data"
MAX_SEQUENCE_LENGTH = 182
QUANT_4_BIT = False

# Used for writing to output in color

GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
RESET = "\033[0m"
COLOR_MAP = {"red":RED, "orange": YELLOW, "green": GREEN}

%matplotlib inline

## HuggingFace Token

**IMPORTANT** requires read and write permissions.

Add `HF_TOKEN` to secrets, paste value and toggle on for this notebook.

In [None]:
# Log in to HuggingFace

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

# Evaluate Tokenizers

Numbers used to test tokenizer:
> 0, 1, 10, 100, 999, 1000

Function used to get token:

`tokens = tokenizer.encode(str(number), add_special_tokens=False)`

`add_special_tokens=False`: do not add in things like a start and end of sentance token that will interfer with things. Just sinple convert this text into tokens that represents this test.

## Function

In [None]:
def investigate_tokenizer(model_name):
  print("Investigating tokenizer for", model_name)
  tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
  for number in [0, 1, 10, 100, 999, 1000]:
    tokens = tokenizer.encode(str(number), add_special_tokens=False)
    print(f"The tokens for {number}: {tokens}")

In [None]:
# Investigate tokenizer of each model: LLAMA_3_1, QWEN_2_5, GEMMA_2, PHI_3

investigate_tokenizer(LLAMA_3_1)
investigate_tokenizer(QWEN_2_5)
investigate_tokenizer(GEMMA_2)
investigate_tokenizer(PHI_3)

## Results

Token investigation results:

`investigate_tokenizer(LLAMA_3_1)`

> The tokens for 0: [15]<br>
> The tokens for 1: [16]<br>
> The tokens for 10: [605]<br>
> The tokens for 100: [1041]<br>
> The tokens for 999: [5500]<br>
> The tokens for 1000: [1041, 15]

`investigate_tokenizer(QWEN_2_5)`

> The tokens for 0: [15]<br>
> The tokens for 1: [16]<br>
> The tokens for 10: [16, 15]<br>
> The tokens for 100: [16, 15, 15]<br>
> The tokens for 999: [24, 24, 24]<br>
> The tokens for 1000: [16, 15, 15, 15]

`investigate_tokenizer(GEMMA_2)`

> The tokens for 0: [235276]<br>
> The tokens for 1: [235274]<br>
> The tokens for 10: [235274, 235276]<br>
> The tokens for 100: [235274, 235276, 235276]<br>
> The tokens for 999: [235315, 235315, 235315]<br>
> The tokens for 1000: [235274, 235276, 235276, 235276]

`investigate_tokenizer(PHI_3)`

> The tokens for 0: [29871, 29900]<br>
> The tokens for 1: [29871, 29896]<br>
> The tokens for 10: [29871, 29896, 29900]<br>
> The tokens for 100: [29871, 29896, 29900, 29900]<br>
> The tokens for 999: [29871, 29929, 29929, 29929]<br>
> The tokens for 1000: [29871, 29896, 29900, 29900, 29900]

Note: there is a smaller variant of PHI_3 that has same properties for representing numbers as tokens as Lllama that might be worth testing.

For Llama you see that every number from 0 to 999 gets mapped to one token whereas for Qwen, Phi-3 and Gemma -they have a token per digit. For example, if the next token it predicts is 9 then that could be \\$9, \\$99 or \\$999 and that will only transpire when it does the token after that.

# Load Data

Retrieve uploaded data from HuggingFace.

`MAX_SEQUENCE_LENGTH = 182` - This variable is used to limit the text to 179 tokens +3 tokens in case a start and end sentance token is used to ensure the price is not cut off.

In [None]:
dataset = load_dataset(DATASET_NAME)
train = dataset['train']
test = dataset['test']

In [None]:
test[0]

`test[0]`
> {'text': "How much does this cost to the nearest dollar?\n\nOEM AC Compressor w/A/C Repair Kit For Ford F150 F-150 V8 & Lincoln Mark LT 2007 2008 - BuyAutoParts NEW\nAs one of the world's largest automotive parts suppliers, our parts are trusted every day by mechanics and vehicle owners worldwide. This A/C Compressor and Components Kit is manufactured and tested to the strictest OE standards for unparalleled performance. Built for trouble-free ownership and 100% visually inspected and quality tested, this A/C Compressor and Components Kit is backed by our 100% satisfaction guarantee. Guaranteed Exact Fit for easy installation 100% BRAND NEW, premium ISO/TS 16949 quality - tested to meet or exceed OEM specifications Engineered for superior durability, backed by industry-leading unlimited-mileage warranty Included in this K\n\nPrice is $",
 'price': 374.41}

# Prepare Base Llama Model for Evaluation

The Llama model has been chosen for further evaluation so now it will be loaded with 4 bit quantization.

`QUANT_4_BIT = True` is a setting to select the 4 bit quantization if set to true, or 8 bit if set to false.

Load the Tokenizer and the Model:

- 4 bit Memory footprint: 5.6 GB
- 8 bit Memory footprint: 9.1 GB

`model_predict(prompt)` is how we call our model in inference mode.

- take the prompt and encode it
- `.to("cuba")` pushes it off to the GPU
- `attention_mask` line stops it from printing a warning but doesn't actually affect anything. It prevents it from trying to predict anythign that is happening in that input token area. We only want it to predict what's coming afterwards. It will do this anyway but gives warning if you don't explicit tell it to do that.
- Use `generate` method on base model with max tokens of 4 (only really need 1 token but just in case it prints a dollar sign or something) and `num_return_sequences=1` to tell it to only give back 1 answer
- Take that one answer and set it as the response using decode to turn it back into a string
- Our `exact_price` function reurns just the price from the answer

In [None]:
## pick the right quantization

if QUANT_4_BIT:
  quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
  )
else:
  quant_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.bfloat16
  )

In [None]:
# Load the Tokenizer and the Model

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=quant_config,
    device_map="auto",
)
base_model.generation_config.pad_token_id = tokenizer.pad_token_id

print(f"Memory footprint: {base_model.get_memory_footprint() / 1e9:.1f} GB")

In [None]:
def extract_price(s):
    if "Price is $" in s:
      contents = s.split("Price is $")[1]
      contents = contents.replace(',','').replace('$','')
      match = re.search(r"[-+]?\d*\.\d+|\d+", contents)
      return float(match.group()) if match else 0
    return 0

In [None]:
extract_price("Price is $999 blah blah so cheap")

In [None]:
def model_predict(prompt):
    set_seed(42)
    inputs = tokenizer.encode(prompt, return_tensors="pt").to("cuda")
    attention_mask = torch.ones(inputs.shape, device="cuda")
    outputs = base_model.generate(inputs, max_new_tokens=4, attention_mask=attention_mask, num_return_sequences=1)
    response = tokenizer.decode(outputs[0])
    return extract_price(response)

In [None]:
model_predict(test[0]['text'])

# Evaluation

Use the test harness to see how the base Llama 3.1 model does against the Test dataset.

<img src="./../images/Product-Pricer-Open-Source-LLM-Llama-3-1-8B-Base-Model-4-Bit.jpg" alt="Distribution of Prices Predicted Using Llama 3.1 8B Base Model with 4 Bit Quantization" />

**Result**: Worse yet, even worse than taking a random guess lol.

- Llama 3.1 8B Base Model 4 Bit Pricer Error=$395.72
- RMSLE=1.49
- Hits=28.0%

<img src="./../images/Product-Pricer-Open-Source-LLM-Llama-3-1-8B-Base-Model-8-Bit.jpg" alt="Distribution of Prices Predicted Using Llama 3.1 8B Base Model with 8 Bit Quantization" />

**Result**: Some improvement compared to 4 bit quantization and better than random guessing but still a really bad result.

- Llama 3.1 8B Base Model 8 Bit Pricer Error=$308.13
- RMSLE=1.37
- Hits=29.2%

In [None]:
class Tester:

    def __init__(self, predictor, data, title=None, size=250):
        self.predictor = predictor
        self.data = data
        self.title = title or predictor.__name__.replace("_", " ").title()
        self.size = size
        self.guesses = []
        self.truths = []
        self.errors = []
        self.sles = []
        self.colors = []

    def color_for(self, error, truth):
        if error<40 or error/truth < 0.2:
            return "green"
        elif error<80 or error/truth < 0.4:
            return "orange"
        else:
            return "red"

    def run_datapoint(self, i):
        datapoint = self.data[i]
        guess = self.predictor(datapoint["text"])
        truth = datapoint["price"]
        error = abs(guess - truth)
        log_error = math.log(truth+1) - math.log(guess+1)
        sle = log_error ** 2
        color = self.color_for(error, truth)
        title = datapoint["text"].split("\n\n")[1][:20] + "..."
        self.guesses.append(guess)
        self.truths.append(truth)
        self.errors.append(error)
        self.sles.append(sle)
        self.colors.append(color)
        print(f"{COLOR_MAP[color]}{i+1}: Guess: ${guess:,.2f} Truth: ${truth:,.2f} Error: ${error:,.2f} SLE: {sle:,.2f} Item: {title}{RESET}")

    def chart(self, title):
        max_error = max(self.errors)
        plt.figure(figsize=(12, 8))
        max_val = max(max(self.truths), max(self.guesses))
        plt.plot([0, max_val], [0, max_val], color='deepskyblue', lw=2, alpha=0.6)
        plt.scatter(self.truths, self.guesses, s=3, c=self.colors)
        plt.xlabel('Ground Truth')
        plt.ylabel('Model Estimate')
        plt.xlim(0, max_val)
        plt.ylim(0, max_val)
        plt.title(title)
        plt.show()

    def report(self):
        average_error = sum(self.errors) / self.size
        rmsle = math.sqrt(sum(self.sles) / self.size)
        hits = sum(1 for color in self.colors if color=="green")
        title = f"{self.title} Error=${average_error:,.2f} RMSLE={rmsle:,.2f} Hits={hits/self.size*100:.1f}%"
        self.chart(title)

    def run(self):
        self.error = 0
        for i in range(self.size):
            self.run_datapoint(i)
        self.report()

    @classmethod
    def test(cls, function, data):
        cls(function, data).run()

In [None]:
Tester.test(model_predict, test)