<a href="https://colab.research.google.com/github/mohmirzabr/SIMP/blob/main/Simple%20Math%20Benchmarking%20for%20LLM/misc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# 🎨 Math LLM Study Assistant Experiment Notebook
#@title 👨‍🔬 Simple Mathematical Benchmarking for Natural Language Processing Models
#@markdown This notebook version is the preseved version of my recent research outputs. Based on the [main notebook](https://github.com/mohmirzabr/SIMP/blob/main/1.%20Simple%20Mathematical%20Benchmarking%20for%20Natural%20Language%20Processing%20Models/MAIN.ipynb) from the repository which is developed together with ChatGPT by OpenAI that I attributed to. This notebook guides you step-by-step to compare atleast three natural language processing models from Ollama on basic math questions from GSK8M (openai/gsk8m).
#@markdown <br> <br> This project is so simple that it only could benchmark simple mathemathics like GSM8K. As the result comparing only compare the numbers from the generation result and looking for the supposedly correct number, whether it is mentioned in the generation result or not.
#@markdown <br> <br> In conclusion, this notebook isn't suitable for advanced benchmarking and only can process simple mathematical prompts.
#@markdown <br> <br> Visit me on [GitHub](https://github.com/mohmirzabr)

In [None]:
#@title 0. ⚠️ MAKE SURE YOU'RE USING NVIDIA T4 AS THE GPU AND PYTHON 3 FOR THE RUNTIME TYPE
#@markdown This cell will make sure whether you're having your NVIDIA T4 enabled or not.
!nvidia-smi

In [None]:
#@title 1. 🧰 Install Ollama
# @markdown In this cell, we will install the backend first. In which, we'll be using Ollama.

# 🚀 Installing Ollama
!curl -fsSL https://ollama.com/install.sh | sh

In [None]:
#@title 2. 📂 Load Math Questions
#@markdown This cell is where we're going to put the samples to be tested to each model. This cell will create a file named 'math_questions.txt' with lines: question<TAB>answer (except if the file existed in the first place).
templatequestions = False

#@markdown <br> <br> Below are the variables that are going to be used for scraping the dataset from Hugging Face to 'math_questions.txt'.
#@markdown <br> <br> 1. 🧰 Which dataset you're going to refer to?
dataset = "openai/gsm8k" # @param {"type":"string"}
branch = "main" # @param {"type":"string"}
#@markdown <br> <br> 2. 📊 How many examples to use?
NUM_SAMPLES = 100 # @param {"type":"number"}

if templatequestions == False:
  from pathlib import Path

  questions_file = Path('math_questions.txt')

  if not questions_file.exists():
      # 🔧 1. Install and import the HF datasets library
      !pip install -U datasets
      from datasets import load_dataset

      # 🔍 2. Load the GSM8K training split
      gsm8k = load_dataset(dataset, branch, split="train")

      # ✂️ 3. Extract question + final answer (after "####")
      samples = []
      for ex in gsm8k.select(range(NUM_SAMPLES)):
          q_text = ex["question"].strip()
          # the answer field has steps + "#### <final>"
          raw_ans = ex["answer"]
          # split on "####" to get the last line as our truth
          truth = raw_ans.split("####")[-1].strip()
          samples.append((q_text, truth))

      # 📝 4. Save those to math_questions.txt as "question␉answer"
      with open(questions_file, 'w') as f:
          for q, a in samples:
              f.write(f"{q}\t{a}\n")

      print(f"✅ Pulled {NUM_SAMPLES} samples from {dataset} and wrote '{questions_file}'")

  # 📖 5. Now read back the file into `questions` for your experiment loop
  questions = []
  with open(questions_file) as f:
      for line in f:
          q, a = line.strip().split("\t")
          questions.append((q, a))

  print(f"✅ Loaded {len(questions)} questions for testing!")
elif templatequestions == True:
  from pathlib import Path

  questions_file = Path('math_questions.txt')
  if not questions_file.exists():
      # Create a sample file with simple math problems
      sample = [
          "12 x 7\t84",
          "17 * 24\t408",
          "256 / 8\t32",
          "13 + 49\t62"
      ]
      with open(questions_file, 'w') as f:
          for line in sample:
              f.write(line + '\n')  # 📝 writing sample questions

  # Read the questions into a list
  questions = []
  with open(questions_file) as f:
      for line in f:
          q, a = line.strip().split("\t")
          questions.append((q, a))

  print(f"✅ Loaded {len(questions)} questions for testing!")

In [None]:
#@title 3. 🧰 First Model: tinyllama:1.1b
# @markdown In this cell, we'll pull the model to be tested in the next cells.
model = "tinyllama:1.1b" #@param {"type" : "string"}

# 🔥 Enabling Ollama
import subprocess
process = subprocess.Popen ("ollama serve", shell=True)

# ♻️ Pulling the models.
!ollama pull {model}

# 📦 Confirming the models that are ready.
!ollama list

In [None]:
#@title 🏁 4. Set Up Results Logging
#@markdown We will record: question, model, answer, correct (True/False), time in seconds and put it into a CSV file to be processed by the next cell and then compared for conclusion.

import csv

output_file = f"results_{model.replace(':','_')}.csv"
with open(output_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # Header row for our CSV
    writer.writerow(['question', 'model', 'answer', 'correct', 'time_s', 'mem_delta_MiB'])
print(f"🗄️  Results will be saved in: {output_file}")

In [None]:
#@title 5. 🚀 Experiment is running!
#@markdown This is the cell where everything begin to run, and the GPU RAM start to shrinking. The notebook is processing all of the questions and comparing the answer at the same time!

import time, subprocess, csv, re

selected_model = model  # from your dropdown cell

# Helper to read GPU memory
def get_gpu_mem():
    out = subprocess.run(
        ['nvidia-smi',
         '--query-gpu=memory.used',
         '--format=csv,noheader,nounits'],
        stdout=subprocess.PIPE,
        text=True
    )
    return int(out.stdout.strip())

# 1️⃣ Measure GPU memory before loading/running the model
mem_before = get_gpu_mem()
print(f"🏷️  Running model: {selected_model}")
print(f"▶️  GPU memory BEFORE: {mem_before} MiB\n")

# 2️⃣ Prepare CSV logging (append mode; header assumed already written)
#    If you need to reinitialize, uncomment the lines below:
# import csv
# with open(output_file, 'w', newline='') as csvfile:
#     writer = csv.writer(csvfile)
#     writer.writerow(['question', 'model', 'answer', 'correct', 'time_s', 'mem_delta_MiB'])

# 3️⃣ Loop through questions
for question, truth in questions:
    # Print the question
    print(f"\n🔍 [Q] {question}")

    # Run the model and time it
    t0 = time.time()
    result = subprocess.run(
        ['ollama', 'run', selected_model, question],
        stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
    )
    elapsed = time.time() - t0

    # Measure memory after (model already loaded)
    mem_after = get_gpu_mem()
    mem_delta = mem_after - mem_before

    # Parse the model’s output
    if result.returncode != 0:
        answer = result.stderr.strip()
    else:
        lines = [l for l in result.stdout.strip().split('\n') if l.strip()]
        answer = lines[-1] if lines else ""

    # 🔍 Diagnostic
    print(f"🗨️  Answer: {answer}")
    print(f"🔎 Looking for numeric token: '{truth}'")

    # === Numeric matching ===
    # Extract all integer/decimal tokens
    found_numbers = re.findall(r'\d+\.?\d*', answer)
    # Convert them to floats
    nums = [float(tok) for tok in found_numbers]
    ## Compare to the truth value
    #truth_val = float(truth)
    # Remove commas from the ground‐truth (e.g. "1,080" → "1080")
    truth_clean = truth.replace(',', '')
    truth_val   = float(truth_clean)
    correct = any(abs(n - truth_val) < 1e-6 for n in nums)

    # Diagnostics
    print(f"🔢  Extracted numeric tokens: {nums}")
    print(f"✔️ Correct? {correct}")

    mark = '✅' if correct else '❌'
    print(f"{mark}  ({elapsed:.2f}s, ΔRAM={mem_delta} MiB)")

    # 4️⃣ Log to CSV
    with open(output_file, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow([
            question,
            selected_model,
            answer,
            correct,
            f"{elapsed:.2f}",
            mem_delta
        ])

# 5️⃣ Final memory report
print(f"\n✔️  Finished running {selected_model}. Final GPU RAM used: {mem_after} MiB")

In [None]:
#@title 6. 📊 Summarize & Compare Results
#@markdown This is where everything begin to be put into a conclusion for a single model. By using pandas, this cell loads the CSV file and compute per-model metrics: Accuracy (%), Average inference time (seconds), and Memory Usage (MiB).

# ✅ Summary Code for Single-Model Runs (New Format)
import pandas as pd
import matplotlib.pyplot as plt

# Load the result file from current model
file_path = f'results_{selected_model.replace(":", "_")}.csv'
df = pd.read_csv(file_path)

# Convert types as needed
df['correct_num'] = df['correct'].astype(int)
df['time_s'] = df['time_s'].astype(float)
df['mem_delta_MiB'] = df['mem_delta_MiB'].astype(float)

# Calculate summary metrics
accuracy = df['correct_num'].mean() * 100
avg_time = df['time_s'].mean()
avg_mem = df['mem_delta_MiB'].mean()

# Create a summary DataFrame for display
summary_df = pd.DataFrame([{
    'Model': selected_model,
    'Accuracy (%)': round(accuracy, 1),
    'Avg Latency (s)': round(avg_time, 2),
    'GPU Memory Used (MiB)': round(avg_mem, 1)
}])

# 🧾 Show table
from IPython.display import display
print("📈 Overall Performance Summary:")
display(summary_df)

In [None]:
#@title 7. 🧹 Stopping Ollama Server
#@markdown In order to keep the memory management measured accurately, let's stop the whole process and load the model again.
#@markdown <br> <br> This cell clears the previous Ollama server, loads the selected model, and captures GPU memory usage for benchmarking.

import subprocess, time

selected_model = model  # ← assumes you have a dropdown or text input earlier

# Kill any existing Ollama server
!pkill -f "ollama serve" || echo "No existing Ollama server."

# Wait briefly just to be safe
time.sleep(2)

# Pull the model (if not already pulled)
print(f"📥 Pulling model: {selected_model}")
!ollama pull {selected_model}

# Measure memory BEFORE loading the model
def get_gpu_mem():
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
        stdout=subprocess.PIPE, text=True
    )
    return int(result.stdout.strip())

mem_before = get_gpu_mem()
print(f"📉 GPU memory BEFORE loading model: {mem_before} MiB")

# Start Ollama server
print("🚀 Starting Ollama server...")
server_process = subprocess.Popen(f"ollama serve", shell=True)

# Allow time for the model to load into memory
print("⏳ Waiting for model to load into GPU...")
time.sleep(8)  # increase if model loads slowly

# Measure memory AFTER loading the model
mem_after = get_gpu_mem()
print(f"📈 GPU memory AFTER loading model: {mem_after} MiB")

# Save memory delta
mem_delta = mem_after - mem_before
print(f"🔍 Model GPU Memory Footprint: {mem_delta} MiB")

In [None]:
#@title 8. 🧰 Second Model: tinyllama:1.1b
# @markdown In this cell, we'll pull the model to be tested in the next cells.
model = "phi3:3.8b" #@param {"type" : "string"}

# 🔥 Enabling Ollama
import subprocess
process = subprocess.Popen ("ollama serve", shell=True)

# ♻️ Pulling the models.
!ollama pull {model}

# 📦 Confirming the models that are ready.
!ollama list

In [None]:
#@title 🏁 9. Set Up Results Logging
#@markdown We will record: question, model, answer, correct (True/False), time in seconds and put it into a CSV file to be processed by the next cell and then compared for conclusion.

import csv

output_file = f"results_{model.replace(':','_')}.csv"
with open(output_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # Header row for our CSV
    writer.writerow(['question', 'model', 'answer', 'correct', 'time_s', 'mem_delta_MiB'])
print(f"🗄️  Results will be saved in: {output_file}")

In [None]:
#@title 10. 🚀 Experiment is running!
#@markdown This is the cell where everything begin to run, and the GPU RAM start to shrinking. The notebook is processing all of the questions and comparing the answer at the same time!

import time, subprocess, csv, re

selected_model = model  # from your dropdown cell

# Helper to read GPU memory
def get_gpu_mem():
    out = subprocess.run(
        ['nvidia-smi',
         '--query-gpu=memory.used',
         '--format=csv,noheader,nounits'],
        stdout=subprocess.PIPE,
        text=True
    )
    return int(out.stdout.strip())

# 1️⃣ Measure GPU memory before loading/running the model
mem_before = get_gpu_mem()
print(f"🏷️  Running model: {selected_model}")
print(f"▶️  GPU memory BEFORE: {mem_before} MiB\n")

# 2️⃣ Prepare CSV logging (append mode; header assumed already written)
#    If you need to reinitialize, uncomment the lines below:
# import csv
# with open(output_file, 'w', newline='') as csvfile:
#     writer = csv.writer(csvfile)
#     writer.writerow(['question', 'model', 'answer', 'correct', 'time_s', 'mem_delta_MiB'])

# 3️⃣ Loop through questions
for question, truth in questions:
    # Print the question
    print(f"\n🔍 [Q] {question}")

    # Run the model and time it
    t0 = time.time()
    result = subprocess.run(
        ['ollama', 'run', selected_model, question],
        stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
    )
    elapsed = time.time() - t0

    # Measure memory after (model already loaded)
    mem_after = get_gpu_mem()
    mem_delta = mem_after - mem_before

    # Parse the model’s output
    if result.returncode != 0:
        answer = result.stderr.strip()
    else:
        lines = [l for l in result.stdout.strip().split('\n') if l.strip()]
        answer = lines[-1] if lines else ""

    # 🔍 Diagnostic
    print(f"🗨️  Answer: {answer}")
    print(f"🔎 Looking for numeric token: '{truth}'")

    # === Numeric matching ===
    # Extract all integer/decimal tokens
    found_numbers = re.findall(r'\d+\.?\d*', answer)
    # Convert them to floats
    nums = [float(tok) for tok in found_numbers]
    ## Compare to the truth value
    #truth_val = float(truth)
    # Remove commas from the ground‐truth (e.g. "1,080" → "1080")
    truth_clean = truth.replace(',', '')
    truth_val   = float(truth_clean)
    correct = any(abs(n - truth_val) < 1e-6 for n in nums)

    # Diagnostics
    print(f"🔢  Extracted numeric tokens: {nums}")
    print(f"✔️ Correct? {correct}")

    mark = '✅' if correct else '❌'
    print(f"{mark}  ({elapsed:.2f}s, ΔRAM={mem_delta} MiB)")

    # 4️⃣ Log to CSV
    with open(output_file, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow([
            question,
            selected_model,
            answer,
            correct,
            f"{elapsed:.2f}",
            mem_delta
        ])

# 5️⃣ Final memory report
print(f"\n✔️  Finished running {selected_model}. Final GPU RAM used: {mem_after} MiB")

In [None]:
#@title 11. 📊 Summarize & Compare Results
#@markdown This is where everything begin to be put into a conclusion for a single model. By using pandas, this cell loads the CSV file and compute per-model metrics: Accuracy (%), Average inference time (seconds), and Memory Usage (MiB).

# ✅ Summary Code for Single-Model Runs (New Format)
import pandas as pd
import matplotlib.pyplot as plt

# Load the result file from current model
file_path = f'results_{selected_model.replace(":", "_")}.csv'
df = pd.read_csv(file_path)

# Convert types as needed
df['correct_num'] = df['correct'].astype(int)
df['time_s'] = df['time_s'].astype(float)
df['mem_delta_MiB'] = df['mem_delta_MiB'].astype(float)

# Calculate summary metrics
accuracy = df['correct_num'].mean() * 100
avg_time = df['time_s'].mean()
avg_mem = df['mem_delta_MiB'].mean()

# Create a summary DataFrame for display
summary_df = pd.DataFrame([{
    'Model': selected_model,
    'Accuracy (%)': round(accuracy, 1),
    'Avg Latency (s)': round(avg_time, 2),
    'GPU Memory Used (MiB)': round(avg_mem, 1)
}])

# 🧾 Show table
from IPython.display import display
print("📈 Overall Performance Summary:")
display(summary_df)

In [None]:
#@title 12. 🧹 Stopping Ollama Server
#@markdown In order to keep the memory management measured accurately, let's stop the whole process and load the model again.
#@markdown <br> <br> This cell clears the previous Ollama server, loads the selected model, and captures GPU memory usage for benchmarking.

import subprocess, time

selected_model = model  # ← assumes you have a dropdown or text input earlier

# Kill any existing Ollama server
!pkill -f "ollama serve" || echo "No existing Ollama server."

# Wait briefly just to be safe
time.sleep(2)

# Pull the model (if not already pulled)
print(f"📥 Pulling model: {selected_model}")
!ollama pull {selected_model}

# Measure memory BEFORE loading the model
def get_gpu_mem():
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
        stdout=subprocess.PIPE, text=True
    )
    return int(result.stdout.strip())

mem_before = get_gpu_mem()
print(f"📉 GPU memory BEFORE loading model: {mem_before} MiB")

# Start Ollama server
print("🚀 Starting Ollama server...")
server_process = subprocess.Popen(f"ollama serve", shell=True)

# Allow time for the model to load into memory
print("⏳ Waiting for model to load into GPU...")
time.sleep(8)  # increase if model loads slowly

# Measure memory AFTER loading the model
mem_after = get_gpu_mem()
print(f"📈 GPU memory AFTER loading model: {mem_after} MiB")

# Save memory delta
mem_delta = mem_after - mem_before
print(f"🔍 Model GPU Memory Footprint: {mem_delta} MiB")

In [None]:
#@title 13. 🧰 Third Model: tinyllama:1.1b
# @markdown In this cell, we'll pull the model to be tested in the next cells.
model = "mistral:7b" #@param {"type" : "string"}

# 🔥 Enabling Ollama
import subprocess
process = subprocess.Popen ("ollama serve", shell=True)

# ♻️ Pulling the models.
!ollama pull {model}

# 📦 Confirming the models that are ready.
!ollama list

In [None]:
#@title 🏁 14. Set Up Results Logging
#@markdown We will record: question, model, answer, correct (True/False), time in seconds and put it into a CSV file to be processed by the next cell and then compared for conclusion.

import csv

output_file = f"results_{model.replace(':','_')}.csv"
with open(output_file, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # Header row for our CSV
    writer.writerow(['question', 'model', 'answer', 'correct', 'time_s', 'mem_delta_MiB'])
print(f"🗄️  Results will be saved in: {output_file}")

In [None]:
#@title 15. 🚀 Experiment is running!
#@markdown This is the cell where everything begin to run, and the GPU RAM start to shrinking. The notebook is processing all of the questions and comparing the answer at the same time!

import time, subprocess, csv, re

selected_model = model  # from your dropdown cell

# Helper to read GPU memory
def get_gpu_mem():
    out = subprocess.run(
        ['nvidia-smi',
         '--query-gpu=memory.used',
         '--format=csv,noheader,nounits'],
        stdout=subprocess.PIPE,
        text=True
    )
    return int(out.stdout.strip())

# 1️⃣ Measure GPU memory before loading/running the model
mem_before = get_gpu_mem()
print(f"🏷️  Running model: {selected_model}")
print(f"▶️  GPU memory BEFORE: {mem_before} MiB\n")

# 2️⃣ Prepare CSV logging (append mode; header assumed already written)
#    If you need to reinitialize, uncomment the lines below:
# import csv
# with open(output_file, 'w', newline='') as csvfile:
#     writer = csv.writer(csvfile)
#     writer.writerow(['question', 'model', 'answer', 'correct', 'time_s', 'mem_delta_MiB'])

# 3️⃣ Loop through questions
for question, truth in questions:
    # Print the question
    print(f"\n🔍 [Q] {question}")

    # Run the model and time it
    t0 = time.time()
    result = subprocess.run(
        ['ollama', 'run', selected_model, question],
        stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
    )
    elapsed = time.time() - t0

    # Measure memory after (model already loaded)
    mem_after = get_gpu_mem()
    mem_delta = mem_after - mem_before

    # Parse the model’s output
    if result.returncode != 0:
        answer = result.stderr.strip()
    else:
        lines = [l for l in result.stdout.strip().split('\n') if l.strip()]
        answer = lines[-1] if lines else ""

    # 🔍 Diagnostic
    print(f"🗨️  Answer: {answer}")
    print(f"🔎 Looking for numeric token: '{truth}'")

    # === Numeric matching ===
    # Extract all integer/decimal tokens
    found_numbers = re.findall(r'\d+\.?\d*', answer)
    # Convert them to floats
    nums = [float(tok) for tok in found_numbers]
    ## Compare to the truth value
    #truth_val = float(truth)
    # Remove commas from the ground‐truth (e.g. "1,080" → "1080")
    truth_clean = truth.replace(',', '')
    truth_val   = float(truth_clean)
    correct = any(abs(n - truth_val) < 1e-6 for n in nums)

    # Diagnostics
    print(f"🔢  Extracted numeric tokens: {nums}")
    print(f"✔️ Correct? {correct}")

    mark = '✅' if correct else '❌'
    print(f"{mark}  ({elapsed:.2f}s, ΔRAM={mem_delta} MiB)")

    # 4️⃣ Log to CSV
    with open(output_file, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow([
            question,
            selected_model,
            answer,
            correct,
            f"{elapsed:.2f}",
            mem_delta
        ])

# 5️⃣ Final memory report
print(f"\n✔️  Finished running {selected_model}. Final GPU RAM used: {mem_after} MiB")

In [None]:
#@title 16. 📊 Summarize & Compare Results
#@markdown This is where everything begin to be put into a conclusion for a single model. By using pandas, this cell loads the CSV file and compute per-model metrics: Accuracy (%), Average inference time (seconds), and Memory Usage (MiB).

# ✅ Summary Code for Single-Model Runs (New Format)
import pandas as pd
import matplotlib.pyplot as plt

# Load the result file from current model
file_path = f'results_{selected_model.replace(":", "_")}.csv'
df = pd.read_csv(file_path)

# Convert types as needed
df['correct_num'] = df['correct'].astype(int)
df['time_s'] = df['time_s'].astype(float)
df['mem_delta_MiB'] = df['mem_delta_MiB'].astype(float)

# Calculate summary metrics
accuracy = df['correct_num'].mean() * 100
avg_time = df['time_s'].mean()
avg_mem = df['mem_delta_MiB'].mean()

# Create a summary DataFrame for display
summary_df = pd.DataFrame([{
    'Model': selected_model,
    'Accuracy (%)': round(accuracy, 1),
    'Avg Latency (s)': round(avg_time, 2),
    'GPU Memory Used (MiB)': round(avg_mem, 1)
}])

# 🧾 Show table
from IPython.display import display
print("📈 Overall Performance Summary:")
display(summary_df)

In [None]:
#@title 17. 🧹 Stopping Ollama Server
#@markdown In order to keep the memory management measured accurately, let's stop the whole process and load the model again.
#@markdown <br> <br> This cell clears the previous Ollama server, loads the selected model, and captures GPU memory usage for benchmarking.

import subprocess, time

selected_model = model  # ← assumes you have a dropdown or text input earlier

# Kill any existing Ollama server
!pkill -f "ollama serve" || echo "No existing Ollama server."

# Wait briefly just to be safe
time.sleep(2)

# Pull the model (if not already pulled)
print(f"📥 Pulling model: {selected_model}")
!ollama pull {selected_model}

# Measure memory BEFORE loading the model
def get_gpu_mem():
    result = subprocess.run(
        ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
        stdout=subprocess.PIPE, text=True
    )
    return int(result.stdout.strip())

mem_before = get_gpu_mem()
print(f"📉 GPU memory BEFORE loading model: {mem_before} MiB")

# Start Ollama server
print("🚀 Starting Ollama server...")
server_process = subprocess.Popen(f"ollama serve", shell=True)

# Allow time for the model to load into memory
print("⏳ Waiting for model to load into GPU...")
time.sleep(8)  # increase if model loads slowly

# Measure memory AFTER loading the model
mem_after = get_gpu_mem()
print(f"📈 GPU memory AFTER loading model: {mem_after} MiB")

# Save memory delta
mem_delta = mem_after - mem_before
print(f"🔍 Model GPU Memory Footprint: {mem_delta} MiB")

In [None]:
#@title 18. 🔀 Merge All Results & Final Comparison
#@markdown This cell will combine all CSV file named after 'result_'. Simply, it will combine all result into one single file named 'results_combined_summary.csv'.
import pandas as pd
import glob
import os

# 1️⃣ Find all result CSVs in this directory
pattern = "results_*.csv"
files = glob.glob(pattern)
if not files:
    raise FileNotFoundError(f"No files matching {pattern} found!")

# 2️⃣ Load each file and extract the model name from its filename
all_dfs = []
for fn in files:
    # e.g. fn = "results_phi3_3.8b.csv"
    model_name = os.path.basename(fn).removeprefix("results_").removesuffix(".csv")
    df = pd.read_csv(fn)
    df['model'] = model_name
    all_dfs.append(df)

# 3️⃣ Concatenate into one DataFrame
combined = pd.concat(all_dfs, ignore_index=True)

# 4️⃣ Compute per-model metrics
combined['correct_num'] = combined['correct'].astype(int)
combined['time_s']      = combined['time_s'].astype(float)
combined['mem_delta_MiB'] = combined['mem_delta_MiB'].astype(float)

summary = combined.groupby('model').agg(
    Accuracy       = ('correct_num',     lambda x: x.mean() * 100),
    Avg_Latency_s  = ('time_s',          'mean'),
    Avg_Mem_Use_MiB= ('mem_delta_MiB',   'mean')
).round(2).reset_index()

# 5️⃣ Display the final comparison table
from IPython.display import display
print("🏁 Final Comparison of All Models:")
display(summary)

# 🎉 Optional: save this summary if you like
summary.to_csv("results_combined_summary.csv", index=False)
print("Saved combined summary to results_combined_summary.csv")

# 🎉 End of experiment. Review the printed logs and summary to draw conclusions!