## Run and Evaluate DeepSeek-R1 (distilled model) with Ollama and OpenAI's simple-evals
### Notebook Walkthrough
Author: Kenneth Leung
___

### (1) Installation Instructions
- Download and install Ollama from https://ollama.com/download
- To start Ollama, either open the Ollama app on your local machine, or run `ollama serve` in the terminal.
- We will be experimenting with the distilled DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Qwen-7B models. Pull the models by running the following commands (separately) in terminal:
  - `ollama pull deepseek-r1:14b`
  - `ollama pull deepseek-r1:7b`
- Once done, we return to this notebook to continue with the Python codes

___
### (2) Initial Setup

In [None]:
import ollama
import time
from simple_evals.gpqa_eval import GPQAEval

from datetime import datetime
from pprint import pprint
from utils.utils import load_config
from utils.samplers.ollama_sampler import OllamaSampler

In [2]:
config = load_config("config/config.yaml")
config

{'MODEL_NAME': 'deepseek-r1:7b',
 'EVAL_BENCHMARK': 'gpqa',
 'GPQA_VARIANT': 'diamond',
 'MATH_VARIANT': 'math_500_test',
 'EVAL_N_REPEATS': 1,
 'EVAL_N_EXAMPLES': 20}

In [3]:
# Confirm that our models have been downloaded
for m in ollama.list():
    pprint(m)

('models',
 [Model(model='deepseek-r1:7b', modified_at=datetime.datetime(2025, 3, 12, 0, 54, 6, 263959, tzinfo=TzInfo(+08:00)), digest='0a8c266910232fd3291e71e5ba1e058cc5af9d411192cf88b6d30e92b6e73163', size=4683075271, details=ModelDetails(parent_model='', format='gguf', family='qwen2', families=['qwen2'], parameter_size='7.6B', quantization_level='Q4_K_M')),
  Model(model='deepseek-r1:1.5b', modified_at=datetime.datetime(2025, 3, 12, 0, 49, 45, 270048, tzinfo=TzInfo(+08:00)), digest='a42b25d8c10a841bd24724309898ae851466696a7d7f3a0a408b895538ccbc96', size=1117322599, details=ModelDetails(parent_model='', format='gguf', family='qwen2', families=['qwen2'], parameter_size='1.8B', quantization_level='Q4_K_M'))])


___
### (3) Single test run

In [5]:
prompt = """
You are an advanced AI assistant analyzing an alien civilization’s mathematical system. 
They use an unfamiliar number system, and their number patterns follow unknown rules. 
You receive the following number sequences and must determine the missing number:

Sequences:
3, 6, 11, 18, 27, ?
2, 6, 12, 20, 30, ?
5, 10, 18, 30, 47, ?
Rules:
The aliens do not use base-10 but instead follow their own logical sequence.
Each sequence follows a hidden pattern based on an unknown mathematical principle.
You must determine the next number in each sequence and briefly explain the reasoning behind it.

Ensure you initiate your response with "<think>\n at the beginning of your output.
"""

In [6]:
start_time = time.time()
response: ollama.ChatResponse = ollama.chat(model=config["MODEL_NAME"], 
                                            messages=[
                                              {'role': 'user',
                                               'content': prompt},
                                            ])
end_time = time.time()
execution_time = end_time - start_time
minutes = int(execution_time // 60)
seconds = execution_time % 60
if minutes > 0:
    print(f"\nExecution Time: {minutes} min {seconds:.2f} sec\n")
else:
    print(f"\nExecution Time: {seconds:.2f} sec\n")

print(response['message']['content'])


Execution Time: 26.50 sec

<think>
Okay, so I'm trying to figure out these three alien sequences. They don't use base-10, which is a bit confusing, but maybe the patterns are similar to something we understand in base-10 or another numbering system.

Starting with the first sequence: 3, 6, 11, 18, 27, ?. Let's look at the differences between each number. 

From 3 to 6 is +3.
Then from 6 to 11 is +5.
Next, 11 to 18 is +7.
Then 18 to 27 is +9.

Oh, I see a pattern here: the increments are increasing by 2 each time. So it's like adding an odd number each step—3, then 5, 7, 9... That makes sense because those are consecutive odd numbers starting from 3. 

If that continues, the next increment should be +11. Adding 11 to 27 gives us 38.

Moving on to the second sequence: 2, 6, 12, 20, 30, ?. Let's check the differences here too.

From 2 to 6 is +4.
6 to 12 is +6.
12 to 20 is +8.
20 to 30 is +10.

So again, each time we're adding an even number that increases by 2: 4, 6, 8, 10... Following 

___

### (4) Initiate GPQA Evaluation

In [None]:
start_time = time.time()

# Load the Ollama wrapper that wraps ollama.chat() to format prompts correctly and retrieve responses for GPQA eval
ollama_sampler = OllamaSampler(model_name=config["MODEL_NAME"])

# Instantiate the GPQAEval class for evaluation
gpqa_eval = GPQAEval(num_examples=config["EVAL_N_EXAMPLES"], 
                     n_repeats=config["EVAL_N_REPEATS"],
                     variant=config["GPQA_VARIANT"])
results = gpqa_eval(ollama_sampler)

end_time = time.time()
elapsed_seconds = end_time - start_time
minutes, seconds = divmod(
    elapsed_seconds, 60
)

In [None]:
# The returned results is an EvalResult which includes a list of SingleEvalResult
# and aggregated metrics. Print metrics:
print("Overall Evaluation Metrics:")
print("Score: ", results.score)
print(results.metrics)
print(f"Total Execution Time: {int(minutes)} min {seconds:.2f} sec")

Overall Evaluation Metrics:
Score:  0.1
{'chars': np.float64(28495.9), 'chars:std': np.float64(13586.347985753935), 'score:std': np.float64(0.30000000000000004)}
Total Execution Time: 25 min 37.46 sec


In [30]:
# Save convo output from LLM
timestamp = datetime.now().strftime("%Y%m%d_%H%M")
output_filename = f"data/convos_output_{timestamp}.txt"

with open(output_filename, "w", encoding="utf-8") as f:
    for i in range(len(results.convos)):
        f.write(f'### Conversation {i} ###\n')
        f.write("\n".join(map(str, results.convos[i])))
        f.write("\n\n")