## Run and Evaluate DeepSeek-R1 (distilled model) with Ollama and OpenAI's simple-evals
### Notebook Walkthrough
Author: Kenneth Leung
___

### (1) Installation Instructions
- Download and install Ollama from https://ollama.com/download
- To start Ollama, either open the Ollama app on your local machine, or run `ollama serve` in the terminal.
- We will be experimenting with the distilled DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Qwen-1.5B models. Pull the models by running the following commands (separately) in terminal:
  - `ollama pull deepseek-r1:7b`
  - `ollama pull deepseek-r1:1.5b`
- Once done, we return to this notebook to continue with the Python codes

**References**
- https://github.com/ollama/ollama-python/blob/main/examples/README.md

___
### (2) Initial Setup

In [1]:
import os
import ollama
import time
from simple_evals.gpqa_eval import GPQAEval

from pprint import pprint
from utils.utils import load_config
from utils.samplers.ollama_sampler import OllamaSampler

In [2]:
config = load_config("config/config.yaml")
config

{'MODEL_NAME': 'deepseek-r1:1.5b',
 'EVAL_BENCHMARK': 'math',
 'GPQA_VARIANT': 'diamond',
 'MATH_VARIANT': 'math_500_test',
 'EVAL_N_REPEATS': 1,
 'EVAL_N_EXAMPLES': 5}

In [5]:
# Confirm that our models have been downloaded
for m in ollama.list():
    pprint(m)

('models',
 [Model(model='deepseek-r1:7b', modified_at=datetime.datetime(2025, 3, 12, 0, 54, 6, 263959, tzinfo=TzInfo(+08:00)), digest='0a8c266910232fd3291e71e5ba1e058cc5af9d411192cf88b6d30e92b6e73163', size=4683075271, details=ModelDetails(parent_model='', format='gguf', family='qwen2', families=['qwen2'], parameter_size='7.6B', quantization_level='Q4_K_M')),
  Model(model='deepseek-r1:1.5b', modified_at=datetime.datetime(2025, 3, 12, 0, 49, 45, 270048, tzinfo=TzInfo(+08:00)), digest='a42b25d8c10a841bd24724309898ae851466696a7d7f3a0a408b895538ccbc96', size=1117322599, details=ModelDetails(parent_model='', format='gguf', family='qwen2', families=['qwen2'], parameter_size='1.8B', quantization_level='Q4_K_M'))])


In [12]:
# Check GPU usage
!nvidia-smi

Wed Mar 12 00:52:59 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.94                 Driver Version: 560.94         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 Ti   WDDM  |   00000000:09:00.0  On |                  N/A |
|  0%   43C    P0             51W /  200W |    1247MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

___
### (3) Single test run

In [13]:
prompt = """
You are an advanced AI assistant analyzing an alien civilization’s mathematical system. 
They use an unfamiliar number system, and their number patterns follow unknown rules. 
You receive the following number sequences and must determine the missing number:

Sequences:
3, 6, 11, 18, 27, ?
2, 6, 12, 20, 30, ?
5, 10, 18, 30, 47, ?
Rules:
The aliens do not use base-10 but instead follow their own logical sequence.
Each sequence follows a hidden pattern based on an unknown mathematical principle.
You must determine the next number in each sequence and briefly explain the reasoning behind it.

Ensure you initiate your response with "<think>\n at the beginning of your output.
"""

In [14]:
start_time = time.time()
response: ollama.ChatResponse = ollama.chat(model=config["MODEL_NAME"], 
                                            messages=[
                                              {'role': 'user',
                                               'content': prompt},
                                            ])
end_time = time.time()
execution_time = end_time - start_time
minutes = int(execution_time // 60)
seconds = execution_time % 60
if minutes > 0:
    print(f"\nExecution Time: {minutes} min {seconds:.2f} sec\n")
else:
    print(f"\nExecution Time: {seconds:.2f} sec\n")

print(response['message']['content'])


Execution Time: 27.49 sec

<think>
Okay, so I have these three sequences to figure out. Let me take them one by one. 

First up is 3, 6, 11, 18, 27, ?. Hmm, I notice that each number seems to be increasing, but the gaps between them are getting bigger. From 3 to 6 is +3, then from 6 to 11 it's +5, then +7, +9, and so on. It looks like each time we're adding an odd number, specifically the next odd integer in sequence. So after adding 9, which was from 18 to 27, the next step would be adding 11. Let me check: 27 plus 11 is 38. That seems right.

Moving on to the second sequence: 2, 6, 12, 20, 30, ?. I see that each number increases by more than just a small amount. The differences are 4, 6, 8, 10... These increments themselves seem to be increasing by 2 each time. So after adding 10 to get from 20 to 30, the next increment should be 12. Adding that gives me 42. That makes sense because 2+4=6, 6+6=12, 12+8=20, and so on.

Now the third sequence: 5, 10, 18, 30, 47, ?. This one is trickie

___

### (4) Initiate GPQA Evaluation

In [None]:
start_time = time.time()

# Load the Ollama wrapper that wraps ollama.chat() to format prompts correctly and retrieve responses for GPQA eval
ollama_sampler = OllamaSampler(model_name=config["MODEL_NAME"])

# Instantiate the GPQAEval class for evaluation
gpqa_eval = GPQAEval(n_repeats=config["EVAL_N_REPEATS"],
                     num_examples=config["EVAL_N_EXAMPLES"], 
                     variant=config["GPQA_VARIANT"])

# Run GPQA evaluation
results = gpqa_eval(ollama_sampler)

end_time = time.time()
elapsed_seconds = end_time - start_time
minutes, seconds = divmod(
    elapsed_seconds, 60
)

# The returned results is an EvalResult which includes a list of SingleEvalResult
# and aggregated metrics. Print metrics:
print("Overall Evaluation Metrics:")
print(results.metrics)
print(f"Total Execution Time: {int(minutes)} min {seconds:.2f} sec")

In [None]:
results