# Reasoning traces
- This notebook shows you how to read the reasoning traces from the models
- o3-mini reasoning traces were manually generated from the web interface and are listed in the `o3-mini-excerpts/` directory
- DeepSeek R1 shows full raw reasoning traces
- Gemini 2.5 Flash and Pro have reasoning summaries generated from the API

In [2]:
import json
import pandas as pd
from tqdm import tqdm
from rdkit import RDLogger
from utils.answer_verifier import AnswerVerifier
from utils import parser

RDLogger.DisableLog('rdApp.*')
tqdm.pandas()

# Load model responses
df = pd.read_csv('model_responses/combined_model_responses.csv')

# Parse responses
answer_parser = parser.AnswerParser("questions/chemiq.jsonl", doClean=True)
df["parsed_answer"] = df.apply(lambda row: answer_parser.parse(row["uuid"], row["raw_model_answer"]), axis=1)

# Check answers
answer_checker = AnswerVerifier("questions/chemiq.jsonl")
df[['is_correct','opsin_smiles']] = (df.progress_apply(lambda row: pd.Series(answer_checker.check_answer(row["uuid"], row["parsed_answer"])),axis=1))

# --- Adding columns for data analysis ---
# Load question data
all_questions = []
with open("questions/chemiq.jsonl", 'r') as f:
    for line in f:
        all_questions.append(json.loads(line))
question_dict = {q["uuid"]: q for q in all_questions}

# Add useful question columns to dataframe
df["question_category"] = df["uuid"].apply(lambda x: question_dict[x]["question_category"])
df["sub_category"] = df["uuid"].apply(lambda x: question_dict[x]["sub_category"])
df["expected_answer"] = df["uuid"].apply(lambda x: question_dict[x]["answer"])

100%|██████████████████████████████████████████████████████████████████| 13872/13872 [00:04<00:00, 3261.51it/s]


In [3]:
# Selection shows how many reasoning traces exist
df[~df["reasoning_trace"].isna()].groupby(["model", "thinking_budget"]).size().reset_index(name='count')

Unnamed: 0,model,thinking_budget,count
0,DeepSeek-R1-0528,32000,816
1,gemini-flash-2-5,1024,815
2,gemini-flash-2-5,24576,783
3,gemini-flash-2-5,8192,798
4,gemini-pro-2-5,1024,816
5,gemini-pro-2-5,128,816
6,gemini-pro-2-5,2048,816
7,gemini-pro-2-5,24576,802
8,gemini-pro-2-5,32768,808
9,gemini-pro-2-5,4096,816


# Select subset

In [4]:
# The only models with reasoning traces are DeepSeek R1 and Gemini models
# In some cases the gemini models failed to provide a reasoning summary

model = 'DeepSeek-R1-0528'
#model = 'gemini-flash-2-5'
#model = 'gemini-pro-2-5'

thinking_budget = "32000"

subset = df[(df["model"]==model) & (df["thinking_budget"]== thinking_budget)]
print(f"{len(subset)} rows in dataframe subset")

816 rows in dataframe subset


# Read trace

In [5]:
# Row to select
idx = 1

# Select row
row = subset.iloc[idx]

# load corresponding question data
question = question_dict[row["uuid"]]

print("="*30 + " PROMPT " + "="*30)
print(question["prompt"])

print("="*30 + " REASONING " + "="*30)
print("[...]\n\n" + row["reasoning_trace"][-1000:]) # tuncating output
#print(row["reasoning_trace"])

print("="*30 + " MODEL ANSWER " + "="*30)
print(row["raw_model_answer"])

print("="*30 + " EXPECTED ANSWER " + "="*30)
print(question["answer"])

How many carbon atoms are in the molecule:

n1c(sc(S(N)=N)c1)N

Give your answer as an integer. Do not write any comments.
[...]

S3 (single), C8 (double), S5 (single). Bond order sum: 1+2+1=4, no hydrogen.

Atom C8: bonded to C4 double, to N1 single. So neighbors: C4 (double), N1 (single). Bond order sum 3, so it needs a hydrogen, as we said earlier.

In this structure, atom C8 has only two neighbors: C4 and N1, with bond orders double and single, sum 3, so yes, it must have a hydrogen atom.

The other resonance structures might involve different bond orders, but overall, atom C8 has a lower bond order and requires a hydrogen.

So consistent.

Now, all carbons are accounted for: atom 2,4,8.

So three carbon atoms.

Therefore, the answer should be 3.

But let me see if there is any mistake.

In the SMILES, the end is "N", which is atom 9, bonded to atom 2, so -NH2, no carbon.

The group on atom 4 is "S(N)=N", which is S-NH2 and S=NH, so no carbon.

So yes.

Perhaps the "S(N)=N" could b