# Gemini 2.5 Flash - Token Usage Investigation

Exploring:
1. What token usage metadata is available from Gemini API
2. What parameters are available in GenerateContentConfig
3. Whether there are any reasoning/thinking-related fields in responses
4. What actual token counts we're getting in our current setup


In [5]:
import os
from google import genai
from google.genai import types
from datasets import load_dataset
from dotenv import load_dotenv
import json

load_dotenv()

# Initialize Gemini client
genai_client = genai.Client(
    api_key=os.environ.get('GEMINI_API_KEY'),
    http_options={'timeout': 120000}
)


In [6]:
# Load a random MMLU-Pro question
dataset = load_dataset("TIGER-Lab/MMLU-Pro", split="test")

# Use a fixed index for reproducibility
question_idx = 7269  # Same as in your recent logs
question_data = dataset[question_idx]

print("Question:")
print(question_data['question'])
print("\nOptions:")
for i, option in enumerate(question_data['options'][:4]):  # Take first 4 options
    print(f"{chr(65+i)}) {option}")
print(f"\nCorrect Answer: {question_data['answer']}")


Question:
A white noise process will have

(i) A zero mean

(ii) A constant variance

(iii) Autocovariances that are constant

(iv) Autocovariances that are zero except at lag zero

Options:
A) (ii), (iii), and (iv) only
B) (i), (ii), (iii), and (iv)
C) (i) and (ii) only
D) (i), (ii), and (iii) only

Correct Answer: G


In [7]:
# Create a test prompt (similar to your QA prompts)
options = question_data['options'][:4]
options_text = "\n".join([f"Option {chr(65+i)}: {option}" for i, option in enumerate(options)])

prompt = f"""Answer the following question. You must choose between the options provided.

Question: {question_data['question']}

{options_text}

Provide your answer in the following format:
Answer: [A, B, C, or D]
Confidence: [percentage between 25-100]%
Reasoning: [brief explanation]"""

print("Prompt created (length: {} chars)".format(len(prompt)))


Prompt created (length: 539 chars)


## Test 1: No Reasoning (thinking_budget=0)


In [13]:
print("Running with thinking_budget=0 (no reasoning)...\n")

response_no_thinking = genai_client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        temperature=0.0,
        thinking_config=types.ThinkingConfig(thinking_budget=0)
    )
)

print("Response Text:")
print(response_no_thinking.text)
print("\n" + "="*70)
print("Usage Metadata:")
print(f"  prompt_token_count: {response_no_thinking.usage_metadata.prompt_token_count}")
print(f"  candidates_token_count: {response_no_thinking.usage_metadata.candidates_token_count}")
print(f"  total_token_count: {response_no_thinking.usage_metadata.total_token_count}")
print("\nFull usage_metadata object:")
print(response_no_thinking.usage_metadata)


Running with thinking_budget=0 (no reasoning)...

Response Text:
Answer: B
Confidence: 100%
Reasoning: A white noise process is defined by four key properties:
(i) A zero mean: The expected value of the process at any time point is zero.
(ii) A constant variance: The variance of the process is the same at all time points.
(iii) Autocovariances that are zero except at lag zero: This is the defining characteristic of white noise. It means there is no linear correlation between observations at different time points.
(iv) Autocovariances that are constant: This is true because the autocovariance at lag zero is a constant (the variance), and the autocovariances at all other lags are also constant (zero). Therefore, all autocovariances are constant.

All four statements are true for a white noise process.

Usage Metadata:
  prompt_token_count: 168
  candidates_token_count: 177
  total_token_count: 345

Full usage_metadata object:
cache_tokens_details=None cached_content_token_count=None cand

## Test 2: Default (no thinking_budget specified)


In [14]:
print("Running with DEFAULT settings (no thinking_budget specified)...\n")

response_default = genai_client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        temperature=0.0
    )
)

print("Response Text:")
print(response_default.text)
print("\n" + "="*70)
print("Usage Metadata:")
print(f"  prompt_token_count: {response_default.usage_metadata.prompt_token_count}")
print(f"  candidates_token_count: {response_default.usage_metadata.candidates_token_count}")
print(f"  total_token_count: {response_default.usage_metadata.total_token_count}")
print("\nFull usage_metadata object:")
print(response_default.usage_metadata)


Running with DEFAULT settings (no thinking_budget specified)...

Response Text:
A white noise process, denoted as $W_t$, is defined by the following properties:

(i) **A zero mean**: $E[W_t] = 0$ for all $t$. This is a fundamental property of white noise. So, (i) is true.

(ii) **A constant variance**: $Var(W_t) = E[W_t^2] - (E[W_t])^2 = E[W_t^2] = \sigma^2$ for some positive constant $\sigma^2$. This is also a fundamental property. So, (ii) is true.

(iii) **Autocovariances that are constant**: The autocovariance function is defined as $\gamma_k = Cov(W_t, W_{t-k})$. For a process to be weakly stationary (which white noise is), its autocovariance function must depend only on the lag $k$, and not on the specific time $t$. This means $Cov(W_t, W_{t-k})$ is constant with respect to $t$. In the context of time series, "constant autocovariances" often refers to this time-invariance property. If interpreted this way, then (iii) is true. (If it meant $\gamma_k$ is the same value for all $k$,

## Test 3: High Reasoning (thinking_budget=8192)


In [None]:
print("Running with thinking_budget=8192 (high reasoning)...\n")

response_high_thinking = genai_client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        temperature=0.0,
        thinking_config=types.ThinkingConfig(thinking_budget=8192)
    )
)

print("Response Text:")
print(response_high_thinking.text)
print("\n" + "="*70)
print("Usage Metadata:")
print(f"  prompt_token_count: {response_high_thinking.usage_metadata.prompt_token_count}")
print(f"  candidates_token_count: {response_high_thinking.usage_metadata.candidates_token_count}")
print(f"  total_token_count: {response_high_thinking.usage_metadata.total_token_count}")
print("\nFull usage_metadata object:")
print(response_high_thinking.usage_metadata)


## Comparison Summary


In [None]:
import pandas as pd

comparison = pd.DataFrame([
    {
        'Configuration': 'No Reasoning (budget=0)',
        'Prompt Tokens': response_no_thinking.usage_metadata.prompt_token_count,
        'Candidates Tokens': response_no_thinking.usage_metadata.candidates_token_count,
        'Total Tokens': response_no_thinking.usage_metadata.total_token_count,
        'Response Length (chars)': len(response_no_thinking.text)
    },
    {
        'Configuration': 'Default (no budget)',
        'Prompt Tokens': response_default.usage_metadata.prompt_token_count,
        'Candidates Tokens': response_default.usage_metadata.candidates_token_count,
        'Total Tokens': response_default.usage_metadata.total_token_count,
        'Response Length (chars)': len(response_default.text)
    },
    {
        'Configuration': 'High Reasoning (budget=8192)',
        'Prompt Tokens': response_high_thinking.usage_metadata.prompt_token_count,
        'Candidates Tokens': response_high_thinking.usage_metadata.candidates_token_count,
        'Total Tokens': response_high_thinking.usage_metadata.total_token_count,
        'Response Length (chars)': len(response_high_thinking.text)
    }
])

print("\n" + "="*70)
print("TOKEN USAGE COMPARISON")
print("="*70)
print(comparison.to_string(index=False))

# Calculate deltas
print("\n" + "="*70)
print("DELTAS (compared to No Reasoning)")
print("="*70)
baseline = response_no_thinking.usage_metadata.total_token_count
print(f"Default vs No Reasoning: {response_default.usage_metadata.total_token_count - baseline:+d} tokens")
print(f"High Reasoning vs No Reasoning: {response_high_thinking.usage_metadata.total_token_count - baseline:+d} tokens")


## Investigate Response Object Structure


In [None]:
# Let's inspect the full response object to see if there are any hidden fields
print("Inspecting response object structure...\n")
print("Available attributes on response:")
print([attr for attr in dir(response_high_thinking) if not attr.startswith('_')])

print("\n" + "="*70)
print("Available attributes on usage_metadata:")
print([attr for attr in dir(response_high_thinking.usage_metadata) if not attr.startswith('_')])


In [None]:
# Check if there's any difference in candidates structure
print("Candidates structure (high reasoning):")
print(f"Number of candidates: {len(response_high_thinking.candidates)}")
if response_high_thinking.candidates:
    print(f"\nFirst candidate attributes:")
    print([attr for attr in dir(response_high_thinking.candidates[0]) if not attr.startswith('_')])
    print(f"\nFirst candidate content:")
    print(response_high_thinking.candidates[0])


## Conclusions

Based on the results above:

1. **Does default use reasoning tokens?**
   - Look at whether Default total_token_count ≈ No Reasoning total_token_count
   - If they're similar → Default does NOT use reasoning
   - If Default > No Reasoning → Default DOES use reasoning

2. **Are reasoning tokens included in total_token_count?**
   - If High Reasoning > No Reasoning by a significant amount → YES, included
   - The delta tells us approximately how many reasoning tokens were used

3. **Can we see reasoning token breakdown?**
   - Check if usage_metadata has any additional fields
   - Check if candidates contain any reasoning-related fields
   - If not found → reasoning tokens are NOT separately reported
