In [1]:
# %%
from magistral_benchmark import MagistralBenchmarkConfig, MagistralReasoningBenchmark


In [2]:

# %%
config = MagistralBenchmarkConfig(
    model_name="mistralai/Magistral-Small-2506",
    tokenizer_name="mistralai/Mistral-Nemo-Instruct-2407",
    batch_size=6,               # Slightly smaller for reasoning mode
    min_vram_gb=8,              # Lower requirement due to 4-bit quantization
    test_file="./data/test.jsonl",
    max_new_tokens=500,         # Extra tokens for reasoning
    max_eval_samples=1000,      # Subset for faster testing
    system_message="Sei un assistente utile e intelligente.",
    output_prefix="magistral_small_reasoning",
    # Quantization settings
    use_quantization=True,
    quantization_type="nf4",
    quantization_compute_dtype="float16",
    # Optimizations
    use_flash_attention=True,
    use_torch_compile=True,
)


In [3]:

# %%
config.print_config()



MAGISTRAL BENCHMARK CONFIGURATION
‚úì Model: mistralai/Magistral-Small-2506
‚úì Tokenizer: mistralai/Mistral-Nemo-Instruct-2407
‚úì Test file: ./data/test.jsonl
‚úì Max samples: 1000
‚úì Batch size: 6
‚úì Max new tokens: 500
‚úì Min VRAM required: 8GB
‚úì Quantization: nf4 (float16)
‚úì Flash Attention 2: Enabled
‚úì torch.compile: Enabled
‚úì Output prefix: magistral_small_reasoning
‚úì System message: Sei un assistente utile e intelligente.


In [4]:

# %%
benchmark = MagistralReasoningBenchmark(config)


‚úÖ CUDA optimizations enabled (TF32, cuDNN benchmark)
‚úÖ CUDA available: True
GPU: NVIDIA GeForce RTX 3090
VRAM: 23.6 GB
‚úÖ GPU has sufficient VRAM for the model


In [5]:

# %%
results, accuracy, category_stats = benchmark.run_benchmark()



MAGISTRAL REASONING BENCHMARK
ü§î THINKING MODE ENABLED

MAGISTRAL BENCHMARK CONFIGURATION
‚úì Model: mistralai/Magistral-Small-2506
‚úì Tokenizer: mistralai/Mistral-Nemo-Instruct-2407
‚úì Test file: ./data/test.jsonl
‚úì Max samples: 1000
‚úì Batch size: 6
‚úì Max new tokens: 500
‚úì Min VRAM required: 8GB
‚úì Quantization: nf4 (float16)
‚úì Flash Attention 2: Enabled
‚úì torch.compile: Enabled
‚úì Output prefix: magistral_small_reasoning
‚úì System message: Sei un assistente utile e intelligente.
‚úì Reasoning: ENABLED
‚úì Aggressive prompting: YES (max 3 sentences)

üìö Loading ITALIC dataset...
Loaded 10000 questions

Original dataset category distribution:
  art_history: 980 questions
  civic_education: 973 questions
  current_events: 92 questions
  geography: 979 questions
  history: 978 questions
  lexicon: 979 questions
  literature: 984 questions
  morphology: 140 questions
  orthography: 971 questions
  synonyms_and_antonyms: 971 questions
  syntax: 973 questions
  tourism

tokenizer_config.json:   0%|          | 0.00/181k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.26M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

‚ö†Ô∏è Flash Attention 2 not available - using standard attention
Loading model with optimizations...


config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

model-00008-of-00010.safetensors:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

model-00005-of-00010.safetensors:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

model-00001-of-00010.safetensors:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

model-00007-of-00010.safetensors:   0%|          | 0.00/4.89G [00:00<?, ?B/s]

model-00003-of-00010.safetensors:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

model-00006-of-00010.safetensors:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

model-00002-of-00010.safetensors:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

model-00004-of-00010.safetensors:   0%|          | 0.00/4.89G [00:00<?, ?B/s]

model-00009-of-00010.safetensors:   0%|          | 0.00/4.78G [00:00<?, ?B/s]

model-00010-of-00010.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Compiling model with torch.compile for optimization...


The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


‚úÖ Model compiled successfully!
‚úÖ Model loaded with REASONING support!
‚úÖ Sampling params: temp=0.1, top_p=0.95, top_k=20 (reasoning mode)
GPU Memory - Allocated: 13.18 GB, Available: 9.12 GB
üöÄ Optimal batch size for reasoning mode: 6

üß™ Testing inference with REASONING enabled...
Question: Chi √® il regista del film "La strada"?...
Test response length: 340 chars
Expected answer: 'C'
Extracted answer: 'C'
Correct: True

STARTING REASONING EVALUATION

ü§î Evaluating mistralai/Magistral-Small-2506 WITH REASONING on 1000 questions...
‚ö° Using aggressive prompting to minimize thinking time


Evaluating (with reasoning):   0%|          | 0/167 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Evaluating (with reasoning):   1%|          | 1/167 [00:20<55:36, 20.10s/it]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Evaluating (with reasoning):   1%|          | 2/167 [00:36<49:05, 17.85s/it]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Evaluating (with reasoning):   2%|‚ñè         | 3/167 [01:15<1:14:59, 27.44s/it]The following generation flags are not valid and may be ignored: ['early_stopping']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Evaluating (with reasoning):   2%|‚ñè         | 4/167 [01:31<1:02:37, 23.05s/it]The following generation flags are not valid and may be ignored: ['early


üìä FINAL RESULTS (WITH REASONING):
Total questions: 1000
Correct answers: 796
Accuracy: 0.7960 (79.60%)
Average reasoning length: 245 characters

üìà RESULTS BY CATEGORY (REASONING MODE):
------------------------------------------------------------
Category                      Accuracy  Correct    Total
------------------------------------------------------------
art_history                     76.19%       64       84
civic_education                 80.95%       68       84
current_events                  86.90%       73       84
geography                       85.71%       72       84
history                         90.36%       75       83
lexicon                         89.16%       74       83
literature                      78.31%       65       83
morphology                      57.83%       48       83
orthography                     74.70%       62       83
synonyms_and_antonyms           93.98%       78       83
syntax                          68.67%       57       83
to

In [6]:

# %%
print(f"\nReasoning benchmark completed with {accuracy:.4f} accuracy")



Reasoning benchmark completed with 0.7960 accuracy


In [7]:

# %%
# Check reasoning usage
valid_reasoning = [r for r in results if r['reasoning_length'] > 0]
avg_reasoning_length = sum(r['reasoning_length'] for r in valid_reasoning) / len(valid_reasoning) if valid_reasoning else 0
print(f"Questions with reasoning: {len(valid_reasoning)}/{len(results)} ({len(valid_reasoning)/len(results)*100:.1f}%)")
print(f"Average reasoning length: {avg_reasoning_length:.0f} characters")


Questions with reasoning: 1000/1000 (100.0%)
Average reasoning length: 245 characters


In [8]:

# %%
print(f"Results saved with prefix: {config.output_prefix}")


Results saved with prefix: magistral_small_reasoning


In [9]:

# %%
benchmark.analyse_results_by_category(category_stats)



üìà RESULTS BY CATEGORY (REASONING MODE):
------------------------------------------------------------
Category                      Accuracy  Correct    Total
------------------------------------------------------------
art_history                     76.19%       64       84
civic_education                 80.95%       68       84
current_events                  86.90%       73       84
geography                       85.71%       72       84
history                         90.36%       75       83
lexicon                         89.16%       74       83
literature                      78.31%       65       83
morphology                      57.83%       48       83
orthography                     74.70%       62       83
synonyms_and_antonyms           93.98%       78       83
syntax                          68.67%       57       83
tourism                         72.29%       60       83
------------------------------------------------------------


In [10]:

# %%
benchmark.show_sample_predictions(results, 10)


üîç SAMPLE PREDICTIONS (WITH REASONING):

Example 1 ‚úÖ CORRECT:
  Category: current_events
  Question: Chi √® il regista del film "La strada"?...
  Expected: C
  Predicted: C
  Reasoning length: 340 chars
  Response excerpt: 'Il film "La strada" √® un capolavoro del cinema italiano, diretto da un grande regista del neorealismo. Tra le opzioni, Vittorio De Sica √® noto per fil...'

Example 2 ‚úÖ CORRECT:
  Category: lexicon
  Question: Uno tra i seguenti pu√≤ essere un contrario di ‚Äúimponenza‚Äù. Quale?...
  Expected: B
  Predicted: B
  Reasoning length: 169 chars
  Response excerpt: '"Imponenza" significa grandiosit√† o autorit√†. Tra le opzioni, "miseria" (B) √® l'unica che si oppone a ricchezza o grandezza, quindi √® il contrario pi√π...'

Example 3 ‚úÖ CORRECT:
  Category: art_history
  Question: IL PRIMO CONTATTO RADIO TRANSATLANTICO FU STABILITO DA GUGLIELMO MARCONI NEL:...
  Expected: A
  Predicted: A
  Reasoning length: 218 chars
  Response excerpt: 'Marconi √® noto per i s