# NanoChat on a Single GPU in Google Colab

This notebook adapts the original `karpathy/nanochat` repository, which is designed for a multi-GPU setup (8xH100), to run on a single GPU, such as the T4, V100, or A100 available in Google Colab.

The key modifications are:
1.  **Removal of Distributed Training**: We replace `torchrun` with a standard `python -m` execution, which the nanochat codebase supports out-of-the-box.
2.  **Reduced Batch Size**: The `device_batch_size` is significantly lowered to prevent Out-Of-Memory (OOM) errors on GPUs with less VRAM. The code's built-in gradient accumulation will automatically compensate for this, ensuring similar training results at the cost of longer training time.

This notebook includes all the training and evaluation steps from the original `speedrun.sh` script for a complete pipeline.

## 1. Environment Setup

First, we'll clone the repository and set up the environment. This involves installing the `uv` package manager, Rust/Cargo for the tokenizer, and all the Python dependencies.

In [1]:
# Clone the repository
!git clone https://github.com/karpathy/nanochat.git
%cd nanochat

# Set the base directory for artifacts
import os
os.environ['NANOCHAT_BASE_DIR'] = '/content/nanochat_data'
!mkdir -p $NANOCHAT_BASE_DIR

# Install uv package manager
!curl -LsSf https://astral.sh/uv/install.sh | sh
# Add uv and cargo to the PATH
os.environ['PATH'] = f"/root/.cargo/bin:/root/.local/bin:{os.environ['PATH']}"

# Create a virtual environment and install dependencies
!uv venv
!uv sync

# Install Rust/Cargo for the tokenizer
!curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
print('✨ Environment setup complete.')

Cloning into 'nanochat'...
remote: Enumerating objects: 80, done.[K
remote: Counting objects:   1% (1/57)[Kremote: Counting objects:   3% (2/57)[Kremote: Counting objects:   5% (3/57)[Kremote: Counting objects:   7% (4/57)[Kremote: Counting objects:   8% (5/57)[Kremote: Counting objects:  10% (6/57)[Kremote: Counting objects:  12% (7/57)[Kremote: Counting objects:  14% (8/57)[Kremote: Counting objects:  15% (9/57)[Kremote: Counting objects:  17% (10/57)[Kremote: Counting objects:  19% (11/57)[Kremote: Counting objects:  21% (12/57)[Kremote: Counting objects:  22% (13/57)[Kremote: Counting objects:  24% (14/57)[Kremote: Counting objects:  26% (15/57)[Kremote: Counting objects:  28% (16/57)[Kremote: Counting objects:  29% (17/57)[Kremote: Counting objects:  31% (18/57)[Kremote: Counting objects:  33% (19/57)[Kremote: Counting objects:  35% (20/57)[Kremote: Counting objects:  36% (21/57)[Kremote: Counting objects:  38% (22/57)[Kremote: Counti

## 2. Initialize Report & Download Data

We'll reset the report directory to start fresh and download the necessary datasets for training and evaluation. For the default `d20` model, this includes 240 shards of pre-training data (~24GB) and the `eval_bundle`.

In [2]:
# Reset the report directory
!bash -c "source .venv/bin/activate && python -m nanochat.report reset"

# Download the eval_bundle for CORE metric evaluation
!if [ ! -d "$NANOCHAT_BASE_DIR/eval_bundle" ]; then \
    curl -L -o eval_bundle.zip https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip && \
    unzip -q eval_bundle.zip && \
    rm eval_bundle.zip && \
    mv eval_bundle $NANOCHAT_BASE_DIR; \
fi

# Download the pre-training data shards
# For the d20 model, 240 shards are recommended (~24GB)
!bash -c "source .venv/bin/activate && python -m nanochat.dataset -n 240"

print('✨ Data download and report reset complete.')

Reset report and wrote header to /content/nanochat_data/report/header.md
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0 24.8M    0 17001    0     0  25813      0  0:16:49 --:--:--  0:16:49 25798 18 24.8M   18 4762k    0     0  2889k      0  0:00:08  0:00:01  0:00:07 2889k 69 24.8M   69 17.2M    0     0  6693k      0  0:00:03  0:00:02  0:00:01 6693k100 24.8M  100 24.8M    0     0  7933k      0  0:00:03  0:00:03 --:--:-- 7935k
Downloading 240 shards using 4 workers...
Target directory: /content/nanochat_data/base_data

Downloading shard_00000.parquet...
Downloading shard_00015.parquet...
Downloading shard_00045.parquet...
Downloading shard_00030.parquet...
Successfully downloaded shard_00045.parquet
Downloading shard_00046.parquet...
Successfully downloaded shard_00015.parquet
Do

## 3. Train & Evaluate Tokenizer

With the initial data downloaded, we can build the Rust-based BPE tokenizer, train it on about 2 billion characters of text, and evaluate its performance.

In [3]:
# Build the rustbpe Tokenizer
!bash -c "source .venv/bin/activate && uv run maturin develop --release --manifest-path rustbpe/Cargo.toml"

# Train the tokenizer
!bash -c "source .venv/bin/activate && python -m scripts.tok_train --max_chars=2000000000"

# Evaluate the tokenizer
!bash -c "source .venv/bin/activate && python -m scripts.tok_eval"

print('✨ Tokenizer training and evaluation complete.')

[1m[32m    Updating[0m crates.io index
[1m[36m Downloading[0m 52 crates, remaining bytes: 98.1KiB                              [K[1m[32m  Downloaded[0m crossbeam-epoch v0.9.18
[1m[36m Downloading[0m 51 crates, remaining bytes: 178.2KiB                             [K[1m[32m  Downloaded[0m wit-bindgen v0.45.1
[1m[36m Downloading[0m 50 crates, remaining bytes: 743.7KiB                             [K[1m[32m  Downloaded[0m version_check v0.9.5
[1m[36m Downloading[0m 49 crates, remaining bytes: 4.8MiB                               [K[1m[32m  Downloaded[0m unindent v0.2.4
[1m[36m Downloading[0m 48 crates, remaining bytes: 4.5MiB                               [K[1m[32m  Downloaded[0m itoa v1.0.15
[1m[36m Downloading[0m 47 crates, remaining bytes: 4.3MiB                               [K[1m[32m  Downloaded[0m quote v1.0.40
[1m[36m Downloading[0m 46 crates, remaining bytes: 4.0MiB                               [K[1m[32m  Downloaded

## 4. Base Model Pre-training

This is the most time-consuming step. We train the 561M parameter `d20` model from scratch.

**Key changes for Colab:**
- We run `base_train.py` directly with `python -m` instead of `torchrun`.
- We set `--device_batch_size=4`. **If you encounter an OOM error, reduce this to `2` or `1`.**
- We set `--depth=4` for a quicker run to test the pipeline. Default is `--depth=20`.
- Practically, you can use a smaller model by setting `--depth=12`. It will take estimated ~12-20 hours to execute the complete pipeline using a A100-80GB GPU.

In [6]:
# Adjust device_batch_size to 2 or 1 if you get OOM errors
!bash -c "source .venv/bin/activate && python -m scripts.base_train --depth=4 --device_batch_size=32"
# --depth=12 \ # Set this option for a smaller, faster model


                                                   █████                 █████
                                                  ░░███                 ░░███
 ████████    ██████   ████████    ██████   ██████  ░███████    ██████   ███████
░░███░░███  ░░░░░███ ░░███░░███  ███░░███ ███░░███ ░███░░███  ░░░░░███ ░░░███░
 ░███ ░███   ███████  ░███ ░███ ░███ ░███░███ ░░░  ░███ ░███   ███████   ░███
 ░███ ░███  ███░░███  ░███ ░███ ░███ ░███░███  ███ ░███ ░███  ███░░███   ░███ ███
 ████ █████░░████████ ████ █████░░██████ ░░██████  ████ █████░░████████  ░░█████
░░░░ ░░░░░  ░░░░░░░░ ░░░░ ░░░░░  ░░░░░░   ░░░░░░  ░░░░ ░░░░░  ░░░░░░░░    ░░░░░

Overriding: depth = 4
Overriding: device_batch_size = 32
2025-10-15 05:26:59,999 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
Vocab size: 65,536
num_layers: 4
model_dim: 256
num_heads: 2
num_kv_heads: 2
Tokens / micro-batch / rank: 32 x 2048 = 65,536
Tokens / micro-batch: 65,536
Total batch size 524,288 => gradient acc

## 5. Evaluate Base Model

After pre-training, we evaluate the base model's loss and its performance on the CORE benchmark.

In [7]:
# Evaluate the model on a larger chunk of train/val data and draw some samples
!bash -c "source .venv/bin/activate && python -m scripts.base_loss"

# Evaluate the model on CORE tasks
!bash -c "source .venv/bin/activate && python -m scripts.base_eval"

2025-10-15 06:40:17,956 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
2025-10-15 06:40:17,957 - nanochat.checkpoint_manager - [32m[1mINFO[0m - No model tag provided, guessing model tag: d4
2025-10-15 06:40:17,957 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Loading model from /content/nanochat_data/base_checkpoints/d4 with step 1400
2025-10-15 06:40:18,380 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}
train bpb: 1.1694
val bpb: 1.1627
<|bos|>The capital of France is the city of Paris, which is the city of Paris. The city is the
<|bos|>The chemical symbol of gold is the gold symbol of gold. The gold symbol of gold is the gold symbol of
<|bos|>If yesterday was Friday, then tomorrow will be a good day, and tomorrow will be a good day, and tomorrow will be
<|bos|>The opposite of hot is the fact that the temperature of

## 6. Mid-training

In this stage, we teach the base model about conversational structure, special tokens, and tool use.

In [9]:
# device_batch_size must be <= the batch size used in base training
!bash -c "source .venv/bin/activate && python -m scripts.mid_train --device_batch_size=32" 

Overriding: device_batch_size = 32
2025-10-15 07:04:40,883 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
2025-10-15 07:04:40,912 - nanochat.checkpoint_manager - [32m[1mINFO[0m - No model tag provided, guessing model tag: d4
2025-10-15 07:04:40,913 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Loading model from /content/nanochat_data/base_checkpoints/d4 with step 1400
2025-10-15 07:04:41,334 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}
Tokens / micro-batch / rank: 32 x 2048 = 65,536
Tokens / micro-batch: 65,536
Total batch size 524,288 => gradient accumulation steps: 8
Scaling the LR for the AdamW parameters ∝1/√(256/768) = 1.732051
README.md: 0.00B [00:00, ?B/s]README.md: 2.24kB [00:00, 5.68MB/s]
data/train-00000-of-00004.parquet:   0%|             | 0.00/230M [00:00<?, ?B/s]data/train-00000-of-00004.parquet

## 7. Evaluate Mid-trained Model

We evaluate the model's chat abilities after the mid-training stage.

In [10]:
!bash -c "source .venv/bin/activate && python -m scripts.chat_eval -i mid"

2025-10-15 07:52:36,127 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
2025-10-15 07:52:36,166 - nanochat.checkpoint_manager - [32m[1mINFO[0m - No model tag provided, guessing model tag: d4
2025-10-15 07:52:36,167 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Loading model from /content/nanochat_data/mid_checkpoints/d4 with step 771
2025-10-15 07:52:36,908 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}
README.md: 0.00B [00:00, ?B/s]README.md: 9.00kB [00:00, 16.0MB/s]
ARC-Easy/train-00000-of-00001.parquet:   0%|         | 0.00/331k [00:00<?, ?B/s]ARC-Easy/train-00000-of-00001.parquet: 100%|██| 331k/331k [00:00<00:00, 518kB/s]ARC-Easy/train-00000-of-00001.parquet: 100%|██| 331k/331k [00:00<00:00, 517kB/s]
ARC-Easy/test-00000-of-00001.parquet:   0%|          | 0.00/346k [00:00<?, ?B/s]ARC-Easy/test-00000-of-00001.pa

## 8. Supervised Fine-tuning (SFT)

Finally, we perform supervised fine-tuning for domain adaptation, making the model a better chatbot.

In [11]:
# The device_batch_size must be <= the batch size used in base training
!bash -c "source .venv/bin/activate && python -m scripts.chat_sft --device_batch_size=32"

Overriding: device_batch_size = 32
2025-10-15 08:55:50,837 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
2025-10-15 08:55:50,876 - nanochat.checkpoint_manager - [32m[1mINFO[0m - No model tag provided, guessing model tag: d4
2025-10-15 08:55:50,877 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Loading model from /content/nanochat_data/mid_checkpoints/d4 with step 771
2025-10-15 08:55:51,343 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}
Target examples per step: 32
Device batch size: 32
Examples per step is device_batch_size * ddp_world_size: 32
=> Setting grad accum steps: 1
Scaling the LR for the AdamW parameters ∝1/√(256/768) = 1.732051
Step 00000 | Validation loss: 2.035508
Step 00000/00651 | Training loss: 2.048453| lrm: 1.000000| num_tokens: 15,439
Step 00001/00651 | Training loss: 2.172420| lrm: 0.998464| n

## 9. Evaluate SFT Model

This is the final evaluation of our fully fine-tuned chat model.

In [12]:
!bash -c "source .venv/bin/activate && python -m scripts.chat_eval -i sft"

2025-10-15 09:18:35,186 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
2025-10-15 09:18:35,224 - nanochat.checkpoint_manager - [32m[1mINFO[0m - No model tag provided, guessing model tag: d4
2025-10-15 09:18:35,225 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Loading model from /content/nanochat_data/chatsft_checkpoints/d4 with step 650
2025-10-15 09:18:35,657 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}
Final: 602/2376 (25.34%)
ARC-Easy accuracy: 25.34%
Final: 303/1172 (25.85%)
ARC-Challenge accuracy: 25.85%
Final: 3553/14042 (25.30%)
MMLU accuracy: 25.30%
[KRank 0 | 0/1 (0.00%)[KRank 0 | 0/2 (0.00%)[KRank 0 | 0/3 (0.00%)[KRank 0 | 0/4 (0.00%)[KRank 0 | 0/5 (0.00%)[KRank 0 | 0/6 (0.00%)[KRank 0 | 0/7 (0.00%)[KRank 0 | 0/8 (0.00%)[KRank 0 | 0/9 (0.00%)[KRank 0 | 0/10 (0.00%)[KRank 0 | 0/11 (0

## 10. Generate Final Report

Now we compile all the metrics gathered during the pipeline into a single `report.md` file.

In [13]:
!bash -c "source .venv/bin/activate && python -m nanochat.report generate"

# Display the final report
from IPython.display import display, Markdown
with open('report.md', 'r') as f:
    report_content = f.read()
display(Markdown(report_content))

Generating report to /content/nanochat_data/report/report.md
Copying report.md to current directory for convenience


# nanochat training report

Generated: 2025-10-15 04:56:28

## Environment

### Git Information
- Branch: master
- Commit: 67aaca9 (clean)
- Message: export NANOCHAT_BASE_DIR so child processes get it too

### Hardware
- Platform: Linux
- CPUs: 43 cores (43 logical)
- Memory: 1338.1 GB
- GPUs: 1x NVIDIA A100-SXM4-80GB
- GPU Memory: 79.3 GB total
- CUDA Version: 12.8
- Hourly Rate: $1.79/hour

### Software
- Python: 3.10.19
- PyTorch: 2.8.0+cu128


### Bloat
- Characters: 330,622
- Lines: 8,077
- Files: 42
- Tokens (approx): 82,655
- Dependencies (uv.lock lines): 2,004

Run started: 2025-10-15 04:56:29

---

## Tokenizer training
timestamp: 2025-10-15 05:01:03

- max_chars: 2,000,000,000
- doc_cap: 10,000
- vocab_size: 65,536
- train_time: 110.7337
- num_special_tokens: 9
- token_bytes_min: 1
- token_bytes_max: 32
- token_bytes_mean: 6.9151
- token_bytes_std: 2.8736


## Tokenizer evaluation
timestamp: 2025-10-15 05:01:15

### Comparison with GPT-2

| Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
|-----------|-------|--------------|--------------|-------------|------------|-----------------|
| news | 1819 | 404 | 4.50 | 375 | 4.85 | +7.2% |
| korean | 893 | 745 | 1.20 | 721 | 1.24 | +3.2% |
| code | 1259 | 576 | 2.19 | 493 | 2.55 | +14.4% |
| math | 1834 | 936 | 1.96 | 966 | 1.90 | -3.2% |
| science | 1112 | 260 | 4.28 | 225 | 4.94 | +13.5% |
| fwe-train | 4208518 | 900364 | 4.67 | 856901 | 4.91 | +4.8% |
| fwe-val | 4908443 | 1059062 | 4.63 | 1010356 | 4.86 | +4.6% |

### Comparison with GPT-4

| Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
|-----------|-------|--------------|--------------|-------------|------------|-----------------|
| news | 1819 | 387 | 4.70 | 375 | 4.85 | +3.1% |
| korean | 893 | 364 | 2.45 | 721 | 1.24 | -98.1% |
| code | 1259 | 309 | 4.07 | 493 | 2.55 | -59.5% |
| math | 1834 | 832 | 2.20 | 966 | 1.90 | -16.1% |
| science | 1112 | 249 | 4.47 | 225 | 4.94 | +9.6% |
| fwe-train | 4208518 | 874799 | 4.81 | 856901 | 4.91 | +2.0% |
| fwe-val | 4908443 | 1029691 | 4.77 | 1010356 | 4.86 | +1.9% |


## Base model training
timestamp: 2025-10-15 06:40:02

- run: dummy
- depth: 4
- max_seq_len: 2048
- num_iterations: -1
- target_flops: -1.0000
- target_param_data_ratio: 20
- device_batch_size: 32
- total_batch_size: 524,288
- embedding_lr: 0.2000
- unembedding_lr: 0.0040
- weight_decay: 0.0000
- matrix_lr: 0.0200
- grad_clip: 1.0000
- eval_every: 250
- eval_tokens: 10,485,760
- core_metric_every: 2000
- core_metric_max_per_task: 500
- sample_every: 2000
- model_tag: 
- Number of parameters: 36,700,160
- Number of FLOPs per token: 1.447035e+08
- Calculated number of iterations: 1400
- Number of training tokens: 734,003,200
- Tokens : Params ratio: 20.0000
- DDP world size: 1
- warmup_ratio: 0.0000
- warmdown_ratio: 0.2000
- final_lr_frac: 0.0000
- Minimum validation bpb: 1.1627
- Final validation bpb: 1.1627
- CORE metric estimate: 0.0448
- MFU %: 2.86%
- Total training flops: 1.062128e+17
- Total training time: 62.08m
- Peak memory usage: 11081.51MiB


## Base model loss
timestamp: 2025-10-15 06:42:38

- train bpb: 1.1694
- val bpb: 1.1627
- sample 0: <|bos|>The capital of France is the city of Paris, which is the city of Paris. The city is the
- sample 1: <|bos|>The chemical symbol of gold is the gold symbol of gold. The gold symbol of gold is the gold symbol of
- sample 2: <|bos|>If yesterday was Friday, then tomorrow will be a good day, and tomorrow will be a good day, and tomorrow will be
- sample 3: <|bos|>The opposite of hot is the fact that the temperature of the planet is very hot. The temperature of the
- sample 4: <|bos|>The planets of the solar system are: the planets of the solar system are: the planets of the solar system are:
- sample 5: <|bos|>My favorite color is to use a color palette, a color palette, a color palette, a color
- sample 6: <|bos|>If 5*x + 3 = 13, then x is 3*x + 3 = 13, then x is 3


## Base model evaluation
timestamp: 2025-10-15 07:04:07

- Model: base_model (step 1400)
- CORE metric: 0.0395
- hellaswag_zeroshot: 0.0184
- jeopardy: 0.0000
- bigbench_qa_wikidata: 0.0178
- arc_easy: 0.1347
- arc_challenge: -0.0307
- copa: 0.0200
- commonsense_qa: 0.0653
- piqa: 0.1262
- openbook_qa: -0.0107
- lambada_openai: 0.1589
- hellaswag: 0.0140
- winograd: 0.0037
- winogrande: 0.0639
- bigbench_dyck_languages: 0.0270
- agi_eval_lsat_ar: 0.0543
- bigbench_cs_algorithms: 0.3583
- bigbench_operators: 0.0286
- bigbench_repeat_copy_logic: 0.0000
- squad: 0.0013
- coqa: 0.0303
- boolq: -0.3979
- bigbench_language_identification: 0.1846


## Midtraining
timestamp: 2025-10-15 07:52:23

- run: dummy
- dtype: bfloat16
- max_seq_len: 2048
- device_batch_size: 32
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- init_lr_frac: 1.0000
- weight_decay: 0.0000
- final_lr_frac: 0.0000
- eval_every: 150
- eval_tokens: 10,485,760
- total_batch_size: 524,288
- Number of iterations: 771
- DDP world size: 1
- Minimum validation bpb: 0.7144


## Chat evaluation mid
timestamp: 2025-10-15 08:55:36

- source: mid
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 8
- model_tag: None
- step: None
- max_problems: None
- ARC-Easy: 0.2370
- ARC-Challenge: 0.2534
- MMLU: 0.2490
- GSM8K: 0.0008
- HumanEval: 0.0000
- ChatCORE metric: -0.0027


## Chat SFT
timestamp: 2025-10-15 09:18:19

- run: dummy
- source: mid
- dtype: bfloat16
- device_batch_size: 32
- num_epochs: 1
- max_iterations: -1
- target_examples_per_step: 32
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
- weight_decay: 0.0000
- init_lr_frac: 0.0200
- eval_every: 100
- eval_steps: 100
- eval_metrics_every: 200
- Training rows: 20,843
- Number of iterations: 651
- Training loss: 2.1431
- Validation loss: 1.9899


## Chat evaluation sft
timestamp: 2025-10-15 10:01:16

- source: sft
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 8
- model_tag: None
- step: None
- max_problems: None
- ARC-Easy: 0.2534
- ARC-Challenge: 0.2585
- MMLU: 0.2530
- GSM8K: 0.0038
- HumanEval: 0.0000
- ChatCORE metric: 0.0047


## Summary

- Characters: 330,622
- Lines: 8,077
- Files: 42
- Tokens (approx): 82,655
- Dependencies (uv.lock lines): 2,004

| Metric          | BASE     | MID      | SFT      | RL       |
|-----------------|----------|----------|----------|----------|
| CORE            | 0.0395   | -        | -        | -        |
| ARC-Challenge   | -        | 0.2534   | 0.2585   | -        |
| ARC-Easy        | -        | 0.2370   | 0.2534   | -        |
| GSM8K           | -        | 0.0008   | 0.0038   | -        |
| HumanEval       | -        | 0.0000   | 0.0000   | -        |
| MMLU            | -        | 0.2490   | 0.2530   | -        |
| ChatCORE        | -        | -0.0027  | 0.0047   | -        |

Total wall clock time: 5h4m


## 11. Inference

Your custom LLM is trained! You can now chat with it directly from the command line or through a simple web UI.

In [14]:
# Chat with the model via the Command Line Interface (CLI)
!bash -c 'source .venv/bin/activate && python -m scripts.chat_cli -p "Why is the sky blue?"'

2025-10-15 10:01:29,638 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
2025-10-15 10:01:29,674 - nanochat.checkpoint_manager - [32m[1mINFO[0m - No model tag provided, guessing model tag: d4
2025-10-15 10:01:29,675 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Loading model from /content/nanochat_data/chatsft_checkpoints/d4 with step 650
2025-10-15 10:01:30,340 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}

NanoChat Interactive Mode
--------------------------------------------------
Type 'quit' or 'exit' to end the conversation
Type 'clear' to start a new conversation
--------------------------------------------------

Assistant: The sky blue is a blue, and the sky is a blue, and the sky is a blue.<|assistant_end|>


In [15]:
# (Optional) Launch the Web UI
# You will need to use a tool like ngrok to expose the port from Colab.
# Example with ngrok:
# 1. Get an authtoken from https://dashboard.ngrok.com/get-started/your-authtoken
# 2. Run the following in a new cell:
#    !pip install pyngrok
#    from pyngrok import ngrok
#    authtoken = "YOUR_NGROK_AUTHTOKEN"
#    ngrok.set_auth_token(authtoken)
#    public_url = ngrok.connect(8000)
#    print(f'Click to access the Web UI: {public_url}')
# 3. Then run the cell below.

!bash -c "source .venv/bin/activate && python -m scripts.chat_web"

2025-10-15 10:01:44,251 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
Starting NanoChat Web Server
Temperature: 0.8, Top-k: 50, Max tokens: 512
[32mINFO[0m:     Started server process [[36m39139[0m]
[32mINFO[0m:     Waiting for application startup.
Loading nanochat model...
2025-10-15 10:01:44,418 - nanochat.checkpoint_manager - [32m[1mINFO[0m - No model tag provided, guessing model tag: d4
2025-10-15 10:01:44,418 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Loading model from /content/nanochat_data/chatsft_checkpoints/d4 with step 650
2025-10-15 10:01:44,832 - nanochat.checkpoint_manager - [32m[1mINFO[0m - Building model with config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 4, 'n_head': 2, 'n_kv_head': 2, 'n_embd': 256}
Server ready at http://localhost:8000
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Uvicorn running on [1mhttp://0.0.0.0:8000[0m (Press CTRL+C to quit)

[32mINFO[0m:     Shutti