# PyCodeAI - Google Colab Training (GitHub Version)

This notebook trains your PyCodeAI model using code from GitHub and saves the results to Google Drive.

## Instructions

1.  **Configure**: Set your GitHub Repository URL in the first code cell.
2.  **Mount Drive**: Run the cell to connect Google Drive (for saving the trained model).
3.  **Run All**: Run all cells to clone, install, and train.

In [3]:
# CONFIGURATION
# Replace this with your repository URL
GITHUB_REPO = 'https://github.com/mohhomadfarman/PyCodeAI.git'
BRANCH = 'main'  # or 'master'

# This is where the model will be SAVED in your Google Drive
DRIVE_SAVE_PATH = '/content/drive/MyDrive/PyCodeAI_Models'

In [4]:
# 1. Mount Google Drive
from google.colab import drive
import os

drive.mount('/content/drive')

# Create the save directory if it doesn't exist
os.makedirs(DRIVE_SAVE_PATH, exist_ok=True)
print(f"Models will be saved to: {DRIVE_SAVE_PATH}")

Mounted at /content/drive
Models will be saved to: /content/drive/MyDrive/PyCodeAI_Models


In [6]:
# 2. Clone Repository & Install Dependencies
!git clone {GITHUB_REPO} PyCodeAI_Repo
%cd PyCodeAI_Repo
!git checkout {BRANCH}
!git pull origin {BRANCH}  # Ensure we have the latest

# Install cupy for GPU
!pip install cupy-cuda12x

Cloning into 'PyCodeAI_Repo'...
remote: Enumerating objects: 3039, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 3039 (delta 7), reused 7 (delta 4), pack-reused 3009 (from 4)[K
Receiving objects: 100% (3039/3039), 41.54 MiB | 29.48 MiB/s, done.
Resolving deltas: 100% (1063/1063), done.
/content/PyCodeAI_Repo
Already on 'main'
Your branch is up to date with 'origin/main'.
From https://github.com/mohhomadfarman/PyCodeAI
 * branch            main       -> FETCH_HEAD
Already up to date.


In [7]:
# 3. Check for Existing Model
import os
import shutil

# If you have a 'best_model.npz' in your Drive, we can copy it here to resume training
# Uncomment the lines below if you want to pull a model FROM Drive
# DRIVE_MODEL = os.path.join(DRIVE_SAVE_PATH, 'best_model.npz')
# if os.path.exists(DRIVE_MODEL):
#     print("Found model in Drive, copying to local workspace...")
#     shutil.copy(DRIVE_MODEL, 'best_model.npz')

if os.path.exists('best_model.npz'):
    print("Starting training from existing 'best_model.npz'...")
else:
    print("No 'best_model.npz' found. Starting fresh training (or finding it in repo).")

Starting training from existing 'best_model.npz'...


In [9]:
# 4. Run Training
# - Resumes from best_model.npz (if it exists)
# - Saves to best_model_new.npz
# - Creates a NEW tokenizer file

# Protect original tokenizer
!cp tokenizer.json tokenizer_new.json 2>/dev/null || echo "No tokenizer.json found, will build new one."

!python cli.py train \
    --device gpu \
    --load-model best_model.npz \
    --epochs 5 \
    --batch-size 32 \
    --log-interval 10

Backend: GPU (Unknown GPU, 15095MB VRAM)
>> Training PyCodeAI
Device: GPU

1. Loading training data...
[OK] Loaded 2062 legacy crawled files
[OK] Loaded 1531 structured crawled files
[OK] Loaded 103 articles
   Loaded 8060 samples

2. Building tokenizer...
   Loading tokenizer from tokenizer.json...
Tokenizer loaded from tokenizer.json
   Vocabulary size: 5000

3. Tokenizing data...
   Tokenized 8060 samples

4. Creating model...
   Loading weights from best_model.npz...
Model loaded from best_model.npz
   [OK] Weights loaded successfully!
   Model parameters: 2,073,344
GPTConfig(
  vocab_size=5000,
  max_seq_len=64,
  embed_dim=128,
  num_heads=4,
  num_layers=4,
  expansion_factor=4
)

5. Creating data loader...
   Batches per epoch: 2440

6. Setting up trainer...

7. Starting training...
Starting Training
Model parameters: 2,073,344
Epochs: 5
Batches per epoch: 2440
Gradient Accumulation: 1 steps
Effective Batch Size: 32
Epoch 1/5 | Step 10 | Loss: 5.4494 | LR: 3.00e-05 | Tokens/s: 

In [12]:
!python cli.py train --epochs 5 --batch-size 64  --grad-accum 4 --vocab-size 7171  --chat-only  --device gpu  --load-model best_model.npz

Backend: GPU (Unknown GPU, 15095MB VRAM)
>> Training PyCodeAI
Device: GPU

1. Loading training data...
    Chat Mode: Loading dedicated conversation data
[OK] Loaded 103 articles
[OK] Loaded 3153 chat/article samples
   Loaded 3153 samples

2. Building tokenizer...
   Loading tokenizer from tokenizer.json...
Tokenizer loaded from tokenizer.json
   Vocabulary size: 7171

3. Tokenizing data...
   Tokenized 3153 samples

4. Creating model...
   Loading weights from best_model.npz...
Model loaded from best_model.npz
   [OK] Weights loaded successfully!
   Model parameters: 2,629,120
GPTConfig(
  vocab_size=7171,
  max_seq_len=64,
  embed_dim=128,
  num_heads=4,
  num_layers=4,
  expansion_factor=4
)

5. Creating data loader...
   Batches per epoch: 371

6. Setting up trainer...

7. Starting training...
Starting Training
Model parameters: 2,629,120
Epochs: 5
Batches per epoch: 371
Gradient Accumulation: 4 steps
Effective Batch Size: 256
Epoch 1/5 | Step 10 | Loss: 4.1370 | LR: 3.00e-05 | To

In [17]:
# 5. Save Results to Drive
import shutil
import datetime
import os

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
new_model_name = f"best_model_{timestamp}.npz"
new_token_name = f"tokenizer_{timestamp}.json"

print(f"Backing up to Drive as {new_model_name}...")

# Copy model
if os.path.exists("best_model.npz"): # Changed from best_model_new.npz
    shutil.copy("best_model.npz", os.path.join(DRIVE_SAVE_PATH, new_model_name)) # Changed from best_model_new.npz
    # Also update the 'latest' one
    shutil.copy("best_model.npz", os.path.join(DRIVE_SAVE_PATH, "best_model_latest.npz")) # Changed from best_model_new.npz
    print("Model saved.")
else:
    print("ERROR: best_model.npz not found! (Was looking for best_model_new.npz before)")

# Copy tokenizer
if os.path.exists("tokenizer_new.json"):
    shutil.copy("tokenizer_new.json", os.path.join(DRIVE_SAVE_PATH, new_token_name))
    shutil.copy("tokenizer_new.json", os.path.join(DRIVE_SAVE_PATH, "tokenizer_latest.json"))
    print("Tokenizer saved.")
else:
    print("WARNING: tokenizer_new.json not found!")

Backing up to Drive as best_model_20260210_175011.npz...
Model saved.
Tokenizer saved.


In [16]:
print("## Git Push Process")
print("\n1.  **Stage changes**: Add files to the staging area.")
print("!git add .  # To add all changes in the current directory")
print("# Or: !git add <file_name> # To add a specific file")
print("\n2.  **Commit changes**: Record the changes to the repository with a message.")
print("!git commit -m \"Your commit message here\"")
print("\n3.  **Push to remote**: Upload the committed changes to the remote repository.")
print("!git push origin main # Replace 'main' with your branch name (e.g., master)")

## Git Push Process

1.  **Stage changes**: Add files to the staging area.
!git add .  # To add all changes in the current directory
# Or: !git add <file_name> # To add a specific file

2.  **Commit changes**: Record the changes to the repository with a message.
!git commit -m "Your commit message here"

3.  **Push to remote**: Upload the committed changes to the remote repository.
!git push origin main # Replace 'main' with your branch name (e.g., master)
