# üöÄ Bullet OS Universal Trainer

## Complete Pipeline: Data ‚Üí Tokenizer ‚Üí Training ‚Üí .bullet Model ‚Üí Testing

**Created by:** Shrikant Bhosale | **Mentored by:** [Hintson.com](https://hintson.com)

---

### What This Notebook Does:

1. ‚úÖ **Load Data** - Upload or create your dataset
2. ‚úÖ **Build Tokenizer** - Train BPE tokenizer on your data
3. ‚úÖ **Train Model** - Train Transformer from scratch
4. ‚úÖ **Export .bullet** - Create production-ready model
5. ‚úÖ **Test & Validate** - Generate text and verify quality

**Time:** 20-30 minutes | **Cost:** ‚Çπ0 | **GPU:** Optional (works on CPU)

---

## üì¶ Step 1: Setup Environment

In [None]:
%%capture
!git clone https://github.com/iShrikantBhosale/bullet-core.git
%cd bullet-core
!pip install numpy

import sys
sys.path.append('bullet_core')

print('‚úÖ Environment ready!')

## üìÇ Step 2: Load Your Dataset

Choose how to provide your training data.

In [None]:
import json
from google.colab import files
import os

print('üìÇ Choose your dataset option:\n')
print('1. Upload JSONL file')
print('2. Upload plain text file')
print('3. Enter text interactively')
print('4. Use demo Marathi dataset\n')

choice = input('Enter choice (1/2/3/4): ')

dataset_path = 'training_data.jsonl'

if choice == '1':
    print('\nüì§ Upload your JSONL file (format: {"text": "your text"})')
    uploaded = files.upload()
    filename = list(uploaded.keys())[0]
    os.rename(filename, dataset_path)
    
elif choice == '2':
    print('\nüì§ Upload your text file (one sentence per line)')
    uploaded = files.upload()
    filename = list(uploaded.keys())[0]
    
    # Convert to JSONL
    with open(filename, 'r', encoding='utf-8') as f:
        lines = [line.strip() for line in f if line.strip()]
    
    with open(dataset_path, 'w', encoding='utf-8') as f:
        for line in lines:
            f.write(json.dumps({'text': line}, ensure_ascii=False) + '\n')
    
elif choice == '3':
    print('\n‚úçÔ∏è Enter your training texts (press Enter twice when done):\n')
    texts = []
    while True:
        text = input(f'Text {len(texts)+1}: ')
        if not text:
            break
        texts.append(text)
    
    with open(dataset_path, 'w', encoding='utf-8') as f:
        for text in texts:
            f.write(json.dumps({'text': text}, ensure_ascii=False) + '\n')

else:
    # Demo dataset
    demo_texts = [
        '‡§ï‡•É‡§§‡•ç‡§∞‡§ø‡§Æ ‡§¨‡•Å‡§¶‡•ç‡§ß‡§ø‡§Æ‡§§‡•ç‡§§‡§æ ‡§§‡§Ç‡§§‡•ç‡§∞‡§ú‡•ç‡§û‡§æ‡§®‡§æ‡§§ ‡§ï‡•ç‡§∞‡§æ‡§Ç‡§§‡•Ä ‡§Ü‡§£‡§§ ‡§Ü‡§π‡•á.',
        '‡§Æ‡§∂‡•Ä‡§® ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó ‡§°‡•á‡§ü‡§æ‡§Æ‡§ß‡•Ä‡§≤ ‡§™‡•Ö‡§ü‡§∞‡•ç‡§® ‡§ì‡§≥‡§ñ‡§§‡•á.',
        '‡§°‡•Ä‡§™ ‡§≤‡§∞‡•ç‡§®‡§ø‡§Ç‡§ó ‡§®‡•ç‡§Ø‡•Ç‡§∞‡§≤ ‡§®‡•á‡§ü‡§µ‡§∞‡•ç‡§ï ‡§µ‡§æ‡§™‡§∞‡§§‡•á.',
        '‡§®‡•à‡§∏‡§∞‡•ç‡§ó‡§ø‡§ï ‡§≠‡§æ‡§∑‡§æ ‡§™‡•ç‡§∞‡§ï‡•ç‡§∞‡§ø‡§Ø‡§æ ‡§Æ‡§ú‡§ï‡•Ç‡§∞ ‡§∏‡§Æ‡§ú‡•Ç‡§® ‡§ò‡•á‡§§‡•á.',
        '‡§∏‡§Ç‡§ó‡§£‡§ï ‡§¶‡•É‡§∑‡•ç‡§ü‡•Ä ‡§™‡•ç‡§∞‡§§‡§ø‡§Æ‡§æ ‡§ì‡§≥‡§ñ‡•Ç ‡§∂‡§ï‡§§‡•á.',
    ]
    with open(dataset_path, 'w', encoding='utf-8') as f:
        for text in demo_texts:
            f.write(json.dumps({'text': text}, ensure_ascii=False) + '\n')

# Count examples
with open(dataset_path, 'r', encoding='utf-8') as f:
    num_examples = len(f.readlines())

print(f'\n‚úÖ Dataset ready: {num_examples} examples')
print(f'üìÅ Saved to: {dataset_path}')

## üî§ Step 3: Build Tokenizer

Train a BPE tokenizer on your data.

In [None]:
from python.tokenizer import BPETokenizer
import json

# Load all text from dataset
texts = []
with open(dataset_path, 'r', encoding='utf-8') as f:
    for line in f:
        data = json.loads(line)
        texts.append(data['text'])

# Train tokenizer
print('üî® Training tokenizer...')
tokenizer = BPETokenizer()
tokenizer.train(texts, vocab_size=2000)

# Save tokenizer
tokenizer.save('my_tokenizer.json')

print(f'\n‚úÖ Tokenizer trained!')
print(f'Vocab size: {len(tokenizer.vocab)}')
print(f'\nTest encoding:')
test_text = texts[0][:50]
tokens = tokenizer.encode(test_text)
print(f'Text: {test_text}')
print(f'Tokens: {tokens[:10]}...')
print(f'Decoded: {tokenizer.decode(tokens)}')

## ‚öôÔ∏è Step 4: Configure Model

Set up training parameters.

In [None]:
# Model configuration
config = f'''hidden_size: 128
num_heads: 4
num_layers: 4
vocab_size: {len(tokenizer.vocab)}
learning_rate: 0.0003
batch_size: 4
max_seq_len: 64
max_steps: 500
dataset_path: "{dataset_path}"
checkpoint_dir: "my_model_checkpoints"
'''

# Save config
with open('bullet_core/configs/my_model.yaml', 'w') as f:
    f.write(config)

print('‚úÖ Configuration created')
print('\nModel specs:')
print('  - 128 hidden dimensions')
print('  - 4 attention heads')
print('  - 4 transformer layers')
print(f'  - {len(tokenizer.vocab)} vocab size')
print('  - ~500K parameters')
print('\nTraining: 500 steps (~10 minutes on CPU)')

## üéØ Step 5: Train Model

Train your Transformer model!

In [None]:
import time

start_time = time.time()

# Copy tokenizer to expected location
!cp my_tokenizer.json bullet_core/marathi_tokenizer.json

# Train
!python bullet_core/train_production.py --config bullet_core/configs/my_model.yaml

training_time = time.time() - start_time

print('\n' + '='*60)
print('‚úÖ Training Complete!')
print('='*60)
print(f'Total time: {training_time/60:.1f} minutes')
print(f'Speed: {500/(training_time/60):.1f} steps/min')

## üì¶ Step 6: Export to .bullet Format

Create production-ready model file.

In [None]:
!python test_checkpoints.py

import os

# Find .bullet file
bullet_files = [f for f in os.listdir('my_model_checkpoints') if f.endswith('.bullet')]

if bullet_files:
    bullet_path = f'my_model_checkpoints/{bullet_files[-1]}'
    size_mb = os.path.getsize(bullet_path) / (1024*1024)
    
    print(f'\n‚úÖ Model exported!')
    print(f'üì¶ File: {bullet_path}')
    print(f'üíæ Size: {size_mb:.2f} MB (BQ4 quantized)')
    print(f'üöÄ Ready for deployment!')
    
    # Save path for next step
    model_path = bullet_path
else:
    print('‚ùå Export failed')
    model_path = None

## üß™ Step 7: Test Your Model

Generate text and validate quality.

In [None]:
from utils.bullet_io import BulletReader
from python.transformer import GPT
from python.tensor import Tensor
import numpy as np

if model_path:
    # Load model
    print('üì• Loading model...')
    reader = BulletReader(model_path)
    reader.load()
    
    # Create model
    model = GPT(
        vocab_size=len(tokenizer.vocab),
        d_model=128,
        n_head=4,
        n_layer=4,
        max_len=64
    )
    
    # Load weights
    for i, param in enumerate(model.parameters()):
        key = f'param_{i}'
        if key in reader.tensors:
            param.data = reader.tensors[key]
    
    print('‚úÖ Model loaded!\n')
    
    # Test generation
    test_prompts = texts[:3]  # Use first 3 training examples
    
    print('üé® Generating text:\n')
    for i, prompt in enumerate(test_prompts, 1):
        # Take first few words as prompt
        prompt_text = ' '.join(prompt.split()[:3])
        
        # Encode
        tokens = tokenizer.encode(prompt_text)
        generated = tokens.copy()
        
        # Generate 10 tokens
        for _ in range(10):
            x = Tensor(np.array([generated], dtype=np.int32), requires_grad=False)
            logits = model(x)
            next_token = np.argmax(logits.data[0, -1, :])
            generated.append(next_token)
        
        result = tokenizer.decode(generated)
        
        print(f'{i}. Prompt: "{prompt_text}"')
        print(f'   Generated: "{result}"')
        print()
    
    print('‚úÖ Testing complete!')
else:
    print('‚ùå No model to test')

## üìä Step 8: Validation Report

Check model quality metrics.

In [None]:
if model_path:
    print('üìä Model Validation Report\n')
    print('='*60)
    
    # Dataset stats
    print(f'Dataset: {num_examples} examples')
    print(f'Tokenizer: {len(tokenizer.vocab)} vocab size')
    print(f'Model: 128d, 4 heads, 4 layers (~500K params)')
    print(f'Training: 500 steps')
    print(f'File size: {size_mb:.2f} MB (BQ4)')
    
    # Quality check
    print('\n' + '='*60)
    print('Quality Checklist:')
    print('  ‚úÖ Model trains without errors')
    print('  ‚úÖ .bullet file created successfully')
    print('  ‚úÖ Model loads and generates text')
    
    # Recommendations
    print('\n' + '='*60)
    print('Recommendations:')
    if num_examples < 100:
        print('  ‚ö†Ô∏è  Add more training data (100+ examples recommended)')
    else:
        print('  ‚úÖ Dataset size is good')
    
    print('  üí° Train longer (1000+ steps) for better quality')
    print('  üí° Increase model size for more complex tasks')
    print('  üí° Use repetition penalty during inference')
    
    print('\n' + '='*60)
    print('‚úÖ Validation Complete!')
else:
    print('‚ùå No model to validate')

## üíæ Step 9: Download Your Model

Get all your files for deployment.

In [None]:
from google.colab import files
import zipfile

if model_path:
    # Create deployment package
    print('üì¶ Creating deployment package...\n')
    
    with zipfile.ZipFile('my_bullet_model.zip', 'w') as zipf:
        zipf.write(model_path, os.path.basename(model_path))
        zipf.write('my_tokenizer.json', 'tokenizer.json')
        zipf.write(dataset_path, 'training_data.jsonl')
    
    print('Files included:')
    print(f'  - {os.path.basename(model_path)} (model)')
    print(f'  - tokenizer.json')
    print(f'  - training_data.jsonl\n')
    
    # Download
    files.download('my_bullet_model.zip')
    
    print('‚úÖ Download started!')
    print('\nYou can now:')
    print('  1. Run inference on any computer')
    print('  2. Deploy to mobile/web')
    print('  3. Share with others')
    print('  4. Continue training')
else:
    print('‚ùå No model to download')

## üîç Step 10: Automated Validation Checklist

Run this to verify everything is working correctly.

In [None]:
# ====================================================
# üîç BULLET TRAINER AUTOMATED CHECKLIST
# ====================================================

import os
import json
import numpy as np

print('\n' + '='*60)
print('üîç BULLET TRAINER AUTOMATED CHECKLIST')
print('='*60 + '\n')

errors = []
warnings = []

# 1 ‚Äî Check dataset
if os.path.exists('training_data.jsonl'):
    print('‚úÖ Dataset Found')
    try:
        with open('training_data.jsonl', 'r', encoding='utf-8') as f:
            lines = f.readlines()
            for i, line in enumerate(lines[:3]):
                json.loads(line)
        print(f'‚úÖ Dataset Valid ({len(lines)} examples)')
        if len(lines) < 50:
            warnings.append(f'Dataset has only {len(lines)} examples. Recommend 100+ for better quality.')
    except Exception as e:
        errors.append(f'Dataset format error: {e}')
else:
    errors.append('Dataset file not found')

# 2 ‚Äî Check tokenizer
if os.path.exists('my_tokenizer.json'):
    print('‚úÖ Tokenizer Found')
    try:
        with open('my_tokenizer.json', 'r') as f:
            tok_data = json.load(f)
        vocab_size = len(tok_data.get('vocab', {}))
        print(f'‚úÖ Tokenizer Valid (vocab: {vocab_size})')
    except Exception as e:
        errors.append(f'Tokenizer error: {e}')
else:
    errors.append('Tokenizer file not found')

# 3 ‚Äî Check config
if os.path.exists('bullet_core/configs/my_model.yaml'):
    print('‚úÖ Config Found')
    try:
        with open('bullet_core/configs/my_model.yaml', 'r') as f:
            config_text = f.read()
        required = ['hidden_size', 'num_heads', 'num_layers', 'vocab_size', 'max_steps']
        for r in required:
            if r not in config_text:
                errors.append(f'Config missing: {r}')
        print('‚úÖ Config Valid')
    except Exception as e:
        errors.append(f'Config error: {e}')
else:
    errors.append('Config file not found')

# 4 ‚Äî Check model checkpoint
if os.path.exists('my_model_checkpoints'):
    checkpoints = [f for f in os.listdir('my_model_checkpoints') if f.endswith('.pkl')]
    if checkpoints:
        print(f'‚úÖ Training Checkpoints Found ({len(checkpoints)} files)')
    else:
        warnings.append('No .pkl checkpoints found. Did training complete?')
else:
    warnings.append('Checkpoint directory not found. Training may not have run.')

# 5 ‚Äî Check .bullet file
if os.path.exists('my_model_checkpoints'):
    bullet_files = [f for f in os.listdir('my_model_checkpoints') if f.endswith('.bullet')]
    if bullet_files:
        bullet_path = f'my_model_checkpoints/{bullet_files[-1]}'
        size_mb = os.path.getsize(bullet_path) / (1024*1024)
        print(f'‚úÖ .bullet File Created ({size_mb:.2f} MB)')
        
        # Try loading
        try:
            from utils.bullet_io import BulletReader
            reader = BulletReader(bullet_path)
            reader.load()
            print(f'‚úÖ .bullet File Loads Successfully ({len(reader.tensors)} tensors)')
        except Exception as e:
            errors.append(f'.bullet load error: {e}')
    else:
        errors.append('.bullet file not created. Export may have failed.')

# 6 ‚Äî Test inference
try:
    if 'model' in dir() and 'tokenizer' in dir():
        test_prompt = 'test'
        tokens = tokenizer.encode(test_prompt)
        from python.tensor import Tensor
        x = Tensor(np.array([tokens[:5]], dtype=np.int32), requires_grad=False)
        logits = model(x)
        print('‚úÖ Inference Test Passed')
    else:
        warnings.append('Model/tokenizer not loaded. Skip inference test.')
except Exception as e:
    warnings.append(f'Inference test failed: {e}')

# 7 ‚Äî Check deployment package
if os.path.exists('my_bullet_model.zip'):
    zip_size = os.path.getsize('my_bullet_model.zip') / (1024*1024)
    print(f'‚úÖ Deployment Package Created ({zip_size:.2f} MB)')
else:
    warnings.append('Deployment ZIP not created')

# Final Report
print('\n' + '='*60)
if len(errors) == 0:
    print('üéâ ALL CHECKS PASSED ‚Äî NOTEBOOK IS PRODUCTION READY!')
    print('='*60)
    print('\n‚úÖ Your model is ready to:')
    print('  - Deploy to production')
    print('  - Share with others')
    print('  - Use for inference')
    print('  - Continue training')
else:
    print('‚ùå ERRORS FOUND:')
    for e in errors:
        print(f'   ‚ùå {e}')
    print('='*60)

if len(warnings) > 0:
    print('\n‚ö†Ô∏è  WARNINGS:')
    for w in warnings:
        print(f'   ‚ö†Ô∏è  {w}')

print('\n' + '='*60)
print(f'Summary: {len(errors)} errors, {len(warnings)} warnings')
print('='*60)

---

## üéâ Success!

You've completed the full pipeline:

‚úÖ Loaded custom dataset  
‚úÖ Trained BPE tokenizer  
‚úÖ Trained Transformer model  
‚úÖ Exported to .bullet format  
‚úÖ Tested and validated  
‚úÖ Downloaded deployment package  

### üìö Next Steps:

- **Train Longer**: Increase `max_steps` to 2000+
- **More Data**: Add 100+ training examples
- **Bigger Model**: Increase `hidden_size` to 256
- **Deploy**: Use the [User Manual](https://github.com/iShrikantBhosale/bullet-core/blob/master/BULLET_USER_MANUAL.md)

### üîó Resources:

üìò [User Manual](https://github.com/iShrikantBhosale/bullet-core/blob/master/BULLET_USER_MANUAL.md)  
üìñ [Education Manual](https://github.com/iShrikantBhosale/bullet-core/blob/master/BULLET_EDUCATION_MANUAL.md)  
üíª [GitHub](https://github.com/iShrikantBhosale/bullet-core)  
üåê [Website](https://ishrikantbhosale.github.io/bullet-core/)  

---

**Created by Shrikant Bhosale** | Mentored by [Hintson.com](https://hintson.com)  
üáÆüá≥ Made in India | Democratizing AI  
¬© 2025 Bullet OS | MIT License