# Cortex Compliance AI - Fine-Tuning via HuggingFace AutoTrain API

No GPU required! HuggingFace handles the training on their servers.

## Quick Start:
1. Get a HuggingFace token with WRITE access
2. Run all cells
3. Wait for training to complete (~30-60 min)

In [None]:
# Step 1: Install dependencies (minimal - no GPU packages needed)
!pip install -q huggingface_hub datasets

In [None]:
# Step 2: Login to Hugging Face
from huggingface_hub import notebook_login
notebook_login()

In [None]:
# Step 3: Load Training Data - 217 Russian Business Document Templates
# Includes: Contracts, Corporate Docs, Financial, HR, Legal, Tax, Industry, Specialized and more!

import json

# Download training data from GitHub (217 examples from 265 templates)
!wget -q https://raw.githubusercontent.com/maanisingh/cortex-compliance-ai/main/combined_training_data.jsonl -O training_data.jsonl

# Load training data
TRAINING_DATA = []
with open('training_data.jsonl', 'r') as f:
    for line in f:
        TRAINING_DATA.append(json.loads(line))

print(f"Loaded {len(TRAINING_DATA)} training examples")
print(f"\nSample categories:")
for i, item in enumerate(TRAINING_DATA[:5]):
    print(f"  {i+1}. {item['instruction'][:60]}...")

In [None]:
# Step 4: Upload dataset to HuggingFace
from datasets import Dataset
from huggingface_hub import whoami, HfApi

# Format for fine-tuning
def format_for_training(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"}

dataset = Dataset.from_list(TRAINING_DATA)
dataset = dataset.map(format_for_training)

# Upload to your HuggingFace account
hf_user = whoami()["name"]
dataset_repo = f"{hf_user}/cortex-compliance-data"

dataset.push_to_hub(dataset_repo, private=True)
print(f"‚úÖ Dataset uploaded to: https://huggingface.co/datasets/{dataset_repo}")

In [None]:
# Step 5: Start AutoTrain job via API
import requests
import os

hf_user = whoami()["name"]
HF_TOKEN = os.environ.get("HF_TOKEN") or input("Enter your HuggingFace token: ")

# AutoTrain API endpoint
response = requests.post(
    "https://huggingface.co/api/autotrain/create_project",
    headers={"Authorization": f"Bearer {HF_TOKEN}"},
    json={
        "username": hf_user,
        "project_name": "cortex-compliance-ai",
        "task": "llm-sft",  # Supervised fine-tuning
        "base_model": "mistralai/Mistral-7B-Instruct-v0.2",
        "hub_dataset": f"{hf_user}/cortex-compliance-data",
        "text_column": "text",
        "train_split": "train",
        "params": {
            "epochs": 3,
            "lr": 2e-4,
            "batch_size": 2,
            "use_peft": True,
            "quantization": "int4",
        }
    }
)

if response.status_code == 200:
    print(f"‚úÖ Training started!")
    print(f"üìä Monitor at: https://huggingface.co/{hf_user}/cortex-compliance-ai")
else:
    print(f"‚ùå Error: {response.text}")
    print("\nüîÑ Alternative: Go to https://huggingface.co/autotrain and create manually")

In [None]:
# Step 6: Check training status
import time

print("‚è≥ Training typically takes 30-60 minutes...")
print(f"üìä Check progress at: https://huggingface.co/{hf_user}/cortex-compliance-ai")
print("\nThe model will automatically appear in your HuggingFace account when done.")

In [None]:
# Step 7: Test the model (run after training completes)
# You can test via the HuggingFace Inference API

import requests

hf_user = whoami()["name"]
HF_TOKEN = os.environ.get("HF_TOKEN") or input("Enter your HuggingFace token: ")

test_prompt = "### Instruction:\nGenerate a Personal Data Processing Policy for –û–û–û –¢–µ—Å—Ç (INN: 1234567890)\n\n### Response:\n"

response = requests.post(
    f"https://api-inference.huggingface.co/models/{hf_user}/cortex-compliance-ai",
    headers={"Authorization": f"Bearer {HF_TOKEN}"},
    json={"inputs": test_prompt, "parameters": {"max_new_tokens": 300}}
)

if response.status_code == 200:
    print(response.json()[0]["generated_text"])
else:
    print(f"Model not ready yet. Status: {response.status_code}")
    print("Wait for training to complete, then run this cell again.")

## Done!

Your fine-tuned model will be available at: `https://huggingface.co/{your-username}/cortex-compliance-ai`

**Benefits of AutoTrain API:**
- No GPU required locally
- HuggingFace handles all infrastructure
- Model automatically hosted for inference
- Use via Inference API in your Cortex GRC backend