# SMS Claim Extraction - Training on Colab

This notebook trains all 4 approaches for claim extraction research.

## ðŸš€ Setup Instructions

**Before running:**
1. Replace `YOUR_PROJECT_ID` with your GCP project ID
2. Replace `your-bucket-name` with your GCS bucket name
3. Make sure your GCS bucket exists and you have write permissions

**What this notebook does:**
- Uses K-Fold Cross-Validation (5 folds) for robust training
- Keeps test set BLIND until final evaluation
- Saves all checkpoints to Google Cloud Storage
- No more Drive storage issues!

In [None]:
# Clone repository
!git clone https://github.com/iamdiluxedbutcooler/sms-claim-check.git
%cd sms-claim-check

In [None]:
# Install dependencies (including GCS support)
!pip install -q transformers datasets torch scikit-learn pandas numpy seaborn matplotlib openai python-dotenv evaluate accelerate sentencepiece seqeval google-cloud-storage

## Setup Google Cloud Storage

Authenticate and configure GCS bucket for saving checkpoints and results.

## Update Code (if needed)

Run this cell ONLY if you need to pull latest code updates. It will backup experiments first.

In [None]:
# Backup experiments before updating code
!cp -r experiments /content/drive/MyDrive/sms-claim-check/backup_experiments_$(date +%Y%m%d_%H%M%S) 2>/dev/null || echo "No experiments to backup yet"

# Pull latest code
!git pull origin main

# IMPORTANT: Restart runtime after pulling to reload modules
print("\n[WARNING] After pulling, go to Runtime > Restart runtime to reload updated code!")
print("Then continue from where you left off.")

In [None]:
# QUICK FIX: Reload modules without restarting runtime
import sys
import importlib

# Remove cached modules
modules_to_reload = [m for m in sys.modules.keys() if m.startswith('src.')]
for module in modules_to_reload:
    del sys.modules[module]

# Reload
import src.models
import src.data

print("[OK] Modules reloaded! Continue training.")

In [None]:
# Authenticate with Google Cloud
from google.colab import auth
auth.authenticate_user()

# Configure GCS
import os
os.environ['GCLOUD_PROJECT'] = 'YOUR_PROJECT_ID'  # Replace with your GCP project ID

# Test GCS connection
from google.cloud import storage
client = storage.Client()

# Set your bucket name
GCS_BUCKET_NAME = 'your-bucket-name'  # Replace with your bucket name
bucket = client.bucket(GCS_BUCKET_NAME)

print(f"âœ“ Authenticated and connected to GCS bucket: {GCS_BUCKET_NAME}")
print(f"âœ“ Project: {os.environ['GCLOUD_PROJECT']}")

# Create folder structure in bucket
print("\nBucket ready for checkpoints and results!")

In [None]:
# Check GPU
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")

## Optional: Download Checkpoint from GCS

If you need to resume training or load a previous checkpoint:

In [None]:
def download_from_gcs(gcs_folder_path, local_folder):
    """Download folder from GCS to local"""
    client = storage.Client()
    bucket = client.bucket(GCS_BUCKET_NAME)
    
    # List all blobs with prefix
    blobs = bucket.list_blobs(prefix=gcs_folder_path)
    
    file_count = 0
    for blob in blobs:
        # Create local path
        relative_path = blob.name[len(gcs_folder_path):].lstrip('/')
        local_path = Path(local_folder) / relative_path
        
        # Create parent directories
        local_path.parent.mkdir(parents=True, exist_ok=True)
        
        # Download file
        blob.download_to_filename(str(local_path))
        file_count += 1
    
    print(f'[DOWNLOADED] {file_count} files from gs://{GCS_BUCKET_NAME}/{gcs_folder_path} -> {local_folder}')

# Example: Download latest checkpoint for approach 1
# download_from_gcs('checkpoints/approach1_entity_ner_latest', 'experiments/approach1_entity_ner')

print("âœ“ Download function ready. Uncomment example above to use.")

In [None]:
# Setup GCS auto-save function
import shutil
from pathlib import Path
from google.cloud import storage
from datetime import datetime
import os

def upload_folder_to_gcs(local_folder, gcs_folder_path):
    """Upload entire folder to GCS"""
    client = storage.Client()
    bucket = client.bucket(GCS_BUCKET_NAME)
    
    local_path = Path(local_folder)
    if not local_path.exists():
        print(f"[SKIP] Folder not found: {local_folder}")
        return
    
    file_count = 0
    for file_path in local_path.rglob('*'):
        if file_path.is_file():
            # Create blob path relative to local folder
            relative_path = file_path.relative_to(local_path.parent)
            blob_path = f"{gcs_folder_path}/{relative_path}"
            
            # Upload file
            blob = bucket.blob(blob_path)
            blob.upload_from_filename(str(file_path))
            file_count += 1
    
    print(f'[SAVED] {file_count} files from {local_folder} -> gs://{GCS_BUCKET_NAME}/{gcs_folder_path}')

def save_checkpoint(approach_name):
    """Save checkpoint to GCS"""
    source = f'experiments/{approach_name}'
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    gcs_path = f'checkpoints/{approach_name}_{timestamp}'
    
    upload_folder_to_gcs(source, gcs_path)
    
    # Also save as "latest"
    latest_path = f'checkpoints/{approach_name}_latest'
    upload_folder_to_gcs(source, latest_path)
    
    print(f'âœ“ Checkpoint saved to GCS')

def save_all_results():
    """Save all results to GCS"""
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    upload_folder_to_gcs('experiments', f'results/experiments_{timestamp}')
    upload_folder_to_gcs('experiments', 'results/experiments_latest')
    print('[SAVED] All results -> GCS')

print('âœ“ GCS auto-save setup complete!')
print(f'âœ“ Bucket: gs://{GCS_BUCKET_NAME}')
print(f'âœ“ Checkpoints will be saved to: checkpoints/<approach>_<timestamp>/')
print(f'âœ“ Results will be saved to: results/experiments_<timestamp>/')

## Approach 1: Entity-based NER

In [None]:
!python train_kfold.py --config configs/entity_ner.yaml --n_folds 5

# Save checkpoint to GCS
save_checkpoint('approach1_entity_ner')

## Approach 2: Claim-based NER

In [None]:
!python train_kfold.py --config configs/claim_ner.yaml --n_folds 5

# Save checkpoint to GCS
save_checkpoint('approach2_claim_ner')

## Approach 4: Contrastive Learning

In [None]:
!python train_kfold.py --config configs/contrastive.yaml --n_folds 5

# Save checkpoint to GCS
save_checkpoint('approach4_contrastive')

## Approach 3a: Hybrid Entity + LLM (Inference Only)

In [None]:
# Set OpenAI API key
import os
os.environ['OPENAI_API_KEY'] = 'YOUR_API_KEY_HERE'  # Replace with your key

!python inference.py --config configs/hybrid_llm.yaml --model experiments/approach1_entity_ner/best_model

# Save results to Drive
save_checkpoint('approach3_hybrid_llm')

## Approach 3b: Hybrid Claim + LLM (Inference Only)

In [None]:
!python inference.py --config configs/hybrid_claim_llm.yaml --model experiments/approach2_claim_ner/best_model

## Compare Results

In [None]:
!python scripts/compare_models.py

# Save final comparison to Drive
save_all_results()

## Download Results

In [None]:
# Final save to GCS
save_all_results()

# Optional: Also zip and download locally
!zip -r results.zip experiments/
from google.colab import files
files.download('results.zip')

print('[COMPLETE] All results saved to GCS!')
print(f'View in console: https://console.cloud.google.com/storage/browser/{GCS_BUCKET_NAME}/results/')
print(f'Checkpoints: https://console.cloud.google.com/storage/browser/{GCS_BUCKET_NAME}/checkpoints/')