# CMCL Recommender System - GPU Run (Goodreads Poetry)

This notebook is designed to train the Cross-Modal Contrastive Learning (CMCL) model on the Goodreads Poetry dataset using a GPU. It is optimized for Google Colab.

## 1. Setup Codebase
Cloning the repository from GitHub.

In [None]:
!git clone https://github.com/mohdfaour03/PGMS_for_Recommender_Systems.git
%cd PGMS_for_Recommender_Systems
!pip install -e .

## 2. Setup and Dependencies
Install necessary packages.

In [None]:
!pip install torch pandas numpy scipy scikit-learn tqdm pyyaml requests

## 3. Imports and Configuration

In [None]:
import sys
from pathlib import Path
import json
import torch

# Ensure the current directory is in the path so we can import coldstart
sys.path.append(".")

try:
    from coldstart.src import pipeline
    from coldstart.src.notebook_utils import build_goodreads_interaction_frame, _read_simple_yaml
except ImportError:
    print("❌ ERROR: Could not import 'coldstart'. Please ensure the repository was cloned successfully.")
    raise

# Check for GPU
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Device: {torch.cuda.get_device_name(0)}")
else:
    print("⚠️ WARNING: GPU not detected. Go to Runtime > Change runtime type > T4 GPU to enable it.")

## 4. Data Loading
Load the Goodreads Poetry dataset. We use a larger limit here (300k) since we have a GPU.

In [None]:
DATA_DIR = Path("coldstart/data")
DATA_DIR.mkdir(parents=True, exist_ok=True)
GOODREADS_PATH = DATA_DIR / "goodreads_poetry.csv"
RUN_DIR = Path("coldstart/output/goodreads_gpu_run")
RUN_DIR.mkdir(parents=True, exist_ok=True)

if not GOODREADS_PATH.exists():
    print("Downloading/Loading Goodreads Poetry dataset...")
    # Using 300k interactions for a good balance of speed and performance
    df = build_goodreads_interaction_frame(genre="poetry", limit=300000)
    df.to_csv(GOODREADS_PATH, index=False)
    print(f"Saved Goodreads Poetry dataset to {GOODREADS_PATH}")
else:
    print("Goodreads Poetry dataset already exists.")

## 5. Pipeline Execution
Run the training pipeline. We enable `prefer_gpu=True` and increase batch sizes.

In [None]:
# Load base config
config = _read_simple_yaml("coldstart/configs/base.yaml")

# Prepare Dataset
print("Preparing dataset...")
pipeline.prepare_dataset(
    GOODREADS_PATH,
    RUN_DIR,
    tfidf_params=config.get("tfidf", {}),
    cold_item_frac=0.2,
    seed=42,
    interaction_limit=300000,
)

# Train and Evaluate
print("Starting training on GPU...")
metrics = pipeline.train_and_evaluate_content_model(
    RUN_DIR,
    k_factors=16,
    k_eval=5,
    mf_reg=float(config.get("mf", {}).get("reg", 0.02)),
    mf_iters=30, # Full iterations
    mf_lr=float(config.get("mf", {}).get("lr", 0.02)),
    seed=42,
    ctrlite_reg=float(config.get("ctrlite", {}).get("reg", 0.01)),
    ctrlite_lr=float(config.get("ctrlite", {}).get("lr", 0.1)),
    ctrlite_iters=80, # Full iterations
    adaptive=False,
    model="ctrlite,cdl,cmcl", # Run all for comparison
    a2f_cfg=config.get("a2f", {}),
    ctpf_cfg=config.get("ctpf", {}),
    cdl_cfg=config.get("cdl", {}),
    hft_cfg=config.get("hft", {}),
    micm_cfg=config.get("micm", {}),
    cmcl_cfg={"iters": 10}, # More iterations for CMCL
    backend="torch",
    prefer_gpu=True, # ENABLE GPU
    mf_cfg={
        "batch_size": 4096, # Large batch size for GPU
        "score_batch_size": 4096,
        "infer_batch_size": 4096,
        "ctrlite_batch_size": 2048,
    },
)

## 6. Results

In [None]:
print("\n=== Final Results ===")
print(json.dumps(metrics, indent=2))
with open(RUN_DIR / "metrics.json", "w") as f:
    json.dump(metrics, f, indent=2)