# Churn Prediction Model Training - Jupyter Notebook

## Phase 1: Foundational Exploration & Quick Wins (JupyterHub)

This notebook adapts the functionality from the `xgboost_script.py` for interactive development and exploration within a JupyterHub environment running on Kubernetes.

**Goals:**
*   Rapid iteration and data exploration.
*   Initial model development.
*   Demonstrate an interactive ML environment within the k8s cluster.

**Assumptions:**
*   The preprocessed `train.csv` and `test.csv` files are accessible (e.g., via a mounted PVC).
*   Hyperparameters are manually defined in this notebook for now.

### 1. Setup and Imports

In [None]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split # Though script uses pre-split data
from sklearn.metrics import accuracy_score, roc_auc_score
import json
import os
import logging

# Configure logging (optional for notebooks, but good practice)
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

### 2. Define Parameters and File Paths

Update these paths based on where your data is accessible in the JupyterHub environment.

In [None]:
# --- Hyperparameters (manual definition for notebook) ---
params = {
    'max_depth': 5,        # Corresponds to SM_HP_MAX_DEPTH
    'eta': 0.2,            # Corresponds to SM_HP_ETA (learning rate)
    'min_child_weight': 1, # Corresponds to SM_HP_MIN_CHILD_WEIGHT
    'subsample': 0.8,      # Corresponds to SM_HP_SUBSAMPLE
    'objective': 'binary:logistic',
    'num_round': 100       # Number of boosting rounds (n_estimators for scikit-learn wrapper)
    # 'eval_metric': ['auc', 'logloss'] # XGBoost native API can take this
}

# --- File Paths (Update these!) ---
# Example: If your PVC is mounted at /home/jovyan/work
BASE_PATH = '/opt/ml/processing' # Or your equivalent path in JupyterHub
INPUT_DATA_PATH = os.path.join(BASE_PATH, 'input', 'data') # Path to preprocessed data dir
TRAIN_DATA_FILE = os.path.join(INPUT_DATA_PATH, 'train', 'train.csv')
VALID_DATA_FILE = os.path.join(INPUT_DATA_PATH, 'test', 'test.csv') # Using test set as validation

MODEL_OUTPUT_DIR = os.path.join(BASE_PATH, 'model_notebook') # Where to save the model from notebook
METRICS_OUTPUT_DIR = os.path.join(BASE_PATH, 'output_notebook') # Where to save metrics from notebook

# Ensure output directories exist
os.makedirs(MODEL_OUTPUT_DIR, exist_ok=True)
os.makedirs(METRICS_OUTPUT_DIR, exist_ok=True)

MODEL_FILE_PATH = os.path.join(MODEL_OUTPUT_DIR, 'xgboost-model-notebook.xgb')
METRICS_FILE_PATH = os.path.join(METRICS_OUTPUT_DIR, 'metrics-notebook.json')

logger.info(f"Train data path: {TRAIN_DATA_FILE}")
logger.info(f"Validation data path: {VALID_DATA_FILE}")
logger.info(f"Model output path: {MODEL_FILE_PATH}")
logger.info(f"Metrics output path: {METRICS_FILE_PATH}")

### 3. Load Data

In [None]:
try:
    logger.info("Loading training data...")
    train_df = pd.read_csv(TRAIN_DATA_FILE)
    logger.info(f"Training data loaded. Shape: {train_df.shape}")

    logger.info("Loading validation data...")
    valid_df = pd.read_csv(VALID_DATA_FILE)
    logger.info(f"Validation data loaded. Shape: {valid_df.shape}")
except FileNotFoundError as e:
    logger.error(f"Error loading data: {e}. Please check your file paths.")
    raise

# Assuming the first column is the target 'Churn' and was named appropriately during preprocessing
# If your CSVs have a different target column name, adjust here.
TARGET_COLUMN = train_df.columns[0] # Or explicitly 'Churn' if that's the name
logger.info(f"Identified target column: {TARGET_COLUMN}")

X_train = train_df.drop(columns=[TARGET_COLUMN])
y_train = train_df[TARGET_COLUMN]

X_valid = valid_df.drop(columns=[TARGET_COLUMN])
y_valid = valid_df[TARGET_COLUMN]

logger.info(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
logger.info(f"X_valid shape: {X_valid.shape}, y_valid shape: {y_valid.shape}")

X_train.head()

### 4. Initialize and Train XGBoost Model

In [None]:
logger.info("Initializing XGBoost model...")
model = xgb.XGBClassifier(
    objective=params['objective'],
    max_depth=params['max_depth'],
    learning_rate=params['eta'], # 'eta' is learning_rate in XGBClassifier
    min_child_weight=params['min_child_weight'],
    subsample=params['subsample'],
    n_estimators=params['num_round'], # 'num_round' is n_estimators
    # eval_metric=params.get('eval_metric', 'logloss'), # Can specify eval_metric for early stopping
    use_label_encoder=False # Suppress a warning for newer XGBoost versions
)
logger.info("XGBoost model initialized.")

logger.info("Starting model training...")
# For XGBClassifier, eval_set expects a list of tuples (X, y)
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False) # verbose=True for iteration details
logger.info("Model training completed.")

### 5. Save the Trained Model

In [None]:
logger.info(f"Saving the trained model to {MODEL_FILE_PATH}")
model.save_model(MODEL_FILE_PATH)
logger.info("Model saved successfully.")

### 6. Evaluate the Model

In [None]:
logger.info("Evaluating the model on validation data.")
y_pred = model.predict(X_valid)
y_pred_proba = model.predict_proba(X_valid)[:, 1]

accuracy = accuracy_score(y_valid, y_pred)
auc = roc_auc_score(y_valid, y_pred_proba)

logger.info(f"Model evaluation completed. Accuracy: {accuracy:.4f}, AUC: {auc:.4f}")

# Print metrics (similar to SageMaker HPO format for consistency if desired)
print(f"validation:accuracy: {accuracy}")
print(f"validation:auc: {auc}")

### 7. Save Metrics to File

In [None]:
logger.info(f"Saving evaluation metrics to {METRICS_FILE_PATH}")
metrics_data = {'accuracy': accuracy, 'auc': auc}

with open(METRICS_FILE_PATH, 'w') as f:
    json.dump(metrics_data, f, indent=4)
logger.info("Metrics saved successfully.")
print(f"Metrics saved to {METRICS_FILE_PATH}")

### 8. (Optional) Load Model and Test Prediction

In [None]:
logger.info(f"Loading model from {MODEL_FILE_PATH} for a test prediction.")
loaded_model = xgb.XGBClassifier()
loaded_model.load_model(MODEL_FILE_PATH)
logger.info("Model loaded successfully.")

# Make a prediction on the first few validation samples
sample_predictions = loaded_model.predict(X_valid.head())
sample_probas = loaded_model.predict_proba(X_valid.head())[:,1]

logger.info(f"Sample predictions on X_valid.head(): {sample_predictions}")
logger.info(f"Sample probabilities on X_valid.head(): {sample_probas}")

## Next Steps

1.  Ensure your data paths (`TRAIN_DATA_FILE`, `VALID_DATA_FILE`) are correct for your JupyterHub environment.
2.  Experiment with different hyperparameters.
3.  Explore the data further (visualizations, feature importance from the model).
4.  This notebook can serve as a prototype for the automated pipeline steps in Phase 2.