# ü´Å Lung Health Prediction

**Objective**:
This notebook trains the **Pulmonary Risk Model** (XGBoost). It analyzes lifestyle factors (Smoking, Anxiety, Alcohol) to predict Lung Cancer risk.

**Workflow**:
1.  **Ingestion**: Loading Survey Lung Cancer Data.
2.  **Preprocessing**: Encoding binary symptom flags.
3.  **Training**: XGBoost Classification.
4.  **Export**: Production-ready pickle export.

In [None]:
# Core Libraries
import os
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Ensemble Components
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

print("‚úÖ Environment Loaded.")

In [None]:
# Load Pulmonary Data
DATA_FILE = "../data/processed/lungs.parquet"

if os.path.exists(DATA_FILE):
    df = pd.read_parquet(DATA_FILE)
    print(f"‚úÖ Data Ingested: {df.shape[0]} rows | {df.shape[1]} features")
else:
    print("‚ùå Dataset missing. Please run the Data Pipeline first.")

### Model Architecture: Pulmonary XGBoost
This model focuses on maximizing recall to ensure no potential cancer cases are missed.

In [None]:
target = 'target'
X = df.drop(columns=[target])
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# --- MODEL DEFINITION ---

model = XGBClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=4,
    eval_metric='logloss',
    random_state=42
)

print("‚è≥ Training Pulmonary Model...")
model.fit(X_train, y_train)
print("‚úÖ Training Complete.")

In [None]:
# Evaluation
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
print(f"üéØ Pulmonary Model Accuracy: {acc:.4f}")
print("\nClassification Report:\n", classification_report(y_test, preds))