# MSK Cancer Treatment Classification Pipeline

## Complete Machine Learning Pipeline with BERT & Advanced Metrics

---

### Project Overview

This notebook implements a comprehensive machine learning pipeline for **Memorial Sloan Kettering Cancer Center (MSK)** treatment classification. The pipeline includes:

- **Text Preprocessing** with NLTK
- **Feature Engineering** using TF-IDF and BERT embeddings
- **Multiple ML Models** (Naive Bayes, Logistic Regression, Random Forest, XGBoost, LightGBM)
- **Advanced Evaluation Metrics** for healthcare applications
- **Production-ready code** optimized for Kaggle

---

### Key Features

- Handles class imbalance with weighted sampling  
- Multi-class classification with comprehensive metrics  
- BERT-based deep learning embeddings  
- Visualization of model performance  
- Modular, reusable code structure  

---

---

## 1. Import Libraries & Dependencies

Installing and importing all required packages for data processing, modeling, and visualization.

In [None]:
# Core data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

print("Core libraries loaded successfully!")

In [None]:
# Text processing libraries
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download NLTK data (quiet mode)
try:
    nltk.download('stopwords', quiet=True)
    nltk.download('punkt', quiet=True)
    print("NLTK resources downloaded successfully!")
except Exception as e:
    print(f"NLTK download warning: {e}")

In [None]:
# Scikit-learn: Preprocessing & Feature Engineering
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import label_binarize

print("Scikit-learn preprocessing modules loaded!")

In [None]:
# Scikit-learn: Evaluation Metrics
from sklearn.metrics import (
    log_loss, 
    accuracy_score, 
    classification_report, 
    confusion_matrix,
    f1_score, 
    precision_score, 
    recall_score, 
    roc_auc_score,
    balanced_accuracy_score
)

print("Evaluation metrics loaded!")

In [None]:
# Scikit-learn: Machine Learning Models
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier

print("Scikit-learn models loaded!")

In [None]:
# Optional: Advanced Gradient Boosting Libraries
try:
    import xgboost as xgb
    XGBOOST_AVAILABLE = True
    print("XGBoost loaded successfully!")
except ImportError:
    XGBOOST_AVAILABLE = False
    print("XGBoost not available. Install with: pip install xgboost")

try:
    import lightgbm as lgb
    LIGHTGBM_AVAILABLE = True
    print("LightGBM loaded successfully!")
except ImportError:
    LIGHTGBM_AVAILABLE = False
    print("LightGBM not available. Install with: pip install lightgbm")

In [None]:
# Optional: BERT & Transformers for Deep Learning
try:
    from transformers import AutoTokenizer, AutoModel
    import torch
    BERT_AVAILABLE = True
    print("Transformers & PyTorch loaded successfully!")
    print(f"   PyTorch version: {torch.__version__}")
    print(f"   CUDA available: {torch.cuda.is_available()}")
except ImportError:
    BERT_AVAILABLE = False
    print("Transformers not available. Install with: pip install transformers torch")

---

## 2. Data Loading & Exploration

Load the MSK cancer treatment dataset and perform initial exploratory data analysis.

In [None]:
# Load the dataset
# Replace with your actual data path
DATA_PATH = '/kaggle/input/msk-cancer-treatment/training_variants.zip'

# Example: Load from CSV
# df = pd.read_csv(DATA_PATH)

# For demonstration, create a sample dataset
# Remove this section and load your actual data
print("Loading dataset...")
print("Please replace this with your actual data loading code")

# Sample data structure (replace with actual loading)
# df = pd.read_csv('your_data.csv')

In [None]:
# Display basic information about the dataset
# print(f"\nDataset Shape: {df.shape}")
# print(f"\nColumn Names:\n{df.columns.tolist()}")
# print(f"\nFirst Few Rows:")
# display(df.head())
# print(f"\nData Types:")
# print(df.dtypes)
# print(f"\nMissing Values:")
# print(df.isnull().sum())

---

## 3. Text Preprocessing Pipeline

Clean and normalize text data for optimal model performance.

In [None]:
def clean_text(text):
    """
    Comprehensive text cleaning function
    
    Steps:
    1. Convert to lowercase
    2. Remove URLs
    3. Remove special characters
    4. Remove extra whitespace
    5. Remove stopwords
    6. Tokenization
    
    Args:
        text (str): Raw input text
    
    Returns:
        str: Cleaned text
    """
    if not isinstance(text, str):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove numbers (optional - depends on use case)
    text = re.sub(r'\d+', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    # Tokenization and stopword removal
    try:
        tokens = word_tokenize(text)
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
        text = ' '.join(tokens)
    except:
        pass
    
    return text

# Test the function
sample_text = "This is a SAMPLE text with URLs http://example.com and numbers 123!"
cleaned = clean_text(sample_text)
print(f"Original: {sample_text}")
print(f"Cleaned:  {cleaned}")

In [None]:
# Apply text cleaning to your dataset
# Example:
# df['cleaned_text'] = df['text_column'].apply(clean_text)
# print("Text preprocessing completed!")

---

## 4. Feature Engineering

### 4.1 TF-IDF Vectorization

In [None]:
def create_tfidf_features(texts, max_features=5000):
    """
    Create TF-IDF features from text data
    
    Args:
        texts: List of text documents
        max_features: Maximum number of features
    
    Returns:
        X: TF-IDF feature matrix
        vectorizer: Fitted TF-IDF vectorizer
    """
    vectorizer = TfidfVectorizer(
        max_features=max_features,
        ngram_range=(1, 2),  # unigrams and bigrams
        min_df=2,            # minimum document frequency
        max_df=0.95          # maximum document frequency
    )
    
    X = vectorizer.fit_transform(texts)
    
    print(f"TF-IDF Features Created")
    print(f"   Shape: {X.shape}")
    print(f"   Vocabulary Size: {len(vectorizer.vocabulary_)}")
    
    return X, vectorizer

# Example usage:
# X_tfidf, tfidf_vectorizer = create_tfidf_features(df['cleaned_text'])

### 4.2 BERT Embeddings (Optional - Advanced)

In [None]:
def create_bert_embeddings(texts, model_name='bert-base-uncased', batch_size=16):
    """
    Create BERT embeddings for text data
    
    Args:
        texts: List of text documents
        model_name: HuggingFace model name
        batch_size: Batch size for processing
    
    Returns:
        embeddings: Numpy array of BERT embeddings
    """
    if not BERT_AVAILABLE:
        print("BERT not available. Skipping BERT embeddings.")
        return None
    
    print(f"Loading BERT model: {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()
    
    embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        
        # Tokenize
        encoded = tokenizer(
            batch_texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors='pt'
        )
        
        # Move to device
        encoded = {k: v.to(device) for k, v in encoded.items()}
        
        # Get embeddings
        with torch.no_grad():
            outputs = model(**encoded)
            # Use [CLS] token embedding
            batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
        
        embeddings.append(batch_embeddings)
        
        if (i // batch_size + 1) % 10 == 0:
            print(f"   Processed {i + len(batch_texts)}/{len(texts)} texts")
    
    embeddings = np.vstack(embeddings)
    print(f"BERT Embeddings Created: {embeddings.shape}")
    
    return embeddings

# Example usage:
# if BERT_AVAILABLE:
#     X_bert = create_bert_embeddings(df['cleaned_text'][:100])  # Test on subset first

---

## 5. Target Encoding & Train-Test Split

In [None]:
# Encode target labels
# label_encoder = LabelEncoder()
# y_encoded = label_encoder.fit_transform(df['target_column'])

# print(f"\nTarget Classes: {label_encoder.classes_}")
# print(f"Class Distribution:")
# print(pd.Series(y_encoded).value_counts().sort_index())

In [None]:
# Split data into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(
#     X_tfidf, 
#     y_encoded, 
#     test_size=0.2, 
#     random_state=42,
#     stratify=y_encoded  # Maintain class distribution
# )

# print(f"\nData Split Complete")
# print(f"   Training samples: {X_train.shape[0]}")
# print(f"   Testing samples: {X_test.shape[0]}")

---

## 6. Model Training Pipeline

### 6.1 Handle Class Imbalance

In [None]:
def calculate_class_weights(y):
    """
    Calculate class weights for imbalanced datasets
    
    Args:
        y: Target labels
    
    Returns:
        class_weights: Dictionary of class weights
    """
    classes = np.unique(y)
    weights = compute_class_weight(
        class_weight='balanced',
        classes=classes,
        y=y
    )
    
    class_weights = dict(zip(classes, weights))
    
    print("Class Weights:")
    for cls, weight in class_weights.items():
        print(f"   Class {cls}: {weight:.4f}")
    
    return class_weights

# Example:
# class_weights = calculate_class_weights(y_train)

### 6.2 Model Training Functions

In [None]:
def train_naive_bayes(X_train, y_train):
    """Train Multinomial Naive Bayes classifier"""
    print("\nTraining Naive Bayes...")
    model = MultinomialNB(alpha=1.0)
    model.fit(X_train, y_train)
    print("Naive Bayes training complete!")
    return model

def train_logistic_regression(X_train, y_train, class_weights=None):
    """Train Logistic Regression classifier"""
    print("\nTraining Logistic Regression...")
    model = LogisticRegression(
        max_iter=1000,
        class_weight=class_weights,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    print("Logistic Regression training complete!")
    return model

def train_random_forest(X_train, y_train, class_weights=None):
    """Train Random Forest classifier"""
    print("\nTraining Random Forest...")
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=20,
        class_weight=class_weights,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    print("Random Forest training complete!")
    return model

def train_xgboost(X_train, y_train, class_weights=None):
    """Train XGBoost classifier"""
    if not XGBOOST_AVAILABLE:
        print("XGBoost not available")
        return None
    
    print("\nTraining XGBoost...")
    
    # Convert class weights to sample weights
    if class_weights is not None:
        sample_weights = np.array([class_weights[y] for y in y_train])
    else:
        sample_weights = None
    
    model = xgb.XGBClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        random_state=42,
        tree_method='hist',
        n_jobs=-1
    )
    
    model.fit(X_train, y_train, sample_weight=sample_weights)
    print("XGBoost training complete!")
    return model

def train_lightgbm(X_train, y_train, class_weights=None):
    """Train LightGBM classifier"""
    if not LIGHTGBM_AVAILABLE:
        print("LightGBM not available")
        return None
    
    print("\nTraining LightGBM...")
    
    # Convert class weights to sample weights
    if class_weights is not None:
        sample_weights = np.array([class_weights[y] for y in y_train])
    else:
        sample_weights = None
    
    model = lgb.LGBMClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        random_state=42,
        n_jobs=-1
    )
    
    model.fit(X_train, y_train, sample_weight=sample_weights)
    print("LightGBM training complete!")
    return model

### 6.3 Train All Models

In [None]:
# Dictionary to store trained models
# models = {}

# Train Naive Bayes
# models['Naive Bayes'] = train_naive_bayes(X_train, y_train)

# Train Logistic Regression
# models['Logistic Regression'] = train_logistic_regression(X_train, y_train, class_weights)

# Train Random Forest
# models['Random Forest'] = train_random_forest(X_train, y_train, class_weights)

# Train XGBoost (if available)
# if XGBOOST_AVAILABLE:
#     models['XGBoost'] = train_xgboost(X_train, y_train, class_weights)

# Train LightGBM (if available)
# if LIGHTGBM_AVAILABLE:
#     models['LightGBM'] = train_lightgbm(X_train, y_train, class_weights)

# print(f"\nAll models trained! Total models: {len(models)}")

---

## 7. Model Evaluation & Metrics

### 7.1 Comprehensive Evaluation Function

In [None]:
def evaluate_model(model, X_test, y_test, model_name="Model"):
    """
    Comprehensive model evaluation with multiple metrics
    
    Args:
        model: Trained model
        X_test: Test features
        y_test: True labels
        model_name: Name for display
    
    Returns:
        results: Dictionary of evaluation metrics
    """
    print(f"\n{'='*60}")
    print(f"Evaluating: {model_name}")
    print(f"{'='*60}")
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Get probability predictions if available
    try:
        y_pred_proba = model.predict_proba(X_test)
        has_proba = True
    except:
        y_pred_proba = None
        has_proba = False
    
    # Calculate metrics
    results = {
        'model_name': model_name,
        'accuracy': accuracy_score(y_test, y_pred),
        'balanced_accuracy': balanced_accuracy_score(y_test, y_pred),
        'precision_macro': precision_score(y_test, y_pred, average='macro', zero_division=0),
        'precision_weighted': precision_score(y_test, y_pred, average='weighted', zero_division=0),
        'recall_macro': recall_score(y_test, y_pred, average='macro', zero_division=0),
        'recall_weighted': recall_score(y_test, y_pred, average='weighted', zero_division=0),
        'f1_macro': f1_score(y_test, y_pred, average='macro', zero_division=0),
        'f1_weighted': f1_score(y_test, y_pred, average='weighted', zero_division=0),
    }
    
    # Add log loss if probabilities available
    if has_proba:
        results['log_loss'] = log_loss(y_test, y_pred_proba)
        
        # Calculate ROC AUC for multi-class
        try:
            y_test_bin = label_binarize(y_test, classes=np.unique(y_test))
            if y_test_bin.shape[1] > 1:
                results['roc_auc_ovr'] = roc_auc_score(
                    y_test_bin, y_pred_proba, 
                    average='macro', 
                    multi_class='ovr'
                )
        except:
            pass
    
    # Print results
    print(f"\nPerformance Metrics:")
    print(f"   Accuracy:              {results['accuracy']:.4f}")
    print(f"   Balanced Accuracy:     {results['balanced_accuracy']:.4f}")
    print(f"   Precision (Macro):     {results['precision_macro']:.4f}")
    print(f"   Precision (Weighted):  {results['precision_weighted']:.4f}")
    print(f"   Recall (Macro):        {results['recall_macro']:.4f}")
    print(f"   Recall (Weighted):     {results['recall_weighted']:.4f}")
    print(f"   F1-Score (Macro):      {results['f1_macro']:.4f}")
    print(f"   F1-Score (Weighted):   {results['f1_weighted']:.4f}")
    
    if has_proba:
        print(f"   Log Loss:              {results['log_loss']:.4f}")
        if 'roc_auc_ovr' in results:
            print(f"   ROC AUC (OvR):         {results['roc_auc_ovr']:.4f}")
    
    # Print classification report
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred, zero_division=0))
    
    return results, y_pred, y_pred_proba

### 7.2 Evaluate All Models

In [None]:
# Store evaluation results
# all_results = []

# for model_name, model in models.items():
#     results, y_pred, y_pred_proba = evaluate_model(model, X_test, y_test, model_name)
#     all_results.append(results)

# Create results DataFrame
# results_df = pd.DataFrame(all_results)
# results_df = results_df.set_index('model_name')
# print("\nSummary of All Models:")
# display(results_df.style.highlight_max(axis=0, color='lightgreen'))

---

## 8. Visualization & Analysis

### 8.1 Model Comparison Chart

In [None]:
def plot_model_comparison(results_df):
    """
    Create comprehensive model comparison visualizations
    """
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Model Performance Comparison', fontsize=16, fontweight='bold')
    
    # 1. Accuracy Metrics
    metrics = ['accuracy', 'balanced_accuracy']
    results_df[metrics].plot(kind='bar', ax=axes[0, 0], rot=45)
    axes[0, 0].set_title('Accuracy Metrics')
    axes[0, 0].set_ylabel('Score')
    axes[0, 0].legend(['Accuracy', 'Balanced Accuracy'])
    axes[0, 0].set_ylim([0, 1])
    axes[0, 0].grid(axis='y', alpha=0.3)
    
    # 2. Precision & Recall
    metrics = ['precision_macro', 'recall_macro']
    results_df[metrics].plot(kind='bar', ax=axes[0, 1], rot=45)
    axes[0, 1].set_title('Precision & Recall (Macro)')
    axes[0, 1].set_ylabel('Score')
    axes[0, 1].legend(['Precision', 'Recall'])
    axes[0, 1].set_ylim([0, 1])
    axes[0, 1].grid(axis='y', alpha=0.3)
    
    # 3. F1 Scores
    metrics = ['f1_macro', 'f1_weighted']
    results_df[metrics].plot(kind='bar', ax=axes[1, 0], rot=45)
    axes[1, 0].set_title('F1 Scores')
    axes[1, 0].set_ylabel('Score')
    axes[1, 0].legend(['F1 Macro', 'F1 Weighted'])
    axes[1, 0].set_ylim([0, 1])
    axes[1, 0].grid(axis='y', alpha=0.3)
    
    # 4. Overall Ranking
    ranking_metric = 'f1_weighted'
    top_models = results_df[ranking_metric].sort_values(ascending=True)
    top_models.plot(kind='barh', ax=axes[1, 1], color='skyblue')
    axes[1, 1].set_title(f'Model Ranking by {ranking_metric.replace("_", " ").title()}')
    axes[1, 1].set_xlabel('Score')
    axes[1, 1].set_xlim([0, 1])
    axes[1, 1].grid(axis='x', alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Example:
# plot_model_comparison(results_df)

### 8.2 Confusion Matrix

In [None]:
def plot_confusion_matrix(y_true, y_pred, model_name="Model", class_names=None):
    """
    Plot confusion matrix with percentages
    """
    cm = confusion_matrix(y_true, y_pred)
    
    # Calculate percentages
    cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
    
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Create heatmap
    sns.heatmap(
        cm_percent, 
        annot=True, 
        fmt='.1f', 
        cmap='Blues',
        xticklabels=class_names if class_names else 'auto',
        yticklabels=class_names if class_names else 'auto',
        cbar_kws={'label': 'Percentage (%)'},
        ax=ax
    )
    
    ax.set_title(f'Confusion Matrix - {model_name}', fontsize=14, fontweight='bold')
    ax.set_xlabel('Predicted Label', fontsize=12)
    ax.set_ylabel('True Label', fontsize=12)
    
    plt.tight_layout()
    plt.show()

# Example:
# best_model_name = results_df['f1_weighted'].idxmax()
# best_model = models[best_model_name]
# y_pred = best_model.predict(X_test)
# plot_confusion_matrix(y_test, y_pred, best_model_name)

---

## 9. Model Saving & Export

Save the best performing model for production use.

In [None]:
import pickle
import joblib
from datetime import datetime

def save_model(model, vectorizer, label_encoder, model_name="best_model"):
    """
    Save model and preprocessing objects
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{model_name}_{timestamp}.pkl"
    
    # Save as a dictionary
    model_package = {
        'model': model,
        'vectorizer': vectorizer,
        'label_encoder': label_encoder,
        'timestamp': timestamp
    }
    
    joblib.dump(model_package, filename)
    print(f"Model saved to: {filename}")
    
    return filename

def load_model(filename):
    """
    Load saved model
    """
    model_package = joblib.load(filename)
    print(f"Model loaded from: {filename}")
    
    return model_package

# Example:
# best_model_name = results_df['f1_weighted'].idxmax()
# best_model = models[best_model_name]
# save_model(best_model, tfidf_vectorizer, label_encoder, best_model_name)

---

## 10. Prediction Pipeline

Complete end-to-end prediction pipeline for new data.

In [None]:
def predict_treatment(text, model_package):
    """
    Predict treatment class for new text
    
    Args:
        text: Input text
        model_package: Dictionary with model, vectorizer, label_encoder
    
    Returns:
        prediction: Predicted class
        probabilities: Class probabilities
    """
    # Extract components
    model = model_package['model']
    vectorizer = model_package['vectorizer']
    label_encoder = model_package['label_encoder']
    
    # Preprocess text
    cleaned_text = clean_text(text)
    
    # Vectorize
    X = vectorizer.transform([cleaned_text])
    
    # Predict
    y_pred = model.predict(X)[0]
    prediction = label_encoder.inverse_transform([y_pred])[0]
    
    # Get probabilities if available
    try:
        probabilities = model.predict_proba(X)[0]
        prob_dict = dict(zip(label_encoder.classes_, probabilities))
    except:
        prob_dict = None
    
    return prediction, prob_dict

# Example usage:
# sample_text = "Patient presents with stage III melanoma..."
# prediction, probabilities = predict_treatment(sample_text, model_package)
# print(f"\nPredicted Treatment: {prediction}")
# if probabilities:
#     print("\nClass Probabilities:")
#     for cls, prob in sorted(probabilities.items(), key=lambda x: x[1], reverse=True):
#         print(f"   {cls}: {prob:.4f}")

---

## Appendix: Utility Functions

In [1]:
# Additional utility functions can be added here

def print_model_info(model):
    """Print model parameters and configuration"""
    print(f"\nModel Type: {type(model).__name__}")
    print(f"Parameters: {model.get_params()}")

def export_predictions_to_csv(y_true, y_pred, filename="predictions.csv"):
    """Export predictions to CSV file"""
    results = pd.DataFrame({
        'true_label': y_true,
        'predicted_label': y_pred,
        'correct': y_true == y_pred
    })
    results.to_csv(filename, index=False)
    print(f"Predictions saved to: {filename}")