1. Preprocessing Routine
Objective: Implement a preprocessing pipeline to clean and standardize data, ensuring that both training and test data are processed the same way.
Implementation: Use tools like StandardScaler for normalization and SimpleImputer for handling missing values. Package this as a pipeline step in sklearn to maintain consistency across different stages.
2. Feature Selection and Dimensionality Reduction
Objective: Prioritize the most relevant features and potentially reduce the feature space for faster training.

Approach:

Permutation Testing: Test different feature subsets by dropping specific columns and evaluating changes in performance. Track changes in the F1 score for each subset to identify impactful features.
Dimensionality Reduction (PCA): Use PCA to further reduce dimensionality by retaining components that account for a significant portion of variance (e.g., 95%). This can accelerate model training by reducing noise and unimportant features.
Feature Importance from PCA: Evaluate which features contribute most to each principal component, based on eigenvalues. Discard features with minimal contributions to maintain a balance between performance and model complexity.

3. Model Training with XGBoost Classifier
Objective: Train the model using the optimized feature set from the previous step.
Pipeline Setup: Create a pipeline that includes the preprocessing, feature selection, and XGBoost model.
Hyperparameter Tuning with K-Fold Grid Search:
Use GridSearchCV to tune hyperparameters such as learning_rate, max_depth, n_estimators, and subsample.
Set scoring='f1' to prioritize the F1 score during optimization.
4. Evaluate the Impact of Feature Selection and Dimensionality Reduction
Objective: Determine whether feature reduction has improved, degraded, or maintained model performance.
Comparison: After finding the optimal hyperparameters, compare F1 scores from models trained with full features versus reduced features (via permutation tests and PCA) to confirm the benefits of feature selection.
5. Latent Space Embedding Projection (Optional)
Objective: Further understand the data structure by projecting it into a latent space for analysis.
Techniques: Use t-SNE, UMAP, or autoencoders to create embeddings that could reveal additional patterns or relationships, potentially informing new feature engineering strategies.
6. Neural Network Exploration (Optional)
Objective: Explore neural networks as an alternative to XGBoost for this task.
Implementation:
Replace XGBoost with a neural network architecture, varying parameters like layer count, activation functions, learning rate, and batch size.
Use GridSearchCV for tuning and evaluation, focusing on the F1 score to match previous results.
Comparison: Evaluate neural network performance relative to XGBoost in terms of F1 score and computational efficiency.


In [10]:
import pandas as pd

test_clean_df = pd.read_csv('test_clean_df.csv')
train_clean_df = pd.read_csv('train_clean_df.csv')

  test_clean_df = pd.read_csv('test_clean_df.csv')
  train_clean_df = pd.read_csv('train_clean_df.csv')


In [19]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Load data with low_memory=False to handle mixed types

# Data prep
exclude_cols = [
    'OIICS Nature of Injury Description',
    'IME-4 Count',
    'First Hearing Date_Year',
    'First Hearing Date_Quarter',
    'First Hearing Date_Month',
    'First Hearing Date',
    'C-3 Date_Year',
    'C-3 Date',
    'C-3 Date_Quarter',
    'C-3 Date_Month'
]

# Select features and target
X = train_clean_df.drop(exclude_cols + ['Claim Injury Type'], axis=1)
y = train_clean_df['Claim Injury Type']

# Fill nulls
X = X.fillna(X.mode().iloc[0])

# Save feature names
feature_names = list(X.columns)

# Encode features
X_encoded = X.copy()
le = LabelEncoder()
non_numeric = X.select_dtypes(exclude=['int64', 'float64']).columns

for col in non_numeric:
    X_encoded[col] = le.fit_transform(X_encoded[col].astype(str))

# Convert to numpy array after all encoding is done
X_encoded = X_encoded.values
y_encoded = LabelEncoder().fit_transform(y)



In [38]:
import numpy as np
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, make_scorer

# Optimized parameters for T4 GPU
param_grid = {
    'max_depth': [5, 7],
    'learning_rate': [0.1],
    'n_estimators': [200],           # Reduced to save memory
    'min_child_weight': [1, 3],
    'subsample': [0.8],              # Reduced to save memory
    'colsample_bytree': [0.8],       # Reduced to save memory
    'tree_method': ['gpu_hist'],
    'predictor': ['gpu_predictor'],
    'gpu_id': [0],                   # Specify GPU
    'max_bin': [256],                # Optimize for GPU memory
    'grow_policy': ['lossguide']     # Better GPU performance
}

def tune_xgboost(X_encoded, y_encoded, feature_names):
    """
    Two-phase XGBoost tuning optimized for Colab T4 GPU
    """
    # Phase 1: 1% Sample (reduced from 10% for memory)
    print("\n=== Phase 1: Sample Training ===")
    sample_size = int(len(X_encoded) * 1)
    indices = np.random.permutation(len(X_encoded))[:sample_size]
    X_sample = X_encoded[indices]
    y_sample = y_encoded[indices]

    print(f"Training on {sample_size} samples")

    # Train on sample
    xgb = XGBClassifier(random_state=42,
                       tree_method='gpu_hist',
                       predictor='gpu_predictor',
                       gpu_id=0)

    grid_search_sample = GridSearchCV(
        estimator=xgb,
        param_grid=param_grid,
        scoring=make_scorer(f1_score, average='macro'),
        cv=3,                    # Reduced from 5 to save memory
        n_jobs=1,               # Use single job for GPU
        verbose=2
    )

    grid_search_sample.fit(X_sample, y_sample)

    # Print sample results
    print(f"\nSample F1 Score: {grid_search_sample.best_score_:.4f}")
    print("Best Parameters:")
    for param, value in grid_search_sample.best_params_.items():
        print(f"  {param}: {value}")

    # Phase 2: Full Dataset with best params
    print("\n=== Phase 2: Full Dataset ===")
    best_params = grid_search_sample.best_params_

    final_model = XGBClassifier(
        **best_params,
        random_state=42,
        tree_method='gpu_hist',
        predictor='gpu_predictor',
        gpu_id=0
    )

    print(f"Training on full dataset ({len(X_encoded)} samples)...")
    final_model.fit(X_encoded, y_encoded)

    # Get importance
    importance = final_model.feature_importances_

    print("\nTop 15 Important Features:")
    feature_importance = list(zip(feature_names, importance))
    feature_importance.sort(key=lambda x: x[1], reverse=True)

    for name, imp in feature_importance[:15]:
        print(f"{name}: {imp:.4f}")

    return final_model, grid_search_sample.best_score_, best_params, feature_importance

# Usage:
best_model, best_score, best_params, feature_importance = tune_xgboost(X_encoded, y_encoded, feature_names)


=== Training XGBoost Model ===


ValueError: The test_size = 1 should be greater or equal to the number of classes = 9

=== Training Set Performance ===
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
              precision    recall  f1-score   support

           0       0.74      0.14      0.24     11229
           1       0.82      1.00      0.90    261970
           2       0.00      0.00      0.00     62015
           3       0.70      0.93      0.79    133656
           4       0.70      0.30      0.42     43452
           5       0.00      0.00      0.00      3790
           6       0.00      0.00      0.00        87
           7       0.00      0.00      0.00       423
           8       1.00      1.00      1.00     17501

    accuracy                           0.78    534123
   macro avg       0.44      0.37      0.37    534123
weighted avg       0.68      0.78      0.71    534123


=== Validation Set Performance ===
              precision    recall  f1-score   support

           0       0.78      0.15      0.26      1248
           1       0.82      1.00      0.90     29108
           2       0.00      0.00      0.00      6891
           3       0.70      0.93      0.80     14851
           4       0.69      0.29      0.41      4828
           5       0.00      0.00      0.00       421
           6       0.00      0.00      0.00        10
           7       0.00      0.00      0.00        47
           8       1.00      1.00      1.00      1944

    accuracy                           0.78     59348
   macro avg       0.44      0.37      0.37     59348
weighted avg       0.68      0.78      0.71     59348


Top 15 Important Features:
Average Weekly Wage: 0.2677
Accident Date_Year: 0.1172
Claim Identifier: 0.1147
Agreement Reached: 0.0756
Attorney/Representative: 0.0648
Number of Dependents: 0.0495
Accident Date: 0.0265
WCIO Nature of Injury Description: 0.0238
Accident Date_Month: 0.0185
WCIO Nature of Injury Description_Grouped: 0.0173
Industry Code: 0.0170
Carrier Type: 0.0157
Industry Code Description: 0.0153
County of Injury: 0.0153
C-2 Date_Quarter: 0.0123

In [40]:
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, f1_score, make_scorer

def train_xgb_kfold(X_encoded, y_encoded, feature_names, n_folds=5):
    """
    XGBoost training with K-Fold cross-validation
    """
    print("\n=== Preparing K-Fold Cross Validation ===")
    print(f"Total samples: {len(X_encoded)}")
    print(f"Number of folds: {n_folds}")

    # Setup KFold
    kfold = KFold(n_splits=n_folds, shuffle=True, random_state=42)

    # Optimized parameter grid for GPU
    param_grid = {
        'max_depth': [3, 5],
        'learning_rate': [0.01, 0.1],
        'n_estimators': [100, 200],
        'min_child_weight': [3, 5],
        'subsample': [0.8],
        'colsample_bytree': [0.8],
        'gamma': [0.1, 0.5]
    }

    # Base model with GPU settings
    base_model = XGBClassifier(
        tree_method='hist',
        device='cuda:0',
        random_state=42,
        grow_policy='lossguide',
        max_bin=256
    )

    # Grid search with k-fold CV
    print("\n=== Running Grid Search with K-Fold CV ===")
    grid_search = GridSearchCV(
        estimator=base_model,
        param_grid=param_grid,
        scoring=make_scorer(f1_score, average='macro'),
        cv=kfold,
        n_jobs=1,  # Single job for GPU
        verbose=1,
        return_train_score=True
    )

    print("\nFitting model...")
    grid_search.fit(X_encoded, y_encoded)

    # Print CV results
    print("\n=== Cross Validation Results ===")
    print(f"Best CV Score: {grid_search.best_score_:.4f}")
    print("\nBest Parameters:")
    for param, value in grid_search.best_params_.items():
        print(f"  {param}: {value}")

    # Get detailed CV scores
    cv_results = grid_search.cv_results_
    for i in range(len(cv_results['params'])):
        print(f"\nParameter set {i+1}:")
        print(f"Parameters: {cv_results['params'][i]}")
        print(f"Mean CV Score: {cv_results['mean_test_score'][i]:.4f}")
        print(f"Std CV Score: {cv_results['std_test_score'][i]:.4f}")

    # Train final model with best parameters
    print("\n=== Training Final Model ===")
    final_model = XGBClassifier(
        tree_method='hist',
        device='cuda:0',
        random_state=42,
        grow_policy='lossguide',
        max_bin=256,
        **grid_search.best_params_
    )

    print("\nTraining on full dataset...")
    final_model.fit(X_encoded, y_encoded)

    # Feature importance
    importance = final_model.feature_importances_
    print("\nTop 15 Important Features:")
    feature_importance = list(zip(feature_names, importance))
    feature_importance.sort(key=lambda x: x[1], reverse=True)

    for name, imp in feature_importance[:15]:
        print(f"{name}: {imp:.4f}")

    return final_model, feature_importance, grid_search.best_score_, grid_search.best_params_

# Usage:
final_model, final_importance, best_score, best_params = train_xgb_kfold(
    X_encoded,
    y_encoded,
    feature_names,
    n_folds=5
)


=== Preparing K-Fold Cross Validation ===
Total samples: 593471
Number of folds: 5

=== Running Grid Search with K-Fold CV ===

Fitting model...
Fitting 5 folds for each of 32 candidates, totalling 160 fits

=== Cross Validation Results ===
Best CV Score: 0.4645

Best Parameters:
  colsample_bytree: 0.8
  gamma: 0.5
  learning_rate: 0.1
  max_depth: 5
  min_child_weight: 5
  n_estimators: 200
  subsample: 0.8

Parameter set 1:
Parameters: {'colsample_bytree': 0.8, 'gamma': 0.1, 'learning_rate': 0.01, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 100, 'subsample': 0.8}
Mean CV Score: 0.3683
Std CV Score: 0.0006

Parameter set 2:
Parameters: {'colsample_bytree': 0.8, 'gamma': 0.1, 'learning_rate': 0.01, 'max_depth': 3, 'min_child_weight': 3, 'n_estimators': 200, 'subsample': 0.8}
Mean CV Score: 0.3739
Std CV Score: 0.0015

Parameter set 3:
Parameters: {'colsample_bytree': 0.8, 'gamma': 0.1, 'learning_rate': 0.01, 'max_depth': 3, 'min_child_weight': 5, 'n_estimators': 100, 'subs

In [42]:
# === Cross Validation Results ===
# Best CV Score: 0.4645

# Best Parameters:
#   colsample_bytree: 0.8
#   gamma: 0.5
#   learning_rate: 0.1
#   max_depth: 5
#   min_child_weight: 5
#   n_estimators: 200
#   subsample: 0.8


# === Training Final Model ===

# Training on full dataset...

# Top 15 Important Features:
# Average Weekly Wage: 0.2197
# Claim Identifier: 0.1565
# Agreement Reached: 0.1382
# Accident Date_Year: 0.0942
# Attorney/Representative: 0.0893
# Alternative Dispute Resolution: 0.0237
# C-2 Date_Month: 0.0234
# C-2 Date_Year: 0.0203
# C-2 Date_Quarter: 0.0181
# COVID-19 Indicator: 0.0143
# Accident Date_Month: 0.0124
# ADR_Clean: 0.0124
# Accident Date: 0.0121
# Industry Code: 0.0107
# Carrier_Category: 0.0106

SyntaxError: invalid syntax (<ipython-input-42-b75305a32c57>, line 1)


=== Feature Dropping Experiments ===

Testing top 5 features...


InvalidIndexError: (slice(None, None, None), [5, 10, 25, 30, 4])

In [None]:
# Fill NaN's before PCA

# 1. FROM XGBOOST WE USE:
#    - Feature importance scores
#    - Best performing features
#    - Model performance baseline

# 2. FOR PCA WE NEED:
#    - Original feature matrix
#    - Feature importance rankings
#    - Variance threshold (95%)

In [89]:
# X_encoded = pd.DataFrame(X_encoded, columns=feature_names)

pd.set_option('display.max_columns', None)
X_encoded.head()


Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,Carrier Name,Carrier Type,Claim Identifier,County of Injury,COVID-19 Indicator,District Name,Gender,Industry Code,Industry Code Description,Medical Fee Region,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents,Accident Date_Month,Accident Date_Quarter,Accident Date_Year,Assembly Date_Month,Assembly Date_Quarter,Assembly Date_Year,C-2 Date_Month,C-2 Date_Quarter,C-2 Date_Year,Carrier_Category,County_Grouped,Zip_Region,Industry_Grouped,WCIO Cause of Injury Description_Grouped,WCIO Nature of Injury Description_Grouped,WCIO Part Of Body Description_Grouped,Gender_Clean,ADR_Clean
0,4392.0,31.0,0.0,0.0,0.0,0.0,1988.0,1085.0,1197.0,0.0,5393875.0,49.0,0.0,7.0,1.0,44.0,16.0,0.0,27.0,27.0,10.0,16.0,62.0,5.0,3935.0,0.0,0.0,1.0,490.0,192.0,2019.0,0.0,0.0,2020.0,241.0,88.0,2019.0,10.0,13.0,160.0,14.0,9.0,2.0,19.0,1.0,0.0
1,4270.0,46.0,0.0,0.0,1.0,1745.93,1973.0,1086.0,2044.0,0.0,5393091.0,61.0,0.0,5.0,0.0,23.0,4.0,0.0,97.0,57.0,49.0,50.0,38.0,39.0,4606.0,1.0,0.0,4.0,486.0,191.0,2019.0,0.0,0.0,2020.0,242.0,89.0,2020.0,13.0,13.0,178.0,3.0,23.0,13.0,20.0,0.0,0.0
2,4368.0,40.0,0.0,0.0,0.0,1434.8,1979.0,1086.0,894.0,0.0,5393889.0,35.0,0.0,0.0,1.0,56.0,1.0,1.0,79.0,43.0,7.0,14.0,10.0,30.0,3075.0,0.0,0.0,6.0,490.0,192.0,2019.0,0.0,0.0,2020.0,242.0,89.0,2020.0,8.0,12.0,137.0,1.0,13.0,11.0,17.0,1.0,0.0
3,4454.0,31.0,0.0,0.0,0.0,0.0,1990.0,1575.0,1710.0,0.0,957648180.0,51.0,0.0,4.0,1.0,62.0,7.0,3.0,56.0,35.0,52.0,51.0,42.0,23.0,1909.0,0.0,0.0,6.0,493.0,193.0,2021.0,0.0,0.0,2020.0,273.0,99.0,2022.0,13.0,13.0,120.0,10.0,19.0,11.0,19.0,1.0,0.0
4,4392.0,61.0,0.0,0.0,0.0,0.0,1958.0,1085.0,1710.0,1.0,5393887.0,13.0,0.0,0.0,1.0,62.0,7.0,1.0,16.0,30.0,43.0,44.0,36.0,12.0,3088.0,0.0,0.0,1.0,490.0,192.0,2019.0,0.0,0.0,2020.0,241.0,88.0,2019.0,16.0,3.0,139.0,6.0,10.0,12.0,6.0,1.0,0.0


In [56]:
import pandas as pd
import numpy as np

def impute_mixed_data(X_encoded, feature_names):
    """
    Imputes missing values in mixed-type data.

    Uses mode for categorical variables and median for numerical variables.

    Args:
        X_encoded: The encoded data as a NumPy array.
        feature_names: List of feature names corresponding to X_encoded columns.

    Returns:
        X_imputed: The imputed data as a NumPy array.
    """

    # Convert to DataFrame for easier handling
    df = pd.DataFrame(X_encoded, columns=feature_names)

    # Separate numerical and categorical features
    numerical_features = df.select_dtypes(include=np.number).columns
    categorical_features = df.select_dtypes(exclude=np.number).columns

    # Impute numerical features with median
    for feature in numerical_features:
        df[feature] = df[feature].fillna(df[feature].median())

    # Impute categorical features with mode
    for feature in categorical_features:
        df[feature] = df[feature].fillna(df[feature].mode()[0])

    # Convert back to NumPy array
    X_imputed = df.values

    return X_imputed

# Example usage:
X_imputed = impute_mixed_data(X_encoded, feature_names)

X_imputed = pd.DataFrame(X_imputed, columns=feature_names)

X_imputed.isna().sum()

Unnamed: 0,0
Accident Date,0
Age at Injury,0
Alternative Dispute Resolution,0
Assembly Date,0
Attorney/Representative,0
Average Weekly Wage,0
Birth Year,0
C-2 Date,0
Carrier Name,0
Carrier Type,0


In [94]:
import numpy as np
from xgboost import XGBClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, StratifiedKFold
import time

def run_feature_dropping(X_imputed, y_encoded, feature_names):
    print("\n=== Feature Dropping Experiments ===")
    dropping_results = {}
    feature_counts = [5, 10, 15]

    # Convert to numpy if needed
    X = X_imputed.to_numpy() if hasattr(X_imputed, 'to_numpy') else np.array(X_imputed)

    for n_features in feature_counts:
        print(f"\nTesting top {n_features} features...")
        # Get top N features and their indices
        top_features = [f[0] for f in feature_importance[:n_features]]
        feature_indices = [feature_names.index(f) for f in top_features]

        # Extract subset using numpy indexing
        X_subset = X[:, feature_indices]

        model = XGBClassifier(
            tree_method='hist',
            device='cuda:0',
            random_state=42,
            **best_params
        )

        scores = cross_val_score(
            model, X_subset, y_encoded,
            scoring='f1_macro',
            cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        )

        dropping_results[n_features] = {
            'mean_score': scores.mean(),
            'std_score': scores.std(),
            'features': top_features
        }
        print(f"F1 Score: {scores.mean():.4f} ± {scores.std():.4f}")
        print("Features used:", top_features)

    return dropping_results

def run_pca(X_imputed, y_encoded):
    print("\n=== PCA Experiments ===")
    pca_results = {}

    # Convert to numpy if needed
    X = X_imputed.to_numpy() if hasattr(X_imputed, 'to_numpy') else np.array(X_imputed)

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    for threshold in [0.85, 0.90, 0.95]:
        print(f"\nTesting {threshold*100}% variance explained...")
        pca = PCA(n_components=threshold)
        X_pca = pca.fit_transform(X_scaled)

        model = XGBClassifier(
            tree_method='hist',
            device='cuda:0',
            random_state=42,
            **best_params
        )

        scores = cross_val_score(
            model, X_pca, y_encoded,
            scoring='f1_macro',
            cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        )

        pca_results[threshold] = {
            'n_components': X_pca.shape[1],
            'mean_score': scores.mean(),
            'std_score': scores.std(),
            'explained_variance_ratio': pca.explained_variance_ratio_,
            'cumulative_variance': np.cumsum(pca.explained_variance_ratio_)
        }

        print(f"Components: {X_pca.shape[1]}")
        print(f"F1 Score: {scores.mean():.4f} ± {scores.std():.4f}")
        print(f"Cumulative variance: {pca_results[threshold]['cumulative_variance']}")

    return pca_results

# Known results from XGB
best_params = {
    'colsample_bytree': 0.8,
    'gamma': 0.5,
    'learning_rate': 0.1,
    'max_depth': 5,
    'min_child_weight': 5,
    'n_estimators': 200,
    'subsample': 0.8
}

feature_importance = [
    ('Average Weekly Wage', 0.2197),
    ('Claim Identifier', 0.1565),
    ('Agreement Reached', 0.1382),
    ('Accident Date_Year', 0.0942),
    ('Attorney/Representative', 0.0893),
    ('Alternative Dispute Resolution', 0.0237),
    ('C-2 Date_Month', 0.0234),
    ('C-2 Date_Year', 0.0203),
    ('C-2 Date_Quarter', 0.0181),
    ('COVID-19 Indicator', 0.0143),
    ('Accident Date_Month', 0.0124),
    ('ADR_Clean', 0.0124),
    ('Accident Date', 0.0121),
    ('Industry Code', 0.0107),
    ('Carrier_Category', 0.0106)
]

def execute_experiments(X_imputed, y_encoded, feature_names):
    print("Starting experiments...")
    print(f"Input shape: {X_imputed.shape}")
    print(f"Number of features: {len(feature_names)}")
    print("Feature names:", feature_names)

    original_score = 0.4645

    dropping_results = run_feature_dropping(X_imputed, y_encoded, feature_names)
    pca_results = run_pca(X_imputed, y_encoded)

    print("\n=== Results Summary ===")
    print(f"Original Score ({len(feature_names)} features): {original_score:.4f}")

    print("\nFeature Dropping Results:")
    for n_features, res in dropping_results.items():
        print(f"{n_features} features: {res['mean_score']:.4f} ± {res['std_score']:.4f}")
        print(f"Features used: {res['features']}")

    print("\nPCA Results:")
    for threshold, res in pca_results.items():
        print(f"{threshold*100}% variance: {res['mean_score']:.4f} ± {res['std_score']:.4f} "
              f"({res['n_components']} components)")

    return {
        'original_score': original_score,
        'dropping_results': dropping_results,
        'pca_results': pca_results
    }

# Run experiments
results = execute_experiments(X_imputed, y_encoded, feature_names)

Starting experiments...
Input shape: (593471, 46)
Number of features: 46
Feature names: ['Accident Date', 'Age at Injury', 'Alternative Dispute Resolution', 'Assembly Date', 'Attorney/Representative', 'Average Weekly Wage', 'Birth Year', 'C-2 Date', 'Carrier Name', 'Carrier Type', 'Claim Identifier', 'County of Injury', 'COVID-19 Indicator', 'District Name', 'Gender', 'Industry Code', 'Industry Code Description', 'Medical Fee Region', 'WCIO Cause of Injury Code', 'WCIO Cause of Injury Description', 'WCIO Nature of Injury Code', 'WCIO Nature of Injury Description', 'WCIO Part Of Body Code', 'WCIO Part Of Body Description', 'Zip Code', 'Agreement Reached', 'WCB Decision', 'Number of Dependents', 'Accident Date_Month', 'Accident Date_Quarter', 'Accident Date_Year', 'Assembly Date_Month', 'Assembly Date_Quarter', 'Assembly Date_Year', 'C-2 Date_Month', 'C-2 Date_Quarter', 'C-2 Date_Year', 'Carrier_Category', 'County_Grouped', 'Zip_Region', 'Industry_Grouped', 'WCIO Cause of Injury Descript

Feature Dropping Results:
5 features: 0.3528 ± 0.0005
Features used: ['Average Weekly Wage', 'Claim Identifier', 'Agreement Reached', 'Accident Date_Year', 'Attorney/Representative']

10 features: 0.4138 ± 0.0009
Features used: ['Average Weekly Wage', 'Claim Identifier', 'Agreement Reached', 'Accident Date_Year', 'Attorney/Representative', 'Alternative Dispute Resolution', 'C-2 Date_Month', 'C-2 Date_Year', 'C-2 Date_Quarter', 'COVID-19 Indicator']

15 features: 0.4204 ± 0.0036
Features used: ['Average Weekly Wage', 'Claim Identifier', 'Agreement Reached', 'Accident Date_Year', 'Attorney/Representative', 'Alternative Dispute Resolution', 'C-2 Date_Month', 'C-2 Date_Year', 'C-2 Date_Quarter', 'COVID-19 Indicator', 'Accident Date_Month', 'ADR_Clean', 'Accident Date', 'Industry Code', 'Carrier_Category']

PCA Results:
85.0% variance: 0.3823 ± 0.0031 (18 components)
90.0% variance: 0.3945 ± 0.0020 (21 components)
95.0% variance: 0.3970 ± 0.0026 (24 components)

In [None]:
# XGBoost Parameter Tuning Issues:

# # Current limited grid

# param_grid = {
#     'subsample': [0.8],        # Single value!
#     'colsample_bytree': [0.8], # Single value!
#     'gamma': [0.1, 0.5]        # Large jump

# }

# # Improved grid

# improved_param_grid = {

#     'subsample': [0.6, 0.7, 0.8, 0.9],
#     'colsample_bytree': [0.6, 0.7, 0.8, 0.9],
#     'gamma': [0, 0.1, 0.2, 0.3],
#     'min_child_weight': [1, 3, 5, 7],
#     'max_depth': [3, 4, 5, 6]

# }

# PCA Implementation Issues:

# # Current issues:
# - Duplicate date features
# - Raw and processed versions mixed
# - Potential leakage (Claim Identifier)

# # Improvements

# def improved_feature_engineering():

#     # 1. Date feature consolidation
#     date_features = {
#         'accident_date': ['Accident Date_Year', 'Accident Date_Month', 'Accident Date_Quarter'],
#         'c2_date': ['C-2 Date_Year', 'C-2 Date_Month', 'C-2 Date_Quarter']

#     }


#     # 2. Feature interactions

#     interaction_features = [
#         ('Average Weekly Wage', 'Industry_Grouped'),
#         ('Age at Injury', 'WCIO Nature of Injury Description_Grouped')
#     ]

In [91]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
import matplotlib.pyplot as plt

def compare_pca_performance(X_imputed, optimal_params, y_encoded, variance_threshold=0.95):
    """
    Compare model performance with and without PCA
    """
    # Split features and target
    X = X_imputed.drop(target_column, axis=1)
    y = y_encoded

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Initialize cross-validation
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

    # Get baseline score with all features
    xgb = XGBClassifier(**optimal_params, random_state=42)
    baseline_scores = cross_val_score(xgb, X_scaled, y, cv=cv, scoring='f1_weighted')

    print("Baseline Performance (All Features):")
    print(f"F1 Score: {baseline_scores.mean():.4f} (+/- {baseline_scores.std() * 2:.4f})")
    print(f"Number of features: {X.shape[1]}")

    # Apply PCA
    pca = PCA()
    X_pca = pca.fit_transform(X_scaled)

    # Calculate cumulative explained variance
    cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
    n_components = np.argmax(cumulative_variance >= variance_threshold) + 1

    # Reduce to selected number of components
    pca = PCA(n_components=n_components)
    X_reduced = pca.fit_transform(X_scaled)

    # Get score with reduced features
    pca_scores = cross_val_score(xgb, X_reduced, y, cv=cv, scoring='f1_weighted')

    print(f"\nPCA Performance ({variance_threshold*100}% variance explained):")
    print(f"F1 Score: {pca_scores.mean():.4f} (+/- {pca_scores.std() * 2:.4f})")
    print(f"Number of components used: {n_components}")

    # Plot explained variance
    plt.figure(figsize=(10, 6))
    plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
            np.cumsum(pca.explained_variance_ratio_), 'bo-')
    plt.axhline(y=variance_threshold, color='r', linestyle='--',
               label=f'{variance_threshold*100}% Variance Threshold')
    plt.axvline(x=n_components, color='g', linestyle='--',
               label=f'Selected Components: {n_components}')
    plt.xlabel('Number of Components')
    plt.ylabel('Cumulative Explained Variance Ratio')
    plt.title('PCA Explained Variance')
    plt.grid(True)
    plt.legend()
    plt.show()

    return {
        'baseline_score': baseline_scores.mean(),
        'pca_score': pca_scores.mean(),
        'n_components': n_components,
        'variance_explained': variance_threshold
    }

# Example usage with your optimal parameters
optimal_params = {
    'max_depth': 3,  # replace with your optimal parameters
    'learning_rate': 0.1,
    'n_estimators': 100
}

results = compare_pca_performance(X_imputed, optimal_params)

TypeError: compare_pca_performance() missing 1 required positional argument: 'y_encoded'

In [86]:
X_imputed.columns

Index(['Accident Date', 'Age at Injury', 'Alternative Dispute Resolution',
       'Assembly Date', 'Attorney/Representative', 'Average Weekly Wage',
       'Birth Year', 'C-2 Date', 'Carrier Name', 'Carrier Type',
       'Claim Identifier', 'County of Injury', 'COVID-19 Indicator',
       'District Name', 'Gender', 'Industry Code', 'Industry Code Description',
       'Medical Fee Region', 'WCIO Cause of Injury Code',
       'WCIO Cause of Injury Description', 'WCIO Nature of Injury Code',
       'WCIO Nature of Injury Description', 'WCIO Part Of Body Code',
       'WCIO Part Of Body Description', 'Zip Code', 'Agreement Reached',
       'WCB Decision', 'Number of Dependents', 'Accident Date_Month',
       'Accident Date_Quarter', 'Accident Date_Year', 'Assembly Date_Month',
       'Assembly Date_Quarter', 'Assembly Date_Year', 'C-2 Date_Month',
       'C-2 Date_Quarter', 'C-2 Date_Year', 'Carrier_Category',
       'County_Grouped', 'Zip_Region', 'Industry_Grouped',
       'WCIO Cause of

In [None]:
# Detailed Results:
# --------------------------------------------------
# Feature Scores by Set Size:
# 3 features: 0.2612
# 4 features: 0.2706
# 5 features: 0.2750
# 6 features: 0.3113
# 7 features: 0.3306
# 8 features: 0.3383
# 9 features: 0.3536
# 10 features: 0.3528
# 11 features: 0.3587

# Selected Features Summary:
# --------------------------------------------------
# Average Weekly Wage            0.2075 (30.9% of total)
# Attorney/Representative        0.0667 (9.9% of total)
# Zip Code                       0.0581 (8.7% of total)
# Age at Injury                  0.0527 (7.9% of total)
# Birth Year                     0.0511 (7.6% of total)
# WCIO Part Of Body Code         0.0484 (7.2% of total)
# WCIO Cause of Injury Code      0.0483 (7.2% of total)
# IME-4 Count                    0.0379 (5.7% of total)
# Agreement Reached              0.0362 (5.4% of total)
# Number of Dependents           0.0351 (5.2% of total)
# Industry Code                  0.0285 (4.3% of total)

# Filtered Dataset Shape: (574026, 12)

IndentationError: expected an indented block after function definition on line 170 (<ipython-input-56-e6c7425242a0>, line 171)

Test Data Information:
--------------------------------------------------

Columns and Data Types:
Accident Date                                 object
Age at Injury                                float64
Alternative Dispute Resolution                object
Assembly Date                                 object
Attorney/Representative                       object
Average Weekly Wage                          float64
Birth Year                                   float64
C-2 Date                                      object
C-3 Date                                      object
Carrier Name                                  object
Carrier Type                                  object
Claim Identifier                               int64
County of Injury                              object
COVID-19 Indicator                            object
District Name                                 object
First Hearing Date                            object
Gender                                        object


Model saved successfully
Model save verified - able to load successfully

Model Parameters:
n_estimators: 100
max_depth: 20
Number of encoders: 3
