# usage: 

1. fetch feature engineered data set
2. select features to use.
3. run validation: `validate_data(sales_df, outcome, numerical_features, categorical_features)`
4. create a preprocessor
5. prepare for machine learning (standard scaler, one-hot-encoding, split train/test)
6. run random forest analysis
```
xgb_model, xgb_metrics, xgb_feature_importance, xgb_threshold, xgb_test_proba, xgb_deciles, model_comparison = run_xgboost_analysis(
    X_train_processed,
    X_test_processed,
    y_train,
    y_test,
    feature_names,
    lr_metrics,
    rf_metrics,
    lr_test_proba,
    rf_test_proba
)
```
7. study performance metrics, compare against baseline models

In [1]:
import sys
import os
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

In [2]:
import duckdb
import pandas as pd
from pre_processing import (
    validate_data,                       
    create_preprocessor,                 
    get_feature_names,    
    prepare_data
)
from train_logistic_regression import run_logistic_regression_analysis
from train_random_forest       import run_random_forest_analysis
from train_xgboost             import run_xgboost_analysis

In [3]:
# fetch the feature engineered datatable
with duckdb.connect(database='../data/propensity_to_buy.duckdb', read_only=True) as con:
    sales_df = con.sql("SELECT * FROM feature_engineered;").df()
    sales_df = (
        sales_df
        .assign(day_of_week = sales_df['day_of_week'].astype('category')) # need to reapply categorical conversion?
        .assign(entry_hour  = sales_df['entry_hour' ].astype('category')) # need to reapply categorical conversion?
        .assign(entry_month = sales_df['entry_month'].astype('category')) # need to reapply categorical conversion?
    )
    
display(sales_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   has_placed_order           50000 non-null  bool    
 1   salary                     50000 non-null  float64 
 2   n_engaged_minutes          50000 non-null  float64 
 3   n_of_cars_viewed           50000 non-null  int64   
 4   day_of_week                50000 non-null  category
 5   price_of_last_car_viewed   50000 non-null  float64 
 6   step_reached_in_website    50000 non-null  category
 7   how_did_you_hear_about_us  50000 non-null  category
 8   entry_month                50000 non-null  category
 9   entry_hour                 50000 non-null  category
 10  age_at_entry               50000 non-null  float64 
 11  area_code                  50000 non-null  object  
 12  entry_year                 50000 non-null  int64   
 13  minutes_per_car            5000

None

In [4]:
outcome = 'has_placed_order'
numerical_features = [
    'salary',
    'n_engaged_minutes',
    'n_of_cars_viewed', 
    'price_of_last_car_viewed',
    'age_at_entry',
    'entry_year',
    'minutes_per_car',
    'affordability_ratio'
]

categorical_features = [
    'day_of_week',
    'step_reached_in_website', 
    'how_did_you_hear_about_us', 
    # 'entry_month',
    # 'entry_hour', # try skipping this (cardinality:24)
    'area_code',
    # 'industry_categories',
]

print(f"numerical features   ({len(numerical_features)}): {numerical_features}")
print(f"categorical features ({len(categorical_features)}): {categorical_features}")

numerical features   (8): ['salary', 'n_engaged_minutes', 'n_of_cars_viewed', 'price_of_last_car_viewed', 'age_at_entry', 'entry_year', 'minutes_per_car', 'affordability_ratio']
categorical features (4): ['day_of_week', 'step_reached_in_website', 'how_did_you_hear_about_us', 'area_code']


In [5]:
validate_data(sales_df, outcome, numerical_features, categorical_features)
preprocessor = create_preprocessor(numerical_features, categorical_features)
X_train_processed, X_test_processed, y_train, y_test, fitted_preprocessor, feature_names = prepare_data(sales_df, preprocessor, outcome, numerical_features, categorical_features)

=== validation ===
dataset shape: (50000, 16)

data contains no missing values ✓

target distribution:
>   false (unconverted): 43,500
>   true  (ordered):      6,500
>   conversion rate:     13.000%

categorical variable cardinalities:
>   day_of_week: 7 unique values
>   step_reached_in_website: 4 unique values
>   how_did_you_hear_about_us: 5 unique values
>   area_code: 9 unique values
=== validation completed ===

dataset shape:       (50000, 15)
target distribution: {False: 43500, True: 6500}
conversion rate:     0.130

after split:
training set: 37500 samples (0.130 conversion rate)
test set:     12500  samples (0.130  conversion rate)

after preprocessing:
training features shape: (37500, 29)
test features shape:     (12500, 29)
num features after preprocessing: 29
  - numerical features: 8
  - categorical features (after one-hot): 21


In [6]:
lr_model, lr_metrics, lr_feature_importance, lr_threshold, lr_test_proba = (
    run_logistic_regression_analysis(
        X_train_processed,
        X_test_processed,
        y_train,
        y_test,
        feature_names
    )
)
rf_model, rf_metrics, rf_feature_importance, rf_threshold, rf_test_proba, rf_deciles = run_random_forest_analysis(
    X_train_processed,
    X_test_processed,
    y_train,
    y_test,
    feature_names, 
    lr_metrics,   
    lr_test_proba 
)


=== LOGISTIC REGRESSION BASELINE ===

training complete
training samples: 37500
test samples:     12500
features:         29

=== LOGISTIC-REGRESSION PERFORMANCE EVALUATION ===

TRAINING SET PERFORMANCE:
>   ROC-AUC:   0.8700
>   PR-AUC:    0.3899
>   precision: 0.3679
>   recall:    0.8345
>   f1-score:  0.5106

TEST SET PERFORMANCE:
>   ROC-AUC:   0.8666
>   PR-AUC:    0.3773
>   precision: 0.3682
>   recall:    0.8283
>   f1-score:  0.5098

CONFUSION MATRIX (test set):
[[8565 2310]
 [ 279 1346]]
true  negatives: 8565, false positives: 2310
false negatives: 279, true  positives: 1346

BUSINESS METRICS:
>   total actual conversions:    1625
>   total predicted conversions: 3656
>   conversions captured:        1346 out of 1625 (82.8%)
>   precision in predictions:    1346 out of 3656 (36.8%)

=== FEATURE IMPORTANCE ===

TOP 15 MOST IMPORTANT FEATURES:
                                feature  coefficient
                                 salary     1.814587
step_reached_in_website_quote

In [7]:
xgb_model, xgb_metrics, xgb_feature_importance, xgb_threshold, xgb_test_proba, xgb_deciles, model_comparison = run_xgboost_analysis(
    X_train_processed,
    X_test_processed,
    y_train,
    y_test,
    feature_names,
    lr_metrics,
    rf_metrics,
    lr_test_proba,
    rf_test_proba
)

=== XGBOOST MODEL ===

class imbalance ratio: 6.69
training xgboost with early stopping 
training completed at iteration 52
model training complete
best iteration: 52

=== XGBOOST PERFORMANCE EVALUATION ===

TRAINING SET PERFORMANCE:
>   ROC-AUC:   0.9051
>   PR-AUC:    0.5261
>   precision: 0.3640
>   recall:    0.9346
>   f1-score:  0.5240

TEST SET PERFORMANCE:
>   ROC-AUC:   0.8627
>   PR-AUC:    0.3818
>   precision: 0.3420
>   recall:    0.8695
>   f1-score:  0.4909

CONFUSION MATRIX (test set):
[[8156 2719]
 [ 212 1413]]
true  negatives: 8156, false positives: 2719
false negatives: 212, true  positives: 1413

BUSINESS METRICS:
>   total actual conversions:    1625
>   total predicted conversions: 4132
>   conversions captured:        1413 out of 1625 (87.0%)
>   precision in predictions:    1413 out of 4132 (34.2%)

=== XGBOOST FEATURE IMPORTANCE ANALYSIS ===


TOP 20 FEATURES BY WEIGHT (number of times feature is used to split):
                                feature  importan

# lesson
xgboost confirms what we have already seen, but fails to train a better predictor model. it is possible to have a decent recall, but the precision is unimpressive. 

what might help:
- cross validation
- further data wrangling, towards building more enhanced features, interactive terms
- acquire more information on the leads
