# usage: 

1. fetch feature engineered data set
2. select features to use.
3. run validation: `validate_data(sales_df, outcome, numerical_features, categorical_features)`
4. create a preprocessor
5. prepare for machine learning (standard scaler, one-hot-encoding, split train/test)
6. run random forest analysis
```
rf_model, rf_metrics, rf_feature_importance, rf_threshold, rf_test_proba, rf_deciles = run_random_forest_analysis(
    X_train_processed,
    X_test_processed,
    y_train,
    y_test,
    feature_names, 
    lr_metrics,    # for comparison
    lr_test_proba  # for comparison
)
```
7. study performance metrics, compare against baseline model

In [1]:
import sys
import os
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

In [2]:
import duckdb
import pandas as pd
from pre_processing import (
    validate_data,                       
    create_preprocessor,                 
    get_feature_names,    
    prepare_data
)
from train_logistic_regression import run_logistic_regression_analysis
from train_random_forest       import run_random_forest_analysis

In [3]:
# fetch the feature engineered datatable
with duckdb.connect(database='../data/propensity_to_buy.duckdb', read_only=True) as con:
    sales_df = con.sql("SELECT * FROM feature_engineered;").df()
    sales_df = (
        sales_df
        .assign(day_of_week = sales_df['day_of_week'].astype('category')) # need to reapply categorical conversion?
        .assign(entry_hour  = sales_df['entry_hour' ].astype('category')) # need to reapply categorical conversion?
        .assign(entry_month = sales_df['entry_month'].astype('category')) # need to reapply categorical conversion?
    )
    
display(sales_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   has_placed_order           50000 non-null  bool    
 1   salary                     50000 non-null  float64 
 2   n_engaged_minutes          50000 non-null  float64 
 3   n_of_cars_viewed           50000 non-null  int64   
 4   day_of_week                50000 non-null  category
 5   price_of_last_car_viewed   50000 non-null  float64 
 6   step_reached_in_website    50000 non-null  category
 7   how_did_you_hear_about_us  50000 non-null  category
 8   entry_month                50000 non-null  category
 9   entry_hour                 50000 non-null  category
 10  age_at_entry               50000 non-null  float64 
 11  area_code                  50000 non-null  object  
 12  entry_year                 50000 non-null  int64   
 13  minutes_per_car            5000

None

In [4]:
outcome = 'has_placed_order'
numerical_features = [
    'salary',
    'n_engaged_minutes',
    'n_of_cars_viewed', 
    'price_of_last_car_viewed',
    'age_at_entry',
    'entry_year',
    'minutes_per_car',
    'affordability_ratio'
]

categorical_features = [
    'day_of_week',
    'step_reached_in_website', 
    'how_did_you_hear_about_us', 
    # 'entry_month',
    # 'entry_hour', # try skipping this (cardinality:24)
    'area_code',
    # 'industry_categories',
]

print(f"numerical features   ({len(numerical_features)}): {numerical_features}")
print(f"categorical features ({len(categorical_features)}): {categorical_features}")

numerical features   (8): ['salary', 'n_engaged_minutes', 'n_of_cars_viewed', 'price_of_last_car_viewed', 'age_at_entry', 'entry_year', 'minutes_per_car', 'affordability_ratio']
categorical features (4): ['day_of_week', 'step_reached_in_website', 'how_did_you_hear_about_us', 'area_code']


In [5]:
validate_data(sales_df, outcome, numerical_features, categorical_features)
preprocessor = create_preprocessor(numerical_features, categorical_features)
X_train_processed, X_test_processed, y_train, y_test, fitted_preprocessor, feature_names = prepare_data(sales_df, preprocessor, outcome, numerical_features, categorical_features)

=== validation ===
dataset shape: (50000, 16)

data contains no missing values ✓

target distribution:
>   false (unconverted): 43,500
>   true  (ordered):      6,500
>   conversion rate:     13.000%

categorical variable cardinalities:
>   day_of_week: 7 unique values
>   step_reached_in_website: 4 unique values
>   how_did_you_hear_about_us: 5 unique values
>   area_code: 9 unique values
=== validation completed ===

dataset shape:       (50000, 15)
target distribution: {False: 43500, True: 6500}
conversion rate:     0.130

after split:
training set: 37500 samples (0.130 conversion rate)
test set:     12500  samples (0.130  conversion rate)

after preprocessing:
training features shape: (37500, 29)
test features shape:     (12500, 29)
num features after preprocessing: 29
  - numerical features: 8
  - categorical features (after one-hot): 21


In [6]:
lr_model, lr_metrics, lr_feature_importance, lr_threshold, lr_test_proba = (
    run_logistic_regression_analysis(
        X_train_processed,
        X_test_processed,
        y_train,
        y_test,
        feature_names
    )
)

=== LOGISTIC REGRESSION BASELINE ===

training complete
training samples: 37500
test samples:     12500
features:         29

=== LOGISTIC-REGRESSION PERFORMANCE EVALUATION ===

TRAINING SET PERFORMANCE:
>   ROC-AUC:   0.8700
>   PR-AUC:    0.3899
>   precision: 0.3679
>   recall:    0.8345
>   f1-score:  0.5106

TEST SET PERFORMANCE:
>   ROC-AUC:   0.8666
>   PR-AUC:    0.3773
>   precision: 0.3682
>   recall:    0.8283
>   f1-score:  0.5098

CONFUSION MATRIX (test set):
[[8565 2310]
 [ 279 1346]]
true  negatives: 8565, false positives: 2310
false negatives: 279, true  positives: 1346

BUSINESS METRICS:
>   total actual conversions:    1625
>   total predicted conversions: 3656
>   conversions captured:        1346 out of 1625 (82.8%)
>   precision in predictions:    1346 out of 3656 (36.8%)

=== FEATURE IMPORTANCE ===

TOP 15 MOST IMPORTANT FEATURES:
                                feature  coefficient
                                 salary     1.814587
step_reached_in_website_quote

In [7]:
rf_model, rf_metrics, rf_feature_importance, rf_threshold, rf_test_proba, rf_deciles = run_random_forest_analysis(
    X_train_processed,
    X_test_processed,
    y_train,
    y_test,
    feature_names, 
    lr_metrics,    # for comparison
    lr_test_proba  # for comparison
)


=== RANDOM FOREST MODEL ===

training random forest...
random forest model ready
out-of-bag score not available

=== RANDOM-FOREST PERFORMANCE EVALUATION ===

TRAINING SET PERFORMANCE:
>   ROC-AUC:   0.8876
>   PR-AUC:    0.4869
>   precision: 0.3465
>   recall:    0.9003
>   f1-score:  0.5005

TEST SET PERFORMANCE:
>   ROC-AUC:   0.8558
>   PR-AUC:    0.3718
>   precision: 0.3353
>   recall:    0.8529
>   f1-score:  0.4814

CONFUSION MATRIX (test set):
[[8128 2747]
 [ 239 1386]]
true  negatives: 8128, false positives: 2747
false negatives: 239, true  positives: 1386

BUSINESS METRICS:
>   total actual conversions:    1625
>   total predicted conversions: 4133
>   conversions captured:        1386 out of 1625 (85.3%)
>   precision in predictions:    1386 out of 4133 (33.5%)

=== RANDOM FOREST FEATURE IMPORTANCE ===

TOP 15 IMPORTANT FEATURES:
                                feature  importance
                                 salary    0.372047
                    affordability_ratio  

# lesson 
again, the top most important features make sense. and yet,  
random forest does not perform any better than logistic regression. there are hints of overfitting. taking nonlinearity into account didn't help, so maybe those effects are less sever than expected

- limited signal?
- data not capturing the signal?

next we try boosting. 