## Cross-Validation Plan
- We will do HyperParameter Tuning in the next workbook.  For now, we perform initial quick Cross-Validation testing in such a way that it will allow us to get a sense of what to include in our starting transformer/pipeline design.  Namely we address the following choices:
    - StandardScaler vs. RobustScaler
    - Transformer for Numerical Features (e.g. Log, Logit, Quantile, Yeo-Johnson)
    - Encoder type for features with a high number of categories (Binary vs. Target vs. Catboosting)
    - SelectFromModel vs. RFE vs. SequentialFeatureSelector 

In [6]:
#import custom modules
from importlib import reload
from helpers.my_imports import *
import helpers.preprocessing as pp
import helpers.plot as plot
import helpers.tools as tools
import helpers.transformers as xfrs
from helpers.reload import myreload

#make sure latest copy of library is loaded
myreload()

#Global Variable for Random State
rs=42 #random_state

#Reload dataframe
df = pd.read_csv('saved_dfs/preprocessed_negotiations_df.csv')
df.head(2)

Reloaded helpers.preprocessing, helpers.plots, and helpers.tools.


Unnamed: 0,claim_type,NSA_NNSA,split_claim,negotiation_type,in_response_to,level,facility,carrier,group_number,plan_funding,TPA,TPA_rep,billed_amount,negotiation_amount,offer,counter_offer,decision,service_days,decision_days,offer_days,counter_offer_days,YOB,neg_to_billed,offer_to_neg,offer_to_counter_offer
0,HCFA,NNSA,No,NNSA Negotiation,Insurance Initiated,Level 3,Cedar Hill,Cigna,3344605,FULLY,Zelis,Marissa Pepe,4058.0,4058.0,258.0,3449.0,Rejected,128.0,,0.0,0.0,1984,1.0,0.0636,0.0748
1,UB,NNSA,No,NNSA Negotiation,Insurance Initiated,Level 5,Cedar Hill,Blue Cross Blue Shield,174518M3BH,SELF,Zelis,Courtney Kiernan,52253.0,52253.0,12500.0,44415.0,Rejected,127.0,,2.0,2.0,2021,1.0,0.2392,0.2814


## Define X, y
Redefine X and y based on what we learned in Feature Engineering

In [7]:
#Define X and y
X,y=df.drop(columns=['decision', 'billed_amount', 'negotiation_amount', 'offer', 'counter_offer']), df.decision

#Split and stratify the data
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6398, 20), (2133, 20), (6398,), (2133,))

### Define Scoring Metrics

In [8]:
scoring_metrics = {
    'f1': make_scorer(f1_score, average='binary', pos_label='Accepted', zero_division=0.0),
    'precision': make_scorer(precision_score, average='binary', pos_label='Accepted', zero_division=0.0)
}

## Cross-Validation  
The models we will start out with are shown below. Below is the reasoning for these choices:
- Logistic Regression is always a good baseline choice and provides for an interpretable model via its coefficients 
- RidgeClassifier: I ran a pycaret session which showed that it is possible to get Precision values of 98.4% with the RidgeClassifier.  While Precision is not our primary metric the Billing Team does prefer high Precision over high Recall and it may make sense to consider a model with very high Precision
- Random Forest and Gradient Boost stood out in our Cross-Validation for different values of features selected as we saw in our Feature Engineering notebook

In [9]:
default_models = {
    'Logistic Regression': LogisticRegression(max_iter=200, random_state=rs), 
    'Ridge Classifier': RidgeClassifier(random_state=rs),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=rs, n_jobs=-1), 
    'Gradient Boost': GradientBoostingClassifier(n_estimators=100, random_state=rs)
}

For the first 12 sets we will simply try combinations of the Scalers and Transformers below without changing anything else:
#### - Scalers:
     - StandardScaler
     - RobustScaler
#### - Data Transformers:
     - No Transformation
     - Log Transformation
     - Quantile Transformation
     - Logit Transformation
     - Yeo Johnson Transformation
     
Note: Below we use the same function we used during Feature Engineering

In [21]:
#Standard Scaler, No Transform, No Category Combiner, No oversampling/undersampling, BinaryEncoder, SelectFromModel
cv_set1_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=None, 
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,             
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', StandardScaler()),
                     
                   set_name='cv_set1', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,2.831143,0.011161,0.934704,0.95757,0.934508,0.957212
Ridge Classifier,2.905987,0.009594,0.935131,0.942005,0.934353,0.941166
Random Forest,3.253882,0.024372,0.93716,1.0,0.937431,1.0
Gradient Boost,3.630965,0.010625,0.936134,0.991355,0.935942,0.991415


In [22]:
#Robust Scaler, No Transform, No Category Combiner, No oversampling, No undersampling, BinaryEncoder, SelectFromModel
cv_set2_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=None, 
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,              
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', RobustScaler()),
                     
                   set_name='cv_set2', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,6.341008,0.009268,0.939435,0.956432,0.939618,0.956041
Ridge Classifier,6.231901,0.00892,0.936682,0.942784,0.935901,0.942068
Random Forest,6.384472,0.022562,0.934146,1.0,0.934532,1.0
Gradient Boost,7.528356,0.010465,0.939669,0.992664,0.939896,0.992722


Random Forest responded better to StandaScaler while other models responded better to RobustScaler

In [23]:
#Standard Scaler, Log Transform, No Category Combiner, No oversampling/undersampling, BinaryEncoder, SelectFromModel
cv_set3_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('log', xfrs.LogTransformer(add_constant=1)),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,               
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', StandardScaler()),
                     
                   set_name='cv_set3', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,2.484171,0.008755,0.933545,0.956335,0.933495,0.955982
Ridge Classifier,2.514459,0.009947,0.934989,0.941001,0.93431,0.940241
Random Forest,2.598965,0.023844,0.938481,1.0,0.938965,1.0
Gradient Boost,3.214002,0.009805,0.934177,0.9899,0.93402,0.989979


Now that we are using ratios it looks like StandardScaler performs slightly better for LogisticRegression and GBC and performs slightly worse for RidgeClassifier and RandomForest

In [24]:
#Robust Scaler, Log Transform, No Category Combiner, No oversampling/undersampling, BinaryEncoder, SelectFromModel
cv_set4_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('log', xfrs.LogTransformer(add_constant=1)),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,             
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', RobustScaler()),
                     
                   set_name='cv_set4', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,4.408562,0.009209,0.938207,0.953368,0.938673,0.95288
Ridge Classifier,4.416818,0.009011,0.936455,0.940936,0.935899,0.94022
Random Forest,4.438211,0.021558,0.935863,1.0,0.936549,1.0
Gradient Boost,5.042772,0.009629,0.937177,0.990861,0.937856,0.990929


In [25]:
#Standard Scaler, Quantile Transform, No Category Combiner, No oversampling/undersampling, BinaryEncoder, SFM
cv_set5_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('qt', QuantileTransformer(output_distribution= 'normal', random_state=rs)),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,            
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', StandardScaler()),
                     
                   set_name='cv_set5', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,2.830193,0.011408,0.935835,0.95966,0.936175,0.959274
Ridge Classifier,2.828304,0.010732,0.924904,0.934431,0.924651,0.934974
Random Forest,2.817517,0.027329,0.937625,1.0,0.937842,1.0
Gradient Boost,3.662694,0.013248,0.936506,0.991366,0.936448,0.991397


In [23]:
#Standard Scaler, Logit Transform, No Category Combiner, No oversampling, No undersampling, BinaryEncoder, SFM
myreload()
cv_set6_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('logit', xfrs.LogitTransformer()),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None, #('combiner', xfrs.RareCategoryCombiner(threshold=20)),               
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, #('over', SMOTE(random_state=rs)),
                   undersampler=None, #('under', RandomUnderSampler(random_state=rs)),
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', StandardScaler()),
                     
                   set_name='cv_set6', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

Reloaded helpers.preprocessing, helpers.plots, and helpers.tools.


model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,5.89317,0.02304,0.934403,0.954983,0.93486,0.95452
Ridge Classifier,5.782582,0.014602,0.93023,0.936727,0.929973,0.936942
Random Forest,6.021257,0.039301,0.933931,1.0,0.933902,1.0
Gradient Boost,6.303984,0.013418,0.939341,0.987275,0.938937,0.987331


In [26]:
#Robust Scaler, Quantile Transform, No Category Combiner, No oversampling/undersampling, BinaryEncoder, SFM
cv_set7_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('qt', QuantileTransformer(output_distribution= 'normal', random_state=rs)),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,              
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', RobustScaler()),
                     
                   set_name='cv_set7', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,3.915877,0.011621,0.933568,0.955204,0.933719,0.954715
Ridge Classifier,3.910835,0.011022,0.926158,0.933229,0.925803,0.933724
Random Forest,4.060208,0.028322,0.941664,1.0,0.941474,1.0
Gradient Boost,4.817515,0.01149,0.936794,0.990366,0.936482,0.990503


In [None]:
#Set 8 - RobustScaler with LogitTransformer caused an error so we skipped that test

In [29]:
#Standard Scaler, Yeo-Johnson Transform, No Category Combiner, No oversampling, No undersampling, BinaryEncoder, SFM
cv_set9_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('yeo', PowerTransformer(method='yeo-johnson')),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,            
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None,
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', StandardScaler()),
                     
                   set_name='cv_set9', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,2.726147,0.009225,0.938987,0.96134,0.939533,0.96102
Ridge Classifier,2.819282,0.011144,0.940418,0.947774,0.940237,0.947288
Random Forest,2.889769,0.025304,0.942025,1.0,0.942188,1.0
Gradient Boost,3.711805,0.010442,0.93827,0.992914,0.938085,0.992949


In [30]:
#RobustScaler, Yeo-Johnson Transform, No Category Combiner, No oversampling/undersampling, BinaryEncoder, SFM
cv_set10_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('yeo', PowerTransformer(method='yeo-johnson')),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,               
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', RobustScaler()),
                     
                   set_name='cv_set10', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,3.402261,0.00913,0.939065,0.959219,0.939632,0.958907
Ridge Classifier,3.409258,0.009021,0.939264,0.94745,0.938939,0.946908
Random Forest,3.456939,0.02321,0.937641,1.0,0.937581,1.0
Gradient Boost,4.388066,0.010198,0.936453,0.992905,0.936068,0.992945


In [31]:
#MinMaxScaler, Yeo-Johnson Transform, No Category Combiner, No oversampling, No undersampling, BinaryEncoder, SFM
myreload()
cv_set11_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('yeo', PowerTransformer(method='yeo-johnson')),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,              
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', MinMaxScaler()),
                     
                   set_name='cv_set11', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

Reloaded helpers.preprocessing, helpers.plots, and helpers.tools.


model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,1.042206,0.00896,0.941187,0.952843,0.941665,0.952361
Ridge Classifier,1.065432,0.009648,0.94182,0.946363,0.941631,0.94581
Random Forest,1.069692,0.021155,0.938699,1.0,0.938285,1.0
Gradient Boost,1.415661,0.009444,0.937579,0.988421,0.937735,0.988428


In [32]:
#MinMaxScaler, Quantile Transform, No Category Combiner, No oversampling, No undersampling, BinaryEncoder, SFM
cv_set12_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('logit', QuantileTransformer(output_distribution= 'normal', random_state=rs)),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None, #('combiner', xfrs.RareCategoryCombiner(threshold=20)),               
                   cat_encoder=('cat_encoder', BinaryEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, #('over', SMOTE(random_state=rs)),
                   undersampler=None, #('under', RandomUnderSampler(random_state=rs)),
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', MinMaxScaler()),
                     
                   set_name='cv_set12', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,1.973679,0.010406,0.926655,0.931545,0.92548,0.932056
Ridge Classifier,1.97257,0.010588,0.916775,0.92306,0.916639,0.926892
Random Forest,2.032122,0.028299,0.941719,1.0,0.941579,1.0
Gradient Boost,2.253708,0.012554,0.937448,0.988503,0.936967,0.988537


#### Results for sets 1-12
- RandomForest performed best with StandardScaler and Yeo Johnson (0.942025) 
    - Also performed well with MinMax/Quantile and Robust/Quantile
- Logistic performed best with MinMax/Yeo (0.941187)
- Ridge performed best with MinMax/Yeo (0.941820) 
    - Also performed well with Standard/Yeo
- GBC performed best with Robust/No Transformation (0.939669)
    - Also performed well with Standard/Yeo and Standard/Logit

#### Decision based on sets 1-12
Based on the above, going forward we will:
- Use Yeo-Johnson Transformation for all our tests 
- Perform tests with mainly StandardScaler and MinMaxScaler

#### Sets 13-15:  Compare performance for different Binary and CatBoost Encoders
No other changes will be made

In [34]:
#MinMaxScaler, Yeo-Johnson, No Category Combiner, No oversampling/undersampling, TargetEncoder, SelectFromModel
cv_set13_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('yeo', PowerTransformer(method='yeo-johnson')),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,             
                   cat_encoder=('cat_encoder', TargetEncoder(random_state=rs)),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', MinMaxScaler()),
                     
                   set_name='cv_set13', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,0.726588,0.007799,0.942902,0.947318,0.942756,0.946617
Ridge Classifier,0.705543,0.007452,0.943002,0.946315,0.94262,0.945879
Random Forest,0.763852,0.021523,0.942395,0.996812,0.94235,0.996843
Gradient Boost,1.092456,0.008198,0.940732,0.978522,0.940458,0.978767


In [35]:
#MinMaxScaler, Yeo-Johnson, No Category Combiner, No oversampling/undersampling, CatBoostEncoder, SelectFromModel
cv_set14_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('yeo', PowerTransformer(method='yeo-johnson')),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,              
                   cat_encoder=('cat_encoder', CatBoostEncoder()),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None,
                   undersampler=None,
                                  
                   selector= ('selector', SelectFromModel(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', MinMaxScaler()),
                     
                   set_name='cv_set14', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,0.928334,0.007373,0.936515,0.943921,0.936099,0.943096
Ridge Classifier,0.94273,0.007078,0.936682,0.940752,0.93666,0.940094
Random Forest,0.961659,0.02094,0.937745,0.990077,0.93732,0.990101
Gradient Boost,1.300713,0.0076,0.940378,0.969154,0.940931,0.969176


#### Encoder Results:
- Ridge did better with TargetEncoder than with BinaryEncoder (0.943352) 
- Logistic did better with TargetEncoder than with BinaryEncoder (0.942902)
- GBC performed better with CatBoostEncoder than with BinaryEncoder and had a tied performance with TargetEncoder
- All models did reasonably well with TargetEncoder
- The other benefit of TargetEncoder is that we can use a smoothing parameter to improve the score so we will start our HyperParameter testing with it.  


#### Set 15-16: Compare RFE and SequentialFeature Selector

In [36]:
#Repeat set 13 but with RFE
cv_set15_models=tools.my_cross_val(df,'decision',['f1_weighted', 'precision_weighted'], 
                   default_models,
                   
                   num_imputer=('si',SimpleImputer(strategy='most_frequent')),  
                   num_transformer=('yeo', PowerTransformer(method='yeo-johnson')),
                   poly=('poly', PolynomialFeatures()),
                   num_cols=['neg_to_billed', 'offer_to_neg', 'offer_to_counter_offer', 
                             'counter_offer_days', 'offer_days', 'decision_days', 'service_days', 'YOB'],
                   cat_combiner= None,             
                   cat_encoder=('cat_encoder', TargetEncoder(random_state=rs)),
                   cat_cols=['carrier', 'TPA', 'TPA_rep', 'group_number'], 
                   
                   onehotencoder=('ohe', OneHotEncoder(drop='if_binary')),
                   ohe_cols=['claim_type', 'NSA_NNSA', 'split_claim', 'negotiation_type', 'in_response_to', 'facility', 'plan_funding'],
                   ord_cols=['level'], 

                   oversampler=None, 
                   undersampler=None, 
                                  
                   selector= ('selector', RFE(estimator=LogisticRegression(penalty='l1', solver='saga', max_iter=8000, random_state=rs, n_jobs=-1))),
                   scaler= ('scaler', MinMaxScaler()),
                     
                   set_name='cv_set15', cv=5, verbose=0,
                   test_size=0.25, stratify=True,rs=42)

model,fit_time,score_time,test_f1_weighted,train_f1_weighted,test_precision_weighted,train_precision_weighted
Logistic Regression,15.670338,0.007454,0.943831,0.947097,0.943677,0.946378
Ridge Classifier,15.675219,0.008009,0.942682,0.946522,0.942238,0.94608
Random Forest,15.762049,0.019106,0.942275,0.996688,0.94185,0.996725
Gradient Boost,16.329948,0.009245,0.936103,0.979622,0.936063,0.979809


Skipped set 16 with SequentialFeatureSelector since it took very long to run.  Would not be practical for Hyperparameter Tuning

### Summary of Findings
- In addition to not having the benefit of Hyperparameter Tuning in this section, most of the tests resulted in overfit models so we  take these results with a grain of salt and leave options open for our transformer/pipeline design during Hyperparameter Tuning
- We do plan on exploring adding Oversampling/Undersampling to our Pipeline, as well as a Transformer to combine rare values for Categorical columns but since these functions depend heavilty on their parameters we will explore them when we do Hyperparameter Tuning
- We plan to start the next section (HyperParameter Tuning) with Yeo-Johnson Transformation and TargetEncoder for all our tests but we will still try multiple scalers.  
- RFE and SequentialFeatureSelector run much slower than SelectFromModel and with default settings ended up giving us similar results.  Since Execution Time will become increasingly important for HyperParameter Tuning, we will stick with SelectFromModel
- We did not see an especially high Precision Results for the RidgeClassifier so we may drop it early in the next section
- There were some features that we considered dropping in Feature Engineering, however we will let our SelectFromModel function handle that for us in the next section