This notebook is used to develop and implement the utility functions that will allow us do perform our experiments and store the results appropriately.

The experiments will test for the following:
* The role of hyperparameters
* The interplay between hyperparameters
* The interplay between features and hyperparameters
* The presence of unnecessary features
* The absence of necessary features
* The role of the number of instances and how this affects the tests above (we have 100k entries in our simulated dataset, we will test for 10k, 1k, and 100 too)
* The role of the cleaning process on everything mentioned.

All the experiments will be run on different targets, some with a linear relation with the features, some with a non-linear one. Each algorithm will be tested and we will record the results of each of the above experiments.

Each run will have thus to record
* Score under various metrics on both the train and test sets (to be decided if we keep the simple train/test split, we use a k-fold strategy, or both).
* Plot of the predicted values vs the real one
* Plot of the estimated coefficients vs the real one (when possible)
* Learning curves (when necessary)

We will make extensive use of Pipelines, as they make it easier to iterate quickly on different configurations

In [5]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import pickle

from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge  # just a random model to test stuff

import source.transf_univ as df_p
from source.utility import cv_score, grid_search

%matplotlib inline
pd.set_option('max_columns', 500)

In [4]:
df = pd.read_csv('data/simulated/clean.csv')

df.head()

Unnamed: 0,unc_normal_1,unc_normal_2,unc_skewed_pos,unc_skewed_neg,unc_binary,unc_categories_5,unc_categories_100,unc_ordinal,corr_cat_1,corr_cat_2,corr_normal_by_cat,corr_normal_by_2cats,corr_multinormal_high_a,corr_multinormal_high_b,corr_multinormal_mid_a,corr_multinormal_mid_b,corr_multinormal_low_a,corr_multinormal_low_b,tar_lin_full,tar_lin_unc,tar_lin_corr,tar_lin_3,tar_lin_3int,tar_nonlin_full,tar_nonlin_unc,tar_nonlin_corr,tar_nonlin_3,tar_nonlin_3int
0,0.333494,9.046767,2.098417,29.620455,0,b,ljr,7,1,0,0.541971,0.351782,8.430675,-3.355695,25.236371,11.646659,-5.525505,4.62174,79.339121,17.975525,-98.029729,0.62446,0.002313,2.251485,2.00047,2.667974,2.450614,1.828467
1,0.012907,10.954762,2.895381,29.842289,1,a,vah,22,1,1,0.857349,-0.629935,3.334968,-0.792505,20.909618,11.555425,-5.988788,2.05706,49.482021,15.378844,-83.619765,0.297647,-0.99386,1.535575,2.395349,3.132021,19.756608,18.465101
2,-0.38881,9.986689,3.343133,29.393639,0,d,slv,81,1,0,1.2442,2.232988,3.895617,-1.586941,21.466997,11.038788,-5.470228,5.952688,64.436321,28.1013,-83.89022,0.281222,0.584477,0.73156,3.872721,3.34933,262.9332,263.236455
3,0.474317,10.710674,3.307999,28.96866,1,d,exm,95,0,1,0.092622,3.870697,8.248896,-3.246644,21.905036,9.963264,-5.43188,3.758482,69.456178,24.31747,-88.733443,0.771001,-0.325958,2.566616,3.44771,3.330483,361.716714,360.619755
4,0.350836,9.140127,1.827119,28.887216,0,d,jgc,29,0,1,-0.230801,4.35284,4.49677,-1.173505,17.377266,6.832519,-8.288617,3.684319,19.475321,20.812954,-68.675902,0.611498,1.680765,2.750494,2.395742,2.50563,34.40831,35.477577


In [7]:
numeric_pipe = Pipeline([('fs', df_p.feat_sel('numeric')),
                         ('imputer', df_p.df_imputer(strategy='median'))])


cat_pipe = Pipeline([('fs', df_p.feat_sel('category')),
                     ('imputer', df_p.df_imputer(strategy='most_frequent')), 
                     ('dummies', df_p.dummify())])

processing_pipe = df_p.FeatureUnion_df(transformer_list=[('cat_pipe', cat_pipe),
                                                    ('num_pipe', numeric_pipe)])

model = Pipeline([('processing', processing_pipe),
                  ('scl', df_p.df_scaler()), 
                  ('ridge', Ridge())])