Classification exercise for State Farm Recruitment Process. <br>
Moazam Iqbal Hakim, PhD School of Information Sciences, University of Illinois Urbana Champaign <br>
mihakim@illinois.edu <br>
LinkedIn- https://www.linkedin.com/in/moazamiqbal/

## Requirements

In [1]:
import pandas as pd
import os
import numpy as np
import ipywidgets as widgets
from ipywidgets import interact, interact_manual
from itertools import product

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE

## Data overview

Load and take a look on the data

In [2]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

train_data

Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,...,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100
0,0,0.165254,18.060003,Wed,1.077380,-1.339233,-1.584341,0.0062%,0.220784,1.816481,...,-0.397427,0.909479,no,5.492487,,10.255579,7.627730,0,yes,104.251338
1,1,2.441471,18.416307,Friday,1.482586,0.920817,-0.759931,0.0064%,1.192441,3.513950,...,0.656651,9.093466,no,3.346429,4.321172,,10.505284,1,yes,101.230645
2,1,4.427278,19.188092,Thursday,0.145652,0.366093,0.709962,-8e-04%,0.952323,0.782974,...,2.059615,0.305170,no,4.456565,,8.754572,7.810979,0,yes,109.345215
3,0,3.925235,19.901257,Tuesday,1.763602,-0.251926,-0.827461,-0.0057%,-0.520756,1.825586,...,0.899392,5.971782,no,4.100022,1.151085,,9.178325,1,yes,103.021970
4,0,2.868802,22.202473,Sunday,3.405119,0.083162,1.381504,0.0109%,-0.732739,2.151990,...,3.003595,1.046096,yes,3.234033,2.074927,9.987006,11.702664,0,yes,92.925935
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,0,1.593480,19.628352,Sun,0.794697,-0.825849,0.608774,-0.0085%,2.183834,3.202119,...,-1.640259,5.051545,no,5.798509,,10.854903,9.505529,1,yes,98.855726
39996,0,1.708685,17.132638,Thursday,-2.676659,1.153851,0.465905,0.0077%,-0.048613,3.989567,...,-0.195783,2.020510,no,5.285345,-1.408117,8.867221,9.077493,0,yes,101.880335
39997,0,1.704132,17.824399,Monday,-0.581360,,0.467339,-0.0216%,0.904643,2.975563,...,-0.071581,6.250353,no,4.729509,-1.118486,12.244620,7.663763,1,yes,100.022536
39998,0,3.963408,20.285597,Tuesday,0.430116,0.050189,1.821565,1e-04%,-0.401259,-0.247649,...,-1.248535,8.928009,no,6.803781,,9.876172,8.644538,0,yes,109.460219


Check detailed info about the data

In [3]:
train_data.info(max_cols=10000)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 101 columns):
 #    Column  Non-Null Count  Dtype  
---   ------  --------------  -----  
 0    y       40000 non-null  int64  
 1    x1      40000 non-null  float64
 2    x2      40000 non-null  float64
 3    x3      40000 non-null  object 
 4    x4      40000 non-null  float64
 5    x5      37572 non-null  float64
 6    x6      40000 non-null  float64
 7    x7      40000 non-null  object 
 8    x8      40000 non-null  float64
 9    x9      40000 non-null  float64
 10   x10     40000 non-null  float64
 11   x11     34890 non-null  float64
 12   x12     40000 non-null  float64
 13   x13     40000 non-null  float64
 14   x14     30136 non-null  float64
 15   x15     40000 non-null  float64
 16   x16     28788 non-null  float64
 17   x17     40000 non-null  float64
 18   x18     40000 non-null  float64
 19   x19     40000 non-null  object 
 20   x20     40000 non-null  float64
 21   x21     40

We can see different column dtypes with some missed values. Let's clean and preprocess it

## Data preprocessing

To simplify further purposes we will write custom class for data processing and add methods step-by-step:

In [4]:
class DataPreprocess:
    
    def __init__(self):
        
        self.assets = dict() # we will put custom objects to this dict
                             # for processing test data later

Firstly separate the labels from the training data :

In [5]:
train_labels = train_data.y.copy()
train_data.drop('y', axis=1, inplace=True)

Next consider `object` columns for preprocessing only

In [6]:
train_data.select_dtypes('object')

Unnamed: 0,x3,x7,x19,x24,x31,x33,x39,x60,x65,x77,x93,x99
0,Wed,0.0062%,$-908.650758424405,female,no,Colorado,5-10 miles,August,farmers,mercedes,no,yes
1,Friday,0.0064%,$-1864.9622875143,male,no,Tennessee,5-10 miles,April,allstate,mercedes,no,yes
2,Thursday,-8e-04%,$-543.187402955527,male,no,Texas,5-10 miles,September,geico,subaru,no,yes
3,Tuesday,-0.0057%,$-182.626380634258,male,no,Minnesota,5-10 miles,September,geico,nissan,no,yes
4,Sunday,0.0109%,$967.007090837503,male,yes,New York,5-10 miles,January,geico,toyota,yes,yes
...,...,...,...,...,...,...,...,...,...,...,...,...
39995,Sun,-0.0085%,$3750.51991954505,female,no,,5-10 miles,July,farmers,,no,yes
39996,Thursday,0.0077%,$448.867118077561,male,yes,Illinois,5-10 miles,July,progressive,ford,no,yes
39997,Monday,-0.0216%,$834.95775080472,male,yes,,5-10 miles,August,geico,ford,no,yes
39998,Tuesday,1e-04%,$-48.1031003332715,male,no,Ohio,5-10 miles,December,farmers,,no,yes


In [7]:
train_data.select_dtypes('object').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x3      40000 non-null  object
 1   x7      40000 non-null  object
 2   x19     40000 non-null  object
 3   x24     36144 non-null  object
 4   x31     40000 non-null  object
 5   x33     32829 non-null  object
 6   x39     40000 non-null  object
 7   x60     40000 non-null  object
 8   x65     40000 non-null  object
 9   x77     30743 non-null  object
 10  x93     40000 non-null  object
 11  x99     27164 non-null  object
dtypes: object(12)
memory usage: 3.7+ MB


Common function for `object` columns is cleaing one. We can see there are some float-like columns with continuous values (`x7`, `x19`). We need to convert it to the float representation firstly. For the rest specify cleaning function related to `str` dtype and further process it:

In [8]:
@staticmethod
def clean_str(x):
    
    x = str(x)
    x = ''.join([
        
        s for s in x 
        if s.isdigit() or s.isalpha()
        or s in ['-', '_']
        
    ])
    
    return x.lower()

# -------------------------- #
@staticmethod
def float_representation(x):
    
    x = ''.join([
        
        s.replace(',', '.') for s in x 
        if s.isdigit() or s in ['.', ',']
        
    ])
    
    return float(x)

# assign to the class

DataPreprocess.clean_str = clean_str
DataPreprocess.float_representation = float_representation

# add mentioned columns to spec attribute to simplify
# further preprocessing
DataPreprocess.to_clean_str = [
    
    x for x in train_data.select_dtypes('object').columns
    if x not in ['x7', 'x19']

]

DataPreprocess.to_float = ['x7', 'x19']

# check functions

print(train_data.x7.apply(DataPreprocess.float_representation))
print()
print(train_data.x3.apply(DataPreprocess.clean_str).value_counts())

0          0.0062
1          0.0064
2        804.0000
3          0.0057
4          0.0109
           ...   
39995      0.0085
39996      0.0077
39997      0.0216
39998    104.0000
39999      0.0034
Name: x7, Length: 40000, dtype: float64

wednesday    4930
monday       4144
friday       3975
tuesday      3915
sunday       3610
saturday     3596
tue          2948
thursday     2791
mon          2200
wed          2043
sat          1787
thur         1643
fri          1620
sun           798
Name: x3, dtype: int64


Check that attributes were set correctly 

In [9]:
print(DataPreprocess.to_clean_str)
print()
print(DataPreprocess.to_float)

['x3', 'x24', 'x31', 'x33', 'x39', 'x60', 'x65', 'x77', 'x93', 'x99']

['x7', 'x19']


We have missed values in `x24`, `x33`, `x77`, `x99` columns. Firstly check `x99`, because values in the columns are bool-like (for `x31`, `x93` the same)

In [10]:
train_data.x99.value_counts()

yes    27164
Name: x99, dtype: int64

There is a single value `yes` so convert this column to bool-like representation with assuming NaNs as `False` or `0` (because anyway we consider `int` instead of `bool` dtype). Do the same for `x31`

In [11]:
@staticmethod
def bool_representation(x):
    
    return 1 if x == 'yes' else 0

# assign to the class

DataPreprocess.bool_representation = bool_representation
DataPreprocess.to_bool = ['x31', 'x93', 'x99']

# check the function

train_data.x99.apply(DataPreprocess.bool_representation).value_counts()

1    27164
0    12836
Name: x99, dtype: int64

Need to check that all is fine with raw columns, because there may be duplications of values (as for `x3` above for example)

In [12]:
# custom function to display value_counts
# dynamically

def eda_interact(df):
    
    @interact
    def f(column=list(df.columns)):

        print(df[column].value_counts())

eda_interact(train_data.select_dtypes('object'))

interactive(children=(Dropdown(description='column', options=('x3', 'x7', 'x19', 'x24', 'x31', 'x33', 'x39', '…

We can see that for `x3` we need to get rid of duplications, for `x39` we have constant values, so let's drop this column later

In [13]:
@staticmethod
def process_x3(x):
    
    return x[:2]

# assign to the class

DataPreprocess.process_x3 = process_x3

# check the function

train_data.x3.apply(DataPreprocess.clean_str).apply(DataPreprocess.process_x3).value_counts()

we    6973
tu    6863
mo    6344
fr    5595
sa    5383
th    4434
su    4408
Name: x3, dtype: int64

In [14]:
DataPreprocess.to_drop = ['x39']

For non-float and non-bool columns we apply one-hot encoding with `handle_unknowns='ignore'`, so only non-missed values will be taken. Also we will use one-hot encoding for following `object` columns: `x3`, `x60`, `x65`.

In [15]:
def one_hot_encoding(self, data, columns: list):
    
    df = data.copy()
    
    # check assets for existing ohe instance
    encoder = self.assets.get('OneHotEncoder', False)
    
    # avoid dropped cols
    
    columns = [
        
        x for x in columns
        if x not in self.to_drop
        
    ]
    
    # if it's empty - fit new one
    if not encoder:

        encoder = OneHotEncoder(

                handle_unknown='ignore',
                sparse=False

            )

        encoder.fit(df[columns])

        self.assets.update({f'OneHotEncoder' : encoder})

    # apply ohe to the data
    ohe_cols = encoder.get_feature_names(columns)
    df[ohe_cols] = encoder.transform(df[columns])
    
    df = df.drop(columns, axis=1)
    
    return df

# assign to the class

DataPreprocess.one_hot_encoding = one_hot_encoding
DataPreprocess.to_one_hot = ['x3', 'x33', 'x60', 'x65', 'x77', 'x24']

# check on test instance

abc = DataPreprocess()
abc.one_hot_encoding(data=train_data, columns=DataPreprocess.to_one_hot)

Unnamed: 0,x1,x2,x4,x5,x6,x7,x8,x9,x10,x11,...,x77_chevrolet,x77_ford,x77_mercedes,x77_nissan,x77_subaru,x77_toyota,x77_nan,x24_female,x24_male,x24_nan
0,0.165254,18.060003,1.077380,-1.339233,-1.584341,0.0062%,0.220784,1.816481,1.171788,109.626841,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2.441471,18.416307,1.482586,0.920817,-0.759931,0.0064%,1.192441,3.513950,1.419900,84.079367,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,4.427278,19.188092,0.145652,0.366093,0.709962,-8e-04%,0.952323,0.782974,-1.247022,95.375221,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
3,3.925235,19.901257,1.763602,-0.251926,-0.827461,-0.0057%,-0.520756,1.825586,2.223038,96.420382,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2.868802,22.202473,3.405119,0.083162,1.381504,0.0109%,-0.732739,2.151990,-0.275406,90.769952,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,1.593480,19.628352,0.794697,-0.825849,0.608774,-0.0085%,2.183834,3.202119,-0.723356,94.820410,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
39996,1.708685,17.132638,-2.676659,1.153851,0.465905,0.0077%,-0.048613,3.989567,1.468074,115.785563,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
39997,1.704132,17.824399,-0.581360,,0.467339,-0.0216%,0.904643,2.975563,0.228908,107.939412,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
39998,3.963408,20.285597,0.430116,0.050189,1.821565,1e-04%,-0.401259,-0.247649,-0.499294,93.314126,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


Check that all is fine with `abc.assets`:

In [16]:
abc.assets

{'OneHotEncoder': OneHotEncoder(handle_unknown='ignore', sparse=False)}

Let's consider non-`object` columns:

In [17]:
train_data.select_dtypes(exclude='object').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 88 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      40000 non-null  float64
 1   x2      40000 non-null  float64
 2   x4      40000 non-null  float64
 3   x5      37572 non-null  float64
 4   x6      40000 non-null  float64
 5   x8      40000 non-null  float64
 6   x9      40000 non-null  float64
 7   x10     40000 non-null  float64
 8   x11     34890 non-null  float64
 9   x12     40000 non-null  float64
 10  x13     40000 non-null  float64
 11  x14     30136 non-null  float64
 12  x15     40000 non-null  float64
 13  x16     28788 non-null  float64
 14  x17     40000 non-null  float64
 15  x18     40000 non-null  float64
 16  x20     40000 non-null  float64
 17  x21     40000 non-null  float64
 18  x22     37613 non-null  float64
 19  x23     40000 non-null  float64
 20  x25     40000 non-null  float64
 21  x26     37567 non-null  float64
 22

For imputing NaN values let's apply multivariate imputing. https://scikit-learn.org/stable/modules/impute.html (6.4.3)

In [18]:
def impute_nans(self, data, columns: list):
    
    df = data.copy()
    
    # check assets for existing imputer instance
    imputer = self.assets.get('Imputer', False)
    
    # avoid dropped cols
    
    columns = [
        
        x for x in columns
        if x not in self.to_drop
        
    ]
    
    # if it's empty - fit new one
    if not imputer:

        imputer = IterativeImputer(

                max_iter=10,
                random_state=0

            )

        imputer.fit(df[columns])

        self.assets.update({f'Imputer' : imputer})

    # apply imputer to the data
    df[columns] = imputer.transform(df[columns])
    
    return df

# assign to the class

DataPreprocess.impute_nans = impute_nans
DataPreprocess.to_impute = train_data.select_dtypes(exclude='object').columns

# check

abc = DataPreprocess()

# apply imputing, filter non-object columns and check for NaNs

(abc
 .impute_nans(data=train_data, columns=DataPreprocess.to_impute)
 .select_dtypes(exclude='object')
 .info(max_cols=10000)
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 88 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      40000 non-null  float64
 1   x2      40000 non-null  float64
 2   x4      40000 non-null  float64
 3   x5      40000 non-null  float64
 4   x6      40000 non-null  float64
 5   x8      40000 non-null  float64
 6   x9      40000 non-null  float64
 7   x10     40000 non-null  float64
 8   x11     40000 non-null  float64
 9   x12     40000 non-null  float64
 10  x13     40000 non-null  float64
 11  x14     40000 non-null  float64
 12  x15     40000 non-null  float64
 13  x16     40000 non-null  float64
 14  x17     40000 non-null  float64
 15  x18     40000 non-null  float64
 16  x20     40000 non-null  float64
 17  x21     40000 non-null  float64
 18  x22     40000 non-null  float64
 19  x23     40000 non-null  float64
 20  x25     40000 non-null  float64
 21  x26     40000 non-null  float64
 22

Last preprocessing step is normalizing numerical columns:

In [19]:
def normalize(self, data, columns: list):
    
    df = data.copy()
    
    # check assets for existing imputer instance
    scaler = self.assets.get('Scaler', False)
    
    # avoid dropped cols
    
    columns = [
        
        x for x in columns
        if x not in self.to_drop
        
    ]
    
    # if it's empty - fit new one
    if not scaler:

        scaler = StandardScaler()

        scaler.fit(df[columns])

        self.assets.update({f'Scaler' : scaler})

    # apply imputer to the data
    df[columns] = scaler.transform(df[columns])
    
    return df

# assign to the class

DataPreprocess.normalize = normalize

# check (without filling NaNs)

abc = DataPreprocess()

(abc
 .normalize(data=train_data, columns=DataPreprocess.to_impute)
)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100
0,-1.421286,-1.212302,Wed,0.734820,-1.035976,-1.171538,0.0062%,0.149538,-0.460571,0.661698,...,-0.271699,-1.025200,no,1.033369,,-0.187785,-1.195336,-0.995311,yes,0.805457
1,-0.280019,-0.990205,Friday,1.011947,0.705290,-0.564701,0.0064%,0.820941,0.402488,0.902618,...,0.445962,1.687083,no,-0.422653,2.352797,,0.252884,1.004711,yes,0.229807
2,0.715641,-0.509124,Thursday,0.097596,0.277901,0.517269,-8e-04%,0.655022,-0.986047,-1.687003,...,1.401159,-1.225476,no,0.330534,,-1.231920,-1.103110,-0.995311,yes,1.776191
3,0.463922,-0.064582,Tuesday,1.204139,-0.198255,-0.614409,-0.0057%,-0.362856,-0.455942,1.682477,...,0.611230,0.652515,no,0.088633,0.613913,,-0.414950,1.004711,yes,0.571178
4,-0.065760,1.369849,Sunday,2.326799,0.059915,1.011583,0.0109%,-0.509334,-0.289985,-0.743549,...,2.043861,-0.979923,yes,-0.498909,1.120667,-0.374610,0.855504,-0.995311,yes,-1.352811
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39995,-0.705191,-0.234694,Sun,0.541489,-0.640437,0.442786,-0.0085%,1.505981,0.243941,-1.178515,...,-1.117872,0.347536,no,1.240994,,0.229119,-0.250274,1.004711,yes,-0.222778
39996,-0.647428,-1.790362,Thursday,-1.832629,0.884833,0.337622,0.0077%,-0.036612,0.644311,0.949396,...,-0.134411,-0.656989,no,0.892831,-0.789883,-1.153559,-0.465697,-0.995311,yes,0.353618
39997,-0.649711,-1.359163,Monday,-0.399619,,0.338677,-0.0216%,0.622076,0.128751,-0.253853,...,-0.049849,0.744837,no,0.515716,-0.631012,1.195839,-1.177201,1.004711,yes,-0.000421
39998,0.483062,0.174991,Tuesday,0.292146,0.034511,1.335506,1e-04%,-0.280286,-1.510056,-0.960948,...,-0.851170,1.632248,no,1.923034,,-0.451709,-0.683595,-0.995311,yes,1.798108


Now let's build `process` function with all cleaning and preprocessing workflow:

In [20]:
def process(self, data):
    
    # static methods
    df = data.copy()
    
    # drop useless columns
    
    df = df.drop(self.to_drop, axis=1)
    
    # cleaning all string cols defined in the clean_str
    for c in self.to_clean_str:
        if c in self.to_drop: continue
        df[c] = df[c].apply(self.clean_str)
    
    # float all cols defined in the to_float
    for c in self.to_float:
        if c in self.to_drop: continue
        df[c] = df[c].apply(self.float_representation)
    
    # bool all cols defined in the to_bool
    for c in self.to_bool:
        if c in self.to_drop: continue
        df[c] = df[c].apply(self.bool_representation)
    
    # specific weekday converting
    df['x3'] = df.apply(self.process_x3)
    
    # class methods: one-hot encoding, multivariate imputing, normalizing
    df = self.one_hot_encoding(df, columns=self.to_one_hot)
    df = self.impute_nans(df, columns=self.to_impute)
    df = self.normalize(df, columns=self.to_impute)
    
    return df

# assign to the class

DataPreprocess.process = process

Preprocess the data

In [21]:
train_data = pd.read_csv('exercise_40_train.csv')
test_data = pd.read_csv('exercise_40_test.csv')
X_test = test_data.copy()

# separate labels
train_labels = train_data.y.copy()
train_data.drop('y', axis=1, inplace=True)

X_train, X_val, y_train, y_val = train_test_split(
    
    train_data, 
    train_labels, 
    random_state=0, 
    train_size=0.8,
    stratify=train_labels # the same label distribution in train and val sets
)

prep = DataPreprocess()

Check counts

In [22]:
y_train.value_counts()

0    27358
1     4642
Name: y, dtype: int64

So, after preprocessing we need to get rid of imbalance

In [23]:
# firstly process train data to fit assets
X_train = prep.process(X_train)

# next process val and test data
X_val = prep.process(X_val)
X_test = prep.process(X_test)

To fix imbalance we will use `SMOTE` oversampling from `imblearn` package https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

In [24]:
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)
X_val, y_val = oversample.fit_resample(X_val, y_val)

y_train.value_counts()

0    27358
1    27358
Name: y, dtype: int64

## Logistic Regression

Implement custom grid search cross-validation function (because native doesn't work) for hyperparameters optimization

In [25]:
def grid_search(estimator, X, y, grid: dict, cv: int):
    
    '''
    args:
        estimator: object - model for hyp-opt;
        X: object - training data,
        y: object - training labels,
        grid: dict - values for searching,
        cv: int - number of folds for cross-validation
    '''
    
    # stratified kfold to save target value distribution
    kf = StratifiedKFold(n_splits=cv)
    
    # from df to array
    X = X.to_numpy()
    y = y.to_numpy()
    
    # convert dict of lists with grid values
    # to list of dicts with all combinations
    # of grid values (grid points)
    params = list(map(
    
        lambda x: dict(zip(grid.keys(), x)),
        list(product(*list(grid.values())))
    ))
    
    scores = []
    
    # iterate over all combinations
    for j, prm in enumerate(params):
        
        print(f'Params: {prm}\n')
        scores.append([])
        
        for i, (train, test) in enumerate(kf.split(X, y)):
            
            # init estimator and fit it
            est = estimator(**prm)
            est.fit(X[train], y[train])
            
            # calculate roc score
            score = roc_auc_score(y[test], est.predict(X[test]))
            print(f'\t{i + 1} fold roc_auc: {score:.5f}')
            
            scores[j].append(score)
        
        # take mean score from folds
        scores[j] = np.mean(scores[j])
        
        print(f'\tMean score: {scores[j]}\n')
    
    # return the best params by mean score
    return params[np.argmax(scores)]

Find best params for logistic regression

In [26]:
grid = {
    
    'C': list(np.logspace(-3, 4, 12)),
    'penalty' : ['l2'],
    'max_iter' : [10000], 

}

best_params = grid_search(LogisticRegression, X_train, y_train, grid=grid, cv=3)

Params: {'C': 0.001, 'penalty': 'l2', 'max_iter': 10000}

	1 fold roc_auc: 0.64899
	2 fold roc_auc: 0.68348
	3 fold roc_auc: 0.68132
	Mean score: 0.671265362580795

Params: {'C': 0.004328761281083057, 'penalty': 'l2', 'max_iter': 10000}

	1 fold roc_auc: 0.65322
	2 fold roc_auc: 0.69916
	3 fold roc_auc: 0.69575
	Mean score: 0.6827062106526506

Params: {'C': 0.01873817422860384, 'penalty': 'l2', 'max_iter': 10000}

	1 fold roc_auc: 0.65821
	2 fold roc_auc: 0.70546
	3 fold roc_auc: 0.70326
	Mean score: 0.6889749925289336

Params: {'C': 0.08111308307896872, 'penalty': 'l2', 'max_iter': 10000}

	1 fold roc_auc: 0.65853
	2 fold roc_auc: 0.70925
	3 fold roc_auc: 0.70523
	Mean score: 0.6910037297615862

Params: {'C': 0.3511191734215131, 'penalty': 'l2', 'max_iter': 10000}

	1 fold roc_auc: 0.65914
	2 fold roc_auc: 0.71155
	3 fold roc_auc: 0.70479
	Mean score: 0.6918261382634677

Params: {'C': 1.5199110829529332, 'penalty': 'l2', 'max_iter': 10000}

	1 fold roc_auc: 0.65870
	2 fold roc_auc: 0.

Check the model and save test results

In [33]:
print(best_params)

{'C': 28.48035868435799, 'penalty': 'l2', 'max_iter': 10000}


In [27]:
logreg = LogisticRegression(**best_params)
logreg.fit(X_train, y_train)

print(f'Train set roc_auc: {roc_auc_score(y_train, logreg.predict(X_train))}')
print(f'Val set roc_auc: {roc_auc_score(y_val, logreg.predict(X_val))}')

Train set roc_auc: 0.6961035163389135
Val set roc_auc: 0.7003216844567919


In [28]:
# logreg_test = logreg.predict(X_test)
logreg_test = logreg.predict_proba(X_test)[:, 1]
pd.DataFrame(logreg_test).to_csv('glmresults.csv', header=False, index=False)

## Random Forest Classifier

The same as for logreg

In [29]:
grid_rf = {
    
    'n_estimators': list(np.linspace(10, 100, 10).astype(int)),
    'max_depth' : list(range(2, 5)),

}

best_params_rf = grid_search(RandomForestClassifier, X_train, y_train, grid=grid_rf, cv=3)

Params: {'n_estimators': 10, 'max_depth': 2}

	1 fold roc_auc: 0.66133
	2 fold roc_auc: 0.79264
	3 fold roc_auc: 0.75529
	Mean score: 0.7364189178464859

Params: {'n_estimators': 10, 'max_depth': 3}

	1 fold roc_auc: 0.69394
	2 fold roc_auc: 0.83535
	3 fold roc_auc: 0.83759
	Mean score: 0.78896100231571

Params: {'n_estimators': 10, 'max_depth': 4}

	1 fold roc_auc: 0.72602
	2 fold roc_auc: 0.87159
	3 fold roc_auc: 0.84795
	Mean score: 0.8151871089313554

Params: {'n_estimators': 20, 'max_depth': 2}

	1 fold roc_auc: 0.66988
	2 fold roc_auc: 0.81424
	3 fold roc_auc: 0.82377
	Mean score: 0.7692971794639943

Params: {'n_estimators': 20, 'max_depth': 3}

	1 fold roc_auc: 0.72158
	2 fold roc_auc: 0.85613
	3 fold roc_auc: 0.89533
	Mean score: 0.8243439189206522

Params: {'n_estimators': 20, 'max_depth': 4}

	1 fold roc_auc: 0.72897
	2 fold roc_auc: 0.86358
	3 fold roc_auc: 0.87526
	Mean score: 0.8226063689077941

Params: {'n_estimators': 30, 'max_depth': 2}

	1 fold roc_auc: 0.70897
	2 fold

In [32]:
print(best_params_rf)

{'n_estimators': 50, 'max_depth': 4}


In [30]:
rf = RandomForestClassifier(**best_params_rf)
rf.fit(X_train, y_train)

print(f'Train set roc_auc: {roc_auc_score(y_train, rf.predict(X_train))}')
print(f'Val set roc_auc: {roc_auc_score(y_val, rf.predict(X_val))}')

Train set roc_auc: 0.8741684333650119
Val set roc_auc: 0.8757128235122094


In [31]:
# rf_test = rf.predict(X_test)
rf_test = rf.predict_proba(X_test)[:, 1]
pd.DataFrame(rf_test).to_csv('nonglmresults.csv', header=False, index=False)

## Summary


| **Model**          | **val ROC AUC** |
| :----------------- | :-------------- |
| LogisticRegression |  0.7003         |
| RandomForest       | 0.8757          |


Random forest is better than logistic regression, because LogReg is a simple linear model with not many parameters. On the other hand, random forest is complex ensembled model that can "track" and "take" specified data in better manner and embrace wider data dispersion.