# Speed Dating Data
https://www.kaggle.com/annavictoria/speed-dating-experiment

# 3 Feature Engineering/Pre-processing & Training Data Development

## 3.1  Imports

In [190]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from pandas_profiling import ProfileReport
from scipy import stats

from sb_utils import save_file

In [191]:
# not sure if I need this
import datetime
import unicodedata
import re
from sklearn.preprocessing import scale

## 3.2 Objectives

In the data wrangling notebook, we have identified our target dependent variable as desicion of parner of the specific subject, dec_o (might also consider match, desicion from both the subject and the partner) and cleaned the data accordingly. In this notebook, we will conduct further EDA, hoping to answer following questions.

1. The difference of desirable attributes in a male partner vs female partner.
2. The difference of desirable attributes among  races.
3. The difference of desirable major of male partner vs female partner
4. The difference of desirable majors among races.  

**Learning Objectives**:
1. Understand the importance of creating a model training development data set.
2. Correctly identify when to create dummy features or one-hot encoded features.
3. Understand the importance of magnitude standardization.
4. Apply the train and test split to the development dataset effectively

Since speed dating data is relatively clean we may not need to perform 2&3 pre-processing

Here is possible workflow: TBD
- Use stats.model package for logistic regression model (sloves classification problem): for my model notebook
    - import statsmodels.api as sm (This model  is kind of doing the similar thing as ANOVA)
- Apply this on the whole data set including the dec_o
- Use ‘Speed_Dating_data_cleaned.csv’ from data wrangling output
- Fill the missing data (NaN) with mean to model
- Use PCA to choose features (but will loose interpretability)
- Keep components 0-5 for ~90% var. 
- Use stepwise selection, elastic-net (or L1/L2 regularizers) 
    - Statsmodels should have the code to run this.


## 3.3 Load The Data

In [192]:
# df = pd.read_csv('../data/ski_data_cleaned.csv')
spd = pd.read_csv('spd_data_wrangling_output/Speed_Dating_data_cleaned.csv') #spd1_2 in data wrangling notebook
spd_fp = pd.read_csv('spd_data_wrangling_output/Speed_Dating_data_FemaleRatingMale_cleaned.csv') # spd1_2fp in data wrangling notebook 
spd_mp = pd.read_csv('spd_data_wrangling_output/Speed_Dating_data_MaleRatingFemale_cleaned.csv') # spd1_2mp in data wrangling notebook

In [193]:
spd.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6266 entries, 0 to 6265
Data columns (total 24 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   gender    6266 non-null   int64  
 1   match     6266 non-null   int64  
 2   age       6198 non-null   float64
 3   race      6208 non-null   float64
 4   field     6208 non-null   object 
 5   career    6182 non-null   object 
 6   from      6192 non-null   object 
 7   goal      6192 non-null   float64
 8   int_corr  6118 non-null   float64
 9   samerace  6266 non-null   int64  
 10  imprace   6192 non-null   float64
 11  imprelig  6192 non-null   float64
 12  age_o     6189 non-null   float64
 13  race_o    6198 non-null   float64
 14  dec_o     6266 non-null   int64  
 15  attr_o    6127 non-null   float64
 16  sinc_o    6064 non-null   float64
 17  intel_o   6054 non-null   float64
 18  fun_o     5999 non-null   float64
 19  amb_o     5709 non-null   float64
 20  shar_o    5399 non-null   floa

In [194]:
spd.head()

Unnamed: 0,gender,match,age,race,field,career,from,goal,int_corr,samerace,...,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o
0,0,0,21.0,4.0,Law,lawyer,Chicago,2.0,0.14,0,...,0,6.0,8.0,8.0,8.0,8.0,6.0,7.0,4.0,2.0
1,0,0,21.0,4.0,Law,lawyer,Chicago,2.0,0.54,0,...,0,7.0,8.0,10.0,7.0,7.0,5.0,8.0,4.0,2.0
2,0,1,21.0,4.0,Law,lawyer,Chicago,2.0,0.16,1,...,1,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,1.0
3,0,1,21.0,4.0,Law,lawyer,Chicago,2.0,0.61,0,...,1,7.0,8.0,9.0,8.0,9.0,8.0,7.0,7.0,2.0
4,0,1,21.0,4.0,Law,lawyer,Chicago,2.0,0.21,0,...,1,8.0,7.0,9.0,6.0,9.0,7.0,8.0,6.0,2.0


## 3.4 Pre-processing data

### 3.4.1 filling NaN with mean

In [195]:
# make new df
spd_mean = spd.fillna(spd.mean())
spd_fp_mean = spd_fp.fillna(spd_fp.mean())
spd_mp_mean = spd_mp.fillna(spd_mp.mean())

In [196]:
# check for NaN
spd_mean.isna().sum()

gender       0
match        0
age          0
race         0
field       58
career      84
from        74
goal         0
int_corr     0
samerace     0
imprace      0
imprelig     0
age_o        0
race_o       0
dec_o        0
attr_o       0
sinc_o       0
intel_o      0
fun_o        0
amb_o        0
shar_o       0
like_o       0
prob_o       0
met_o        0
dtype: int64

In [197]:
spd_mean.shape

(6266, 24)

In [198]:
spd_fp_mean.isna().sum()

gender       0
match        0
age          0
race         0
field       20
career      20
from        20
goal         0
int_corr     0
samerace     0
imprace      0
imprelig     0
age_o        0
race_o       0
dec_o        0
attr_o       0
sinc_o       0
intel_o      0
fun_o        0
amb_o        0
shar_o       0
like_o       0
prob_o       0
met_o        0
dtype: int64

In [199]:
spd_fp_mean.shape

(3138, 24)

In [200]:
spd_mp_mean.isna().sum()

gender       0
match        0
age          0
race         0
field       38
career      64
from        54
goal         0
int_corr     0
samerace     0
imprace      0
imprelig     0
age_o        0
race_o       0
dec_o        0
attr_o       0
sinc_o       0
intel_o      0
fun_o        0
amb_o        0
shar_o       0
like_o       0
prob_o       0
met_o        0
dtype: int64

In [201]:
spd_mp_mean.shape

(3128, 24)

### 3.4.2 drop columns with NaN

In [202]:
# make new df
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
spd_mean1 = spd_mean.dropna(axis='columns')
spd_fp_mean1 = spd_fp_mean.dropna(axis='columns')
spd_mp_mean1 = spd_fp_mean.dropna(axis='columns')

In [203]:
# check for NaN
spd_mean1.isna().sum()

gender      0
match       0
age         0
race        0
goal        0
int_corr    0
samerace    0
imprace     0
imprelig    0
age_o       0
race_o      0
dec_o       0
attr_o      0
sinc_o      0
intel_o     0
fun_o       0
amb_o       0
shar_o      0
like_o      0
prob_o      0
met_o       0
dtype: int64

In [204]:
spd_mean1.shape

(6266, 21)

In [205]:
spd_fp_mean1.isna().sum()

gender      0
match       0
age         0
race        0
goal        0
int_corr    0
samerace    0
imprace     0
imprelig    0
age_o       0
race_o      0
dec_o       0
attr_o      0
sinc_o      0
intel_o     0
fun_o       0
amb_o       0
shar_o      0
like_o      0
prob_o      0
met_o       0
dtype: int64

In [206]:
spd_fp_mean1.shape

(3138, 21)

In [207]:
spd_mp_mean1.isna().sum()

gender      0
match       0
age         0
race        0
goal        0
int_corr    0
samerace    0
imprace     0
imprelig    0
age_o       0
race_o      0
dec_o       0
attr_o      0
sinc_o      0
intel_o     0
fun_o       0
amb_o       0
shar_o      0
like_o      0
prob_o      0
met_o       0
dtype: int64

In [208]:
spd_mp_mean1.shape

(3138, 21)

In [209]:
spd_mp_mean1.describe()

Unnamed: 0,gender,match,age,race,goal,int_corr,samerace,imprace,imprelig,age_o,...,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o
count,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,...,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0,3138.0
mean,1.0,0.170491,26.434041,2.672867,2.225786,0.192213,0.404398,3.459589,3.116421,26.079288,...,0.360421,5.976672,7.210102,7.535608,6.345409,7.056627,5.50599,6.062459,5.241818,1.956957
std,0.0,0.376123,3.387402,1.199342,1.487494,0.300429,0.490853,2.655817,2.578369,3.60814,...,0.480199,1.963168,1.786063,1.553134,1.998559,1.714952,2.033149,1.898075,2.11642,0.287586
min,1.0,0.0,18.0,1.0,1.0,-0.83,0.0,1.0,1.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,0.0,24.0,2.0,1.0,-0.01,0.0,1.0,1.0,23.0,...,0.0,5.0,6.0,7.0,5.0,6.0,5.0,5.0,4.0,2.0
50%,1.0,0.0,27.0,2.0,2.0,0.192213,0.0,3.0,2.0,26.0,...,0.0,6.0,7.0,8.0,6.345409,7.0,5.50599,6.0,5.0,2.0
75%,1.0,0.0,28.0,4.0,3.0,0.42,1.0,6.0,5.0,28.0,...,1.0,7.0,8.0,9.0,8.0,8.0,7.0,7.0,7.0,2.0
max,1.0,1.0,42.0,6.0,6.0,0.9,1.0,10.0,10.0,38.0,...,1.0,10.5,10.0,10.0,10.0,10.0,10.0,10.0,10.0,8.0


### 3.4.3 Extracting more seemingly relevant features

In [210]:
# refer to 'Speed dating_2_EDA_mk'
spd_mean1_mini = spd_mean1.loc[:, 'dec_o':'prob_o']

In [211]:
# check
spd_mean1_mini.isna().sum()

dec_o      0
attr_o     0
sinc_o     0
intel_o    0
fun_o      0
amb_o      0
shar_o     0
like_o     0
prob_o     0
dtype: int64

In [212]:
spd_mean1_mini.shape

(6266, 9)

In [213]:
spd_mean1_mini.describe()

Unnamed: 0,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o
count,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0
mean,0.427705,6.233132,7.223615,7.403204,6.418736,6.826152,5.554269,6.166317,5.233812
std,0.494785,1.912573,1.689306,1.502261,1.910492,1.689681,1.987716,1.826321,2.106286
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,5.0,6.0,7.0,5.0,6.0,5.0,5.0,4.0
50%,0.0,6.0,7.0,7.403204,6.418736,7.0,5.554269,6.0,5.0
75%,1.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0
max,1.0,10.5,10.0,10.0,11.0,10.0,10.0,10.0,10.0


### 3.4.4 Scale the whole data

In [214]:
# refer to '6_GuidedCapstone/04_preprocessing_and_training_mk'
# refer to '16.3.1_Capstone_Two_Step_4__Preprocessing_Training_Data_Development.pdf'
from sklearn.preprocessing import StandardScaler
# Making a Scaler object
scaler = StandardScaler()

# Fitting data (spd_mean1, df w/o NaN) to the scaler object
spd_mean1_scaled = scaler.fit_transform(spd_mean1)
print(type(spd_mean1_scaled))
spd_mean1_scaled = pd.DataFrame(spd_mean1_scaled, columns=spd_mean1.columns)
spd_mean1_scaled.head()

<class 'numpy.ndarray'>


Unnamed: 0,gender,match,age,race,goal,int_corr,samerace,imprace,imprelig,age_o,...,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o
0,-1.001597,-0.453793,-1.499401,1.078539,-0.107357,-0.17511,-0.825102,-0.629879,0.094728,0.20998,...,-0.864495,-0.121904,0.459625,0.397297,0.827739,0.694772,0.224261,0.456518,-0.585823,0.169905
1,-1.001597,-0.453793,-1.499401,1.078539,-0.107357,1.156634,-0.825102,-0.629879,0.094728,-1.218746,...,-0.864495,0.400994,0.459625,1.728729,0.304272,0.102896,-0.27887,1.004111,-0.585823,0.169905
2,-1.001597,2.20365,-1.499401,1.078539,-0.107357,-0.108523,1.211971,-0.629879,0.094728,-1.218746,...,1.156745,1.969687,1.643637,1.728729,1.874673,1.878522,2.236781,2.099296,2.263021,-3.704318
3,-1.001597,2.20365,-1.499401,1.078539,-0.107357,1.389689,-0.825102,-0.629879,0.094728,-0.933,...,1.156745,0.400994,0.459625,1.063013,0.827739,1.286647,1.230521,0.456518,0.838599,0.169905
4,-1.001597,2.20365,-1.499401,1.078539,-0.107357,0.057945,-0.825102,-0.629879,0.094728,-0.647255,...,1.156745,0.923891,-0.132381,1.063013,-0.219195,1.286647,0.727391,1.004111,0.363792,0.169905


In [215]:
spd_mean1_scaled.shape

(6266, 21)

In [216]:
spd_mean1_scaled.describe()

Unnamed: 0,gender,match,age,race,goal,int_corr,samerace,imprace,imprelig,age_o,...,dec_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o
count,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,...,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0
mean,6.382808e-16,9.466308e-16,-7.147437e-16,2.325338e-16,-9.428524e-16,7.907637000000001e-17,-6.520656e-16,2.865035e-17,-1.834366e-16,6.846139e-16,...,-2.9943770000000005e-17,-6.506127e-16,-1.096438e-15,-4.078732e-16,-5.045437e-16,2.618751e-16,5.847718e-16,6.311581e-16,4.669636e-15,-3.420173e-15
std,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,...,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008
min,-1.001597,-0.4537925,-2.355032,-1.393671,-0.8108032,-3.404588,-0.8251024,-0.9788916,-0.9775884,-2.361726,...,-0.8644945,-3.259291,-4.276425,-4.928435,-3.359997,-4.040229,-2.794521,-3.376629,-2.485052,-3.704318
25%,-1.001597,-0.4537925,-0.6437707,-0.5696011,-0.8108032,-0.6745135,-0.8251024,-0.9788916,-0.9775884,-0.6472551,...,-0.8644945,-0.6448019,-0.7243876,-0.2684198,-0.7426617,-0.4889786,-0.2788696,-0.6386665,-0.5858231,0.1699051
50%,0.9984054,-0.4537925,-0.07335029,-0.5696011,-0.1073573,0.02465184,-0.8251024,-0.280867,-0.2627107,-0.07576472,...,-0.8644945,-0.1219042,-0.1323813,-5.912749e-16,-4.649321e-16,0.1028965,4.468694e-16,-0.09107402,-0.1110158,0.1699051
75%,0.9984054,-0.4537925,0.4970702,1.078539,-0.1073573,0.7571108,1.211971,0.76617,0.809606,0.4957257,...,1.156745,0.9238912,0.4596249,0.3972966,0.8277393,0.6947715,0.7273908,0.4565184,0.8385988,0.1699051
max,0.9984054,2.20365,4.490013,2.726678,2.706426,2.355203,1.211971,2.162219,2.239362,4.496158,...,1.156745,2.231136,1.643637,1.728729,2.39814,1.878522,2.236781,2.099296,2.263021,23.41524


### 3.4.5 Set up input data for logistic Regression Model (X and y)

In [217]:
# spd_mean1
X = spd_mean1.drop(columns='dec_o')
y = spd_mean1['dec_o']

In [218]:
# check
X.shape, y.shape

((6266, 20), (6266,))

In [219]:
X.describe()

Unnamed: 0,gender,match,age,race,goal,int_corr,samerace,imprace,imprelig,age_o,race_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o
count,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0
mean,0.500798,0.170763,26.25718,2.691205,2.152616,0.192596,0.405043,3.804748,3.734981,26.265148,2.690384,6.233132,7.223615,7.403204,6.418736,6.826152,5.554269,6.166317,5.233812,1.956145
std,0.500039,0.376332,3.506465,1.213586,1.421687,0.300382,0.49094,2.865457,2.797904,3.499901,1.212258,1.912573,1.689306,1.502261,1.910492,1.689681,1.987716,1.826321,2.106286,0.258137
min,0.0,0.0,18.0,1.0,1.0,-0.83,0.0,1.0,1.0,18.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,24.0,2.0,1.0,-0.01,0.0,1.0,1.0,24.0,2.0,5.0,6.0,7.0,5.0,6.0,5.0,5.0,4.0,2.0
50%,1.0,0.0,26.0,2.0,2.0,0.2,0.0,3.0,3.0,26.0,2.0,6.0,7.0,7.403204,6.418736,7.0,5.554269,6.0,5.0,2.0
75%,1.0,0.0,28.0,4.0,2.0,0.42,1.0,6.0,6.0,28.0,4.0,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0,2.0
max,1.0,1.0,42.0,6.0,6.0,0.9,1.0,10.0,10.0,42.0,6.0,10.5,10.0,10.0,11.0,10.0,10.0,10.0,10.0,8.0


Need scaling to use X (whole) as input. Will try both methods: 1) extract relavant features from X (scale 10), 2) use whole feature X with scale.

In [220]:
# spd_mean1_mini
Xm = spd_mean1_mini.drop(columns='dec_o')
ym = spd_mean1_mini['dec_o']

In [221]:
# check
Xm.shape, ym.shape

((6266, 8), (6266,))

In [222]:
Xm.describe()

Unnamed: 0,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o
count,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0
mean,6.233132,7.223615,7.403204,6.418736,6.826152,5.554269,6.166317,5.233812
std,1.912573,1.689306,1.502261,1.910492,1.689681,1.987716,1.826321,2.106286
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5.0,6.0,7.0,5.0,6.0,5.0,5.0,4.0
50%,6.0,7.0,7.403204,6.418736,7.0,5.554269,6.0,5.0
75%,8.0,8.0,8.0,8.0,8.0,7.0,7.0,7.0
max,10.5,10.0,10.0,11.0,10.0,10.0,10.0,10.0


In [223]:
# spd_mean1_scaled
Xs = spd_mean1_scaled.drop(columns='dec_o')
ys = spd_mean1_scaled['dec_o']

In [224]:
# check
Xs.shape, ys.shape

((6266, 20), (6266,))

In [225]:
Xs.describe()

Unnamed: 0,gender,match,age,race,goal,int_corr,samerace,imprace,imprelig,age_o,race_o,attr_o,sinc_o,intel_o,fun_o,amb_o,shar_o,like_o,prob_o,met_o
count,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0,6266.0
mean,6.382808e-16,9.466308e-16,-7.147437e-16,2.325338e-16,-9.428524e-16,7.907637000000001e-17,-6.520656e-16,2.865035e-17,-1.834366e-16,6.846139e-16,1.418414e-15,-6.506127e-16,-1.096438e-15,-4.078732e-16,-5.045437e-16,2.618751e-16,5.847718e-16,6.311581e-16,4.669636e-15,-3.420173e-15
std,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008,1.00008
min,-1.001597,-0.4537925,-2.355032,-1.393671,-0.8108032,-3.404588,-0.8251024,-0.9788916,-0.9775884,-2.361726,-1.394521,-3.259291,-4.276425,-4.928435,-3.359997,-4.040229,-2.794521,-3.376629,-2.485052,-3.704318
25%,-1.001597,-0.4537925,-0.6437707,-0.5696011,-0.8108032,-0.6745135,-0.8251024,-0.9788916,-0.9775884,-0.6472551,-0.5695482,-0.6448019,-0.7243876,-0.2684198,-0.7426617,-0.4889786,-0.2788696,-0.6386665,-0.5858231,0.1699051
50%,0.9984054,-0.4537925,-0.07335029,-0.5696011,-0.1073573,0.02465184,-0.8251024,-0.280867,-0.2627107,-0.07576472,-0.5695482,-0.1219042,-0.1323813,-5.912749e-16,-4.649321e-16,0.1028965,4.468694e-16,-0.09107402,-0.1110158,0.1699051
75%,0.9984054,-0.4537925,0.4970702,1.078539,-0.1073573,0.7571108,1.211971,0.76617,0.809606,0.4957257,1.080398,0.9238912,0.4596249,0.3972966,0.8277393,0.6947715,0.7273908,0.4565184,0.8385988,0.1699051
max,0.9984054,2.20365,4.490013,2.726678,2.706426,2.355203,1.211971,2.162219,2.239362,4.496158,2.730344,2.231136,1.643637,1.728729,2.39814,1.878522,2.236781,2.099296,2.263021,23.41524


In [226]:
# scaled depedent variable (dec_o) for logistic regression fitting does not work!
ys

0      -0.864495
1      -0.864495
2       1.156745
3       1.156745
4       1.156745
          ...   
6261   -0.864495
6262   -0.864495
6263   -0.864495
6264   -0.864495
6265   -0.864495
Name: dec_o, Length: 6266, dtype: float64

## 3.5 Training Data Development

### 3.5.1 LogisticRegression via sklearn

#### 3.5.1.1 Use X, y as whole without scaling

In [227]:
# refer to '14.1.2_3_Supervised Learning_FineTuning'
# Import necessary modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Split the data into a training and test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(X_train,y_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# apply best estimators to test set
# refer to '14.2.11 Logistic Regression Advanced Case Study_mk'
logreg_best = logreg_cv.best_estimator_
training_accuracy = logreg_best.score(X_train, y_train)
test_accuracy = logreg_best.score(X_test, y_test)
print("Accuracy on training data: {:0.2f}".format(training_accuracy))
print("Accuracy on test data:     {:0.2f}".format(test_accuracy))

# other way to check the data: confusion
# refer to '14.1.2_3_SupervisedLearning_Tuning'
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Predict the labels of the test data: y_pred
y_pred = logreg_best.predict(X_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Tuned Logistic Regression Parameters: {'C': 100000000.0}
Best score is 0.8282119708738058
Accuracy on training data: 0.83
Accuracy on test data:     0.82
[[647  67]
 [163 377]]
              precision    recall  f1-score   support

           0       0.80      0.91      0.85       714
           1       0.85      0.70      0.77       540

    accuracy                           0.82      1254
   macro avg       0.82      0.80      0.81      1254
weighted avg       0.82      0.82      0.81      1254



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Need to scale the dataset if using X, y as a whole (from spd_mean1)

#### 3.5.1.2 Use Xm, ym: extracted relevant features

In [228]:
# logistic regresion w/o penalty
# refer to '14.1.2_3_Supervised Learning_Tuning'
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Split the data into a training and test set.
Xm_train, Xm_test, ym_train, ym_test = train_test_split(Xm, ym, test_size=0.2, random_state=42)

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(Xm_train, ym_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# apply best estimators to test set
# refer to '14.2.11 Logistic Regression Advanced Case Study_mk'
logreg_best = logreg_cv.best_estimator_
training_accuracy = logreg_best.score(Xm_train, ym_train)
test_accuracy = logreg_best.score(Xm_test, ym_test)
print("Accuracy on training data: {:0.2f}".format(training_accuracy))
print("Accuracy on test data:     {:0.2f}".format(test_accuracy))

# other way to check the data: confusion
# refer to '14.1.2_3_SupervisedLearning_Tuning'
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Predict the labels of the test data: y_pred
ym_pred = logreg_best.predict(Xm_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(ym_test, ym_pred))
print(classification_report(ym_test, ym_pred))

Tuned Logistic Regression Parameters: {'C': 0.4393970560760795}
Best score is 0.7711496249773633
Accuracy on training data: 0.77
Accuracy on test data:     0.76
[[587 127]
 [170 370]]
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       714
           1       0.74      0.69      0.71       540

    accuracy                           0.76      1254
   macro avg       0.76      0.75      0.76      1254
weighted avg       0.76      0.76      0.76      1254



In [229]:
# logistic regresion w penalty
# refer to '14.1.2_3_Supervised Learning_Tuning'
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Split the data into a training and test set.
Xm_train, Xm_test, ym_train, ym_test = train_test_split(Xm, ym, test_size=0.2, random_state=42)

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']} 
#%%%%% this time 'l1' does not work with the code, when I can check for both 'l1' and 'l2' when I can't?

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(Xm_train, ym_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# apply best estimators to test set
# refer to '14.2.11 Logistic Regression Advanced Case Study_mk'
logreg_best = logreg_cv.best_estimator_
training_accuracy = logreg_best.score(Xm_train, ym_train)
test_accuracy = logreg_best.score(Xm_test, ym_test)
print("Accuracy on training data: {:0.2f}".format(training_accuracy))
print("Accuracy on test data:     {:0.2f}".format(test_accuracy))

# other way to check the data: confusion
# refer to '14.1.2_3_SupervisedLearning_Tuning'
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Predict the labels of the test data: y_pred
ym_pred = logreg_best.predict(Xm_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(ym_test, ym_pred))
print(classification_report(ym_test, ym_pred))

Traceback (most recent call last):
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1304, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 442, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1304, in fit
    solver = _check_solver(self.solver, self.

Traceback (most recent call last):
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1304, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 442, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\mkoba\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1304, in fit
    solver = _check_solver(self.solver, self.

Tuned Logistic Regression Parameters: {'C': 0.4393970560760795, 'penalty': 'l2'}
Best score is 0.7711496249773633
Accuracy on training data: 0.77
Accuracy on test data:     0.76
[[587 127]
 [170 370]]
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       714
           1       0.74      0.69      0.71       540

    accuracy                           0.76      1254
   macro avg       0.76      0.75      0.76      1254
weighted avg       0.76      0.76      0.76      1254



In [230]:
# logistic regresion w penalty
# refer to '14.1.2_3_Supervised Learning_Tuning'
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Split the data into a training and test set.
Xm_train, Xm_test, ym_train, ym_test = train_test_split(Xm, ym, test_size=0.2, random_state=42)

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l2']} # 'l2' penalty works b/c default default ‘lbfgs’ solvers support only l2 penalties.

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression()

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(Xm_train, ym_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# apply best estimators to test set
# refer to '14.2.11 Logistic Regression Advanced Case Study_mk'
logreg_best = logreg_cv.best_estimator_
training_accuracy = logreg_best.score(Xm_train, ym_train)
test_accuracy = logreg_best.score(Xm_test, ym_test)
print("Accuracy on training data: {:0.2f}".format(training_accuracy))
print("Accuracy on test data:     {:0.2f}".format(test_accuracy))

# other way to check the data: confusion
# refer to '14.1.2_3_SupervisedLearning_Tuning'
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Predict the labels of the test data: y_pred
ym_pred = logreg_best.predict(Xm_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(ym_test, ym_pred))
print(classification_report(ym_test, ym_pred))

Tuned Logistic Regression Parameters: {'C': 0.4393970560760795, 'penalty': 'l2'}
Best score is 0.7711496249773633
Accuracy on training data: 0.77
Accuracy on test data:     0.76
[[587 127]
 [170 370]]
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       714
           1       0.74      0.69      0.71       540

    accuracy                           0.76      1254
   macro avg       0.76      0.75      0.76      1254
weighted avg       0.76      0.76      0.76      1254



'l2' (Ridge) penalty term did not improve the model performance.

In [231]:
# refer to 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'

# logistic regresion w elastic penalty
# refer to '14.1.2_3_Supervised Learning_Tuning'
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Split the data into a training and test set.
Xm_train, Xm_test, ym_train, ym_test = train_test_split(Xm, ym, test_size=0.2, random_state=42)

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
l1_space = np.arange(0,1,0.1)
param_grid = {'C': c_space, 'penalty': ['elasticnet'], 'l1_ratio': l1_space} 


# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression(solver='saga', max_iter=10000) 
#%%%%%% need to specify solver and max_inter (default setting does not work)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(Xm_train, ym_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# apply best estimators to test set
# refer to '14.2.11 Logistic Regression Advanced Case Study_mk'
logreg_best = logreg_cv.best_estimator_
training_accuracy = logreg_best.score(Xm_train, ym_train)
test_accuracy = logreg_best.score(Xm_test, ym_test)
print("Accuracy on training data: {:0.2f}".format(training_accuracy))
print("Accuracy on test data:     {:0.2f}".format(test_accuracy))

# other way to check the data: confusion
# refer to '14.1.2_3_SupervisedLearning_Tuning'
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Predict the labels of the test data: y_pred
ym_pred = logreg_best.predict(Xm_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(ym_test, ym_pred))
print(classification_report(ym_test, ym_pred))

Tuned Logistic Regression Parameters: {'C': 0.4393970560760795, 'l1_ratio': 0.4, 'penalty': 'elasticnet'}
Best score is 0.7715486275703827
Accuracy on training data: 0.77
Accuracy on test data:     0.76
[[587 127]
 [170 370]]
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       714
           1       0.74      0.69      0.71       540

    accuracy                           0.76      1254
   macro avg       0.76      0.75      0.76      1254
weighted avg       0.76      0.76      0.76      1254



elasticnet penalty improved the model a little (slightly higher tp and preciscion for 1)

In [232]:
# use scoring = 'roc_auc' instead of 'accuracy' 
#%%%%%% should not use 'accuracy' scoring for data with imbalanced disribution.
# for logistic regression, target variable (0,1), if one has much more than the other (more than 10x > imbalanced)

# refer to 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'

# logistic regresion w elastic penalty
# refer to '14.1.2_3_Supervised Learning_Tuning'
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Split the data into a training and test set.
Xm_train, Xm_test, ym_train, ym_test = train_test_split(Xm, ym, test_size=0.2, random_state=42)

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
l1_space = np.arange(0,1,0.1)
param_grid = {'C': c_space, 'penalty': ['elasticnet'], 'l1_ratio': l1_space} 


# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression(solver='saga', max_iter=10000) 
#%%%%%% need to specify solver and max_inter (default setting does not work)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, scoring='roc_auc', cv=5)

# Fit it to the data
logreg_cv.fit(Xm_train, ym_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# apply best estimators to test set
# refer to '14.2.11 Logistic Regression Advanced Case Study_mk'
logreg_best = logreg_cv.best_estimator_
training_accuracy = logreg_best.score(Xm_train, ym_train)
test_accuracy = logreg_best.score(Xm_test, ym_test)
print("Accuracy on training data: {:0.2f}".format(training_accuracy))
print("Accuracy on test data:     {:0.2f}".format(test_accuracy))

# other way to check the data: confusion
# refer to '14.1.2_3_SupervisedLearning_Tuning'
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Predict the labels of the test data: y_pred
ym_pred = logreg_best.predict(Xm_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(ym_test, ym_pred))
print(classification_report(ym_test, ym_pred))

Tuned Logistic Regression Parameters: {'C': 0.05179474679231213, 'l1_ratio': 0.4, 'penalty': 'elasticnet'}
Best score is 0.8525581234505676
Accuracy on training data: 0.77
Accuracy on test data:     0.76
[[586 128]
 [172 368]]
              precision    recall  f1-score   support

           0       0.77      0.82      0.80       714
           1       0.74      0.68      0.71       540

    accuracy                           0.76      1254
   macro avg       0.76      0.75      0.75      1254
weighted avg       0.76      0.76      0.76      1254



In [233]:
type(ym_pred),type(ym_test)

(numpy.ndarray, pandas.core.series.Series)

In [234]:
ym_pred

array([0, 0, 1, ..., 1, 1, 0], dtype=int64)

In [235]:
### should build a pipeline!?? fill nan with median..etc
# refer to '14.1.2_4_Supervised Learning with scikit-learn_Preprocessing and Pipeline'

#### 3.5.1.3 Use Xs, ys as whole with scaling

In [236]:
'''
# refer to 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'

# logistic regresion w elastic penalty
# refer to '14.1.2_3_Supervised Learning_Tuning'
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Split the data into a training and test set.
Xs_train, Xs_test, ys_train, ys_test = train_test_split(Xs, ys, test_size=0.2, random_state=42)

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
l1_space = np.arange(0,1,0.1)
param_grid = {'C': c_space, 'penalty': ['elasticnet'], 'l1_ratio': l1_space} 


# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression(solver='saga', max_iter=10000) 
#%%%%%% need to specify solver and max_inter (default setting does not work)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(Xs_train, ys_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# apply best estimators to test set
# refer to '14.2.11 Logistic Regression Advanced Case Study_mk'
logreg_best = logreg_cv.best_estimator_
training_accuracy = logreg_best.score(Xs_train, ys_train)
test_accuracy = logreg_best.score(Xs_test, ys_test)
print("Accuracy on training data: {:0.2f}".format(training_accuracy))
print("Accuracy on test data:     {:0.2f}".format(test_accuracy))

# other way to check the data: confusion
# refer to '14.1.2_3_SupervisedLearning_Tuning'
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Predict the labels of the test data: y_pred
ys_pred = logreg_best.predict(Xs_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(ys_test, ys_pred))
print(classification_report(ys_test, ys_pred))
'''

'\n# refer to \'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\'\n\n# logistic regresion w elastic penalty\n# refer to \'14.1.2_3_Supervised Learning_Tuning\'\n# Import necessary modules\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\n\n# Split the data into a training and test set.\nXs_train, Xs_test, ys_train, ys_test = train_test_split(Xs, ys, test_size=0.2, random_state=42)\n\n# Setup the hyperparameter grid\nc_space = np.logspace(-5, 8, 15)\nl1_space = np.arange(0,1,0.1)\nparam_grid = {\'C\': c_space, \'penalty\': [\'elasticnet\'], \'l1_ratio\': l1_space} \n\n\n# Instantiate a logistic regression classifier: logreg\nlogreg = LogisticRegression(solver=\'saga\', max_iter=10000) \n#%%%%%% need to specify solver and max_inter (default setting does not work)\n\n# Instantiate the GridSearchCV object: logreg_cv\nlogreg_cv = GridSear

The code above gives: **ValueError: Unknown label type: 'continuous'**

In [237]:
# use non-scaled dependent variable y!

# refer to 'https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html'
# logistic regresion w elastic penalty
# refer to '14.1.2_3_Supervised Learning_Tuning'
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Split the data into a training and test set
# scaled depedent variable (dec_o) for logistic regression fitting does not work!
# gives error if used ys_train b/c ys is continuous and not 0, 1 to fit logreg!!
# need to scale without the dec_o.....so keeing the original y
Xs_train, Xs_test, ys_train, ys_test = train_test_split(Xs, y, test_size=0.2, random_state=42)

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
l1_space = np.arange(0,1,0.1)
param_grid = {'C': c_space, 'penalty': ['elasticnet'], 'l1_ratio': l1_space} 

# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression(solver='saga', max_iter=10000) 
#%%%%%% need to specify solver and max_inter (default setting does not work)

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)

# Fit it to the data
logreg_cv.fit(Xs_train, ys_train) 

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# apply best estimators to test set
# refer to '14.2.11 Logistic Regression Advanced Case Study_mk'
logreg_best = logreg_cv.best_estimator_
training_accuracy = logreg_best.score(Xs_train, ys_train)
test_accuracy = logreg_best.score(Xs_test, ys_test)
print("Accuracy on training data: {:0.2f}".format(training_accuracy))
print("Accuracy on test data:     {:0.2f}".format(test_accuracy))

# other way to check the data: confusion
# refer to '14.1.2_3_SupervisedLearning_Tuning'
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# Predict the labels of the test data: y_pred
ys_pred = logreg_best.predict(Xs_test)

# Compute and print the confusion matrix and classification report
print(confusion_matrix(ys_test, ys_pred))
print(classification_report(ys_test, ys_pred))

Tuned Logistic Regression Parameters: {'C': 0.05179474679231213, 'l1_ratio': 0.1, 'penalty': 'elasticnet'}
Best score is 0.8332033838603948
Accuracy on training data: 0.83
Accuracy on test data:     0.82
[[651  63]
 [163 377]]
              precision    recall  f1-score   support

           0       0.80      0.91      0.85       714
           1       0.86      0.70      0.77       540

    accuracy                           0.82      1254
   macro avg       0.83      0.80      0.81      1254
weighted avg       0.82      0.82      0.82      1254



Best model so far, yet run time is much longer.

In [238]:
# refer to 'https://stats.stackexchange.com/questions/59392/should-you-ever-standardise-binary-variables'
#%%%%% not sure if above scaling is good enough or need to leave out all the binary variable for scaling

### 3.5.2 LogisticRegression via statsmodels

#### 3.5.2.1 Use Xm, ym: extracted relevant features

In [239]:
# statsmodel w/o constant
# Import the statsmodels module
# refer to 'https://www.geeksforgeeks.org/logistic-regression-using-statsmodels/'
import statsmodels.api as sm

log_reg1 = sm.Logit(ym_train, Xm_train).fit()  

Optimization terminated successfully.
         Current function value: 0.558129
         Iterations 6


In [240]:
# printing the summary table 
print(log_reg1.summary()) 

                           Logit Regression Results                           
Dep. Variable:                  dec_o   No. Observations:                 5012
Model:                          Logit   Df Residuals:                     5004
Method:                           MLE   Df Model:                            7
Date:                Sat, 13 Feb 2021   Pseudo R-squ.:                  0.1822
Time:                        21:24:01   Log-Likelihood:                -2797.3
converged:                       True   LL-Null:                       -3420.4
Covariance Type:            nonrobust   LLR p-value:                7.472e-265
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
attr_o         0.2678      0.025     10.737      0.000       0.219       0.317
sinc_o        -0.2946      0.029    -10.325      0.000      -0.351      -0.239
intel_o       -0.3439      0.033    -10.393      0.0

In [241]:
# statsmodel w/ constant
# Import the statsmodels module
# refer to 'https://www.geeksforgeeks.org/logistic-regression-using-statsmodels/'
import statsmodels.api as sm

# Create constants for X, so the model knows its bounds
Xm = sm.add_constant(Xm)

# Split the data into a training and test set.
Xm_train, Xm_test, ym_train, ym_test = train_test_split(Xm, ym, test_size=0.2, random_state=42)

log_reg2 = sm.Logit(ym_train, Xm_train).fit()  

Optimization terminated successfully.
         Current function value: 0.469677
         Iterations 7


In [242]:
# printing the summary table 
print(log_reg2.summary()) 

                           Logit Regression Results                           
Dep. Variable:                  dec_o   No. Observations:                 5012
Model:                          Logit   Df Residuals:                     5003
Method:                           MLE   Df Model:                            8
Date:                Sat, 13 Feb 2021   Pseudo R-squ.:                  0.3118
Time:                        21:24:01   Log-Likelihood:                -2354.0
converged:                       True   LL-Null:                       -3420.4
Covariance Type:            nonrobust   LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -6.5588      0.260    -25.228      0.000      -7.068      -6.049
attr_o         0.4912      0.029     16.774      0.000       0.434       0.549
sinc_o        -0.1694      0.033     -5.212      0.0

Higher the Pseudo R-squ.score, the better the model is. Adding constant helped improved the Pseudo R-squ. a little.

In [243]:
# statsmodel w/ constant & penalty term
# Import the statsmodels module
# refer to 'https://www.geeksforgeeks.org/logistic-regression-using-statsmodels/'
import statsmodels.api as sm

# Create constants for X, so the model knows its bounds
Xm = sm.add_constant(Xm)

# Split the data into a training and test set.
Xm_train, Xm_test, ym_train, ym_test = train_test_split(Xm, ym, test_size=0.2, random_state=42)

# refer to 'https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html'
log_reg3 = sm.Logit(ym_train, Xm_train).fit_regularized() # default method='l1' (l2 or elasticnet penalty not available)

Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.4696773953325422
            Iterations: 33
            Function evaluations: 35
            Gradient evaluations: 33


In [244]:
# printing the summary table 
print(log_reg3.summary()) 

                           Logit Regression Results                           
Dep. Variable:                  dec_o   No. Observations:                 5012
Model:                          Logit   Df Residuals:                     5003
Method:                           MLE   Df Model:                            8
Date:                Sat, 13 Feb 2021   Pseudo R-squ.:                  0.3118
Time:                        21:24:01   Log-Likelihood:                -2354.0
converged:                       True   LL-Null:                       -3420.4
Covariance Type:            nonrobust   LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -6.5589      0.260    -25.228      0.000      -7.068      -6.049
attr_o         0.4912      0.029     16.774      0.000       0.434       0.549
sinc_o        -0.1694      0.033     -5.212      0.0

In [245]:
# Predict the labels of the test data: y_pred
ym_pred2 = log_reg2.predict(Xm_test)
ym_pred3 = log_reg3.predict(Xm_test)

In [246]:
type(ym_test), type(ym_pred2), type(ym_pred3)

(pandas.core.series.Series,
 pandas.core.series.Series,
 pandas.core.series.Series)

In [247]:
ym_pred2, ym_pred3

(5607    0.441238
 2302    0.344809
 4672    0.939217
 3831    0.146094
 6005    0.038742
           ...   
 4088    0.628473
 952     0.137084
 3574    0.866432
 5880    0.535517
 2024    0.093702
 Length: 1254, dtype: float64,
 5607    0.441232
 2302    0.344805
 4672    0.939217
 3831    0.146094
 6005    0.038744
           ...   
 4088    0.628470
 952     0.137091
 3574    0.866431
 5880    0.535515
 2024    0.093705
 Length: 1254, dtype: float64)

In [248]:
# should convert this probability to 0 (p < 0.5), or 1 (p >= 0.5) 
# in order to compute the confusion matrix and classification report (check the sklearn ym_pred output above)
ym_pred2_df = pd.DataFrame({'probability': ym_pred2})
print(ym_pred2_df)

# hope to use list comprehension
# refer to 'https://chrisalbon.com/python/data_wrangling/pandas_list_comprehension/'
ym_pred2_df['prediction'] = [0 if row < 0.5 else 1 for row in ym_pred2_df.probability] 
#%%%% ym_pred_df.prediction = [list comprehenshion] did not work!
ym_pred2_df 

      probability
5607     0.441238
2302     0.344809
4672     0.939217
3831     0.146094
6005     0.038742
...           ...
4088     0.628473
952      0.137084
3574     0.866432
5880     0.535517
2024     0.093702

[1254 rows x 1 columns]


Unnamed: 0,probability,prediction
5607,0.441238,0
2302,0.344809,0
4672,0.939217,1
3831,0.146094,0
6005,0.038742,0
...,...,...
4088,0.628473,1
952,0.137084,0
3574,0.866432,1
5880,0.535517,1


In [249]:
ym_pred3_df = pd.DataFrame({'probability': ym_pred3})
print(ym_pred3_df)

# hope to use list comprehension
# refer to 'https://chrisalbon.com/python/data_wrangling/pandas_list_comprehension/'
ym_pred3_df['prediction'] = [0 if row < 0.5 else 1 for row in ym_pred3_df.probability] 
#%%%% ym_pred_df.prediction = [list comprehenshion] did not work!
ym_pred3_df 

      probability
5607     0.441232
2302     0.344805
4672     0.939217
3831     0.146094
6005     0.038744
...           ...
4088     0.628470
952      0.137091
3574     0.866431
5880     0.535515
2024     0.093705

[1254 rows x 1 columns]


Unnamed: 0,probability,prediction
5607,0.441232,0
2302,0.344805,0
4672,0.939217,1
3831,0.146094,0
6005,0.038744,0
...,...,...
4088,0.628470,1
952,0.137091,0
3574,0.866431,1
5880,0.535515,1


In [250]:
# Compute and print the confusion matrix and classification report
print(confusion_matrix(ym_test, ym_pred2_df.prediction))
print(classification_report(ym_test, ym_pred2_df.prediction))

[[587 127]
 [170 370]]
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       714
           1       0.74      0.69      0.71       540

    accuracy                           0.76      1254
   macro avg       0.76      0.75      0.76      1254
weighted avg       0.76      0.76      0.76      1254



In [251]:
# Compute and print the confusion matrix and classification report
print(confusion_matrix(ym_test, ym_pred3_df.prediction))
print(classification_report(ym_test, ym_pred3_df.prediction))
# refer to 'https://www.youtube.com/watch?v=Kdsp6soqA7o&ab_channel=StatQuestwithJoshStarmer' to interpret confusion matrix
# refer to 'https://www.youtube.com/watch?v=2osIZ-dSPGE&ab_channel=codebasics' to interpret classififation report

[[587 127]
 [170 370]]
              precision    recall  f1-score   support

           0       0.78      0.82      0.80       714
           1       0.74      0.69      0.71       540

    accuracy                           0.76      1254
   macro avg       0.76      0.75      0.76      1254
weighted avg       0.76      0.76      0.76      1254



In [252]:
ym_pred3_df.prediction == ym_pred2_df.prediction

5607    True
2302    True
4672    True
3831    True
6005    True
        ... 
4088    True
952     True
3574    True
5880    True
2024    True
Name: prediction, Length: 1254, dtype: bool

In [253]:
ym_pred3_df.prediction.all() == ym_pred2_df.prediction.all()

True

Including the penalty term in statsmodel didn't help at all (gave exactly same predition).

In [254]:
# refer to '11.4.1_Case Study - Linear Regression/Springboard Regression Case Study - the Red Wine Dataset - Tier 3_mk.ipynb' 
# for some statsmodel codes to predit from x_test


In [255]:
# refer to '14.1.2_4_Supervised Learning with scikit-learn_Preprocessing and Pipeline' for some elesticNet code.

In [256]:
# should try knn, randomforest too!?

### reference
- sklearn codes:
    - '14.1.2_3_Supervised Learning_Tuning': logisticRegression, confusion_matrix, classification_report, 
    - '14.1.2_4_Supervised Learning with scikit-learn_Preprocessing and Pipeline': buiding pipeline, scaler, get_dummies() 
    - '6_GuidedCapstone/04_preprocessing_and_training_mk': scaler, Random Forest model
     - '14.2.11_Case Study - Logistic Regression/Logistic Regression Advanced Case Study_mk': plot logisticRegression output.
- Statsmodel codes:
    - '11.4.1_Case Study - Linear Regression/Regression Case Study - the Red Wine Dataset - Tier 3_mk': sm.OLS(y, X), plot predictions (y_test vs. y_pred)
    
    
- refer to 'https://pandas.pydata.org/pandas-docs/version/0.24.0rc1/api/generated/pandas.Series.to_numpy.html'
    - series to numpy: s.to_numpy()
   

### Questions:
- how do perform cross validation on statsmodel?
    - Answers:
    - currently not avaliable(can't perform cross validation using statsmodels.api yet) and is not compatibible with sklearn cross_val_score or GredSearchCV...etc
    - need to write custom codes if you really want to
    - people usually use sklearn for building ML model.
    - statsmodel is used for quick stats calculation from a model than model optimazation. 
- how to use ElasticNet in sklearn LogisticRegression? (currently only l1 or l2 are available!?)
    - Answers:
    - need to change the default solver (lbfgs) to saga in order to use elasticnet
        - param_grid = {'C': c_space, 'penalty': ['elasticnet'], 'l1_ratio': l1_space} 
        - logreg = LogisticRegression(solver='saga', max_iter=10000) 
        - logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
- how to compare the performance of statsmodel and sklearn lotistic model
    - Answers:
    - use sklearn confusion_matrix, classification_report on the y_pred, y_test
    - need to convert the statsmodel's y_pred in probability to binary system (0,1) in advance
    
- how to choose scoring system for GridSearhCV like we can do for cross_val_score? (what is the defaut score?, accuracy?)
    - cv_accuracy = cross_val_score(clf, Xlr, ylr, cv=5, scoring='accuracy')
    - cv_auc = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')
    - Answers:
    - use GridSearhCV(logreg, param_grid, scoring = 'roc_auc', cv=5), default scoring = 'accuracy'
    - should not use 'accuracy' scoring for data with imbalanced disribution.
    - for logistic regression, target variable (0,1), if there is much more (more than 10x!?) 1 than 0 (or vice verso) > imbalanced data
- for logistic regression model, do I need to leave all the binary variable (or only dependent variable) out for scaling.
    - Answers: 
    - leaving it out all the binary viarable for scaling for logistic regression model might be better!?
    - 