- Clean out the notebook
- Work on numeric columns
- **CHANGE STATUS GROUP**


## Notes of what I have done here ##
- Changed the status group to binary. Embedded the repair needing ones into non-func
- Imputed some numeric columns with : 
    - Population : Mean
    - Construction_Year : Most frequent
    - Population : Median
    - Amount_tsh : mean
- Removed some categorical variables such as **permit** and **public meeting**
- Changed the max number of iterations in the fit to get rid of the error.
- Fancy output with mean and stdev of final scores

In [257]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score


In [258]:
#Read two data sets and put them into two different DFs
df_p = pd.read_csv('pumps.csv', index_col = 0)
df_py = pd.read_csv('pumps_y.csv', index_col = 0)

In [259]:
#Check shape
df_p.shape, df_py.shape

((59400, 39), (59400, 1))

In [260]:
#Merging pumps_y as a new column on pumps
df_p['status_group'] = df_py['status_group']

In [261]:
#Do train/test split
Xtrain, Xtest, ytrain, ytest = train_test_split(df_p.loc[:,'amount_tsh':'waterpoint_type_group'], df_p.loc[:,'status_group'], test_size = 0.2, random_state = 42)

# This is the point where Feature Engineering Starts. 
# After model building, replace Xtrain by Xtest and so on for y

In [262]:
#Check the sizes
Xtrain.shape, ytrain.shape

((47520, 39), (47520,))

In [263]:
#Merge the training data back together
df_p = pd.concat([Xtrain, ytrain], axis = 1)

In [264]:
df_p.head(20)

Unnamed: 0_level_0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
454,50.0,2013-02-27,Dmdd,2092,DMDD,35.42602,-4.227446,Narmo,0,Internal,...,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional
510,0.0,2011-03-17,Cmsr,0,Gove,35.510074,-5.724555,Lukali,0,Internal,...,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
14146,0.0,2011-07-10,Kkkt,0,KKKT,32.499866,-9.081222,Mahakama,0,Lake Rukwa,...,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other,non functional
47410,0.0,2011-04-12,,0,,34.060484,-8.830208,Shule Ya Msingi Chosi A,0,Rufiji,...,soft,good,insufficient,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,non functional
1288,300.0,2011-04-05,Ki,1023,Ki,37.03269,-6.040787,Kwa Mjowe,0,Wami / Ruvu,...,salty,salty,enough,enough,shallow well,shallow well,groundwater,other,other,non functional
13095,0.0,2011-08-08,Hesawa,0,DWE,33.509112,-2.648505,Kwa Mudaba,0,Lake Victoria,...,salty,salty,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
558,0.0,2013-03-01,World Vision,0,World vision,33.731347,-3.284633,Mwamagulya,0,Internal,...,soft,good,seasonal,seasonal,shallow well,shallow well,groundwater,hand pump,hand pump,functional
35626,0.0,2011-03-21,Selous G,298,Selous G,36.864072,-7.935517,Kwamligo,0,Rufiji,...,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional
8696,0.0,2011-08-02,Government Of Tanzania,0,Government,33.423658,-2.606991,Kwa Nuhu,0,Lake Victoria,...,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,non functional
48650,0.0,2013-01-22,Government Of Tanzania,1141,DWE,30.381136,-4.640729,Msebei,0,Lake Tanganyika,...,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional


In [265]:
#Replace "functional needs repair" by "non functional"
#The rationale behind it is simply we don't want a bad shape pump go under the radar

df_p['status_group'] = df_p['status_group'].str.replace('functional needs repair', 'non functional')
df_p['status_group'].value_counts()

functional        25802
non functional    21718
Name: status_group, dtype: int64

In [266]:
#Missing Values
#Notice when you split train/test this has changed
df_p.isnull().sum() # number of missing values 

amount_tsh                   0
date_recorded                0
funder                    2876
gps_height                   0
installer                 2889
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 296
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            2689
recorded_by                  0
scheme_management         3102
scheme_name              22523
permit                    2439
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_group                0
quantity

In [267]:
#Remove scheme_name and date_recorded because 
#scheme_name is mostly empty and date_recorded is something that cant be correlated

df_p = df_p.drop(['scheme_name', 'date_recorded'], axis=1)

In [301]:
#Divide df into numeric and categorical
#Numeric df
#Get rid of num_private

df_num = df_p[['amount_tsh', 'gps_height', 'longitude', 'latitude', 'region_code', 'district_code', 'population', 'construction_year']]


In [269]:
#Categorical df

df_cat = df_p[['funder', 'installer', 'wpt_name', 'basin', 'subvillage', 'region', 'lga', 'ward', 'public_meeting', 'recorded_by', 'scheme_management', 'permit', 'extraction_type', 'extraction_type_group', 'extraction_type_class', 'management', 'management_group', 'payment', 'payment_type', 'water_quality', 'quality_group', 'quantity', 'quantity_group', 'source', 'source_type', 'source_class', 'waterpoint_type', 'waterpoint_type_group', 'status_group']]

In [270]:
#Imputation with mean, median and most frequent

#Construction_Year
year_to_replace_with = df_num['construction_year'].value_counts()
year_to_replace_with.index[1]
df_num['construction_year'].replace(0, year_to_replace_with.index[1], inplace = True)

#Population
population_to_replace_with = df_num['population'].median()
df_num['population'].replace(0, population_to_replace_with, inplace = True)

#amount_tsh
amount_to_replace_with = round(df_num['amount_tsh'].mean())
df_num['amount_tsh'].replace(0, amount_to_replace_with, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [271]:
df_num

Unnamed: 0_level_0,amount_tsh,gps_height,longitude,latitude,region_code,district_code,population,construction_year
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
454,50.0,2092,35.426020,-4.227446e+00,21,1,160,1998
510,322.0,0,35.510074,-5.724555e+00,1,6,25,2010
14146,322.0,0,32.499866,-9.081222e+00,12,6,25,2010
47410,322.0,0,34.060484,-8.830208e+00,12,7,25,2010
1288,300.0,1023,37.032690,-6.040787e+00,5,1,120,1997
13095,322.0,0,33.509112,-2.648505e+00,19,2,25,2010
558,322.0,0,33.731347,-3.284633e+00,17,2,25,2010
35626,322.0,298,36.864072,-7.935517e+00,5,3,250,2009
8696,322.0,0,33.423658,-2.606991e+00,19,2,25,2010
48650,322.0,1141,30.381136,-4.640729e+00,16,2,1520,2009


In [272]:
df_num.shape, df_cat.shape

((47520, 8), (47520, 29))

In [273]:
df_cat.nunique()

funder                    1698
installer                 1923
wpt_name                 30742
basin                        9
subvillage               17232
region                      21
lga                        125
ward                      2076
public_meeting               2
recorded_by                  1
scheme_management           12
permit                       2
extraction_type             18
extraction_type_group       13
extraction_type_class        7
management                  12
management_group             5
payment                      7
payment_type                 7
water_quality                8
quality_group                6
quantity                     5
quantity_group               5
source                      10
source_type                  7
source_class                 3
waterpoint_type              7
waterpoint_type_group        6
status_group                 2
dtype: int64

In [274]:
#Fill all NaNs with 'no data'
df_cat_fillna = df_cat.fillna('not available')

In [275]:
df_cat_fillna.head(20)

Unnamed: 0_level_0,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,recorded_by,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
454,Dmdd,DMDD,Narmo,Internal,Bashnet Kati,Manyara,Babati,Bashinet,True,GeoData Consultants Ltd,...,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional
510,Cmsr,Gove,Lukali,Internal,Lukali,Dodoma,Bahi,Lamaiti,True,GeoData Consultants Ltd,...,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
14146,Kkkt,KKKT,Mahakama,Lake Rukwa,Chawalikozi,Mbeya,Mbozi,Ndalambo,True,GeoData Consultants Ltd,...,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other,non functional
47410,not available,not available,Shule Ya Msingi Chosi A,Rufiji,Shuleni,Mbeya,Mbarali,Chimala,True,GeoData Consultants Ltd,...,soft,good,insufficient,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,non functional
1288,Ki,Ki,Kwa Mjowe,Wami / Ruvu,Ngholong,Morogoro,Kilosa,Chakwale,True,GeoData Consultants Ltd,...,salty,salty,enough,enough,shallow well,shallow well,groundwater,other,other,non functional
13095,Hesawa,DWE,Kwa Mudaba,Lake Victoria,Lumeji,Mwanza,Magu,Sukuma,True,GeoData Consultants Ltd,...,salty,salty,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
558,World Vision,World vision,Mwamagulya,Internal,Ngomeni,Shinyanga,Maswa,Busilili,True,GeoData Consultants Ltd,...,soft,good,seasonal,seasonal,shallow well,shallow well,groundwater,hand pump,hand pump,functional
35626,Selous G,Selous G,Kwamligo,Rufiji,Namisatu,Morogoro,Kilombero,Kiberege,True,GeoData Consultants Ltd,...,soft,good,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional
8696,Government Of Tanzania,Government,Kwa Nuhu,Lake Victoria,Nyamiselya,Mwanza,Magu,Nyigogo,True,GeoData Consultants Ltd,...,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,non functional
48650,Government Of Tanzania,DWE,Msebei,Lake Tanganyika,Msebei,Kigoma,Kasulu,Ruhita,True,GeoData Consultants Ltd,...,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional


In [276]:
#Check NaNs in here
#Check the NaNs - num
df_cat_fillna.isnull().sum()

funder                   0
installer                0
wpt_name                 0
basin                    0
subvillage               0
region                   0
lga                      0
ward                     0
public_meeting           0
recorded_by              0
scheme_management        0
permit                   0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
waterpoint_type          0
waterpoint_type_group    0
status_group             0
dtype: int64

In [277]:
#Dummify the target
dummy_target_var = pd.get_dummies(df_cat_fillna['status_group'])

In [278]:
#Check the dummy df and NORMALLY remove the redundant variable. I will not here...
dummy_target_var.head()

Unnamed: 0_level_0,functional,non functional
id,Unnamed: 1_level_1,Unnamed: 2_level_1
454,1,0
510,1,0
14146,0,1
47410,0,1
1288,0,1


In [279]:
#Concat original cat df and dummified target

df_cat_fillna_dummy_target = pd.concat([df_cat_fillna, dummy_target_var], axis = 1)
df_cat_fillna_dummy_target.head(3)

Unnamed: 0_level_0,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,recorded_by,...,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group,functional,non functional
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
454,Dmdd,DMDD,Narmo,Internal,Bashnet Kati,Manyara,Babati,Bashinet,True,GeoData Consultants Ltd,...,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional,1,0
510,Cmsr,Gove,Lukali,Internal,Lukali,Dodoma,Bahi,Lamaiti,True,GeoData Consultants Ltd,...,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional,1,0
14146,Kkkt,KKKT,Mahakama,Lake Rukwa,Chawalikozi,Mbeya,Mbozi,Ndalambo,True,GeoData Consultants Ltd,...,enough,enough,shallow well,shallow well,groundwater,other,other,non functional,0,1


In [280]:
#Target Encoding a Subset of columns

column_list_to_target_encode = ['basin','extraction_type_class','region','scheme_management','management_group','extraction_type_class' ,'payment_type', 'quality_group', 'quantity_group',
                               'source_class', 'waterpoint_type_group']
for column in column_list_to_target_encode:
    target_means = df_cat_fillna_dummy_target.groupby(column).mean()
    df_cat_fillna_dummy_target[f'{column}_func'] = df_cat_fillna_dummy_target[column].replace(target_means['functional'])
    df_cat_fillna_dummy_target[f'{column}_nonfunc'] = df_cat_fillna_dummy_target[column].replace(target_means['non functional'])
#df['cat_nonf'] = df['cat'].replace(target_means['nonf'])
df_cat_fillna_dummy_target.head()

Unnamed: 0_level_0,funder,installer,wpt_name,basin,subvillage,region,lga,ward,public_meeting,recorded_by,...,payment_type_func,payment_type_nonfunc,quality_group_func,quality_group_nonfunc,quantity_group_func,quantity_group_nonfunc,source_class_func,source_class_nonfunc,waterpoint_type_group_func,waterpoint_type_group_nonfunc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
454,Dmdd,DMDD,Narmo,Internal,Bashnet Kati,Manyara,Babati,Bashinet,True,GeoData Consultants Ltd,...,0.677281,0.322719,0.565821,0.434179,0.524537,0.475463,0.543124,0.456876,0.577404,0.422596
510,Cmsr,Gove,Lukali,Internal,Lukali,Dodoma,Bahi,Lamaiti,True,GeoData Consultants Ltd,...,0.449503,0.550497,0.565821,0.434179,0.652461,0.347539,0.543124,0.456876,0.615505,0.384495
14146,Kkkt,KKKT,Mahakama,Lake Rukwa,Chawalikozi,Mbeya,Mbozi,Ndalambo,True,GeoData Consultants Ltd,...,0.449503,0.550497,0.565821,0.434179,0.652461,0.347539,0.543124,0.456876,0.132013,0.867987
47410,not available,not available,Shule Ya Msingi Chosi A,Rufiji,Shuleni,Mbeya,Mbarali,Chimala,True,GeoData Consultants Ltd,...,0.661089,0.338911,0.565821,0.434179,0.524537,0.475463,0.541889,0.458111,0.577404,0.422596
1288,Ki,Ki,Kwa Mjowe,Wami / Ruvu,Ngholong,Morogoro,Kilosa,Chakwale,True,GeoData Consultants Ltd,...,0.617945,0.382055,0.458663,0.541337,0.652461,0.347539,0.543124,0.456876,0.132013,0.867987


In [281]:
#Check the shape
df_cat_fillna_dummy_target.shape

(47520, 51)

In [282]:
#Check the column string to make sure that everything is there
df_cat_fillna_dummy_target.columns
#['basin', 'extraction_type_class', 'management_group', 
#                                                       'payment_type', 'quality_group', 'quantity_group', 'source_class', 
#                                                       'waterpoint_type_group']

Index(['funder', 'installer', 'wpt_name', 'basin', 'subvillage', 'region',
       'lga', 'ward', 'public_meeting', 'recorded_by', 'scheme_management',
       'permit', 'extraction_type', 'extraction_type_group',
       'extraction_type_class', 'management', 'management_group', 'payment',
       'payment_type', 'water_quality', 'quality_group', 'quantity',
       'quantity_group', 'source', 'source_type', 'source_class',
       'waterpoint_type', 'waterpoint_type_group', 'status_group',
       'functional', 'non functional', 'basin_func', 'basin_nonfunc',
       'extraction_type_class_func', 'extraction_type_class_nonfunc',
       'region_func', 'region_nonfunc', 'scheme_management_func',
       'scheme_management_nonfunc', 'management_group_func',
       'management_group_nonfunc', 'payment_type_func', 'payment_type_nonfunc',
       'quality_group_func', 'quality_group_nonfunc', 'quantity_group_func',
       'quantity_group_nonfunc', 'source_class_func', 'source_class_nonfunc',
       

In [283]:
#merge with numerical df, just assign it to the new df
num_dumm_cat = pd.concat([df_num, df_cat_fillna_dummy_target], axis = 1)

In [284]:
num_dumm_cat.head()

Unnamed: 0_level_0,amount_tsh,gps_height,longitude,latitude,region_code,district_code,population,construction_year,funder,installer,...,payment_type_func,payment_type_nonfunc,quality_group_func,quality_group_nonfunc,quantity_group_func,quantity_group_nonfunc,source_class_func,source_class_nonfunc,waterpoint_type_group_func,waterpoint_type_group_nonfunc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
454,50.0,2092,35.42602,-4.227446,21,1,160,1998,Dmdd,DMDD,...,0.677281,0.322719,0.565821,0.434179,0.524537,0.475463,0.543124,0.456876,0.577404,0.422596
510,322.0,0,35.510074,-5.724555,1,6,25,2010,Cmsr,Gove,...,0.449503,0.550497,0.565821,0.434179,0.652461,0.347539,0.543124,0.456876,0.615505,0.384495
14146,322.0,0,32.499866,-9.081222,12,6,25,2010,Kkkt,KKKT,...,0.449503,0.550497,0.565821,0.434179,0.652461,0.347539,0.543124,0.456876,0.132013,0.867987
47410,322.0,0,34.060484,-8.830208,12,7,25,2010,not available,not available,...,0.661089,0.338911,0.565821,0.434179,0.524537,0.475463,0.541889,0.458111,0.577404,0.422596
1288,300.0,1023,37.03269,-6.040787,5,1,120,1997,Ki,Ki,...,0.617945,0.382055,0.458663,0.541337,0.652461,0.347539,0.543124,0.456876,0.132013,0.867987


In [285]:
num_dumm_cat.columns

Index(['amount_tsh', 'gps_height', 'longitude', 'latitude', 'region_code',
       'district_code', 'population', 'construction_year', 'funder',
       'installer', 'wpt_name', 'basin', 'subvillage', 'region', 'lga', 'ward',
       'public_meeting', 'recorded_by', 'scheme_management', 'permit',
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 'payment_type',
       'water_quality', 'quality_group', 'quantity', 'quantity_group',
       'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'status_group', 'functional', 'non functional',
       'basin_func', 'basin_nonfunc', 'extraction_type_class_func',
       'extraction_type_class_nonfunc', 'region_func', 'region_nonfunc',
       'scheme_management_func', 'scheme_management_nonfunc',
       'management_group_func', 'management_group_nonfunc',
       'payment_type_func', 'payment_type_nonfunc', 'quality_group_func',
      

In [286]:
# List comprehension for picking training data sets

column_list_to_target_encode = ['basin','extraction_type_class','region','scheme_management','management_group','extraction_type_class' ,'payment_type', 'quality_group', 'quantity_group',
                               'source_class', 'waterpoint_type_group']

#num_list =['amount_tsh', 'gps_height', 'longitude', 'latitude', 'num_private', 
 #          'region_code', 'district_code', 'population', 'construction_year']

#for functionals
s_func = '_func'
s_nonfunc = '_nonfunc'
s_nr = '_nr'
func_list = [x + s_func for x in column_list_to_target_encode]
nonfunc_list = [x + s_nonfunc for x in column_list_to_target_encode]
repair_list = [x + s_nr for x in column_list_to_target_encode]
#func_list.extend(num_list)
#nonfunc_list.extend(num_list)
#repair_list.extend(num_list)
print(func_list, nonfunc_list)

['basin_func', 'extraction_type_class_func', 'region_func', 'scheme_management_func', 'management_group_func', 'extraction_type_class_func', 'payment_type_func', 'quality_group_func', 'quantity_group_func', 'source_class_func', 'waterpoint_type_group_func'] ['basin_nonfunc', 'extraction_type_class_nonfunc', 'region_nonfunc', 'scheme_management_nonfunc', 'management_group_nonfunc', 'extraction_type_class_nonfunc', 'payment_type_nonfunc', 'quality_group_nonfunc', 'quantity_group_nonfunc', 'source_class_nonfunc', 'waterpoint_type_group_nonfunc']


In [287]:
#Define data and train sets
X_func = num_dumm_cat.loc[:,func_list]
X_nonfunc = num_dumm_cat.loc[:,nonfunc_list]
y_func = num_dumm_cat.loc[:, 'functional']
y_nonfunc = num_dumm_cat.loc[:, 'non functional']


In [288]:
#Check shapes
X_func.shape, y_func.shape

((47520, 11), (47520,))

In [289]:
m_f = LogisticRegression(C=1e5, solver = 'lbfgs', max_iter = 1000) # C is a hyperparameter (regularization, W03)
m_nf = LogisticRegression(C=1e5, solver = 'lbfgs', max_iter = 1000) # C is a hyperparameter (regularization, W03)

m_f.fit(X_func, y_func) # find the best model parameters for this data functional pumps
m_nf.fit(X_nonfunc, y_nonfunc) # best model params for non_functional


LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=1000,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

In [294]:
print('Score for functional pump prediction is', m_f.score(X_func,y_func)) # calculates the accuracy (% of correct points) for non_func
print('Score for nonfunctional pump prediction is', m_nf.score(X_nonfunc,y_nonfunc))

Score for functional pump prediction is 0.7098905723905724
Score for nonfunctional pump prediction is 0.7099116161616161


In [297]:
import numpy as np
#model = svm.SVC(kernel='linear', C=1.0, probability=True)

accuracy_func = cross_val_score(m_f, X_func, y_func, cv=5, scoring='accuracy')
accuracy_nonfunc = cross_val_score(m_nf, X_nonfunc, y_nonfunc, cv=5, scoring='accuracy')

mean_func = np.mean(accuracy_func)
std_func = np.std(accuracy_func)
mean_nf = np.mean(accuracy_nonfunc)
std_nf = np.std(accuracy_nonfunc)

print(
"Mean cross-validation score for functional-pumps:", mean_func, 
    '\n',
"St.dev of cross-validation score for functional-pumps:", std_func,   
    '\n',
"Mean cross-validation score for nonfunctional-pumps:", mean_nf, 
    '\n',
"St.dev of cross-validation score for nonfunctional-pumps:", std_nf,   
    '\n',)

Mean cross-validation score for functional-pumps: 0.709280205249682 
 St.dev of cross-validation score for functional-pumps: 0.003130380761529579 
 Mean cross-validation score for nonfunctional-pumps: 0.7092591614786383 
 St.dev of cross-validation score for nonfunctional-pumps: 0.0031454804753731085 

