<a href="https://colab.research.google.com/github/hughjafro/DS1-Unit-2-Sprint-5---Project-Week/blob/master/Sub_4_CTolbert_DS1_Unit_2_Project_Week.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Sprint Challenge using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, which need some repairs, and which don't work at all? Predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

This predictive modeling challenge comes from DrivenData, an organization who helps non-profits by hosting data science competitions for social impact. The competition has open licensing: "The data is available for use outside of DrivenData." We are reusing the data on Kaggle's InClass platform so we can run a weeklong challenge just for our Lambda School DS1 cohort.

The data comes from the Taarifa waterpoints dashboard, which aggregates data from the Tanzania Ministry of Water. In their own words:

Taarifa is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues. Think of it as a bug tracker for the real world which helps to engage citizens with their local government. We are currently working on an Innovation Project in Tanzania, with various partners.


** * Review fast.ai/2017/11/13/validation-sets**

In [0]:
# Import all packages

import numpy as np
np.set_printoptions(threshold=np.inf)
np.set_printoptions(suppress=True)

import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
import matplotlib.pyplot as plt
import seaborn as sns

# Encoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn import preprocessing

# For scaling
from sklearn.preprocessing import scale, StandardScaler, Imputer

# Regression models
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_validate, cross_val_predict, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier


# For ridge regression
from sklearn.linear_model import Ridge

In [0]:
url_testf = 'https://raw.githubusercontent.com/hughjafro/DS1-Unit-2-Sprint-5---Project-Week/master/test_features.csv'
url_trainf = 'https://raw.githubusercontent.com/hughjafro/DS1-Unit-2-Sprint-5---Project-Week/master/train_features.csv'
url_trainl = 'https://raw.githubusercontent.com/hughjafro/DS1-Unit-2-Sprint-5---Project-Week/master/train_labels.csv'
df_trainf = pd.read_csv(url_trainf)
df_labels = pd.read_csv(url_trainl)
df_test = pd.read_csv(url_testf)

####Check the initial datasets

In [18]:
print(df_trainf.shape)
df_trainf.head()

(59400, 40)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [19]:
# This will be our target - status_group
print(df_labels.shape)
df_labels.head()

(59400, 2)


Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [20]:
print(df_test.shape)
df_test.head()

(14358, 40)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,Magoma,Manyara,21,3,Mbulu,Bashay,321,True,GeoData Consultants Ltd,Parastatal,,True,2012,other,other,other,parastatal,parastatal,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,Kimnyak,Arusha,2,2,Arusha Rural,Kimnyaki,300,True,GeoData Consultants Ltd,VWC,TPRI pipe line,True,2000,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,Internal,Msatu,Singida,13,2,Singida Rural,Puma,500,True,GeoData Consultants Ltd,VWC,P,,2010,other,other,other,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Kipindimbi,Lindi,80,43,Liwale,Mkutano,250,,GeoData Consultants Ltd,VWC,,True,1987,other,other,other,vwc,user-group,unknown,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Losonga,Ruvuma,10,3,Mbinga,Mbinga Urban,60,,GeoData Consultants Ltd,Water Board,BRUDER,True,2000,gravity,gravity,gravity,water board,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


####ID the target

In [0]:
#Define target feature
target1 = df_labels.status_group

In [22]:
# Check target values
target1.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

In [23]:
# Percent funtional
print('Percent funtional (baseline accuracy): ', 32259 / (32259+22824+4317))

# Percent non-functional
print('\nPercent non-functional: ', 22824 / (32259+22824+4317))

# Percent need repair
print('\nPercent need repair: ', 4317 / (32259+22824+4317))

Percent funtional (baseline accuracy):  0.543080808080808

Percent non-functional:  0.3842424242424242

Percent need repair:  0.07267676767676767


In [24]:
# Check to make sure this is the right method
target1.value_counts(normalize=True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

####Majority Class

In [0]:
majority_class = df_labels['status_group'].mode()[0]

y_pred = np.full(shape=df_labels['status_group'].shape, fill_value=majority_class)

In [26]:
accuracy_score(df_labels['status_group'], y_pred)

0.543080808080808

###Initial Classification Report

In [27]:
print(classification_report(df_labels['status_group'], y_pred) )

  'precision', 'predicted', average, warn_for)


                         precision    recall  f1-score   support

             functional       0.54      1.00      0.70     32259
functional needs repair       0.00      0.00      0.00      4317
         non functional       0.00      0.00      0.00     22824

              micro avg       0.54      0.54      0.54     59400
              macro avg       0.18      0.33      0.23     59400
           weighted avg       0.29      0.54      0.38     59400



####Combine test/train dataframes into one dataframe so they can be cleaned at one time

In [0]:
# # Convert target to numerical values
# df_labels = pd.get_dummies(df_labels)

In [43]:
# Remember to use concat to add to the bottome
df_all = pd.concat([df_trainf, df_test])
print(df_all.shape)
df_all.head()

(73758, 40)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,Roman,False,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,,GeoData Consultants Ltd,Other,,True,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,Nyumba ya mungu pipe scheme,True,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,,True,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,,,True,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [34]:
# Check NaN's
df_all.isna().sum()

id                           0
amount_tsh                   0
date_recorded                0
funder                    4418
gps_height                   0
installer                 4443
longitude                    0
latitude                     0
wpt_name                     0
num_private                  0
basin                        0
subvillage                 465
region                       0
region_code                  0
district_code                0
lga                          0
ward                         0
population                   0
public_meeting            4119
recorded_by                  0
scheme_management         4816
scheme_name              35005
permit                    3719
construction_year            0
extraction_type              0
extraction_type_group        0
extraction_type_class        0
management                   0
management_group             0
payment                      0
payment_type                 0
water_quality                0
quality_

In [0]:
# Drop feature with very large missing values  - Scheme name
df_all=df_all.drop(columns='scheme_name')

In [37]:
df_all.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'permit', 'construction_year', 'extraction_type',
       'extraction_type_group', 'extraction_type_class', 'management',
       'management_group', 'payment', 'payment_type', 'water_quality',
       'quality_group', 'quantity', 'quantity_group', 'source', 'source_type',
       'source_class', 'waterpoint_type', 'waterpoint_type_group'],
      dtype='object')

###Fill missing values with the mode of that column

This is easier for categorical columns

In [0]:
cols = ['funder', 'installer', 'subvillage', 'public_meeting',
        'scheme_management', 'permit']

df_all[cols] = df_all[cols].fillna(df_all.mode().iloc[0])

In [42]:
# Check for nans
df_all.isna().sum()

id                       0
amount_tsh               0
date_recorded            0
funder                   0
gps_height               0
installer                0
longitude                0
latitude                 0
wpt_name                 0
num_private              0
basin                    0
subvillage               0
region                   0
region_code              0
district_code            0
lga                      0
ward                     0
population               0
public_meeting           0
recorded_by              0
scheme_management        0
permit                   0
construction_year        0
extraction_type          0
extraction_type_group    0
extraction_type_class    0
management               0
management_group         0
payment                  0
payment_type             0
water_quality            0
quality_group            0
quantity                 0
quantity_group           0
source                   0
source_type              0
source_class             0
w

###Dummy encode data

In [0]:
# Dummy encoder function
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

#Separate into train and test again

In [0]:
# Target is still target1
X = df_all[:-14358] # drop last 14358 rows
test = df_all[-14358:] # keep the last 14358 rows
y = target1

In [56]:
print(X.shape)
X.head()

(59400, 40)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,47,Roman,1390,Roman,34.938093,-9.856322,37399,0,1,Mnyusi B,3,11,5,51,1426,109,2,0,VWC,Roman,0,1999,3,1,0,7,4,2,0,6,2,1,1,8,6,0,1,1
1,8776,0.0,309,Grumeti,1399,GRUMETI,34.698766,-2.147466,37195,0,4,Nyamara,9,20,2,103,1576,280,0,0,Other,,1,2010,3,1,0,11,4,0,2,6,2,2,2,5,3,1,1,1
2,34310,25.0,300,Lottery Club,686,World vision,37.460664,-3.821329,14572,0,5,Majengo,8,21,4,108,1624,250,2,0,VWC,Nyumba ya mungu pipe scheme,1,2009,3,1,0,7,4,4,5,6,2,1,1,0,1,1,2,1
3,67743,0.0,272,Unicef,263,UNICEF,38.486161,-11.155298,37285,0,7,Mahakamani,12,90,63,87,1571,58,2,0,VWC,,1,1986,14,10,5,7,4,0,2,6,2,0,0,3,0,0,2,1
4,19728,0.0,104,Action In A,0,Artisan,31.130847,-1.825359,35529,0,4,Kyanyamisa,4,18,1,26,1687,0,2,0,,,1,0,3,1,0,1,1,0,2,6,2,3,3,5,3,1,1,1


In [54]:
df_ce = dummyEncode(X)
df_ce.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Error encoding funder
Error encoding installer
Error encoding subvillage
Error encoding scheme_management
Error encoding scheme_name


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,47,Roman,1390,Roman,34.938093,-9.856322,37399,0,1,Mnyusi B,3,11,5,51,1426,109,2,0,VWC,Roman,0,1999,3,1,0,7,4,2,0,6,2,1,1,8,6,0,1,1
1,8776,0.0,309,Grumeti,1399,GRUMETI,34.698766,-2.147466,37195,0,4,Nyamara,9,20,2,103,1576,280,0,0,Other,,1,2010,3,1,0,11,4,0,2,6,2,2,2,5,3,1,1,1
2,34310,25.0,300,Lottery Club,686,World vision,37.460664,-3.821329,14572,0,5,Majengo,8,21,4,108,1624,250,2,0,VWC,Nyumba ya mungu pipe scheme,1,2009,3,1,0,7,4,4,5,6,2,1,1,0,1,1,2,1
3,67743,0.0,272,Unicef,263,UNICEF,38.486161,-11.155298,37285,0,7,Mahakamani,12,90,63,87,1571,58,2,0,VWC,,1,1986,14,10,5,7,4,0,2,6,2,0,0,3,0,0,2,1
4,19728,0.0,104,Action In A,0,Artisan,31.130847,-1.825359,35529,0,4,Kyanyamisa,4,18,1,26,1687,0,2,0,,,1,0,3,1,0,1,1,0,2,6,2,3,3,5,3,1,1,1


In [59]:
test_ce = dummyEncode(test)

Error encoding funder
Error encoding installer
Error encoding subvillage
Error encoding scheme_management
Error encoding scheme_name


In [0]:
X_train, X_test, y_train, y_test = train_test_split(df_ce, y, test_size=0.25, random_state=42, shuffle=True)

In [61]:
# Logistic Regression again
log_reg = LogisticRegression(solver='lbfgs', multi_class='multinomial')
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

ValueError: ignored

In [0]:
# y_pred.shape, y_test.shape

In [0]:
accuracy_score(y_test, y_pred)

##OLD CODE SNIPPETS

In [0]:
# Start with work on dataset that has no missing values
# Dropped (7): funder, installer, subvillage, public_meeting,
#    scheme_management, scheme_name, permit

df_clean = df.dropna(axis=1)

df_clean = df_clean.drop(columns='status_group')

print(df_clean.shape)
df_clean.head()

(59400, 33)


Unnamed: 0,id,amount_tsh,date_recorded,gps_height,longitude,latitude,wpt_name,num_private,basin,region,region_code,district_code,lga,ward,population,recorded_by,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,1390,34.938093,-9.856322,none,0,Lake Nyasa,Iringa,11,5,Ludewa,Mundindi,109,GeoData Consultants Ltd,1999,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,1399,34.698766,-2.147466,Zahanati,0,Lake Victoria,Mara,20,2,Serengeti,Natta,280,GeoData Consultants Ltd,2010,gravity,gravity,gravity,wug,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,686,37.460664,-3.821329,Kwa Mahundi,0,Pangani,Manyara,21,4,Simanjiro,Ngorika,250,GeoData Consultants Ltd,2009,gravity,gravity,gravity,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,263,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,Mtwara,90,63,Nanyumbu,Nanyumbu,58,GeoData Consultants Ltd,1986,submersible,submersible,submersible,vwc,user-group,never pay,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,0,31.130847,-1.825359,Shuleni,0,Lake Victoria,Kagera,18,1,Karagwe,Nyakasimbi,0,GeoData Consultants Ltd,0,gravity,gravity,gravity,other,other,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [0]:
df_num.dtypes

id                     int64
amount_tsh           float64
gps_height             int64
longitude            float64
latitude             float64
num_private            int64
region_code            int64
district_code          int64
population             int64
construction_year      int64
dtype: object

In [0]:
df_num.shape

(59400, 10)

In [0]:
target.shape

(59400,)

In [0]:
# Convert target to numerical values
# target = pd.get_dummies(target1)

In [0]:
target.head()

Unnamed: 0,functional,functional needs repair,non functional
0,1,0,0
1,1,0,0
2,1,0,0
3,0,0,1
4,1,0,0


###Grid Search

In [0]:
# Polynomial Regression
def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(**kwargs))

In [0]:
y = target
X = df_num

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [0]:
y_train.head()

Unnamed: 0,functional,functional needs repair,non functional
24947,0,0,1
22630,1,0,0
13789,1,0,0
15697,1,0,0
22613,0,0,1


In [0]:
param_grid = {
    'polynomialfeatures__degree': [0,1,2,3]
}

gridsearch = GridSearchCV(PolynomialRegression(), param_grid=param_grid,
                          scoring='neg_mean_absolute_error', cv=3,
                          return_train_score=True, verbose=10)
gridsearch.fit(X_train, y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] polynomialfeatures__degree=0 ....................................
[CV]  polynomialfeatures__degree=0, score=-0.368583414390804, total=   0.0s
[CV] polynomialfeatures__degree=0 ....................................
[CV]  polynomialfeatures__degree=0, score=-0.36816004262603447, total=   0.0s
[CV] polynomialfeatures__degree=0 ....................................
[CV]  polynomialfeatures__degree=0, score=-0.36816695348547995, total=   0.0s
[CV] polynomialfeatures__degree=1 ....................................
[CV]  polynomialfeatures__degree=1, score=-0.3600804549089478, total=   0.0s
[CV] polynomialfeatures__degree=1 ....................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.1s remaining:    0.0s


[CV]  polynomialfeatures__degree=1, score=-0.3601097131032711, total=   0.1s
[CV] polynomialfeatures__degree=1 ....................................
[CV]  polynomialfeatures__degree=1, score=-0.36020322064891047, total=   0.1s
[CV] polynomialfeatures__degree=2 ....................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    0.3s remaining:    0.0s


[CV]  polynomialfeatures__degree=2, score=-0.3384102461008074, total=   0.2s
[CV] polynomialfeatures__degree=2 ....................................
[CV]  polynomialfeatures__degree=2, score=-0.33861601937060865, total=   0.2s


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    0.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    0.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    1.0s remaining:    0.0s


[CV] polynomialfeatures__degree=2 ....................................
[CV]  polynomialfeatures__degree=2, score=-0.33848950922009857, total=   0.2s
[CV] polynomialfeatures__degree=3 ....................................
[CV]  polynomialfeatures__degree=3, score=-0.3695012900990757, total=   1.1s
[CV] polynomialfeatures__degree=3 ....................................
[CV]  polynomialfeatures__degree=3, score=-0.3526197327239726, total=   1.2s
[CV] polynomialfeatures__degree=3 ....................................
[CV]  polynomialfeatures__degree=3, score=-0.33251827087210256, total=   1.2s


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    5.0s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('polynomialfeatures', PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'polynomialfeatures__degree': [0, 1, 2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_absolute_error', verbose=10)

In [0]:
pd.DataFrame(gridsearch.cv_results_).sort_values(by='rank_test_score')

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_polynomialfeatures__degree,params,rank_test_score,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
2,0.16685,0.017993,-0.338505,-0.337444,2,{'polynomialfeatures__degree': 2},1,-0.33841,-0.338207,-0.338616,-0.336983,-0.33849,-0.337142,0.031809,0.000979,8.5e-05,0.000543
3,1.038404,0.106784,-0.351546,-0.326756,3,{'polynomialfeatures__degree': 3},2,-0.369501,-0.327632,-0.35262,-0.326178,-0.332518,-0.326459,0.029969,0.00644,0.015117,0.00063
1,0.041044,0.011676,-0.360131,-0.359992,1,{'polynomialfeatures__degree': 1},3,-0.36008,-0.360374,-0.36011,-0.359748,-0.360203,-0.359854,0.006448,0.001359,5.2e-05,0.000274
0,0.019737,0.002963,-0.368303,-0.368297,0,{'polynomialfeatures__degree': 0},4,-0.368583,-0.367736,-0.36816,-0.368588,-0.368167,-0.368566,0.013328,0.000119,0.000198,0.000396


###Random Forest

In [0]:
model = RandomForestRegressor(n_estimators=100, max_depth=20)

scores = cross_validate(model, X_train, y_train,
                        scoring='neg_mean_absolute_error',
                        cv=3, return_train_score=True,
                       return_estimator=True)

pd.DataFrame(scores)

Unnamed: 0,estimator,fit_time,score_time,test_score,train_score
0,"(DecisionTreeRegressor(criterion='mse', max_de...",15.014101,0.30095,-0.258307,-0.130173
1,"(DecisionTreeRegressor(criterion='mse', max_de...",15.156601,0.301065,-0.257697,-0.129451
2,"(DecisionTreeRegressor(criterion='mse', max_de...",15.167544,0.29376,-0.259428,-0.130638


In [0]:
scores['test_score'].mean()

-0.2584773963894908

### Train Test Split  / Logistic Regression

In [0]:
y = target1
X = df_num

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [0]:
X_train.shape, y_train.shape

((44550, 10), (44550,))

In [0]:
log_reg = LogisticRegression(solver='lbfgs').fit(X_train, y_train)

y_pred = log_reg.predict(X_test)
accuracy_score(y_test, y_pred)

# log_reg.score(X,y)

print('Train test score: ', log_reg.score(X_train, y_train))
print('Test score: ', log_reg.score(X_test, y_test))



Train test score:  0.5509315375982042
Test score:  0.5540067340067341


In [0]:
print(classification_report(y_test, y_pred))
print('accuracy', accuracy_score(y_test, y_pred))

                         precision    recall  f1-score   support

             functional       0.56      0.94      0.70      8098
functional needs repair       0.00      0.00      0.00      1074
         non functional       0.49      0.11      0.19      5678

              micro avg       0.55      0.55      0.55     14850
              macro avg       0.35      0.35      0.30     14850
           weighted avg       0.49      0.55      0.45     14850

accuracy 0.5540067340067341


  'precision', 'predicted', average, warn_for)


In [0]:
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

In [0]:
df_ce =dummyEncode(df_clean)
df_ce.head()

Unnamed: 0,id,amount_tsh,date_recorded,gps_height,longitude,latitude,wpt_name,num_private,basin,region,region_code,district_code,lga,ward,population,recorded_by,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,47,1390,34.938093,-9.856322,37399,0,1,3,11,5,51,1426,109,0,1999,3,1,0,7,4,2,0,6,2,1,1,8,6,0,1,1
1,8776,0.0,309,1399,34.698766,-2.147466,37195,0,4,9,20,2,103,1576,280,0,2010,3,1,0,11,4,0,2,6,2,2,2,5,3,1,1,1
2,34310,25.0,300,686,37.460664,-3.821329,14572,0,5,8,21,4,108,1624,250,0,2009,3,1,0,7,4,4,5,6,2,1,1,0,1,1,2,1
3,67743,0.0,272,263,38.486161,-11.155298,37285,0,7,12,90,63,87,1571,58,0,1986,14,10,5,7,4,0,2,6,2,0,0,3,0,0,2,1
4,19728,0.0,104,0,31.130847,-1.825359,35529,0,4,4,18,1,26,1687,0,0,0,3,1,0,1,1,0,2,6,2,3,3,5,3,1,1,1


####Evaluate the test features file

In [0]:
print(df_test.shape)
df_test.head()

(14358, 40)


Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,255,Dmdd,1996,DMDD,35.290799,-4.059696,633,0,0,Magoma,8,21,3,62,16,321,2,0,Parastatal,,2,2012,9,6,3,3,2,0,2,6,2,3,3,5,3,1,6,5
1,51630,0.0,255,Government Of Tanzania,1569,DWE,36.656709,-3.309214,1727,0,5,Kimnyak,0,2,2,0,642,300,2,0,VWC,TPRI pipe line,2,2000,3,1,0,7,4,0,2,6,2,2,2,8,6,0,1,1
2,17168,0.0,252,,1567,,34.767863,-5.004344,9483,0,0,Msatu,18,13,2,108,1659,500,2,0,VWC,P,0,2010,9,6,3,7,4,0,2,6,2,2,2,5,3,1,6,5
3,45559,0.0,242,Finn Water,267,FINN WATER,38.058046,-9.418672,5467,0,7,Kipindimbi,7,80,43,48,1178,250,0,0,VWC,,2,1987,9,6,3,7,4,6,6,6,2,0,0,7,5,0,6,5
4,49871,500.0,306,Bruder,1260,BRUDER,35.006123,-10.950412,5573,0,7,Losonga,16,10,3,60,1061,60,0,0,Water Board,BRUDER,2,2000,3,1,0,9,4,3,1,6,2,1,1,8,6,0,1,1


In [0]:
df_clean = dummyEncode(df_test.drop(columns=['funder', 'installer', 'subvillage',
                                    'scheme_management', 'scheme_name']))

In [0]:
df_clean.shape

(14358, 35)

####New Test Train Split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df_ce, y, test_size=0.25, random_state=42, shuffle=True)

In [0]:
log_reg = LogisticRegression(solver='lbfgs', multi_class='multinomial')
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)



In [0]:
accuracy_score(y_test, y_pred)

0.5501010101010101

In [0]:
init_baseline = pd.DataFrame({'id':df_test_num.id, 'status_group': test_pred}, index=None)

In [0]:
init_baseline

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
5,52449,functional
6,24806,functional
7,28965,functional
8,36301,non functional
9,54122,functional


In [0]:
from google.colab import files

init_baseline.to_csv('init_baseline_CT.csv', index=False)
files.download('init_baseline_CT.csv')



NameError: ignored

In [0]:
init_baseline.shape

(14358, 2)

In [0]:
init_baseline.head()

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional


In [0]:

model = RandomForestRegressor(n_estimators=100, max_depth=20)

scores = cross_validate(model, X_train, y_train,
                        scoring='neg_mean_absolute_error',
                        cv=3, return_train_score=True,
                        return_estimator=True)




ValueError: ignored