Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 1

## Assignment
- [X] Do train/validate/test split with the Tanzania Waterpumps data.
- [X] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what other columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What other columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Try other [scikit-learn scalers](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this problem, you may want to use the parameter `logistic=True`

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```



In [1]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 5.8MB/s 
[?25hCollecting pandas-profiling
[?25l  Downloading https://files.pythonhosted.org/packages/2c/2f/aae19e2173c10a9bb7fee5f5cad35dbe53a393960fc91abc477dcc4661e8/pandas-profiling-2.3.0.tar.gz (127kB)
[K     |████████████████████████████████| 133kB 15.8MB/s 
[?25hCollecting plotly
[?25l  Downloading https://files.pythonhosted.org/packages/70/19/8437e22c84083a6d5d8a3c80f4edc73c9dcbb89261d07e6bd13b48752bbd/plotly-4.1.1-py2.py3-none-any.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 46.5MB/s 
Collecting htmlmin>=0.1.12 (from pandas-profiling)
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting phik>=0.9.8 (from pa

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [3]:
import plotly.express as px
px.scatter(train, x='longitude', y='latitude', color='status_group', opacity=0.1)

In [4]:
train[['longitude', 'latitude']].describe()

Unnamed: 0,longitude,latitude
count,59400.0,59400.0
mean,34.077427,-5.706033
std,6.567432,2.946019
min,0.0,-11.64944
25%,33.090347,-8.540621
50%,34.908743,-5.021597
75%,37.178387,-3.326156
max,40.345193,-2e-08


In [0]:
import numpy as np

def wrangle(x):
  X= x.copy()
  X['latitude'] = X['latitude'].replace(-2e-08, 0)
  cols_with_zeros = ['longitude', 'latitude']
  for col in cols_with_zeros:
      X[col] = X[col].replace(0, np.nan)  
  X = X.drop(columns='quantity_group')      
  return(X)
  

In [0]:
train = wrangle(train)

In [17]:
px.scatter(train, x='longitude', y='latitude', color='status_group', opacity=0.1)

In [11]:
train[['longitude', 'latitude']].describe()

Unnamed: 0,longitude,latitude
count,57588.0,57588.0
mean,35.149669,-5.885572
std,2.607428,2.809876
min,29.607122,-11.64944
25%,33.2851,-8.643841
50%,35.005943,-5.172704
75%,37.233712,-3.372824
max,40.345193,-0.998464


In [46]:
test = test.drop(columns='id')
test.shape


(14358, 38)

In [0]:
from sklearn.model_selection import train_test_split

train_features = train.drop(columns=['status_group','id']).columns

X_train, X_val, y_train, y_val = train_test_split(train[train_features], train['status_group'], train_size=0.8, test_size=0.2, stratify=train['status_group'])

In [48]:
X_train.shape

(47520, 38)

In [49]:
X_train.head()

Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group
58538,0.0,2011-03-28,Government Of Tanzania,780,RWE,37.427629,-5.930824,Kwa Mwinjuma,0,Wami / Ruvu,Difungo,Morogoro,5,6,Mvomero,Kibati,1,True,GeoData Consultants Ltd,VWC,Diburuma,True,1970,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,river,river/lake,surface,communal standpipe,communal standpipe
32710,500.0,2011-02-18,Government Of Tanzania,1762,Gover,35.68553,-8.074333,none,0,Rufiji,Mwanzi,Iringa,11,7,Kilolo,Ukumbi,200,True,GeoData Consultants Ltd,VWC,Mawamb,True,2008,gravity,gravity,gravity,vwc,user-group,pay monthly,monthly,soft,good,enough,river,river/lake,surface,communal standpipe,communal standpipe
11757,200.0,2011-04-18,Cefa,1593,CEFA,35.184056,-9.231164,Kwa Piusi Mfugale,0,Rufiji,Mtwanzi,Iringa,11,4,Njombe,Lupembe,30,True,GeoData Consultants Ltd,WUA,matembwe water supply schem,True,1996,gravity,gravity,gravity,wua,user-group,pay monthly,monthly,soft,good,enough,river,river/lake,surface,communal standpipe,communal standpipe
20305,0.0,2011-07-10,Government Of Tanzania,0,DWE,31.123344,-1.718533,Kwa Izingirani,0,Lake Victoria,Nyakakiri,Kagera,18,1,Karagwe,Nyaishozi,0,True,GeoData Consultants Ltd,WUA,Lukare water supp,True,0,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe
34960,0.0,2011-02-23,Government Of Tanzania,2010,Commu,34.177837,-9.345069,none,0,Lake Nyasa,Ikiligano,Iringa,11,3,Makete,Lupalilo,0,False,GeoData Consultants Ltd,VWC,Lupali,False,1985,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe


In [50]:
type(X_train['date_recorded'])

pandas.core.series.Series

In [0]:
def inspect(li):
  return(int(li.split('-')[0]))


X_train['time_inspect'] = X_train.date_recorded.apply(inspect)
X_val['time_inspect'] = X_val.date_recorded.apply(inspect)
test['time_inspect'] = test.date_recorded.apply(inspect)


In [53]:
test.head()

Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,time_inspect
0,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,Magoma,Manyara,21,3,Mbulu,Bashay,321,True,GeoData Consultants Ltd,Parastatal,,True,2012,other,other,other,parastatal,parastatal,never pay,never pay,soft,good,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other,2013
1,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,Kimnyak,Arusha,2,2,Arusha Rural,Kimnyaki,300,True,GeoData Consultants Ltd,VWC,TPRI pipe line,True,2000,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,2013
2,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,Internal,Msatu,Singida,13,2,Singida Rural,Puma,500,True,GeoData Consultants Ltd,VWC,P,,2010,other,other,other,vwc,user-group,never pay,never pay,soft,good,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other,2013
3,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Kipindimbi,Lindi,80,43,Liwale,Mkutano,250,,GeoData Consultants Ltd,VWC,,True,1987,other,other,other,vwc,user-group,unknown,unknown,soft,good,dry,shallow well,shallow well,groundwater,other,other,2013
4,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Losonga,Ruvuma,10,3,Mbinga,Mbinga Urban,60,,GeoData Consultants Ltd,Water Board,BRUDER,True,2000,gravity,gravity,gravity,water board,user-group,pay monthly,monthly,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,2013


In [0]:
def nans(li):
  if li == 0:
    li = np.nan
  else:
    li = li
  return(li)

X_train['construction_year'] = X_train['construction_year'].apply(nans)
test['construction_year'] = test['construction_year'].apply(nans)
X_val['construction_year'] = X_val['construction_year'].apply(nans)

In [55]:
X_train['time_inspect'] = X_train.time_inspect - X_train.construction_year
X_val['time_inspect'] = X_val.time_inspect - X_val.construction_year
test['time_inspect'] = test.time_inspect - test.construction_year
X_train.head()

Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,time_inspect
58538,0.0,2011-03-28,Government Of Tanzania,780,RWE,37.427629,-5.930824,Kwa Mwinjuma,0,Wami / Ruvu,Difungo,Morogoro,5,6,Mvomero,Kibati,1,True,GeoData Consultants Ltd,VWC,Diburuma,True,1970.0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,41.0
32710,500.0,2011-02-18,Government Of Tanzania,1762,Gover,35.68553,-8.074333,none,0,Rufiji,Mwanzi,Iringa,11,7,Kilolo,Ukumbi,200,True,GeoData Consultants Ltd,VWC,Mawamb,True,2008.0,gravity,gravity,gravity,vwc,user-group,pay monthly,monthly,soft,good,enough,river,river/lake,surface,communal standpipe,communal standpipe,3.0
11757,200.0,2011-04-18,Cefa,1593,CEFA,35.184056,-9.231164,Kwa Piusi Mfugale,0,Rufiji,Mtwanzi,Iringa,11,4,Njombe,Lupembe,30,True,GeoData Consultants Ltd,WUA,matembwe water supply schem,True,1996.0,gravity,gravity,gravity,wua,user-group,pay monthly,monthly,soft,good,enough,river,river/lake,surface,communal standpipe,communal standpipe,15.0
20305,0.0,2011-07-10,Government Of Tanzania,0,DWE,31.123344,-1.718533,Kwa Izingirani,0,Lake Victoria,Nyakakiri,Kagera,18,1,Karagwe,Nyaishozi,0,True,GeoData Consultants Ltd,WUA,Lukare water supp,True,,gravity,gravity,gravity,vwc,user-group,pay annually,annually,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,
34960,0.0,2011-02-23,Government Of Tanzania,2010,Commu,34.177837,-9.345069,none,0,Lake Nyasa,Ikiligano,Iringa,11,3,Makete,Lupalilo,0,False,GeoData Consultants Ltd,VWC,Lupali,False,1985.0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,26.0


In [56]:
X_val.head()

Unnamed: 0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,source,source_type,source_class,waterpoint_type,waterpoint_type_group,time_inspect
42711,200.0,2011-03-25,Mwaya Mn,273,Communit,36.900383,-7.832688,Kwaligamba,0,Rufiji,Magengeni,Morogoro,5,3,Kilombero,Mang'ula,180,True,GeoData Consultants Ltd,WUA,Mang`ula,True,2010.0,gravity,gravity,gravity,vwc,user-group,pay monthly,monthly,soft,good,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,1.0
9554,0.0,2013-02-11,Tasaf,1209,TASAF,34.838803,-6.089647,Kwa Zolo Ngogomba,0,Rufiji,Chidamsulu A,Singida,13,3,Manyoni,Chikola,600,False,GeoData Consultants Ltd,,,False,2010.0,afridev,afridev,handpump,vwc,user-group,unknown,unknown,soft,good,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,3.0
34443,0.0,2011-03-06,Missi,1670,Missi,35.111248,-9.547133,none,0,Lake Nyasa,Kiondeni,Iringa,11,4,Njombe,Kifanya,25,True,GeoData Consultants Ltd,VWC,Matund,False,1998.0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,enough,spring,spring,groundwater,communal standpipe,communal standpipe,13.0
5926,50.0,2011-02-24,Private,128,Wachina,38.864401,-6.830304,Kwa Mzee Rashidi,102,Wami / Ruvu,Bomu,Pwani,6,2,Kibaha,Soga,20,True,GeoData Consultants Ltd,Parastatal,Upper ruvu,True,2010.0,gravity,gravity,gravity,unknown,unknown,pay per bucket,per bucket,soft,good,insufficient,river,river/lake,surface,communal standpipe,communal standpipe,1.0
11905,0.0,2013-01-30,Fini Water,-8,Fini water,39.884362,-10.286157,Kwa Suleiman Ahmad,0,Ruvuma / Southern Coast,Utende,Mtwara,99,1,Mtwara Rural,Ndumbwe,100,True,GeoData Consultants Ltd,,,True,1971.0,other,other,other,vwc,user-group,unknown,unknown,soft,good,enough,machine dbh,borehole,groundwater,other,other,42.0


In [59]:
features = X_train.columns
target = 'status_group'

Index(['amount_tsh', 'date_recorded', 'funder', 'gps_height', 'installer',
       'longitude', 'latitude', 'wpt_name', 'num_private', 'basin',
       'subvillage', 'region', 'region_code', 'district_code', 'lga', 'ward',
       'population', 'public_meeting', 'recorded_by', 'scheme_management',
       'scheme_name', 'permit', 'construction_year', 'extraction_type',
       'extraction_type_group', 'extraction_type_class', 'management',
       'management_group', 'payment', 'payment_type', 'water_quality',
       'quality_group', 'quantity', 'source', 'source_type', 'source_class',
       'waterpoint_type', 'waterpoint_type_group', 'time_inspect'],
      dtype='object')

In [0]:
import category_encoders as ce 
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(),
    LogisticRegression(multi_class='auto', solver='lbfgs', n_jobs=-1)
)

In [0]:
pipeline.fit(X_train, y_train)

print('Validation Accuracy', pipeline.score(X_val, y_val))