Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






In [2]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module2')

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

In [3]:
# Build list 'view' for required inputs to model.  Items not validated for effect are removed 
#   pending further analysis.
drop_list = ['permit', 'recorded_by', 'quantity_group', 'source',
             'waterpoint_type_group', 'lga', 'ward', 'public_meeting', 'scheme_name',
             'extraction_type_class', 'extraction_type', 'region', 'scheme_management',
             'payment', 'payment_type', 'quality_group', 'source_class', 'wpt_name', 'funder',
            ]
view = train.drop(columns=drop_list).columns.tolist()

# High cardinality variables
train[view].dtypes[train[view].nunique()>20]

id                     int64
amount_tsh           float64
date_recorded         object
gps_height             int64
installer             object
longitude            float64
latitude             float64
num_private            int64
subvillage            object
region_code            int64
population             int64
construction_year      int64
dtype: object

## Ordinal Encoding
**Seeing before/after on subset of data**

In [4]:
from IPython.display import display
df = train[view].sample(10)
display(df.subvillage.unique())

array(['Lundusi', 'Kusini', 'Mtaho A', 'Kichangani', 'Kwembila',
       'Kitagura', 'Mtaa Wa Mzinga', 'Masasi', 'Makambako', 'Rukubo'],
      dtype=object)

In [5]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

encoder = OrdinalEncoder(
    categories='auto',
    dtype=np.int32,
)

view = ['subvillage']
data_view = df[view]
encoder.fit(data_view)
encoder.transform(data_view)

array([[4],
       [2],
       [8],
       [0],
       [3],
       [1],
       [7],
       [6],
       [5],
       [9]])

## The Random Forest Classifier

In [36]:
# Transform data
from sklearn.impute import SimpleImputer
#from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder

view = ['subvillage', 'installer', 'region_code', 'status_group']

data_view = train[view]#.sample(10000)
y = data_view.status_group
X = data_view.drop(columns='status_group')
encoder = OrdinalEncoder(cols=['subvillage', 'installer'])
Xt = encoder.fit_transform(X)

In [37]:
# Split into train/test
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(Xt, y, test_size=0.20, random_state=42)
display(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

(47520, 3)

(11880, 3)

(47520,)

(11880,)

In [42]:
# Fit Model
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100, 
    random_state=42,
    min_samples_split=11,
    n_jobs=3
)
model.fit(X_train, y_train)
print('Validation Accuracy', model.score(X_val, y_val))

Validation Accuracy 0.6543771043771044


In [44]:
# Plot feature importances
from eli5.sklearn import PermutationImportance
import eli5

perm = PermutationImportance(model).fit(X_val, y_val)
eli5.show_weights(perm)

Weight,Feature
0.1355  ± 0.0026,x2
0.1245  ± 0.0082,x1
0.0792  ± 0.0062,x0
