Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [1]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [5]:
sample_submission['status_group'].value_counts()

functional    14358
Name: status_group, dtype: int64

In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14358 entries, 0 to 14357
Data columns (total 40 columns):
id                       14358 non-null int64
amount_tsh               14358 non-null float64
date_recorded            14358 non-null object
funder                   13575 non-null object
gps_height               14358 non-null int64
installer                13570 non-null object
longitude                14358 non-null float64
latitude                 14358 non-null float64
wpt_name                 14358 non-null object
num_private              14358 non-null int64
basin                    14358 non-null object
subvillage               14264 non-null object
region                   14358 non-null object
region_code              14358 non-null int64
district_code            14358 non-null int64
lga                      14358 non-null object
ward                     14358 non-null object
population               14358 non-null int64
public_meeting           13573 non-null object
r

In [8]:
test.describe(include = object).T

Unnamed: 0,count,unique,top,freq
date_recorded,14358,331,2011-03-16,137
funder,13575,960,Government Of Tanzania,2117
installer,13570,1075,DWE,4162
wpt_name,14358,10615,none,822
basin,14358,9,Lake Victoria,2535
subvillage,14264,8253,Shuleni,136
region,14358,21,Shinyanga,1258
lga,14358,124,Njombe,611
ward,14358,1934,Igosi,79
public_meeting,13573,2,True,12308


In [9]:
test.drop(['wpt_name', 'subvillage', 'recorded_by', 'management_group', 'payment_type', 'quality_group', 
           'quantity_group', 'source_type', 'source_class', 'waterpoint_type_group'], axis = 1, inplace = True)

In [10]:
test.describe(include = object).T

Unnamed: 0,count,unique,top,freq
date_recorded,14358,331,2011-03-16,137
funder,13575,960,Government Of Tanzania,2117
installer,13570,1075,DWE,4162
basin,14358,9,Lake Victoria,2535
region,14358,21,Shinyanga,1258
lga,14358,124,Njombe,611
ward,14358,1934,Igosi,79
public_meeting,13573,2,True,12308
scheme_management,13419,11,VWC,8807
scheme_name,7519,1772,Borehole,158


In [11]:
test.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,14358.0,37232.859799,21382.890432,10.0,18765.5,37442.0,55909.25,74249.0
amount_tsh,14358.0,324.219996,2533.367778,0.0,0.0,0.0,25.0,200000.0
gps_height,14358.0,653.6363,688.2721,-57.0,0.0,346.0,1306.0,2777.0
longitude,14358.0,34.082414,6.564449,0.0,33.062317,34.898976,37.221606,40.32502
latitude,14358.0,-5.697584,2.947444,-11.564592,-8.453125,-5.087905,-3.31424,-2e-08
num_private,14358.0,0.408971,8.231859,0.0,0.0,0.0,0.0,669.0
region_code,14358.0,15.156359,17.387588,1.0,5.0,12.0,17.0,99.0
district_code,14358.0,5.713052,9.794304,0.0,2.0,3.0,5.0,80.0
population,14358.0,187.055439,476.065978,0.0,0.0,25.0,230.0,11469.0
construction_year,14358.0,1298.251985,952.551852,0.0,0.0,1986.0,2004.0,2013.0


In [14]:
test.select_dtypes('number').nunique()

id                   14358
amount_tsh              67
gps_height            2142
longitude            13920
latitude             13920
num_private             36
region_code             26
district_code           20
population             631
construction_year       55
dtype: int64

In [None]:
test.drop(['longitude', 'latitude'], axis = 1, inplace = True)

In [17]:
test['amount_tsh'].value_counts()

0.0         10011
500.0         756
50.0          618
1000.0        361
20.0          348
            ...  
200000.0        1
70000.0         1
100000.0        1
2550.0          1
60000.0         1
Name: amount_tsh, Length: 67, dtype: int64