Lambda School Data Science

*Unit 2, Sprint 2, Module 2*

---

# Random Forests

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](http://archive.is/Nu3EI), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/category_encoders/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_


### More Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/category_encoders/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](https://contrib.scikit-learn.org/category_encoders/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](https://contrib.scikit-learn.org/category_encoders/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](https://contrib.scikit-learn.org/category_encoders/catboost.html)
- [Generalized Linear Mixed Model Encoder](https://contrib.scikit-learn.org/category_encoders/glmm.html)
- [James-Stein Encoder](https://contrib.scikit-learn.org/category_encoders/jamesstein.html)
- [Leave One Out](https://contrib.scikit-learn.org/category_encoders/leaveoneout.html)
- [M-estimate](https://contrib.scikit-learn.org/category_encoders/mestimate.html)
- [Target Encoder](https://contrib.scikit-learn.org/category_encoders/targetencoder.html)
- [Weight of Evidence](https://contrib.scikit-learn.org/category_encoders/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

For this reason, mean encoding won't work well within pipelines for multi-class classification problems.

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/colinmorris/embedding-layers)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categoricals. It’s an active area of research and experimentation — maybe you can make your own contributions!**_

### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [7]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [9]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import datetime

# Retrieve data
train_val = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                     pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')

target = 'status_group'

# Wrangle train, validate, and test sets in the same way
def wrangle(X):
    
    # Prevent SettingWithCopyWarning
    X = X.copy()

    if target in X.columns:
      X = X.drop(columns=[target])
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col+'_MISSING'] = X[col].isnull()
            
    # Drop duplicate columns
    duplicates = ['quantity_group', 'payment_type']
    X = X.drop(columns=duplicates)
    
    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['recorded_by', 'id']
    X = X.drop(columns=unusable_variance)
    
    # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    
    # Engineer feature: how many years from construction_year to date_recorded
    X['years'] = X['year_recorded'] - X['construction_year']
    X['years_MISSING'] = X['years'].isnull()
  
    return X

# Split train & val
# train, val = train_test_split(
#     train_val,
#     train_size = 0.80,
#     test_size = 0.20,
#     stratify = train_val[target],
#     random_state = train_val_seed,
#     )

# Actually wrangle and split the data up
X_train_val, y_train_val = wrangle(train_val), train_val[target]
X_test                   = wrangle(test)  # y_test is not provided

# Actually wrangle and split the data up
# X_train, y_train = wrangle(train), train[target]
# X_val, y_val     = wrangle(val), val[target]
# X_test           = wrangle(test)  # y_test is not provided

# print('Shapes:')
# print('Train, X_train, y_train:', train.shape, X_train.shape, y_train.shape)
# print('Val, X_val, y_val:', val.shape, X_val.shape, y_val.shape)
# print('Test, X_test:', test.shape, X_test.shape)

from category_encoders import OrdinalEncoder
from category_encoders import TargetEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Make a pipeline for encoding
encoder = Pipeline(steps=[
    ('ordinalencoder', OrdinalEncoder(return_df=True, handle_unknown='return_nan', handle_missing='return_nan')),
    # ('targetencoder', TargetEncoder(min_samples_leaf=1, smoothing=1)), 
    ('imputer', SimpleImputer()),
    ('scaler', StandardScaler()),
    ])
X_train_val_encoded = encoder.fit_transform(X_train_val)
# X_val_encoded = encoder.transform(X_val)
X_test_encoded = encoder.transform(X_test)

In [31]:
def try_model(seed=42, max_depth=None, n_estimators=20, split_samples=2, leaf_samples=1):

    # Split train & val
    X_train_encoded, X_val_encoded, y_train, y_val = train_test_split(
        X_train_val_encoded,
        y_train_val,
        train_size = 0.80,
        test_size = 0.20,
        stratify = train_val[target],
        random_state = seed,
        )

    # Make and fit the model
    model = Pipeline(steps=[
        ('randomforestclassifier', RandomForestClassifier(
            max_depth = max_depth,
            n_estimators = n_estimators,
            min_samples_split = split_samples, 
            min_samples_leaf = leaf_samples,
            random_state = seed,
            ), 
        ),
        ])
    model.fit(X_train_encoded, y_train)

    # Evaluate the model
    train_score = model.score(X_train_encoded, y_train)
    val_score = model.score(X_val_encoded, y_val)
    # print('Train Accuracy: %.3f' % train_score)
    # print('Validation Accuracy: %.3f' % val_score)

    # Combine these two into a "Cautious Accuracy"
    # Take the difference between the train and val scores and subtract that from val_score
    # Rewards a high val score, but only if the model is not overfit.
    cautious_score = 2 * val_score - train_score
    # print('Cautious Accuracy: %.3f' % cautious_score)
    return cautious_score

In [14]:
# Do a grid search over random forests
for depth in range(10,71,20):
  for n_estimators in range(18,23,2):
    for min_samples in range(10,31,10):
      scores = [try_model(seed=seed, max_depth=depth, n_estimators=n_estimators, min_samples=min_samples) for seed in range(5)]
      mean_score = np.array(scores).mean()
      print('Depth =',depth,'Trees =',n_estimators,'Min Samples/Split =',min_samples,'Mean Score =',mean_score)

Depth = 10 Trees = 18 Min Samples/Split = 10 Mean Score = 0.7401304713804713
Depth = 10 Trees = 18 Min Samples/Split = 20 Mean Score = 0.741797138047138
Depth = 10 Trees = 18 Min Samples/Split = 30 Mean Score = 0.7422937710437711
Depth = 10 Trees = 20 Min Samples/Split = 10 Mean Score = 0.7394949494949494
Depth = 10 Trees = 20 Min Samples/Split = 20 Mean Score = 0.7432070707070709
Depth = 10 Trees = 20 Min Samples/Split = 30 Mean Score = 0.7415656565656565
Depth = 10 Trees = 22 Min Samples/Split = 10 Mean Score = 0.7404040404040404
Depth = 10 Trees = 22 Min Samples/Split = 20 Mean Score = 0.7429671717171716
Depth = 10 Trees = 22 Min Samples/Split = 30 Mean Score = 0.7464814814814814
Depth = 30 Trees = 18 Min Samples/Split = 10 Mean Score = 0.6978072390572391
Depth = 30 Trees = 18 Min Samples/Split = 20 Mean Score = 0.731712962962963
Depth = 30 Trees = 18 Min Samples/Split = 30 Mean Score = 0.742087542087542
Depth = 30 Trees = 20 Min Samples/Split = 10 Mean Score = 0.7017760942760942
De

In [25]:
# Do a grid search over random forests
for n_estimators in range(20,21):
  for min_samples in range(25,76,10):
    scores = [try_model(seed=seed, max_depth=None, n_estimators=n_estimators, min_samples=min_samples) for seed in range(10)]
    scores = np.array(scores)
    mean_score = scores.mean()
    std_score = scores.std()
    low_score = format(mean_score - std_score, '.3f')
    high_score = format(mean_score + std_score, '.3f')
    print('Trees =',n_estimators,'Min Samples/Split =',min_samples,'Scores =',low_score,'-',high_score)

Depth = full, Trees = 20 Min Samples/Split = 25 Scores = 0.731 - 0.744
Depth = full, Trees = 20 Min Samples/Split = 35 Scores = 0.742 - 0.754
Depth = full, Trees = 20 Min Samples/Split = 45 Scores = 0.746 - 0.761
Depth = full, Trees = 20 Min Samples/Split = 55 Scores = 0.749 - 0.764
Depth = full, Trees = 20 Min Samples/Split = 65 Scores = 0.750 - 0.764
Depth = full, Trees = 20 Min Samples/Split = 75 Scores = 0.753 - 0.767


In [44]:
# Do a grid search over random forests
for split_samples in range(52,57,2):
  print()
  for leaf_samples in range(10,21,5):
    scores = [try_model(seed=seed, split_samples=split_samples, leaf_samples=leaf_samples) for seed in range(20)]
    scores = np.array(scores)
    mean_score = scores.mean()
    std_score = scores.std()
    low_score = format(mean_score - std_score, '.4f')
    high_score = format(mean_score + std_score, '.4f')
    print(f'Min Samples/Split = {split_samples}, Min Samples/Leaf = {leaf_samples}, Scores = {low_score} - {high_score}')


Min Samples/Split = 52, Min Samples/Leaf = 10, Scores = 0.751 - 0.768
Min Samples/Split = 52, Min Samples/Leaf = 15, Scores = 0.753 - 0.767
Min Samples/Split = 52, Min Samples/Leaf = 20, Scores = 0.753 - 0.767

Min Samples/Split = 54, Min Samples/Leaf = 10, Scores = 0.751 - 0.766
Min Samples/Split = 54, Min Samples/Leaf = 15, Scores = 0.753 - 0.764
Min Samples/Split = 54, Min Samples/Leaf = 20, Scores = 0.753 - 0.766

Min Samples/Split = 56, Min Samples/Leaf = 10, Scores = 0.754 - 0.766
Min Samples/Split = 56, Min Samples/Leaf = 15, Scores = 0.751 - 0.766
Min Samples/Split = 56, Min Samples/Leaf = 20, Scores = 0.753 - 0.765


In [None]:

from sklearn.model_selection import cross_val_score
k = 3
scores = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='neg_mean_absolute_error')
print(f'MAE for {k} folds:', -scores)

In [49]:
# Make a model with a certain hyperparameter combo
# train_val_seed=42
max_depth=None
n_estimators=40
split_samples=52
leaf_samples=15

# Split train & val
X_train_encoded, X_val_encoded, y_train, y_val = train_test_split(
    X_train_val_encoded,
    y_train_val,
    train_size = 0.80,
    test_size = 0.20,
    stratify = train_val[target],
    # random_state = train_val_seed,
    )

# Make and fit the model
model = Pipeline(steps=[
    ('randomforestclassifier', RandomForestClassifier(
          max_depth = max_depth,
          n_estimators = n_estimators,
          min_samples_split = split_samples, 
          min_samples_leaf = leaf_samples,
        ), 
    ),
    ])
model.fit(X_train_encoded, y_train)

# Evaluate the model
train_score = model.score(X_train_encoded, y_train)
val_score = model.score(X_val_encoded, y_val)
print('Train Accuracy: %.3f' % train_score)
print('Validation Accuracy: %.3f' % val_score)

# Combine these two into a "Cautious Accuracy"
# Take the difference between the train and val scores and subtract that from val_score
# Rewards a high val score, but only if the model is not overfit.
cautious_score = 2 * val_score - train_score
print('Cautious Accuracy: %.3f' % cautious_score)

Train Accuracy: 0.811
Validation Accuracy: 0.791
Cautious Accuracy: 0.772


In [50]:
# Create the test prediction and submission file
from google.colab import files

y_pred = model.predict(X_test_encoded)
message = 'Something something'
submission = pd.DataFrame({'id': test.id, 'status_group': y_pred})
submission_filename = 'submission.csv'
submission.to_csv(submission_filename, index=False)
files.download(submission_filename)
# !kaggle competitions submit -c dspt-pump-it-up-challenge -f submission.csv -m "{message}"

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>