<a href="https://colab.research.google.com/github/nikibhatt/DS-Unit-2-Applied-Modeling/blob/master/solution_applied_modeling_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*
    !pip install eli5

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Titanic (Classification)

In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split
train, val = train_test_split(pd.read_csv(DATA_PATH+'titanic/train.csv'), random_state=42)

In [0]:
features = ['Sex', 'Pclass', 'Embarked', 'SibSp', 'Parch', 'Age']
target = 'Survived'

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier

transformers = make_pipeline(ce.OneHotEncoder(use_cat_names=True), SimpleImputer())
X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

eval_set = [(X_train_transformed, y_train), 
            (X_val_transformed, y_val)]

model = XGBClassifier(n_estimators=1000, n_jobs=-1)

model.fit(X_train_transformed, y_train, eval_set=eval_set, 
          eval_metric='error', early_stopping_rounds=20)

[0]	validation_0-error:0.166168	validation_1-error:0.188341
Multiple eval metrics have been passed: 'validation_1-error' will be used for early stopping.

Will train until validation_1-error hasn't improved in 20 rounds.
[1]	validation_0-error:0.166168	validation_1-error:0.188341
[2]	validation_0-error:0.166168	validation_1-error:0.188341
[3]	validation_0-error:0.166168	validation_1-error:0.188341
[4]	validation_0-error:0.166168	validation_1-error:0.188341
[5]	validation_0-error:0.166168	validation_1-error:0.188341
[6]	validation_0-error:0.166168	validation_1-error:0.188341
[7]	validation_0-error:0.166168	validation_1-error:0.179372
[8]	validation_0-error:0.166168	validation_1-error:0.183857
[9]	validation_0-error:0.166168	validation_1-error:0.174888
[10]	validation_0-error:0.166168	validation_1-error:0.174888
[11]	validation_0-error:0.166168	validation_1-error:0.174888
[12]	validation_0-error:0.163174	validation_1-error:0.170404
[13]	validation_0-error:0.163174	validation_1-error:0.17

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [0]:
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(model, scoring='accuracy', n_iter=3)
permuter.fit(X_val_transformed, y_val)

encoder = transformers.named_steps['onehotencoder']
feature_names = encoder.transform(X_val).columns.tolist()

eli5.show_weights(permuter, top=None, feature_names=feature_names)

Using TensorFlow backend.


Weight,Feature
0.1779  ± 0.0152,Sex_male
0.0852  ± 0.0127,Pclass
0.0538  ± 0.0194,Age
0.0269  ± 0.0146,Embarked_S
0.0105  ± 0.0042,SibSp
0  ± 0.0000,Parch
0  ± 0.0000,Embarked_nan
0  ± 0.0000,Embarked_Q
0  ± 0.0000,Embarked_C
0  ± 0.0000,Sex_female


# NYC Apartments (Regression)

In [0]:
import numpy as np
import pandas as pd

# Read New York City apartment rental listing data
df = pd.read_csv(DATA_PATH+'apartments/renthop-nyc.csv')
assert df.shape == (49352, 34)

# Remove the most extreme 1% prices,
# the most extreme .1% latitudes, &
# the most extreme .1% longitudes
df = df[(df['price'] >= np.percentile(df['price'], 0.5)) & 
        (df['price'] <= np.percentile(df['price'], 99.5)) & 
        (df['latitude'] >= np.percentile(df['latitude'], 0.05)) & 
        (df['latitude'] < np.percentile(df['latitude'], 99.95)) &
        (df['longitude'] >= np.percentile(df['longitude'], 0.05)) & 
        (df['longitude'] <= np.percentile(df['longitude'], 99.95))]

In [0]:
# Do train/val/test split
# Train on April 2016
# Validate on May 2016
# Test on June 2016
df['created'] = pd.to_datetime(df['created'], infer_datetime_format=True)
train = df[df.created.dt.month == 4]
val = df[df.created.dt.month == 5]
test = df[df.created.dt.month == 6]

In [0]:
# Wrangle train, val, test sets in the same way
def engineer_features(df):
    
    # Avoid SettingWithCopyWarning
    df = df.copy()
        
    # Does the apartment have a description?
    df['description'] = df['description'].str.strip().fillna('')
    df['has_description'] = df['description'] != ''

    # How long is the description?
    df['description_length'] = df['description'].str.len()

    # How many total perks does each apartment have?
    perk_cols = ['elevator', 'cats_allowed', 'hardwood_floors', 'dogs_allowed',
                 'doorman', 'dishwasher', 'no_fee', 'laundry_in_building',
                 'fitness_center', 'pre-war', 'laundry_in_unit', 'roof_deck',
                 'outdoor_space', 'dining_room', 'high_speed_internet', 'balcony',
                 'swimming_pool', 'new_construction', 'exclusive', 'terrace', 
                 'loft', 'garden_patio', 'common_outdoor_space', 
                 'wheelchair_access']
    df['perk_count'] = df[perk_cols].sum(axis=1)

    # Are cats or dogs allowed?
    df['cats_or_dogs'] = (df['cats_allowed']==1) | (df['dogs_allowed']==1)

    # Are cats and dogs allowed?
    df['cats_and_dogs'] = (df['cats_allowed']==1) & (df['dogs_allowed']==1)

    # Total number of rooms (beds + baths)
    df['rooms'] = df['bedrooms'] + df['bathrooms']
    
    # Extract number of days elapsed in year, and drop original date feature
    df['days'] = (df['created'] - pd.to_datetime('2016-01-01')).dt.days
    df = df.drop(columns='created')

    return df

train = engineer_features(train)
val = engineer_features(val)
test = engineer_features(test)

In [0]:
target = 'price'
features = train.columns.drop(target)

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

X_test = test[features]
y_test = test[target]

In [0]:
import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, module='xgboost')

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor

transformers = make_pipeline(ce.OrdinalEncoder(), SimpleImputer())
X_train_transformed = transformers.fit_transform(X_train)
X_val_transformed = transformers.transform(X_val)

eval_set = [(X_train_transformed, y_train), 
            (X_val_transformed, y_val)]

model = XGBRegressor(
    n_estimators=1000, 
    max_depth=10, 
    objective='reg:squarederror', 
    n_jobs=-1, 
)

model.fit(X_train_transformed, y_train, eval_set=eval_set, 
          eval_metric='mae', early_stopping_rounds=20)

[0]	validation_0-mae:3193.69	validation_1-mae:3250.54
Multiple eval metrics have been passed: 'validation_1-mae' will be used for early stopping.

Will train until validation_1-mae hasn't improved in 20 rounds.
[1]	validation_0-mae:2877.15	validation_1-mae:2931
[2]	validation_0-mae:2592.08	validation_1-mae:2643.52
[3]	validation_0-mae:2335.64	validation_1-mae:2385.74
[4]	validation_0-mae:2104.93	validation_1-mae:2159.56
[5]	validation_0-mae:1897.3	validation_1-mae:1953.7
[6]	validation_0-mae:1710.61	validation_1-mae:1763.23
[7]	validation_0-mae:1542.69	validation_1-mae:1595.62
[8]	validation_0-mae:1391.39	validation_1-mae:1446.19
[9]	validation_0-mae:1255.33	validation_1-mae:1313.45
[10]	validation_0-mae:1133.13	validation_1-mae:1194.59
[11]	validation_0-mae:1023.82	validation_1-mae:1090.49
[12]	validation_0-mae:926.454	validation_1-mae:998.121
[13]	validation_0-mae:840.288	validation_1-mae:917.633
[14]	validation_0-mae:763.6	validation_1-mae:842.432
[15]	validation_0-mae:696.818	valid

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=10, min_child_weight=1, missing=None, n_estimators=1000,
             n_jobs=-1, nthread=None, objective='reg:squarederror',
             random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
             seed=None, silent=None, subsample=1, verbosity=1)

In [0]:
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(model, scoring='neg_mean_absolute_error', n_iter=3)
permuter.fit(X_val_transformed, y_val)

feature_names = X_val.columns.tolist()
eli5.show_weights(permuter, top=None, feature_names=feature_names)

Weight,Feature
340.0857  ± 3.1054,bathrooms
276.0006  ± 0.7070,longitude
237.4229  ± 3.0805,bedrooms
186.6951  ± 3.9940,latitude
153.3382  ± 1.9519,rooms
80.6217  ± 3.0040,doorman
56.4000  ± 0.9841,interest_level
32.4031  ± 0.9254,fitness_center
25.3407  ± 1.2702,perk_count
10.9165  ± 1.8493,hardwood_floors
