### **Decision Trees**
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Decision trees have high interpretability after being trained you can literally draw them.
Splits data on the feature which when split provides highest infomation gain.
*  clean data with **outliers and missing values**
*  use scikit-learn **pipelines**
*  use scikit-learn for **decision trees**
*  get and interpret **feature importances** of a tree-based model
*  understand why decision trees are useful to model **non-linear, non-monotonic** relationships and **feature interactions**

### Clean data outliners and missing values

In [0]:
# Check Pandas Profiling version
import pandas_profiling
pandas_profiling.__version__

In [0]:
# New code for Pandas Profiling version 2.4
from pandas_profiling import ProfileReport
profile = ProfileReport(train, minimal=True).to_notebook_iframe()

profile

In [0]:
def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    cols_with_zeros = ['longitude', 'latitude']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
            
    # quantity & quantity_group are duplicates, so drop one
    X = X.drop(columns='quantity_group')
    
    # return the wrangled dataframe
    return X


train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

### Pipelines
We can combine steps with pipelines: Encode, Impute, Scale, Fit, Predict!

Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves multiple purposes here:

*  **Convenience and encapsulation.** You only have to call fit and predict once on 
your data to fit a whole sequence of estimators.
*  **Joint parameter selection.** You can grid search over parameters of all estimators in the pipeline at once.
*  **Safety.** Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In [0]:
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    LogisticRegression(max_iter=1000)
)
# Fit on train
pipeline.fit(X_train, y_train)

# Score on val
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Predict on test
y_pred = pipeline.predict(X_test)

Get and plot coefficients
This is slightly harder when using pipelines.

The pipeline doesn't have a .coef_ attribute. But the model inside the pipeline does.

In [0]:
model = pipeline.named_steps['logisticregression']
encoder = pipeline.named_steps['onehotencoder']
encoded_columns = encoder.transform(X_val).columns
coefficients = pd.Series(model.coef_[0], encoded_columns)
plt.figure(figsize=(10,30))
coefficients.sort_values().plot.barh(color='grey');

### Use scikit-learn for decision trees

In [0]:
# These are the only two changes from the previous cell:
# Remove StandardScaler (it's not needed or helpful for trees)
# Change the model from LogisticRegression to DecisionTreeClassifier

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    DecisionTreeClassifier(random_state=42)
)

# Fit on train
pipeline.fit(X_train, y_train)

# Score on train, val
print('Train Accuracy', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

# Predict on test 
y_pred = pipeline.predict(X_test)

In [0]:
# For visualizations
model = pipeline.named_steps['decisiontreeclassifier']
encoder = pipeline.named_steps['onehotencoder']
encoded_columns = encoder.transform(X_val).columns

dot_data = export_graphviz(model, 
                           out_file=None, 
                           max_depth=3, 
                           feature_names=encoded_columns,
                           class_names=model.classes_, 
                           impurity=False, 
                           filled=True, 
                           proportion=True, 
                           rounded=True)   
display(graphviz.Source(dot_data))

**Using two features**

In [0]:
train_location = X_train[['longitude', 'latitude']].copy()
val_location = X_val[['longitude', 'latitude']].copy()

In [0]:
dt = make_pipeline(
    SimpleImputer(), 
    DecisionTreeClassifier(max_depth=16, random_state=42)
)

dt.fit(train_location, y_train)
print('Decision Tree:')
print('Train Accuracy', dt.score(train_location, y_train))
print('Validation Accuracy', dt.score(val_location, y_val))

### Feature importances of a tree-based model

### **Random Forests**
*   use scikit-learn for **random forests**
*   do **ordinal encoding** with high-cardinality categoricals
*   understand how categorical encodings affect trees differently compared to linear models
*   understand how tree ensembles reduce overfitting compared to a single decision tree with unlimited depth

Two take-away messages:
1.   Try Tree Ensembles when you do machine learning with labeled, tabular data
2.  One-hot encoding isn’t the only way, and may not be the best way, of categorical encoding for tree ensembles.

