# Python for Machine Learning

### *Session \#6*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Decision Trees

In [124]:
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

from yellowbrick.classifier import ConfusionMatrix, ROCAUC
from yellowbrick.model_selection import ValidationCurve

from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv("breast_cancer.csv", usecols=range(1, 32))

X = df.drop('diagnosis', 1)
y = df['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

### Warm Ups

*Type the given code into the cell below*

---
**Create a decision tree:** 
```python
model = DecisionTreeClassifier(max_depth=2)
model.fit(X_train, y_train)
```

**Print tree:** 
```python
print(plot_tree(model, 
                feature_names=X.columns, 
                class_names=model.classes_, 
                filled=True))
```

In [None]:
plt.figure(figsize=(14,6))

# Add your code below

**Show Gini importance of features:** 
```python
sorted(zip(model.feature_importances_, X.columns))
```

### Exercises
---

**1. In the decision tree graphed above, how would the following two cases be classified?**

* concave_points_mean = 0.050
* area_se = 42
* concavity_worst = 0.5


* concave points_mean = 0.092
* area_se = 32
* concavity_worst = 0.6


*Hint: If the top line within each box is true, go left. False, go right.*

In [96]:
# M (Second leaf from left)

# M (Far-right leaf)

**2. What is the Gini impurity of the first cut? The second?**

Gini reminder: Imagine randomly choosing a datapoint, then assign class label with probability according to proportion of classes. What's the chance of misclassification?

<br>
<div style="display: flex;">
    <img src="../images/gini1.png"> 
    <img src="../images/gini2.png">
</div>

In [None]:
# FIRST IMAGE
# ------------------------
# Left side: (1/3)*(2/3) + (2/3)*(1/3) = 0.444
# Right side = (1/4)*(3/4) + (3/4)*(1/4) = 0.375

# SECOND IMAGE
# ------------------------
# Left side: (1/4)*(3/4) + (3/4)*(1/4) = 0.375
# Right side = (0)*(1) + (1)*(0) = 0.0

**3. What is the accuracy score of the decision tree above?**

**Retrain and remove the** `max_depth` **parameter. Compare the score on the training set and test sets. What is happening to the model?**

**4. Use a validation curve to find where** `max_depth` **begins to become ineffective.**

Hint: `ValidationCurve()` takes model, then parameter name, then values you want to try.

**5. Retrain the model with** `max_depth` **set to the ideal value you found.**

**Which features were most important to the decision tree?**

## II. Random Forest

### Warm Ups

*Type the given code into the cell below*

---

**Create a random forest**: `model = RandomForestClassifier(n_estimators=100)`

**Feature importance:** `model.feature_importances_`

**Access internal decision trees:** `model.estimators_`

**Grid Search:** 
```python
params = {'n_estimators': range(100, 160, 10)}
grid = GridSearchCV(model, params).fit(X, y)
grid.best_params_
```

### Exercises
---

**1. Redo the validation curve exercise with your RandomForest classifier.** 

**Do deeper trees cause overfitting?**

**3. The** `class_weight` **parameter takes a dictionary of classes, and changes how important each class is to the model**

**Try setting it to** `{'B': 10}` **to make the malignant cases 10x as important. Use a** `ConfusionMatrix` **visualizer to see how this affects the type of errors the model makes**

**4. Further parameters to** `RandomForestClassifier` **are:**

    * max_depth -- # of layers in decision trees
    * max_features -- # of columns used at each split
    
**Use grid search to optimize for these hyperparameters**

Note: The more combinations you try, the longer this will run! Start with trying only a few.

**5. A major benefit of random forest models is that they perform automatic feature selection.**

**Run the code below to add 10 columns of random data to your dataset. Re-train your model and score it.**

In [127]:
random_df = pd.DataFrame(np.random.randint(0,100,size=(len(df), 10)), 
                         columns=[f"random_{i}" for i in range(10)])

big_df = pd.concat([df, random_df], axis=1)
X = df.drop('diagnosis', 1)
y = df['diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

## III. Final Practice: Start to Finish

### Exercises
---

**1. Let's bring it all together! Here you will apply all the individual skills you've learned to a raw dataset.** 

**You will need to:** 

* Explore the data
* Select an appropriate model
* Split the data into X, y and train/test
* Handle null values, if present
* Do feature engineering
* Use appropriate evaluation methods
* Optimize hyperparameters

In [188]:
# "target" column is what we want to predict
# 0 = Bad Car, 3 = Very Good Car

cars = pd.read_csv("cars.csv")

In [187]:
# "price" column is what we want to predict

diamonds = pd.read_csv("diamonds.csv")