# Python for Machine Learning

### *Session \#1*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Preparing Data

### Warm Ups

*Type the given code into the cell below*

---

**Import pandas and read CSV**: 
```python
import pandas as pd
df = pd.read_csv("heart_attacks.csv")
```

**Isolate a column:** `y = df['current_smoker']`

**Use subset of columns**
```python
columns = ['current_smoker', 'education']
X = df[columns]
```

**Drop column:** `df.drop('heart_attack', 1)`

*Note: You can drop multiple columns at once, by using a list of column names*

**Split data into train/test sets:**
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

*Note: Default split is 0.75 train, 0.25 test. Can change proportion using* `test_size` *parameter*

### Exercises
---

**1. Create the feature matrix** `X` **by forming a dataframe from the columns** `male, current_smoker` **and**   `education` 

**2. Now create the feature matrix** `X` **by instead just dropping** `heart_attack` **from the original dataframe**

**3. Create the target vector** `y` **from the column** `heart_attack`

**4. Use** `train_test_split` **to divide your data into** `X_train`, `X_test`, `y_train`, `y_test`

**Add the parameter** `random_state=1` **to lock in the a particular random selection of rows.** 

## II. K-Nearest Neighbors

### Warm Ups

*Type the given code into the cell below*

---

**Create KNN Classifier**: 
```python
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=3)
```

**Fit model**: `model.fit(X_train, y_train)`

**Classify using model**: `model.predict(X_test)`

**Evaluate accuracy of model**: `model.score(X_test, y_test)`

### Exercises
---

**1. Create the **

*Note: In this exercise the `label` column is what we are trying to predict*

**2. Use a validation curve to choose an optimal** `n_neighbors`

## III. Evaluating Classifiers

### Warm Ups

*Type the given code into the cell below*

---

In [53]:
from yellowbrick.classifier import ConfusionMatrix, ClassPredictionError
from yellowbrick.model_selection import ValidationCurve

**Plot confusion matrix:** 
```python
matrix = ConfusionMatrix(model)
matrix.fit(X_train, y_train)
matrix.score(X_test, y_test)
matrix.finalize()
```

**Plot class prediction error:** 
```python
error = ClassPredictionError(model)
error.fit(X_train, y_train)
error.score(X_test, y_test)
error.finalize()
```

**Plot validation Curve:** 
```python
viz = ValidationCurve(
    model, param_name="n_neighbors",
    param_range=range(1, 11)
)
viz.fit(X, y)
viz.finalize()
```

### Exercises
---
**1. Plot the confusion matrix. After accuracy, another important metric is sensitivity (also called recall or true positive rate).**

**Of the patients who DID have a heart attack, what percentage were correctly identified by the model?**

**2. Plot the class prediction error. Which type of patient is more common -- with or without a heart attack?**

**3. Use the Validation Curve to find a reasonable number of neighbors for KNN classifier**


## IV. Bonus Section

### Exercises
---
**1. Divide the dataset into a training set and test set, then train a KNeighborsClassifier on it.**

*Note: In this exercise the `label` column is what we are trying to predict*

**2. Use the** `jupyter_drawing_pad` **widget to draw a digit**

*Hint: Outline the number multiple times to get more data points*

In [None]:
import jupyter_drawing_pad as jd
import numpy as np
widget = jd.CustomBox()
widget.drawing_pad

**3. Run the following code, which convert the drawing into simple numerical form and plots it**

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

def extract_data(widget):
    x_data = widget.drawing_pad.data[0]
    y_data = widget.drawing_pad.data[1]

    x_axis = np.linspace(min(x_data), max(x_data), 8)
    y_axis = np.linspace(min(y_data), max(y_data), 8)

    x_interval = x_axis[1] - x_axis[0]
    y_interval = y_axis[1] - y_axis[0]

    data = np.array(list(zip(x_data,y_data)), dtype=[('x', '<f8'), ('y', '<f8')])
    totals = np.zeros((8,8))

    for x_num, x in enumerate(x_axis):
        for y_num, y in enumerate(y_axis):
            count = len(data[(data['x'] > x) & (data['x'] < x + x_interval)&(data['y'] > y) & (data['y'] < y + y_interval) ])
            totals[x_num,y_num] = count*5 if count < 10 else 50
    return np.rot90(totals).reshape(1, -1)

num = extract_data(widget)
plt.imshow(num.reshape(8,8), cmap='Greys')

**4. Use** `model.predict(num)` **to see if the model got your digit right. If not, try using** `model.predict_proba(num)` **to see how close it was**