# Python for Machine Learning

### *Session \#1*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Preparing Data

### Warm Ups

*Type the given code into the cell below*

---

**Import pandas and read CSV**: 
```python
import pandas as pd
df = pd.read_csv("heart_attack.csv")
```

In [307]:
import pandas as pd
df = pd.read_csv("heart_attack.csv")

**Set target vector:** `y = df['current_smoker']`

In [3]:
y = df['current_smoker']

**Set feature matrix**
```python
columns = ['current_smoker', 'education']
X = df[columns]
```

**Drop column:** `df.drop(columns=['heart_attack'])`

*Note: You can drop multiple columns at once, by using a list of column names*

**Split data into train/test sets:**
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

*Note: Default split is 0.75 train, 0.25 test. Can change proportion using* `test_size` *parameter*

In [56]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Exercises
---

**1. Create the feature matrix** `X` **by forming a dataframe from the columns** `male, current_smoker` **and**   `education` 

In [None]:
columns = ['male', 'current_smoker', 'education']

X = df[columns]

**2. Now create the feature matrix** `X` **by instead just dropping** `heart_attack` **from the original dataframe**

In [None]:
X = df.drop(columns=['heart_attack'])

**3. Create the target vector** `y` **from the column** `heart_attack`

In [None]:
y = df['heart_attack']

**4. Use** `train_test_split` **to divide your data into** `X_train`, `X_test`, `y_train`, `y_test`

**Add the parameter** `random_state=1` **to lock in the a particular random selection of rows.** 

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## II. Distance and Neighbors

### Warm Ups

*Type the given code into the cell below*

---

**Take the square root:** `sqrt(16)`

In [37]:
from math import sqrt

**Create column by applying function to rows:** 
```python 
df.apply(lambda row: row['age'] * 7, axis=1)
```

**Sort dataframe by column values:** 
```python 
df.sort_values('age')
```

### Exercises
---

![image](../images/coordinateplane.png)

**1. Use the square root function,** `sqrt()`**, to find the distance between the two points.**

*Hint: In two dimensions, the formula for distance is $\sqrt{(x_0-x_1)^2 + (y_0 - y_1)^2}$*

In [64]:
x_diff = 4-(-2)
y_diff = 3-1

sqrt(x_diff**2 + y_diff**2)

6.324555320336759

![image](../images/threedimensions.jpeg)

**2. Find the distance between the two points in three dimensions.**

*Hint: In two dimensions, the formula for distance is $\sqrt{(x_0-x_1)^2 + (y_0 - y_1)^2 + (z_0 - z_1)^2}$*

In [65]:
x_diff = 3 - 2
y_diff = -1 - 1
z_diff = 5 - (-1)

sqrt(x_diff**2 + y_diff**2 + z_diff**2)

6.4031242374328485

**3. Complete the function below, which finds the distance between two rows of data in 2D.**

*Hint: In two dimensions, the formula for distance is $\sqrt{(x_0-x_1)^2 + (y_0 - y_1)^2 + (z_0 - z_1)^2}$*

In [29]:
def distance_2d(row1, row2):
    
    # define x_diff = difference between row1['x'] and row2['x']
    x_diff = row1['x'] - row2['x']
    
    # define y_diff = difference between row1['y'] and row2['y']
    y_diff = row1['y'] - row2['y']
    
    # define total = square x_diff and y_diff, add them together
    total = x_diff**2 + y_diff**2
    
    # take the square root 
    return sqrt(total)


# Use the rows below to test your function!
# Answer should be 37
test_row_1 = pd.Series({'x': 10, 'y': -30})
test_row_2 = pd.Series({'x': -2, 'y': 5})

37.0

**4. Complete the function below, which finds the distance between two rows of data in ANY dimension.**

In [34]:
def distance(row1, row2):
    total = 0
    for column in row1.index:
        # Find the difference between row1, row2 along this column
        diff = row1[column] - row2[column]
        # Square the difference, add it to total
        total += diff**2
    # Return the square root of total
    return sqrt(total)

# Use the rows below to test your function!
# Answer should be 15
test_row_3 = pd.Series({'x': 10, 'y': -30, 'z':-1})
test_row_4 = pd.Series({'x': 12, 'y': -20, 'z':10})

distance(test_row_3, test_row_4)

15.0

**5. Add a column** `distance` **to the feature matrix X.**

In [50]:
new_row = df.iloc[1] * 1.2

# Create a new column `distance` based on how far each row is from the new datapoint
df['distance'] = df.apply(lambda row: distance(row, new_row), axis=1)

# sort the dataframe according to distance
df.sort_values('distance')

Unnamed: 0,male,age,education,current_smoker,cigs_per_day,bp_meds,prevalent_stroke,prevalent_hyp,diabetes,tot_chol,sys_bp,dia_bp,bmi,heart_rate,glucose,heart_attack,distance
241,0,54,1.0,0,0.0,0.0,0,1,0,273.0,139.0,98.0,29.06,110.0,73.0,1,40.937135
2038,0,52,1.0,0,0.0,0.0,0,0,0,279.0,135.0,86.0,27.02,100.0,72.0,1,41.794905
2779,0,45,2.0,1,1.0,0.0,0,1,0,285.0,132.5,97.5,24.74,98.0,77.0,0,42.206465
629,1,64,1.0,0,0.0,0.0,0,0,0,271.0,134.0,79.0,24.95,106.0,90.0,0,42.939945
2043,1,46,2.0,1,20.0,0.0,0,0,0,275.0,137.0,88.0,29.28,110.0,88.0,0,43.069828
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2065,0,62,1.0,0,0.0,0.0,0,0,1,233.0,130.0,87.0,21.34,85.0,386.0,0,388.505320
3309,0,62,3.0,1,20.0,1.0,0,1,1,358.0,215.0,110.0,37.62,110.0,368.0,1,407.187322
2501,0,67,2.0,0,0.0,1.0,0,1,1,303.0,204.0,96.0,27.86,75.0,394.0,1,417.665822
2490,1,62,3.0,0,0.0,0.0,0,0,1,346.0,102.5,66.5,17.17,80.0,394.0,1,419.766351


### Extra Credit
---
**Create a function** `neighbors` **that takes a dataframe and a new row.** 

**It should return the 5 nearest rows inside the dataframe**

In [68]:
def neighbors(df, new_row):
    df['distance'] = df.apply(lambda row: distance(row, new_row), axis=1)
    return df.sort_values('distance').head()

neighbors(X_train, new_row)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,male,age,education,current_smoker,cigs_per_day,bp_meds,prevalent_stroke,prevalent_hyp,diabetes,tot_chol,sys_bp,dia_bp,bmi,heart_rate,glucose,distance
2770,0,46,3.0,1,6.0,0.0,0,1,0,315.0,165.0,85.0,32.89,110.0,91.0,36.445859
1025,0,64,2.0,1,3.0,0.0,0,0,0,315.0,135.0,80.0,25.23,103.0,89.0,36.704826
1683,0,46,1.0,0,0.0,0.0,0,0,0,295.0,145.0,90.0,25.87,90.0,79.0,37.301169
363,0,45,2.0,0,0.0,0.0,0,1,0,304.0,148.0,106.0,22.98,98.0,72.0,37.31478
1660,0,66,2.0,0,0.0,0.0,0,1,0,292.0,143.0,95.0,31.11,90.0,77.0,37.574094


## III. Sci-Kit Learn and K-Nearest Neighbors

### Warm Ups

*Type the given code into the cell below*

---

In [160]:
# Run to reset data

X = df.drop(columns=['heart_attack'])
y = df['heart_attack']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

new_row = X_train.iloc[1] * 1.2

**Create KNN Classifier, using 5 neighbors**: 
```python
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(5)
```

In [311]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(5)

**Fit model**: `model.fit(X_train, y_train)`

In [156]:
model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

**Classify using model**: `model.predict(X_test)`

**Classify using model**: `model.score(X_test, y_test)`

In [157]:
model.score(X_test, y_test)

0.8446389496717724

### Exercises
---
**1. Here is a new row. Use the techniques from the previous section to find the 5 rows that are its nearest neighbors.**

In [322]:
new_row = X_train.iloc[1] * 1.2

X_train['distance'] = X_train.apply(lambda row: distance(row, new_row), axis=1)

X_train.sort_values('distance').head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,male,age,education,current_smoker,cigs_per_day,bp_meds,prevalent_stroke,prevalent_hyp,diabetes,sys_bp,dia_bp,bmi,heart_rate,distance
2158,0,68,3.0,1,20.0,0.0,0,1,0,158.0,94.0,31.64,80.0,10.612492
2320,0,68,2.0,1,20.0,0.0,0,0,0,148.0,95.0,20.98,95.0,14.112154
2866,0,59,2.0,1,20.0,0.0,0,1,0,154.5,93.0,21.82,85.0,15.515928
807,1,61,1.0,1,20.0,0.0,0,1,0,157.0,99.0,28.74,95.0,15.644672
2233,1,64,1.0,1,20.0,0.0,0,1,0,155.0,99.0,22.46,75.0,15.669995


**2. The trained scikit-learn model relies on the SAME distance calculations.**

**Call** `model.kneighbors([new_row])` **to get the distances and row numbers of the nearest neighbors.**

**Call** `X_train.iloc[]` **on the row numbers that come back. It should give you the same rows that we got using our own techniques from above** 

In [324]:
distances, row_numbers = model.kneighbors([new_row])

X_train.iloc[row_numbers[0]]

Unnamed: 0,male,age,education,current_smoker,cigs_per_day,bp_meds,prevalent_stroke,prevalent_hyp,diabetes,sys_bp,dia_bp,bmi,heart_rate,distance
2158,0,68,3.0,1,20.0,0.0,0,1,0,158.0,94.0,31.64,80.0,10.612492
2320,0,68,2.0,1,20.0,0.0,0,0,0,148.0,95.0,20.98,95.0,14.112154
2866,0,59,2.0,1,20.0,0.0,0,1,0,154.5,93.0,21.82,85.0,15.515928
807,1,61,1.0,1,20.0,0.0,0,1,0,157.0,99.0,28.74,95.0,15.644672
2233,1,64,1.0,1,20.0,0.0,0,1,0,155.0,99.0,22.46,75.0,15.669995


**3. Some of our columns have MUCH larger numbers. These columns are overweighted when we calculate which rows are close.**

**Drop the** `tot_chol` **and** `glucose` **columns from the feature matrix. Rerun** `train_test_split()` **and refit your model. What's the model's accuracy score now?** 

In [320]:
X = df.drop(columns=['heart_attack', 'tot_chol', 'glucose'])
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8227571115973742

**4. You can also use the** `scale()` **function to standardize the feature matrix before fitting the model.**

**Rerun** `train_test_split()` **on the scaled version of your feature matrix and refit your model. What's the model's accuracy score now?** 

In [318]:
from sklearn.preprocessing import scale

X = df.drop(columns=['heart_attack'])
X_train, X_test, y_train, y_test = train_test_split(scale(X), y, random_state=4)

model.fit(X_train, y_train)
model.score(X_test, y_test)

0.8271334792122538