# Python for Machine Learning

### *Session \#2*


### Helpful shortcuts
---

**SHIFT** + **ENTER** ----> Execute Cell

**UP/DOWN ARROWS** --> Move cursor between cells (then ENTER to start typing)

**TAB** ----> See autocomplete options

**ESC** then **b** ----> Create Cell 

**ESC** then **dd** ----> Delete Cell

**\[python expression\]?** ---> Explanation of that Python expression

**ESC** then **m** then __ENTER__ ----> Switch to Markdown mode

## I. Review and Yellowbrick

### Warm Ups

*Type the given code into the cell below*

---

In [82]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from yellowbrick.classifier import ConfusionMatrix, ClassPredictionError

from warnings import simplefilter
simplefilter(action='ignore', category=FutureWarning)

df = pd.read_excel('titanic.xlsx')

**Create feature matrix/target vector:**
```python
columns = ['fare', 'pclass']
X = df[columns]
y = df['alive']
```                    

**Create train/test split:**
```python
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

**Create model:**
```python
model = KNeighborsClassifier(n_neighbors=3)
```                    

**Confusion Matrix:**
```python
model_cm = ConfusionMatrix(model)
model_cm.fit(X_train, y_train)
model_cm.score(X_test, y_test)
model_cm.finalize()
```                    



### Exercises
---
**1. Train the model using only** `fare, sibsp, pclass` **in the feature matrix, with** `survived` **as the target vector**

**2. Plot a confusion matrix for the model. What is the accuracy? Sensitivity?**

**3. An alternate form of the confusion matrix is the class prediction error graph.**

**Redo the steps for a confusion matrix, but wrap your model with** `ClassPredictionError()` **instead.**

**4. Use** `model.predict()` **on the two new samples below.**

**5. You can use** `model.predict_proba()` **to get the probability the model gives to each class instead.**

**Use** `.predict_proba()` **on the new sampels. Which passenger is the model more certain about?**

## II. Missing Data

### Warm Ups

*Type the given code into the cell below*

---

**Find rows with null age:**
```python
null_fare = df['fare'].isnull()
df[null_fare]
```


**Drop rows with nulls:**`df.dropna(subset=['fare'])`                    
*Hint: You can also set* `inplace=True` *to change the original dataframe*


**Find count of nulls:**`df.isnull().sum()`                    



**Find percentage of nulls:**`df.isnull().mean()`                    



**Fill in nulls based on filter:**
```python
df.loc[null_fare, 'fare'] = df['fare'].mean()
```

### Exercises
---

**1. Find all the rows where** `deck` **is null.**

**2. How many nulls are there in the** `age` **column? The** `embarked` **column?** 

**3. Find the percentage of nulls across all columns. For columns with <5% nulls, drop rows the rows with missing values.**

**4. Drop columns with more than 50% nulls.**

**5. Fill in the age nulls by taking the average age across the** `who` **column**

**So, men with nulls would get the average age of people with** `man` **in the** `who` **column, children with nulls would get the average age of people with** `child` **in the** `who` **column**

## III. Scaling and One-Hot Encoding

### Warm Ups

*Type the given code into the cell below*

---

**Imports for One-Hot Encoding and Pipelines:**

In [41]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

**Add encoding to model using pipeline:**
```python
model = make_pipeline(OneHotEncoder(), KNeighborsClassifier(3))
```

**Add StandardScaler to model using pipeline:**
```python
model = make_pipeline(StandardScaler(), KNeighborsClassifier(3))
```

**Split pipeline between two types of columns:**
```python
numeric = ['fare', 'age']
categorical = ['pclass', 'embark_town', 'who']

column_transformer = make_column_transformer(
        (StandardScaler(), numeric),
        (OneHotEncoder(), categorical)
    )

model = make_pipeline(column_transformer, KNeighborsClassifier(3))
```

### Exercises
---

**1. Create a new feature matrix X by dropping the** `alive` **and** `survived` **columns of** `df` 

**Then create** `X_num` **with just the numeric inputs and** `X_cat` **with just categorical data**

Hint: You can use `df.select_dtypes()` to isolate the categorical (ie. 'object') columns and numeric columns (ie. 'int64', 'float64')

**2. Create a** `OneHotEncoder(sparse=False)` **object as** `encoder` **and call** `encoder.fit_transform()` **on the categorical columns of X**

Note: The `sparse=False` parameter makes the output a normal Numpy array

**3. Create a model based on only the categorical features of X. Fit the model to the data, and find the model's accuracy.**

Hint: Use a pipeline and OneHotEncoder()

**4. Create a model based on only the numeric features of X, which preprocesses data using StandardScaler(). Fit the model to the data, and find the accuracy.**

**5. Use a ColumnTransformer to create a model with both** `OneHotEncoder()` **and** `StandardScaler()` **preprocessing for the appropriate columns. What's the accuracy for this combined model?**

**6. You can also combine Yellowbrick visualizers with pipelines!**

**After creating your pipeline with** `make_pipeline()` **, but before fitting to training data, wrap your model with** `ConfusionMatrix()` 