In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Pipelines Assignment: Palmer Penguins continued

![](https://github.com/allisonhorst/palmerpenguins/raw/master/man/figures/lter_penguins.png) 

## Penguin classification
In this notebook, we will continue improving on the model we made to classify the penguin species. We will do this by using some of the transformers available to us within the scikit-learn API. We will cover the following aspects:

1. Loading the data
2. Preparing the data for sklearn
3. Model creation & evaluation
4. Data pre-processing
5. Pipelines

## 1. Loading our data

We load the data from the data folder.

In [None]:
penguins = pd.read_csv('data/penguins_messy.csv')
penguins.head()

### 2. Exploratory Data Analysis

As you can see, the data is slightly different than the previous file we saw! There seem to be some missing values, and we have a new column. 

Take some time to examine the dataset. Below are some suggestions for what you may want to investigate. Note that this is not an exhaustive list; see if you can find something interesting!

    - How much data do we have? how many features?
    - What do the features represent
    - What datatypes does it contain? Are there any missing values?
    
    - Investigate how many different values some of the categorical features contain
    - Is there any redundant information? Are you sure this information is redundant, ie. is there no information available in it that is not available in another column (e.g. a value that's NOT missing while it is in the original column) 
    
    - Produce some summary statistics for the different features
    - Are any of the features correlated?

   

In [None]:
# Your EDA code here. 

--- 

## 3. Pre-processing the data

A decision tree, the algorithm we used, works pretty well out of the box without requiring much pre-processing to the data, given that all the data is numeric. However, this is not the case for all machine learning algorithms. There are also some chunks of the data that we've omitted to allow for the simple model to be built.

In [None]:
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex', 'island']

X = penguins.loc[:, feature_columns]
y = penguins.loc[:, 'species']

print(f'The shape of feature matrix X is: {X.shape}')
print(f'The shape of target vector y is: {y.shape}')

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [None]:
from sklearn.preprocessing import OneHotEncoder

transformer = OneHotEncoder()
transformer.fit_transform(X_train[['sex', 'island']])

# ASSIGNMENT 


#### 1. Use the one-hot encoder to encode the categorical columns in your feature matrix. 
What columns would you want to one-hot encode? Do you use `drop='first'`? Are your results what you expect? Any reason why they might be different? 

In [None]:
# Your code here. 

Don't worry if you don't get the right results immediately, move on to question 2! 

#### 2. Deal with your missing values first!
Think about it: would you replace your missing values _before_ or _after_ encoding the data? 

Impute your missing values with the `SimpleImputer` from `sklearn.impute`, then encode the data. Verify that your results are what you expect. How many columns do you now have? 

In [None]:
# Your code here. 

#### 3. Use the one-hot encoder in combination with the column transformer to encode your entire feature matrix. 
How many columns do you have now? Is that the number that you expect? Don't forget to use your imputed data! 

In [None]:
# Your code here. 

#### 4. Moving to pipelines
Create a pipeline that does two things:
- Impute missing values
- One hot encode your data

Are your results equal to the results in step 3?

In [None]:
# Your code here. 

#### 5. Expand your pipeline: model! 
You've dealt with your missing values, you've encoded your data.. you're ready for modelling! 

Expand your pipeline with two models: 
- DecisionTreeClassifier: `sklearn.tree.DecisionTreeClassifier`
- KNeighborsClassifier: `sklearn.neighbors.KNeighborsClassifier`

In [None]:
# Your decision tree code here. 

In [None]:
# Your k-nearest neighbors code here.

#### 6. Expand your pipeline: scaling! 
Does your data need to be scaled? If so, what scaler would you use? Add scaling to your pipeline with the k-Nearest Neighbors classifier. Do your results improve? 

**BONUS**: Do the same to the pipeline with the decision tree classifier. Do your results improve as much? Why do you think that is? 

In [None]:
# Your code here. 

#### 7. Tune your hyperparameters
Use GridSearchCV in combination with your pipeline to search over: 
- at least  one model parameter (e.g. value for k) 
- at least one preprocessing parameter (e.g. imputer strategy, type of scaler, etc.) 



In [None]:
# Your code.

## BONUS

Get the best results possible! Add other preprocessing techniques if you like (e.g. polynomial features), try different models (maybe an SVM?), search over those hyperparameters... get the best result you can get! 

# Summary

Scikit-learn is an excellent, resourceful tool for machine learning in Python. We've seen how we can split a dataset with `train_test_split` into a train and test set, create and train a model, use the trained model to create predictions, and how to use the tools from `sklearn.metrics` to evaluate how good the model is. We have also seen preprocessing techniques like scaling and encoding your categorical variables, and the use of pipelines. 