### Exercise: Try out learned skills on penguins dataset
In this exercise we use the [Palmer penguins dataset](https://allisonhorst.github.io/palmerpenguins/)

We use this dataset in classification setting to predict the penguins’ species from anatomical information.

Each penguin is from one of the three following species: Adelie, Gentoo, and Chinstrap. See the illustration below depicting the three different penguin species:

![](https://carpentries-incubator.github.io/deep-learning-intro/fig/palmer_penguins.png)

Your goal is to predict the species of penguin based on the available features. Start simple and step-by-step expand your approach to create better and better models.

![](https://carpentries-incubator.github.io/deep-learning-intro/fig/culmen_depth.png)

You can load the data as follows:
```python
penguins = pd.read_csv("../datasets/penguins_classification.csv")

```


### Read the data

In [2]:
import pandas as pd

penguins = pd.read_csv("../datasets/penguins_classification.csv")

In [5]:
penguins.head()

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Species
0,39.1,18.7,Adelie
1,39.5,17.4,Adelie
2,40.3,18.0,Adelie
3,36.7,19.3,Adelie
4,39.3,20.6,Adelie


### Separate data and target

In [7]:
data, target = penguins.drop(columns="Species"), penguins["Species"]

### Split train and test set (not required when cross-validating)

In [24]:
#from sklearn.model_selection import train_test_split

#data_train, data_test, target_train, target_test = train_test_split(
#    data, 
#    target, 
#    random_state=42) 

#data_train.shape, data_test.shape

### Make a pipeline with LogisticRegression

In [102]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

#model_lr = make_pipeline(StandardScaler(), LogisticRegression())
model_lr = make_pipeline(LogisticRegression())
model_lr

### Perform cross-validation using LogisticRegression

In [103]:
from sklearn.model_selection import cross_validate

cv_result_lr = cross_validate(model_lr, data, target, cv=5)

In [104]:
cv_result_lr

{'fit_time': array([0.01247168, 0.00575614, 0.00962901, 0.00719953, 0.00738025]),
 'score_time': array([0.00133133, 0.00113559, 0.00107813, 0.00260091, 0.00113988]),
 'test_score': array([1.        , 0.95652174, 0.95588235, 0.97058824, 0.92647059])}

In [105]:
lr_mean = cv_result_lr['test_score'].mean()
lr_std = cv_result_lr['test_score'].std()

print("LogisticRegression avg score {} with std of {}".format(round(lr_mean,3), round(lr_std,3)))

LogisticRegression avg score 0.962 with std of 0.024


### Repeat pipeline, but now with RandomForest

In [28]:
from sklearn.ensemble import RandomForestClassifier

In [106]:
#model_rf = make_pipeline(StandardScaler(), RandomForestClassifier())
model_rf = make_pipeline(RandomForestClassifier())
model_rf

In [107]:
cv_result_rf = cross_validate(model_rf, data, target, cv=5)

In [108]:
cv_result_rf

{'fit_time': array([0.08455539, 0.07301044, 0.07286549, 0.07035208, 0.07085466]),
 'score_time': array([0.00320983, 0.00344014, 0.00566602, 0.00291419, 0.00305223]),
 'test_score': array([1.        , 0.98550725, 0.95588235, 0.94117647, 0.97058824])}

In [109]:
rf_mean = cv_result_rf['test_score'].mean()
rf_std = cv_result_rf['test_score'].std()

print("RandomForest avg score {} with std of {}".format(round(rf_mean,3), round(rf_std,3)))

RandomForest avg score 0.971 with std of 0.021


### Use a different split (increase samples in training set)

In [71]:
cv_result_lr = cross_validate(model_lr, data, target, cv=10)
cv_result_rf = cross_validate(model_rf, data, target, cv=10)

In [73]:
lr_mean = cv_result_lr['test_score'].mean()
lr_std = cv_result_lr['test_score'].std()
rf_mean = cv_result_rf['test_score'].mean()
rf_std = cv_result_rf['test_score'].std()

print("LogisticRegression avg score {} with std of {}".format(round(lr_mean,3), round(lr_std,3)))
print("RandomForest       avg score {} with std of {}".format(round(rf_mean,3), round(rf_std,3)))

LogisticRegression avg score 0.959 with std of 0.027
RandomForest       avg score 0.962 with std of 0.026


### Change settings of RandomForestClassifier

In [99]:
model_rf2 = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=200, max_depth=3))
#model_rf2 = make_pipeline(RandomForestClassifier(n_estimators=100, max_depth=3))
model_rf2


In [100]:
cv_result_rf2 = cross_validate(model_rf2, data, target, cv=5)

In [101]:
rf_mean = cv_result_rf2['test_score'].mean()
rf_std = cv_result_rf2['test_score'].std()

print("RandomForest       avg score {} with std of {}".format(round(rf_mean,3), round(rf_std,3)))

RandomForest       avg score 0.938 with std of 0.029


### Use a Support Vector Machine (seems to work well off-the-shelf)

In [116]:
from sklearn.svm import SVC

#model_svm = make_pipeline(StandardScaler(), SVC(gamma='auto'))
model_svm = make_pipeline(SVC(gamma='auto'))
model_svm

In [117]:
cv_result_svm = cross_validate(model_svm, data, target, cv=5)

In [118]:
svm_mean = cv_result_svm['test_score'].mean()
svm_std = cv_result_svm['test_score'].std()

print("Support Vector Machine avg score {} with std of {}".format(round(svm_mean,3), round(svm_std,3)))

Support Vector Machine avg score 0.965 with std of 0.012
