# Practice Session: Palmer Penguins dataset

In the section, we will practice the steps learned so far on a new dataset: the
Palmer Penguins dataset.

<img alt="penguins" src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/lter_penguins.png" width=500 />

*Artwork by @allison_horst*

Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php)
and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a member of
the [Long Term Ecological Research Network](https://lternet.edu/).

Data are available by  [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/)
license in accordance with the [Palmer Station LTER Data Policy](http://pal.lternet.edu/data/policies)
and the [LTER Data Access Policy for Type I data](https://lternet.edu/data-access-policy/).

Here we will use a subset of it, prepared for this exercise.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn')

## Loading and visualizing the dataset

In [None]:
from penguins import load_penguins

First, using the `load_penguins` function, load the data set and explore it.

As you do it, answer the following questions:

- how many features does it contain?
- how many samples?
- what is the target variable and how is it encoded?

*Note: more information about the culmen measures are available [here](https://allisonhorst.github.io/palmerpenguins/#bill-dimensions).*

**Hint: this section is very similar to the [Iris data set exploration in section 02.1](02.1-Machine-Learning-Intro.ipynb#Loading-the-Iris-Data-with-Scikit-Learn).**

In [None]:
# load the dataset


In [None]:
# check the size of the feature matrix 


In [None]:
# check the name of the features


In [None]:
# check the target variable


In [None]:
# and its possible values


Next, let's have a look at the data and see how the different species are
distributed. Create a plot showing two of the dimensions of the dataset.

*Bonus: you can create a function and use it in a loop to show all pairs of dimensions.*

**Hint: this section is very similar to the [Iris data set exploration in section 02.1](02.1-Machine-Learning-Intro.ipynb#Loading-the-Iris-Data-with-Scikit-Learn).**

In [None]:
# create scatter plot to display two of the dimensions


In [None]:
# bonus: plot all pairs of features


### Question

Given the nature of the dataset and the target data, what type of machine learning
task are we trying to achieve (2 keywords)?

## Fit a baseline model: logistic regression

Now that we have loaded and explored the dataset, we can start fitting a first
model and measure its performance.

Let's start with a `LogisticRegression` model, as follows:

1. split the data into a training and test datasets,
2. import the `LogisticRegression` class and create the model,
3. fit the model to the training dataset,
4. generate predictions on the test dataset,
5. compute the accuracy score of the model.

*Bonus: plot the corresponding confusion matrix.*

**Hint: this section is very similar to the [classification on digits example in section 02.2](02.2-Basic-Principles.ipynb#Classification-on-Digits).**

In [None]:
# split the dataset in train / test folds


In [None]:
# import the LogisticRegression class, create a model and fit it to training data


In [None]:
# predict labels on the test data


In [None]:
# compute the accuracy score


In [None]:
# bonus: display the confusion matrix


To improve the quantification of the model performance, use the `cross_val_score`
function to compute 5 folds cross-validation estimation of the model accuracy.

**Hint: this section is very similar to the [K-fold Cross-Validation in section 05](05-Validation.ipynb#K-fold-Cross-Validation).**

In [None]:
# compute the cross-validated accuracy score


### Bonus questions

- Use the `make_pipeline` function in combination with the `StandardScaler`
  preprocessor to fit a logistic regression model with input normalization.
- Use the `cross_val_predict` function to generate the confusion matrix for each
  input data points.

In [None]:
# create a composite model using StandardScaler and LogisticRegression


In [None]:
# create a confusion matrix for cross-validated predictions


## Fit a random forest model

Now that we have a first linear model fitted, we can explore other models and
compare the cross-validated performances.

Fit a `RandomForestClassifier` and compute its cross-validated accuracy score.

Does it perform better than the logistic regression model?

**Hint: this model has been used a similar way in the [Random Forest for Classifying Digits in section 03.2](03.2-Regression-Forests.ipynb#Example:-Random-Forest-for-Classifying-Digits).**

In [None]:
# import, create and fit a RandomForestClassifier model


In [None]:
# compute the cross-validated accuracy score
