# Data exploration exercise

In this exercise, you should try to apply what you've learnt today.
You are going to deal with the well known "penguins" dataset. The idea being, that you prepare the dataset, so it could be used to train a machine learning classifier, which can determine the species of penguin from the given features, such as size of the culmen (beak), flippers, body mass, sex and occurence.
However, not all features are equally important to crate a strong model...

Your task is to: 
* load the dataset into a dataframe<img src="./images/gentoo.jpg" width=38% align="right">
* explore the data within the dataframe
* clean the data
* visualise the data in order to find the important features
* reduce the amount of features to the significant ones
* save the the improved dataset as a csv file

<br/>
<p style="text-align: right;">Image source: wildrepublic.com</p>
<br/>


Before we begin, import pandas, matplotlib.pyplot and seaborn.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

First, find the penguins_size csv file and load it into a pandas dataframe.

In [None]:
df = pd.read_csv("./data/penguins_size.csv")

Have a look at your dataframe and find out how many rows and columns there are.

In [None]:
df.head()

In [None]:
df.shape

Missing data won't help in this task, so find out how much we have of it and if we can go without these entries.

In [None]:
df.info()

In [None]:
df.isna().sum()

What percentage are we dropping in the worst case?

In [None]:
100*(18/344)

Get rid of the instances with missing data!

In [None]:
df = df.dropna()

In [None]:
df.shape

In [None]:
df.head()

Find out how many unique categories there are amongst the non-numeric columns.

In [None]:
df['sex'].unique()

In [None]:
df['island'].unique()

In [None]:
df['species'].unique()

Seems we have some not so useful data here... Can you clean things up?

In [None]:
df = df[df['sex']!='.']

Looking good so far!
Time to get some descriptive statistics on the whole dataframe...

In [None]:
df.describe()

Let's get visual.
Create a seaborn scatterplot plot the culmen features against each other. Get different colours for the species.

In [None]:
sns.scatterplot(x='culmen_length_mm',y='culmen_depth_mm',data=df,hue='species',palette='Dark2')
plt.show()

Now, create a pairplot for the dataframe and separate the species by colour again.

In [None]:
sns.pairplot(df,hue='species',palette='Dark2')
plt.show()

To finish things up, create a figure with subplots, which contain boxplots with x being the species and y the culmen length. Each suplot should depict one of the sexes.

In [None]:
sns.catplot(x='species',y='culmen_length_mm',data=df,kind='box',col='sex',palette='Dark2')

For a ML model to work properly, we need to one-hot encode at least the labels. Why do you think this is?
Try to find out how to do this with help of the pandas documentation and one-hot encode all categorical features, as well as the labels.

In [None]:
pd.get_dummies(df)

To round things off, rename the clumsy sounding column names.

In [None]:
df = df.rename(columns={"culmen_length_mm":"culmen_length", "culmen_depth_mm":"culmen_depth", "flipper_length_mm":"flipper_length","body_mass_g":"body_mass"})

Now, permanently drop one feature you think is irrelevant.

In [None]:
df.drop("", axis=1, inplace=True)

To finish, save the dataframe as a csv file.

In [None]:
df.to_csv("penguins_clean")

Good work. Well done!