The dataset *penguins* contains a study of the penguin species that inhabit different islands of Antarctica. <br>

Researchers have a few questions about penguins:
- Can we distinguish the sex of the animals from the rest of the data?
- Is the island decisive when it comes to the physical development of penguins?
- Are we able to distinguish penguins from one island or another?
- Could we identify the species with physiological data?


To answer these questions and more, it is necessary to do an **exploratory analysis** of the data provided in the dataset and process it in order to obtain answers and extract enough knowledge to make decisions.

In [None]:
# Load libraries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

## Load data with pandas

In [None]:
#Create a data frame from the file separated by ","
#penguins = pd.read_csv('penguins_antarctica.csv', sep=',')  
penguins = sns.load_dataset("penguins")

print("Variable type: ", type(penguins))

Our *penguins* variable contains the data and is already structured in **pandas** format so it will not need to be converted and it will be much easier to perform the exploration. (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) 

### Review data (name and type of the columns, first row, last rows, first 5 rows, random sample of 5 rows...)

In [None]:
# Number of original records using the function len() & .shape 
print("Number of records: ", len(penguins))
print("Dataset size: ", penguins.shape)

In [None]:
penguins.columns

In [None]:
penguins.dtypes

We observe 7 different columns, which will be the **variables** that describe each penguin, represented by each row, also called **record/sample**. <br>
Of the columns, we can see that there are 3 of them with information in **text format** and the rest are numeric with decimals. The first will need additional processing to use them in algorithms (they only understand digits).


In [None]:
#iloc is to access the data frame by position
penguins.iloc[0]

In [None]:
penguins.iloc[-1]

In [None]:
#First 5 rows
penguins.iloc[0:5]

In [None]:
#First 5 rows
penguins.head(5)

In [None]:
#Last 5 rows
penguins.tail(5)

In [None]:
#Random sample of 5 rows
penguins.sample(5)

In [None]:
#We can access to one or several columns
penguins.iloc[0:5]['island']

In [None]:
penguins.iloc[0:5].island

In [None]:
penguins.iloc[0:5][['species','sex']]

In [None]:
#loc is to access the data frame by index
penguins.loc[0]

In [None]:
#-1 index does not exist
penguins.loc[-1]

In [None]:
#Check indexes
penguins.index

## Description of data with pandas

### Numerical attributes

In [None]:
#Pandas allows to make several calculations on columns
penguins['flipper_length_mm'].mean()

In [None]:
penguins['flipper_length_mm'].std()

In [None]:
penguins['flipper_length_mm'].max()

In [None]:
penguins['flipper_length_mm'].min()

In [None]:
#Filtering by condition
penguins[penguins['sex']=='Male']['flipper_length_mm'].min()

In [None]:
#Some descriptive statistics of each attribute
penguins.describe()

In [None]:
#Histogram of numerical attribute flipper_length_mm
penguins['flipper_length_mm'].plot.hist(bins=10)

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm", bins=10)

In [None]:
sns.histplot(data=penguins, x="flipper_length_mm", hue="species")

### Categorical attributes

In [None]:
penguins['species'].unique()

In [None]:
#Equivalent to
set(penguins['species'])

In [None]:
penguins['island'].unique()

In [None]:
penguins['sex'].unique()

It is also observed how some record has **nan/null** values, so in general terms we will need to **clean** the data set to avoid records with this type of associated values.

Let's start with cleaning!

In [None]:
# We evaluate how many records have null values per variable
penguins.isna().sum()

There are few samples with null values, so a priori, if we eliminate them we will not have problems. Few samples do not affect the representativeness of the data.

In [None]:
# We remove null records with the .dropna() function
penguins_clean = penguins.dropna()

# We ensure how many final samples we have
print("Number of registers without nulls: ", len(penguins_clean))

In [None]:
# We start to analyze the data. First step: Is there the same ratio of both sexes?
print(penguins_clean.sex.value_counts())

In [None]:
sns.countplot(penguins_clean['sex'])

It seems that globally, there is practically the same proportion, so it is a good start when it comes to avoiding problems with the algorithms.

In [None]:
# Second step: Are the variables that describe both sexes differentiable?
sns.pairplot(data = penguins_clean, hue = "sex")

The features of the penguins (described by the numerical variables) do not seem to have much difference between them a priori. Observing the distributions (diagonals) it seems that we are dealing with a **binomial distribution** and in the graph of the variables it does not seem that they are very separable between them. <br><br>
Still, it seems that a couple of variables can help distinguish between the sexes of penguins. <br>
For example, the variables *body_mass_g* (vertical axis) and *bill_depth_mm* (horizontal axis) form 2 very separate nuclei where, in turn, it seems that the sex of the penguin is quite separable.
<br><br>
The next step to continue analyzing will consist of using the non-numeric variables (island & species) to see if working with subgroups, we can extract more information.

In [None]:
# How many islands are there and how are the penguins distributed?
penguins_clean.groupby(["sex", "island", "species"]).count()["body_mass_g"]

We observe how they are distributed first by sex, island and finally by species. Paying attention, you can see how there is no island where the 3 penguin species cohabit and therefore it will make it difficult for us to identify the animals.


To simplify the current problem, we will address the question by segmenting it: Are we able to distinguish the sex of the *Adelie* species in individuals from the same island? Will they have different characteristics between the different islands?

The reason for using the *Adelie* species is simple: it is the only one that inhabits the 3 islands.

In [None]:
# We are going to select only the Adelie species
penguins_adelie = penguins_clean[penguins_clean.species == "Adelie"]
print("Penguins from Adelie specie: ", len(penguins_adelie))
penguins_adelie.head()

In [None]:
# Let's now see how well they differ by islands. To do this, choose which island you want to check
sns.pairplot(data = penguins_adelie[penguins_adelie.island == "Torgersen"], hue = "sex")

In [None]:
sns.pairplot(data = penguins_adelie[penguins_adelie.island == "Biscoe"], hue = "sex")

In [None]:
sns.pairplot(data = penguins_adelie[penguins_adelie.island == "Dream"], hue = "sex")

In the case of the island of **Torgersen**: although it cannot be assumed that all penguins will be easily identifiable, it seems that the separation between sex is more consistent, especially in those graphs in which the weight (*body_mass_g*) is involved.

In [None]:
# We are going to visualize the boxplots by island and sex to evaluate if the variables are differentiable
fig,axes=plt.subplots(2,2,figsize=(20,10))
sns.boxplot(data = penguins_adelie, x="island",y="bill_length_mm", hue = "sex", ax=axes[0, 0])
sns.boxplot(data = penguins_adelie, x="island",y="bill_depth_mm", hue = "sex", ax=axes[0, 1])
sns.boxplot(data = penguins_adelie, x="island",y="flipper_length_mm", hue = "sex", ax=axes[1, 0])
sns.boxplot(data = penguins_adelie, x="island",y="body_mass_g", hue = "sex", ax=axes[1, 1])
plt.tight_layout()

With this last visualization we can answer the initial questions. In most combinations, there are very few *outliers*, and while the variables have reasonable differences separated by sex/island, there are many regions that overlap with each other, with *flipper_length_mm* being on island *Biscoe* where it is more difficult to separate.

As a final step, we are going to make a small **correlation matrix** to numerically evaluate the linearity between the variables.

In [None]:
# Encoding the variables to numeric to use the function
coded_sex = [1 if value == "Male" else 0 for value in penguins_adelie.sex]
penguins_adelie_num = penguins_adelie.copy()


penguins_adelie_num.sex = coded_sex


In [None]:
# Another method (better for several categories)
penguins_adelie_num = penguins_adelie.copy()
penguins_adelie_num['sex'] = penguins_adelie['sex'].replace(['Female', 'Male'], [0, 1])

In [None]:
penguins_adelie_num['sex']

In [None]:
# Correlation between the dataset variables with the sex class encoded
plt.figure(figsize = (10, 6))
sns.heatmap(penguins_adelie_num.corr(), annot = True)

Finally, the correlation matrix provides us with information on the linearity of the variables. Since we want to identify if a penguin is male or female, we will look at the target variable: **sex**.

In the case of *body_mass_g* we see that it has a linear correlation of almost 0.75, which is considered high, just like *bill_length_mm* and *bill_depth_mm*, which are close to 0.6. Finally, *flipper_length_mm* has little correlation.

These values can be contrasted with the box plots (*boxplots*) made previously. In the case of *flipper_length_mm* it is observed how the ranges of values between penguins on island *Torgersen* overlap in most of the region.

With this brief visual analysis we can begin to have enough knowledge to apply clustering techniques for the specific case:
- We have visually verified the separability between the sex of the penguins working with the **granularity** of the data (depth)
- Also the linearity between the variables and the objective variable, being able to reduce the number of dimensions to be used.

# Exercise

Try to ask one of these questions (e.g., using a correlation matrix):
* Is the island decisive when it comes to the physical development of penguins? (Does being in a particular island affect the physical properties of the penguins?)
* Are we able to distinguish penguins if they come from one island or another?
* Could we identify the species and island based on their physiological data?

Remember that all categorical attributes must be converted to numerical to consider them in the correlation matrix.

--- 
Question 1: Is the island decisive when it comes to the physical development of penguins?

--- 
Question 2: Are we able to distinguish penguins if they come from one island or another?

--- 
Question 3: Could we identify the species and island based on their physiological data?