The dataset *penguins* contains a study of the penguin species that inhabit different islands of Antarctica. <br>

Researchers have a few questions about penguins:
- Can we distinguish the sex of the animals from the rest of the data?
- Is the island decisive when it comes to the physical development of penguins?
- Are we able to distinguish penguins from one island or another?
- Could we identify the species with physiological data?


To answer these questions and more, it is necessary to do an **exploratory analysis** of the data provided in the dataset and process it in order to obtain answers and extract enough knowledge to make decisions..

In [1]:
# Load libraries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

pd.options.mode.chained_assignment = None  # default='warn'

## Load data with pandas

In [2]:
#Create a data frame from the file separated by ","
#penguins = pd.read_csv('penguins_antarctica.csv', sep=',')  
penguins = sns.load_dataset("penguins")

print("Variable type: ", type(penguins))

Variable type:  <class 'pandas.core.frame.DataFrame'>


There are few samples with null values, so a priori, if we eliminate them we will not have problems. Few samples do not affect the representativeness of the data.

In [3]:
# We remove null records with the .dropna() function
penguins_clean = penguins.dropna()

# We ensure how many final samples we have
print("Number of registers without nulls: ", len(penguins_clean))

Number of registers without nulls:  333


In [None]:
# We are going to select only the Adelie species
penguins_adelie = penguins_clean[penguins_clean.species == "Adelie"]
print("Penguins from Adelie specie: ", len(penguins_adelie))
penguins_adelie.head()

# Clustering
We will try to validate if we are able to identify sex of Adelie specie using the island and the physiological data.

In [5]:
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
import numpy as np

In [6]:
penguins_adelie_num = penguins_adelie.copy()
penguins_adelie_num['island'] = penguins_adelie_num['island'].replace(['Biscoe', 'Dream', 'Torgersen'], [0, 1, 2])
penguins_adelie_num['species'] = penguins_adelie_num['species'].replace(['Adelie', 'Chinstrap', 'Gentoo'], [0, 1, 2])
penguins_adelie_num['sex'] = penguins_adelie_num['sex'].replace(['Female', 'Male'], [0, 1])

In [31]:
kmeans = KMeans(2, init='k-means++', n_init=100)
original_sex = penguins_adelie_num.sex
penguins_no_sex = penguins_adelie_num.drop('sex', axis=1)
kmeans.fit(penguins_no_sex)
print(confusion_matrix(original_sex, kmeans.labels_))
#clusters identified will be in the x-axis (number of cluster can be different to class name)

[[66  7]
 [13 60]]


In [32]:
print("Accuracy is ", np.round(100*accuracy_score(original_sex, kmeans.labels_), 2), "%")

Accuracy is  86.3 %


In [None]:
#More visually
sns.heatmap(confusion_matrix(original_sex,kmeans.labels_),annot=True,cmap="Blues",fmt="d",cbar=False, annot_kws={"size": 24});

Analyze the clusters from the attributes point of view

In [10]:
# Add cluster to data frame
penguins_adelie_num['cluster'] = kmeans.labels_

In [None]:
fig,axes=plt.subplots(2,2,figsize=(20,10))
sns.boxplot(data = penguins_adelie_num, x='cluster', y="body_mass_g", ax=axes[0, 0])
sns.boxplot(data = penguins_adelie_num, x='cluster',  y="flipper_length_mm", ax=axes[0, 1])
sns.boxplot(data = penguins_adelie_num, x='cluster',  y="bill_length_mm", ax=axes[1, 0])
sns.boxplot(data = penguins_adelie_num, x='cluster',  y="bill_depth_mm", ax=axes[1, 1])
plt.tight_layout()

Using only the attribute most correlated with sex

In [12]:
penguins_adelie_filt = penguins_adelie[['body_mass_g','sex']]
penguins_adelie_num = penguins_adelie_filt.copy()
penguins_adelie_num['sex'] = penguins_adelie_filt['sex'].replace(['Female', 'Male'], [0, 1])

In [30]:
kmeans = KMeans(2, init='k-means++', n_init=100)
original_sex = penguins_adelie_num.sex
penguins_no_sex = penguins_adelie_num.drop('sex', axis=1)
kmeans.fit(penguins_no_sex)
print(confusion_matrix(original_sex, kmeans.labels_))
#clusters identified will be in the x-axis (number of cluster can be different to class name)

[[66  7]
 [13 60]]


In [19]:
print("Accuracy is ", np.round(100*accuracy_score(original_sex, kmeans.labels_), 2), "%")

Accuracy is  13.7 %


In [None]:
#More visually
sns.heatmap(confusion_matrix(original_sex,kmeans.labels_),annot=True,cmap="Blues",fmt="d",cbar=False, annot_kws={"size": 24});

There is no important difference in the results, so we can use only attribute *body_mass_g* to identify sex of Adelie penguins.

In [16]:
penguins_adelie['cluster'] = kmeans.labels_

In [None]:
fig,axes=plt.subplots(2,2,figsize=(20,10))
sns.boxplot(data = penguins_adelie, x='cluster', y="body_mass_g", ax=axes[0, 0])
sns.boxplot(data = penguins_adelie, x='cluster',  y="flipper_length_mm", ax=axes[0, 1])
sns.boxplot(data = penguins_adelie, x='cluster',  y="bill_length_mm", ax=axes[1, 0])
sns.boxplot(data = penguins_adelie, x='cluster',  y="bill_depth_mm", ax=axes[1, 1])
plt.tight_layout()