![Alt text](https://imgur.com/orZWHly.png=80)
source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as `penguins.csv`

**Origin of this data** : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

**The dataset consists of 5 columns.**

Column | Description
--- | ---
culmen_length_mm | culmen length (mm)
culmen_depth_mm | culmen depth (mm)
flipper_length_mm | flipper length (mm)
body_mass_g | body mass (g)
sex | penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are **at least three** species that are native to the region: **Adelie**, **Chinstrap**, and **Gentoo**.  Your task is to apply your data science skills to help them identify groups in the dataset!

In [6]:
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

In [7]:
# Loading and examining the dataset
penguins_df_ = pd.read_csv("penguins.csv")

In [8]:
penguins_df_.shape

(332, 5)

In [9]:
penguins_df_.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,39.1,18.7,181.0,3750.0,MALE
1,39.5,17.4,186.0,3800.0,FEMALE
2,40.3,18.0,195.0,3250.0,FEMALE
3,36.7,19.3,193.0,3450.0,FEMALE
4,39.3,20.6,190.0,3650.0,MALE


In [10]:
penguins_df_.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 332 entries, 0 to 331
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   culmen_length_mm   332 non-null    float64
 1   culmen_depth_mm    332 non-null    float64
 2   flipper_length_mm  332 non-null    float64
 3   body_mass_g        332 non-null    float64
 4   sex                332 non-null    object 
dtypes: float64(4), object(1)
memory usage: 13.1+ KB


In [11]:
penguins_df_.describe()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
count,332.0,332.0,332.0,332.0
mean,44.021084,17.153012,200.975904,4206.475904
std,5.452462,1.960275,14.035971,806.361278
min,32.1,13.1,172.0,2700.0
25%,39.5,15.6,190.0,3550.0
50%,44.7,17.3,197.0,4025.0
75%,48.625,18.7,213.0,4781.25
max,59.6,21.5,231.0,6300.0


In [12]:
# One-Hot Encoding
penguins_df = pd.get_dummies(penguins_df_, columns=['sex'], drop_first=True)

In [13]:
# Select only numeric columns for scaling
numeric_cols = penguins_df.select_dtypes(include=['float64']).columns

# Scale Numeric Features
scaler = StandardScaler()
penguins_df[numeric_cols] = scaler.fit_transform(penguins_df[numeric_cols])

In [14]:
kmeans = KMeans(n_clusters=3, random_state=77)
kmeans.fit(penguins_df)

silhouette = silhouette_score(penguins_df, kmeans.labels_)
calinski_harabasz = calinski_harabasz_score(penguins_df, kmeans.labels_)
davies_bouldin = davies_bouldin_score(penguins_df, kmeans.labels_)

print("Metrics for KMeans without PCA:")
print(f"Silhouette Score: {silhouette}")
print(f"Calinski-Harabasz Index: {calinski_harabasz}")
print(f"Davies-Bouldin Index: {davies_bouldin}")

Metrics for KMeans without PCA:
Silhouette Score: 0.40911008114902186
Calinski-Harabasz Index: 355.8496766414667
Davies-Bouldin Index: 1.0482457144005717


In [15]:
for i in range(1,6):
    pca = PCA(n_components=i)
    pca_components = pca.fit_transform(penguins_df)
    pca_df = pd.DataFrame(pca_components)

    kmeans = KMeans(n_clusters=3, random_state=77)
    kmeans.fit(pca_df)

    # Evalaute the model
    silhouette = silhouette_score(pca_df, kmeans.labels_)
    calinski_harabasz = calinski_harabasz_score(pca_df, kmeans.labels_)
    davies_bouldin = davies_bouldin_score(pca_df, kmeans.labels_)
    
    print(f"Metrics for {i} components:")
    print(f"Silhouette Score: {silhouette}")
    print(f"Calinski-Harabasz Index: {calinski_harabasz}")
    print(f"Davies-Bouldin Index: {davies_bouldin}\n")

Metrics for 1 components:
Silhouette Score: 0.6323747388153108
Calinski-Harabasz Index: 1602.8750003912678
Davies-Bouldin Index: 0.46136690639980776

Metrics for 2 components:
Silhouette Score: 0.5353707233489186
Calinski-Harabasz Index: 415.44542136139233
Davies-Bouldin Index: 0.5896062567806731

Metrics for 3 components:
Silhouette Score: 0.47764654767553144
Calinski-Harabasz Index: 309.6771286296555
Davies-Bouldin Index: 0.640493096746233

Metrics for 4 components:
Silhouette Score: 0.4215722142849081
Calinski-Harabasz Index: 375.6811568158796
Davies-Bouldin Index: 1.0087603528316331

Metrics for 5 components:
Silhouette Score: 0.4091100811490217
Calinski-Harabasz Index: 355.8496766414668
Davies-Bouldin Index: 1.048245714400572



In [16]:
pca = PCA(n_components=1)
pca_components = pca.fit_transform(penguins_df)
pca_df = pd.DataFrame(pca_components)

kmeans = KMeans(n_clusters=3, random_state=77)
kmeans.fit(pca_df)

# Evalaute the model
silhouette = silhouette_score(pca_df, kmeans.labels_)
calinski_harabasz = calinski_harabasz_score(pca_df, kmeans.labels_)
davies_bouldin = davies_bouldin_score(pca_df, kmeans.labels_)

print(f"Metrics for 1 components:")
print(f"Silhouette Score: {silhouette}")
print(f"Calinski-Harabasz Index: {calinski_harabasz}")
print(f"Davies-Bouldin Index: {davies_bouldin}\n")

Metrics for 1 components:
Silhouette Score: 0.6323747388153108
Calinski-Harabasz Index: 1602.8750003912678
Davies-Bouldin Index: 0.46136690639980776



In [21]:
penguins_df_['cluster'] = kmeans.labels_
penguins_df_.drop(['sex'], axis=1, inplace=True)
stat_penguins = penguins_df_.groupby('cluster').mean()

In [22]:
stat_penguins

Unnamed: 0_level_0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,47.609322,15.007627,217.313559,5102.118644
1,38.469748,18.17479,187.882353,3511.97479
2,46.517895,18.537895,197.084211,3963.947368
