![Alt text](https://imgur.com/orZWHly.png=80)
source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as `penguins.csv`

**Origin of this data** : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

**The dataset consists of 5 columns.**

Column | Description
--- | ---
culmen_length_mm | culmen length (mm)
culmen_depth_mm | culmen depth (mm)
flipper_length_mm | flipper length (mm)
body_mass_g | body mass (g)
sex | penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are **at least three** species that are native to the region: **Adelie**, **Chinstrap**, and **Gentoo**.  Your task is to apply your data science skills to help them identify groups in the dataset!

In [14]:
# Import Required Packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Loading and examining the dataset
penguins_df = pd.read_csv("penguins.csv")
penguins_df.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,39.1,18.7,181.0,3750.0,MALE
1,39.5,17.4,186.0,3800.0,FEMALE
2,40.3,18.0,195.0,3250.0,FEMALE
3,36.7,19.3,193.0,3450.0,FEMALE
4,39.3,20.6,190.0,3650.0,MALE


In [15]:
#Drop sex column
penguins_df = penguins_df.drop('sex', axis= 1)

penguins_df.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
3,36.7,19.3,193.0,3450.0
4,39.3,20.6,190.0,3650.0


In [16]:
#Instantiate StandardScaler object
scaler = StandardScaler()

#Fit the scaler to the dataset
scaler.fit(penguins_df)

#Transform to the scaled dataset
StandardScaler(copy=True, with_mean=True, with_std=True)
scaled_penguins_df = scaler.transform(penguins_df)


In [17]:
#Plot inertia to define the number of clusters
#ks = range(1,6)
#inertias = []

#for k in ks:
    #Instantiate KMeans model with k number of clusters
    #kMeans = KMeans(n_clusters = k)
    
    #Fit kMeans model
    #kMeans.fit(scaled_penguins_df)
    
    #Append intertias
    #inertias.append(kMeans.inertia_)

#Plot inertias to determine the best number of clustes
#plt.plot(ks, inertias, '-o')
#plt.xlabel('number of clusters, k')
#plt.ylabel('inertia')
#plt.xticks(ks)
#plt.show()

In [18]:
#Instantiate KMeans with 3 clusters
kMeans = KMeans(n_clusters= 3)

#Create Pipeline of the two process Standardizing and Clustering
pipeline = make_pipeline(scaler, kMeans)

#Fit the pipeline
pipeline.fit(penguins_df)

#Calculate the cluster labels
labels = pipeline.predict(penguins_df)

print(labels)


[1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 2 1
 2 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 1
 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [28]:
#Add cluster labels to the df
penguins_df['Cluster'] = labels

#Select only numeric columns for summarizing characteristics
numeric_cols = penguins_df.select_dtypes(include=['number'])

#Group by 'Cluster' and calculate the mean for each cluster
stat_penguins = numeric_cols.groupby('Cluster').mean().round(2)

stat_penguins['body_mass_g'] = round(stat_penguins['body_mass_g']/1000, 2)

In [29]:
stat_penguins.head()

Unnamed: 0_level_0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,47.57,15.0,217.24,5.09
1,38.31,18.1,188.55,3.59
2,47.66,18.75,196.92,3.9
