## K-means inertia & silhouette score  

Do penguins of the same species exhibit different physical characteristics based on sex?

In [1]:
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler


import seaborn as sns

In [5]:
df = pd.read_csv('https://gist.githubusercontent.com/slopp/ce3b90b9168f2f921784de84fa445651/raw/4ecf3041f0ed4913e7c230758733948bc561f434/penguins.csv')

In [6]:
df.head(10)

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,4,Adelie,Torgersen,,,,,,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,6,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,7,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,8,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
8,9,Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
9,10,Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007


## Data Exploration

Exploring data, Checking for missing values, Encoding data, Dropping a column, Scaling the features using StandardScaler

In [9]:
df['species'].nunique()

3

In [21]:
df['species'].value_counts()

species
Adelie       146
Gentoo       119
Chinstrap     68
Name: count, dtype: int64

In [25]:
df['sex'].value_counts()

sex
male      168
female    165
Name: count, dtype: int64

In [24]:
df['island'].unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

In [17]:
df.isnull().sum()

rowid                 0
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

In [18]:
df = df.dropna()

In [19]:
df.isnull().sum()

rowid                0
species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
year                 0
dtype: int64

In [20]:
df.head(10)

Unnamed: 0,rowid,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,2,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,3,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,5,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,6,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
6,7,Adelie,Torgersen,38.9,17.8,181.0,3625.0,female,2007
7,8,Adelie,Torgersen,39.2,19.6,195.0,4675.0,male,2007
12,13,Adelie,Torgersen,41.1,17.6,182.0,3200.0,female,2007
13,14,Adelie,Torgersen,38.6,21.2,191.0,3800.0,male,2007
14,15,Adelie,Torgersen,34.6,21.1,198.0,4400.0,male,2007


## Encode data

In [27]:
df['sex'] = df['sex'].str.upper()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['sex'] = df['sex'].str.upper()


In [28]:
df[df['sex']== "MALE"] = 1
df[df['sex']== "FEMALE"] = 0

Dropping column 'island'

In [33]:
df.drop('island', axis=1)

Unnamed: 0,rowid,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,1,1.0,1.0,1.0,1.0,1,1
1,0,0,0.0,0.0,0.0,0.0,0,0
2,0,0,0.0,0.0,0.0,0.0,0,0
4,0,0,0.0,0.0,0.0,0.0,0,0
5,1,1,1.0,1.0,1.0,1.0,1,1
...,...,...,...,...,...,...,...,...
339,1,1,1.0,1.0,1.0,1.0,1,1
340,0,0,0.0,0.0,0.0,0.0,0,0
341,1,1,1.0,1.0,1.0,1.0,1,1
342,1,1,1.0,1.0,1.0,1.0,1,1


## Scaling features

In [34]:
x = df.drop('species', axis=1)

In [35]:
x_scaled = StandardScaler().fit_transform(x)

## Data modeling