![Alt text](https://imgur.com/orZWHly.png=80)
source: @allison_horst https://github.com/allisonhorst/penguins

You have been asked to support a team of researchers who have been collecting data about penguins in Antartica! The data is available in csv-Format as `penguins.csv`

**Origin of this data** : Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

**The dataset consists of 5 columns.**

Column | Description
--- | ---
culmen_length_mm | culmen length (mm)
culmen_depth_mm | culmen depth (mm)
flipper_length_mm | flipper length (mm)
body_mass_g | body mass (g)
sex | penguin sex

Unfortunately, they have not been able to record the species of penguin, but they know that there are **at least three** species that are native to the region: **Adelie**, **Chinstrap**, and **Gentoo**.  Your task is to apply your data science skills to help them identify groups in the dataset!

Utilize your unsupervised learning skills to clusters in the penguins dataset!

Import, investigate and pre-process the "penguins.csv" dataset.

Perform a cluster analysis based on a reasonable number of clusters and collect the average values for the clusters. The output should be a DataFrame named stat_penguins with one row per cluster that shows the mean of the original variables (or columns in "penguins.csv") by cluster. stat_penguins should not include any non-numeric columns.

In [6]:
# Importar librerías necesarias
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Cargar los datos
penguins_df = pd.read_csv("C:\\Users\\ASUS\\Desktop\\Data_Science\\DataCamp\\Associate_data_scientist_in_Python\\Projects\\35_Clustering_antarctic_penguin_species\\penguins.csv")
penguins_df.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,39.1,18.7,181.0,3750.0,MALE
1,39.5,17.4,186.0,3800.0,FEMALE
2,40.3,18.0,195.0,3250.0,FEMALE
3,36.7,19.3,193.0,3450.0,FEMALE
4,39.3,20.6,190.0,3650.0,MALE


In [7]:
# Crear una columna con la información del sexo pero con datos numéricos
penguins_df['sex_num'] = penguins_df['sex'].replace({'MALE':1,'FEMALE':0})
penguins_df.head()

  penguins_df['sex_num'] = penguins_df['sex'].replace({'MALE':1,'FEMALE':0})


Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,sex_num
0,39.1,18.7,181.0,3750.0,MALE,1
1,39.5,17.4,186.0,3800.0,FEMALE,0
2,40.3,18.0,195.0,3250.0,FEMALE,0
3,36.7,19.3,193.0,3450.0,FEMALE,0
4,39.3,20.6,190.0,3650.0,MALE,1


In [8]:
# Creación del pipeline y entrenamiento del modelo

# Crear el conjunto de datos con el que vamos a trabajar
samples = penguins_df.drop('sex', axis=1).values

# Crea un pipeline que escale los datos y luego aplique un modelo K-means con 3 clusters
scaler = StandardScaler()
kmeans = KMeans(n_clusters=3)
pipeline = make_pipeline(scaler, kmeans)

# Ajusta el pipeline a los datos (escalar los datos y aplicar clustering)
pipeline.fit(samples)

In [12]:
# Predicción de clusters y adición al DataFrame

# Predicir los clusters para las muestras
labels = pipeline.predict(samples)

# Añadir las etiquetas de la predicción al DataFrame original
penguins_df['labels'] = labels
penguins_df.head()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,sex_num,labels
0,39.1,18.7,181.0,3750.0,MALE,1,0
1,39.5,17.4,186.0,3800.0,FEMALE,0,1
2,40.3,18.0,195.0,3250.0,FEMALE,0,1
3,36.7,19.3,193.0,3450.0,FEMALE,0,1
4,39.3,20.6,190.0,3650.0,MALE,1,0


In [11]:
# Cáculo de estadísticas por cluster

# Agrupa los datos por etiquetas de clusters y calcula la media de las características seleccionadas por cada cluster
stat_penguins = penguins_df.groupby('labels')\
[['culmen_length_mm', 'culmen_depth_mm','flipper_length_mm', 'body_mass_g']].mean()

stat_penguins

Unnamed: 0_level_0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,43.878302,19.111321,194.764151,4006.603774
1,40.217757,17.611215,189.046729,3419.158879
2,47.568067,14.996639,217.235294,5092.436975
