# Monday 2/10 Exercise

The **goal** for this exercise is to measure the average differences between the different penguin species and determine which species are the most similar/different.

In [32]:
import pandas as pd
from scipy.spatial.distance import pdist, squareform

In [None]:
data_url =  "https://raw.githubusercontent.com/mcnakhaee/palmerpenguins/refs/heads/master/palmerpenguins/data/penguins.csv"
penguins = pd.read_csv(data_url)
speciesList = penguins["species"].unique()

### Preprocessing
What attributes do we need to compare penguin breeds in the dataset?

- bill_length_mm
- bill_depth_mm
- flipper_length_mm
- body_mass_g
- Species (because we need that as a label)

_Note_: to pull out all the rows for a given column in Pandas, we use the following syntax
df[["attr1", "attr2", ..., "attrn"]]

In [None]:
attrs = ["species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]
phys_attrs = penguins[attrs]

phys_attrs

_Note_: there are some NaN values present in the output. These can be removed with `.dropna()`, though  dropping NaN values isn't always the right thing to do for a dataset.

In [None]:
# The NaN values can be removed
phys_attrs = phys_attrs.dropna()

phys_attrs

### Summarize the penguins by species
Since we want to measure the difference between the penguin species as a whole, we want to calculate the average of each attribute per species. (No for-loops required)

Breakdown of code:
- `groupby` allows you to perform an operation on a group of rows, according to an attribute
- `mean` is just the average (can be swapped out with `min`, `max`, or `median`)

_Note_: Species is an **key** that identifies individual **tuples**

In [None]:
avg_pen_attr = phys_attrs.groupby("species").mean
avg_pen_attr

### Calculate our distances
We convert our tabular data to a matrix. Each row of the matrix represents one of the species in a four dimensional space.

In [None]:
avg_pen_mat = avg_pen_attr.values
pen_dist = pdist(avg_pen_mat, metric="euclidean")
pen_dist_mat = squareform(pen_dist)

pen_dist_mat

Cliff Hanger: this data still needs to be **normalized**!