# Data Cleaning
Ingest the raw `penguins_Iter.csv` datafile and prepare it for further analysis.

`penguins_Iter.csv` include: size measurements, clutch observations, and blood isotope ratios for 344 adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research (LTER) Program.

**Gorman KB, Williams TD, Fraser WR** (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. https://doi.org/10.1371/journal.pone.0090081

## Subset Data

In [99]:
import pandas as pd
raw_penguins_df = pd.read_csv("../data/raw/penguins_lter.csv")
variables_to_keep = ["Species", 
                     "Island", 
                     "Flipper Length (mm)",
                     "Culmen Length (mm)",
                     "Culmen Depth (mm)",
                     "Body Mass (g)", 
                     "Sex"]
penguins_df = raw_penguins_df[variables_to_keep]
penguins_df.head()

Unnamed: 0,Species,Island,Flipper Length (mm),Culmen Length (mm),Culmen Depth (mm),Body Mass (g),Sex
0,Adelie Penguin (Pygoscelis adeliae),Torgersen,181.0,39.1,18.7,3750.0,MALE
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,186.0,39.5,17.4,3800.0,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,195.0,40.3,18.0,3250.0,FEMALE
3,Adelie Penguin (Pygoscelis adeliae),Torgersen,,,,,
4,Adelie Penguin (Pygoscelis adeliae),Torgersen,193.0,36.7,19.3,3450.0,FEMALE


## Standardize Variable Names

In [None]:
penguins_df.rename(columns = {'Species':'species', 
                              'Island':'island', 
                              'Flipper Length (mm)':'flipper_length_mm',
                              'Culmen Length (mm)':'bill_length_mm',
                              'Culmen Depth (mm)':'bill_depth_mm',
                              'Body Mass (g)':'body_mass_g',
                              'Sex':'sex'},inplace=True)
penguins_df.head()

## Rename Species
Rename species from scientific name to common name, for example `Adelie Penguin (Pygoscelis adeliae)` is renamed to `Adelie`.


In [101]:
species = penguins_df.species.unique().tolist()
species_common = ["Adelie",
                 "Chinstrap",
                 "Gentoo"]
penguins_df.loc[:,'species'] = penguins_df.loc[:,'species'].replace(species, species_common)
penguins_df.head()

Unnamed: 0,species,island,flipper_length_mm,bill_length_mm,bill_depth_mm,body_mass_g,sex
0,Adelie,Torgersen,181.0,39.1,18.7,3750.0,MALE
1,Adelie,Torgersen,186.0,39.5,17.4,3800.0,FEMALE
2,Adelie,Torgersen,195.0,40.3,18.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,193.0,36.7,19.3,3450.0,FEMALE


## Save Cleaned Data

In [102]:
penguins_df.to_csv('../data/cleaned/penguins_cleaned.csv', index=False) 