# Project Title

### Authors: Michelle Nguyen, Kelly Zhu, Mariam Virk

# Milestone 1

## Part 1 - Initial Exploration

In [None]:
# importing libraries 
import pandas as pd
import altair as alt
alt.data_transformers.enable("default")
import warnings
warnings.filterwarnings("ignore")

### Wrangling and tidying dataset 

In [None]:
# reading in raw datasets
penguins = pd.read_csv("data/palmerpenguins_extended.csv")
# Replace underscores with spaces in attribute names
penguins.columns = [col.replace('_', ' ') for col in penguins.columns]
penguins.head()

Unnamed: 0,species,island,bill length mm,bill depth mm,flipper length mm,body mass g,sex,diet,life stage,health metrics,year
0,Adelie,Biscoe,53.4,17.8,219.0,5687.0,female,fish,adult,overweight,2021
1,Adelie,Biscoe,49.3,18.1,245.0,6811.0,female,fish,adult,overweight,2021
2,Adelie,Biscoe,55.7,16.6,226.0,5388.0,female,fish,adult,overweight,2021
3,Adelie,Biscoe,38.0,15.6,221.0,6262.0,female,fish,adult,overweight,2021
4,Adelie,Biscoe,60.7,17.9,177.0,4811.0,female,fish,juvenile,overweight,2021


In [None]:
# prints out information about the provided dataset, df
def printDatasetInformation(df):
    print("Dataset data shape:", df.shape, '\n') #prints dataframe shape
    
    print("Dataset data information:") #prints dataframe column information, non-null counts, and value type
    print(df.info(), '\n')

    print("Dataset categorical data description:",) # print the unique values of the dataframe's categorical attributes with less than 10 unique values
    df_ = df.select_dtypes(exclude=['int', 'float'])
    for col in df_.columns:
        if df_[col].nunique() < 10:
            print(df_[col].value_counts()) # to print count of every category
            print('\n')
            
    print("Dataset numerical data description:", '\n', df.describe(), '\n') # prints metrics of dataframe's numerical attributes 

In [None]:
print("Penguins Dataset")
printDatasetInformation(penguins)

Penguins Dataset
Dataset data shape: (3430, 11) 

Dataset data information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3430 entries, 0 to 3429
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            3430 non-null   object 
 1   island             3430 non-null   object 
 2   bill length mm     3430 non-null   float64
 3   bill depth mm      3430 non-null   float64
 4   flipper length mm  3430 non-null   float64
 5   body mass g        3430 non-null   float64
 6   sex                3430 non-null   object 
 7   diet               3430 non-null   object 
 8   life stage         3430 non-null   object 
 9   health metrics     3430 non-null   object 
 10  year               3430 non-null   int64  
dtypes: float64(4), int64(1), object(6)
memory usage: 294.9+ KB
None 

Dataset categorical data description:
species
Adelie       1560
Gentoo       1247
Chinstrap     623
Name: count, dtype: in

We plan to use this penguin dataset, however, we want to merge it with another related dataset to acquire more information and attributes.

In [None]:
penguins_delta = pd.read_csv("data/penguins_lter.csv")
penguins_delta.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,,,,Adult not sampled.
4,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


In [None]:
print("Penguins Delta Dataset")
printDatasetInformation(penguins_delta)

Penguins Delta Dataset
Dataset data shape: (344, 17) 

Dataset data information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   studyName            344 non-null    object 
 1   Sample Number        344 non-null    int64  
 2   Species              344 non-null    object 
 3   Region               344 non-null    object 
 4   Island               344 non-null    object 
 5   Stage                344 non-null    object 
 6   Individual ID        344 non-null    object 
 7   Clutch Completion    344 non-null    object 
 8   Date Egg             344 non-null    object 
 9   Culmen Length (mm)   342 non-null    float64
 10  Culmen Depth (mm)    342 non-null    float64
 11  Flipper Length (mm)  342 non-null    float64
 12  Body Mass (g)        342 non-null    float64
 13  Sex                  334 non-null    object 
 14  Delta 15 

We want the full dataset to use the quantitative attributes of the penguin delta dataset, so we will do an inner join on the datasets based on the groupped key of 'Species', 'Island', and 'Sex', as we notice the datasets have similar values for those attributes.

In [None]:
#filtering penguins_delta dataset for desired attributes to group on and join with penguins dataset
penguins_delta_subset = penguins_delta[['Species', 'Island', 'Sex', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)']]

#rename penguins_delta dataset attributes to match the columns names of penguins
penguins_delta_subset.columns = map(str.lower, penguins_delta_subset.columns)

In [None]:
# tidy up feature values in penguins_delta dataset to match feature values in penguins dataset
penguins_delta_filtered = (
    penguins_delta_subset
    .assign(species=lambda x: x['species'].apply(lambda s: s.split()[0]))
    .assign(sex=lambda x: x['sex'].str.lower())
    .loc[lambda x: x['sex'].isin(['female', 'male'])]
    .dropna(subset=['delta 15 n (o/oo)', 'delta 13 c (o/oo)'])
)
penguins_delta_filtered.head()

Unnamed: 0,species,island,sex,delta 15 n (o/oo),delta 13 c (o/oo)
1,Adelie,Torgersen,female,8.94956,-24.69454
2,Adelie,Torgersen,female,8.36821,-25.33302
4,Adelie,Torgersen,female,8.76651,-25.32426
5,Adelie,Torgersen,male,8.66496,-25.29805
6,Adelie,Torgersen,female,9.18718,-25.21799


In [None]:
printDatasetInformation(penguins_delta_filtered)

Dataset data shape: (324, 5) 

Dataset data information:
<class 'pandas.core.frame.DataFrame'>
Index: 324 entries, 1 to 343
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            324 non-null    object 
 1   island             324 non-null    object 
 2   sex                324 non-null    object 
 3   delta 15 n (o/oo)  324 non-null    float64
 4   delta 13 c (o/oo)  324 non-null    float64
dtypes: float64(2), object(3)
memory usage: 15.2+ KB
None 

Dataset categorical data description:
species
Adelie       139
Gentoo       118
Chinstrap     67
Name: count, dtype: int64


island
Biscoe       162
Dream        119
Torgersen     43
Name: count, dtype: int64


sex
female    163
male      161
Name: count, dtype: int64


Dataset numerical data description: 
        delta 15 n (o/oo)  delta 13 c (o/oo)
count         324.000000         324.000000
mean            8.739944         -25.688691
std    

In [None]:
# Randomly select an example based on the grouping specified
def random_select(group):
    return group.sample(1)

# Merge the two penguin datasets based on the 'species', 'island', 'sex' grouping
penguins_final = pd.merge(
    penguins,
    penguins_delta_filtered.groupby(['species', 'island', 'sex']).apply(random_select).reset_index(drop=True),
    on=['species', 'island', 'sex'],
    how='inner')
penguins_final.head()

Unnamed: 0,species,island,bill length mm,bill depth mm,flipper length mm,body mass g,sex,diet,life stage,health metrics,year,delta 15 n (o/oo),delta 13 c (o/oo)
0,Adelie,Biscoe,53.4,17.8,219.0,5687.0,female,fish,adult,overweight,2021,9.21292,-24.3613
1,Adelie,Biscoe,49.3,18.1,245.0,6811.0,female,fish,adult,overweight,2021,9.21292,-24.3613
2,Adelie,Biscoe,55.7,16.6,226.0,5388.0,female,fish,adult,overweight,2021,9.21292,-24.3613
3,Adelie,Biscoe,38.0,15.6,221.0,6262.0,female,fish,adult,overweight,2021,9.21292,-24.3613
4,Adelie,Biscoe,60.7,17.9,177.0,4811.0,female,fish,juvenile,overweight,2021,9.21292,-24.3613


In [None]:
printDatasetInformation(penguins_final)

Dataset data shape: (2918, 13) 

Dataset data information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2918 entries, 0 to 2917
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            2918 non-null   object 
 1   island             2918 non-null   object 
 2   bill length mm     2918 non-null   float64
 3   bill depth mm      2918 non-null   float64
 4   flipper length mm  2918 non-null   float64
 5   body mass g        2918 non-null   float64
 6   sex                2918 non-null   object 
 7   diet               2918 non-null   object 
 8   life stage         2918 non-null   object 
 9   health metrics     2918 non-null   object 
 10  year               2918 non-null   int64  
 11  delta 15 n (o/oo)  2918 non-null   float64
 12  delta 13 c (o/oo)  2918 non-null   float64
dtypes: float64(6), int64(1), object(6)
memory usage: 296.5+ KB
None 

Dataset categorical data description:
speci

### Data Exploration

| Attribute         | Attribute Type          | Attribute Semantic                  | Cardinality/Range                |
|-------------------|---------------|---------------------------|----------------------------------|
| species           | categorical   | type of penguin           | gentoo, adelie, chinstrap         |
| island            | categorical   | geographic location       | biscoe, dream                     |
| sex               | categorical   | gender of the penguin     | female, male                      |
| diet              | categorical   | diet composition          | krill, fish, parental, squid      |
| life_stage        | ordinal       | life stage of the penguin | juvenile, adult, chick            |
| health_metrics    | ordinal       | health status             | healthy, overweight, underweight  |
| bill length mm    | quantitative  | physical measurement      | range: 15.6 to 88.2               |
| bill depth mm     | quantitative  | physical measurement      | range: 9.1 to 27.9                |
| flipper length mm | quantitative  | physical measurement      | range: 143 to 308                 |
| body mass g       | quantitative  | physical measurement      | range: 2477 to 10549              |
| year              | temporal      | observation year          | range: 2021 to 2025               |
| delta 15 n (o/oo) | quantitative  | isotopic ratio            | range: 8.05 to 9.49               |
| delta 13 c (o/oo) | quantitative  | isotopic ratio            | range: -26.79 to -24.36           |


### Exploratory Data Analysis

In [None]:
print("Penguins data size:", penguins_final.shape)

Penguins data size: (2918, 13)


In [None]:
print("Numeric Summaries:")
penguins_final.describe()

Numeric Summaries:


Unnamed: 0,bill length mm,bill depth mm,flipper length mm,body mass g,year,delta 15 n (o/oo),delta 13 c (o/oo)
count,2918.0,2918.0,2918.0,2918.0,2918.0,2918.0,2918.0
mean,38.866518,18.606066,207.789582,4894.254626,2023.342015,8.660394,-25.572608
std,13.419008,2.763841,29.053689,1339.196676,1.319492,0.62892,0.831705
min,15.6,9.1,143.0,2477.0,2021.0,7.993,-26.79093
25%,29.1,16.8,186.0,3876.5,2022.0,8.13746,-26.07021
50%,34.55,18.6,203.0,4677.0,2024.0,8.61651,-25.5139
75%,47.2,20.4,226.0,5693.75,2024.0,9.21292,-24.90816
max,88.2,27.9,308.0,10549.0,2025.0,10.02544,-24.3613


In [None]:
print("Categorical Summaries")
for col in penguins_final.select_dtypes(exclude=['int', 'float']).columns:
    print(penguins_final[col].value_counts()) # to print count of every category
    print('\n')

Categorical Summaries
species
Gentoo       1247
Adelie       1048
Chinstrap     623
Name: count, dtype: int64


island
Biscoe    1785
Dream     1133
Name: count, dtype: int64


sex
female    1462
male      1456
Name: count, dtype: int64


diet
krill       1226
fish         809
parental     719
squid        164
Name: count, dtype: int64


life stage
juvenile    1330
adult        869
chick        719
Name: count, dtype: int64


health metrics
healthy        1323
overweight      978
underweight     617
Name: count, dtype: int64




In [None]:
bill_length_distribution = alt.Chart(penguins_final).mark_bar().encode(
    alt.X('bill length mm:Q', bin=alt.Bin(maxbins=20), title='Bill Length (mm)'),
    alt.Y('count()', title='Frequency')
).properties(
    width=200,
    height=150,
    title = 'Distribution of Bill Length Attribute'
)

In [None]:
print("Univariate visual summaries")
for column in penguins_final.select_dtypes(include=['int', 'float']):
    chart = alt.Chart(penguins_final).mark_bar().encode(
            alt.X(column, bin=alt.Bin(maxbins=20), title=column),
            alt.Y('count()', title='Frequency')
        ).properties(
            width=200,
            height=150,
            title = 'Distribution of ' + column)
    display(chart)

Univariate visual summaries


In [None]:
print("Multivariate visual summaries")
penguins_pairpplot = alt.Chart(penguins_final).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative')
).properties(
    width=150,
    height=150
).repeat(
    row=penguins_final.select_dtypes(include=['int', 'float']).columns.to_list(),
    column=penguins_final.select_dtypes(include=['int', 'float']).columns.to_list()
)
penguins_pairpplot

Multivariate visual summaries


## Part II: Project Scope

### Introduction: Penguin Profiles: A Study of Species, Health, and Habitats in Palmer Archipelago


Exploration into the Ecological and Biological Dynamics of Penguins in the Palmer Archipelago Islands.

Through this project, we aim to explore the health, dietary patterns, and habitat distributions of penguins in the Palmer Archipelago Islands. By analyzing physical metrics, isotopic ratios, and other factors, we hope to gain insights into the ecological conditions, potential challenges, and distinctive traits of the three species inhabiting this domain.
<br><br>
Intended audience: Ecological Scientists, Ornithology Researchers, Environmental Conservationists, University Environmental Science Professors, Undergraduate and Postgraduate Ecology Students, Citizens involved with Antarctic wildlife conservation, etc 
<br><br>
The intended audience will be able to acquire a comprehensive understanding of the three penguin species' ecological conditions, health metrics, and dietary habits. They will be able to see how different diet impacts health metrics and the distribution of different penguins across the different islands in Palmer, providing information on habitat distribution. This project will provide scientists with a deeper understanding of the three species of penguins and their characteristics. This knowledge is pivotal for habitat conservation planning, guiding further ecological research, and enriching academic curricula. The motivation behind this project is to derive insights that can influence and inform conservation initiatives, ensuring the sustained presence of penguins in the Palmer Archipelago and promoting a deeper appreciation for the delicate balance of life in Antarctica. 

### Task Analysis

Penguin Weight Change over Time (or some other quantitative metric selected by user, this could be an interactive vis):
-	Body Mass g 
-	Species
-	Sex
-	Island 
-	Year 

Comparison of Gender-based Health Variation of Different Species:
-	Species
-   Sex
-	Health Metrics
-	Bill Length mm
-	Bill depth mm
-	Flipper length mm
-	Body Mass g 
-	Diet
-	Life Stage 

Life Stages of Penguins:
-	Species
-	Bill Length mm
-	Bill Depth mm
-	Flipper Length mm
-	Body Mass g 
-	Life Stage

Diet Analysis on Penguin Characteristics:
-	Diet
-	Life Stage
-	Health Metrics
-	Body Mass g 
-	Species
-	Sex 
-	could include a pop-up of displays physical characteristics for penguins with that diet 

Physical Characteristics Correlation:
-	Body Mass g 
-	Bill Length mm
-	Bill Depth mm
-	Flipper Length mm

Habitat Influence on Penguin Characteristics:
-	Island 
-	Species 
-	Diet
-	Health Metrics 
-	Bill Length mm  
-	Bill Depth mm 
-	Body Mass g
-	Delta 15 n (o/oo) 
-	Delta 13 c (o/oo) 


## Part III: Visualization Ideas

### Preliminary Sketches

## Part IV: Next Steps

### Outline