# Analysis of Penguin Dataset

The first phase of EDA is understanding the "shape" and "health" of our raw data. We use Pandas to load the `penguins.csv` file and perform a high-level audit.

**Shape:** Identifies the number of rows and columns.
**Data Types:** Ensures numerical values aren't being treated as strings.
**Summary Stats:** Provides the mean, standard deviation, and quartiles for features like body mass and bill length.

In [None]:
import pandas as pd
df = pd.read_csv('data/penguins.csv')

# Verify it worked
print(df.head())

## 2. Data Cleaning
We need to handle missing values to ensure our analysis is accurate.
- **Measurements:** We will drop rows where critical physical measurements (`bill_length_mm`, `flipper_length_mm`) are missing, as we cannot infer these biological traits safely.
- **Sex:** For the `sex` column, we will fill missing values with `'Unknown'` to preserve the sample size for species-level analysis.

In [None]:
# Create a copy to keep the original data safe
df_clean = df.copy()

# 1. Drop rows with missing measurement data
# We check a subset of columns critical for our graphs
df_clean = df_clean.dropna(subset=['bill_length_mm', 'flipper_length_mm', 'body_mass_g'])

# 2. Fill missing 'sex' data with 'Unknown'
df_clean['sex'] = df_clean['sex'].fillna('Unknown')

# 3. Verify cleaning
print("Missing values after cleaning:")
print(df_clean.isnull().sum())

print(f"\nRows remaining: {len(df_clean)} (Original: {len(df)})")

## 3. Comparative Visualizations
We will visualize the species differences using a $2 \times 2$ grid:
1.  **Bill Length (Histogram):** To see the distribution and overlaps between species.
2.  **Flipper Length (Boxplot):** To compare the median and spread of sizes.
3.  **Body Mass vs Bill Length (Scatter):** To identify correlations between weight and beak size.
4.  **Species by Island (Countplot):** To observe geographic segregation.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style
sns.set_style("whitegrid")

# Create the 2x2 subplots
fig, ax = plt.subplots(2, 2, figsize=(14, 12))
fig.suptitle('Palmer Penguins: Species & Geographic Analysis', fontsize=16)

# A) Histogram: Bill Length by Species
sns.histplot(data=df_clean, x='bill_length_mm', hue='species', kde=True, ax=ax[0,0], palette='viridis')
ax[0,0].set_title('Distribution of Bill Length')

# B) Boxplot: Flipper Length by Species
sns.boxplot(data=df_clean, x='species', y='flipper_length_mm', ax=ax[0,1], palette='Set2')
ax[0,1].set_title('Flipper Length Ranges')

# C) Scatterplot: Bill Length vs Body Mass
sns.scatterplot(data=df_clean, x='bill_length_mm', y='body_mass_g', hue='species', ax=ax[1,0], palette='viridis')
ax[1,0].set_title('Body Mass vs Bill Length')

# D) Countplot: Species by Island
sns.countplot(data=df_clean, x='island', hue='species', ax=ax[1,1], palette='pastel')
ax[1,1].set_title('Species Count by Island')

plt.tight_layout()
plt.show()

## 4. Key Insights
Based on the visualizations above, we can draw the following conclusions for stakeholders:

| Observation | Supporting Data | Implication |
| :--- | :--- | :--- |
| **The "Gentoo Giant"** | In the **Flipper Length Boxplot**, Gentoo penguins consistently show the highest median flipper length (>210mm) and are distinctly heavier in the **Scatterplot**. | Gentoo penguins are physically the largest of the three, suggesting they may hunt larger prey or dive deeper. |
| **Geographic Segregation** | The **Island Countplot** reveals that Gentoos are found *only* on Biscoe island, while Adelies are found on all three. | Biscoe Island is a critical habitat. Conservation efforts for Gentoos can be geographically focused there. |
| **Bill Length vs. Species** | The **Histogram** shows a bimodal distribution. Chinstraps and Gentoos have similar (longer) bill lengths, while Adelies have significantly shorter bills. | Bill length alone distinguishes Adelie from the others, but isn't enough to tell Chinstrap and Gentoo apart (you would need body mass for that). |

## 5. Feature Correlations
Finally, we examine how different physical measurements correlate with one another across the entire dataset.

In [None]:
# Select only numeric columns for correlation
numeric_cols = df_clean.select_dtypes(include=['float64', 'int64'])
corr = numeric_cols.corr()

# Plot Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Penguin Physical Traits')
plt.show()

## 4. Statistical Summary
To understand the central tendency and spread of our data, we calculate summary statistics.
* **Global Summary:** Provides a quick overview (count, mean, std, min/max) for all numerical columns.
* **Grouped Means:** We group the data by `species` to see the average physical traits for Adelie, Chinstrap, and Gentoo penguins separately. This helps quantify the differences we saw in the visualizations.


In [None]:
# 1. Global Summary Statistics
print("--- Global Statistical Summary ---")
display(df_clean.describe())

# 2. Means Grouped by Species [cite: 2]
# We select only numeric columns to avoid errors with text columns
numeric_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
species_means = df_clean.groupby('species')[numeric_cols].mean()

print("\n--- Average Traits by Species ---")
display(species_means)

## 5. Hypothesis Testing & Analysis

Based on our initial visual inspection, we can formulate the following biological hypotheses to test quantitatively:

1.  **Hypothesis 1 (Size):** *Gentoo penguins have a significantly higher body mass than Adelie or Chinstrap penguins.*
2.  **Hypothesis 2 (Differentiation):** *Bill length alone is sufficient to distinguish Chinstrap penguins from Adelie penguins.*
3.  **Hypothesis 3 (Geography):** *The flipper length of penguins varies significantly depending on which island they inhabit.*

Below, we analyze summary statistics to confirm or reject these hypotheses.

In [None]:
# --- Analysis of Hypothesis 1: Body Mass ---
print("--- 1. Body Mass Comparison (g) ---")
mass_stats = df_clean.groupby('species')['body_mass_g'].agg(['mean', 'std', 'max'])
display(mass_stats)
# Check if Gentoo mean is higher than others
gentoo_mean = mass_stats.loc['Gentoo', 'mean']
others_mean = mass_stats.loc[['Adelie', 'Chinstrap'], 'mean'].mean()
print(f"Gentoo Mass vs Others: {gentoo_mean:.1f}g vs {others_mean:.1f}g")


# --- Analysis of Hypothesis 2: Bill Length overlap ---
print("\n--- 2. Bill Length Separation (mm) ---")
bill_stats = df_clean.groupby('species')['bill_length_mm'].agg(['min', 'mean', 'max'])
display(bill_stats)

# Check overlap between Adelie and Chinstrap
adelie_max = bill_stats.loc['Adelie', 'max']
chinstrap_min = bill_stats.loc['Chinstrap', 'min']
print(f"Adelie Max Bill: {adelie_max} mm")
print(f"Chinstrap Min Bill: {chinstrap_min} mm")
print(f"Overlap Exists: {adelie_max >= chinstrap_min}")


# --- Analysis of Hypothesis 3: Flipper Length by Island ---
print("\n--- 3. Flipper Length by Island (mm) ---")
island_stats = df_clean.groupby('island')['flipper_length_mm'].mean()
display(island_stats)

### Analysis Conclusions

* **Result 1 (Confirmed):** The data supports that Gentoo penguins are the heaviest species, with an average mass significantly higher (~5076g) compared to Adelie (~3700g) and Chinstrap (~3733g).
* **Result 2 (Nuanced):** While Chinstraps generally have longer bills, there is an overlap. The maximum Adelie bill length is often close to or greater than the minimum Chinstrap bill length, meaning bill length *alone* is not a perfect separator without other traits.
* **Result 3 (Confirmed):** Penguins on Biscoe island have a much higher average flipper length. This is likely due to the dominance of the larger Gentoo species on that specific island.