# Chapter 1.6 and 1.7: Iterative EDA

Goal: Practice the iterative, question-driven nature of EDA and document your findings.

### Topics:
- EDA as exploration, not a checklist
- Following threads of inquiry
- Documenting discoveries and decisions
- When to dig deeper vs. move on

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## Loading the Data

We'll use the Palmer Penguins dataset, which contains measurements of penguins from three different species on islands in Antarctica. Each row is one penguin, with measurements like bill length, flipper length, and body mass.

In [None]:
# Load the penguins dataset from Seaborn
penguins = sns.load_dataset('penguins')
penguins.head()

In [None]:
# Quick overview
penguins.describe()

In [None]:
# What species do we have?
penguins['species'].unique()

## The Iterative Mindset

EDA isn't a checklist. It's more like a conversation with your data:

1. **Ask a question** → "What's the average bill length?"
2. **Answer it** → Compute the statistic or make a plot
3. **Notice something interesting** → "Hmm, that's higher than I expected..."
4. **Ask a follow-up question** → "Does it vary by species?"
5. **Repeat**

Let's walk through an example of this process.

## Worked Example: Following a Thread

**Starting question:** What's the average bill length?

In [None]:
# Answer: What's the average bill length?
penguins['bill_length_mm'].mean()

About 44mm. But wait - we have three different species. Maybe they're different?

**Follow-up question:** Does bill length vary by species?

In [None]:
# Answer: Bill length by species
penguins.groupby('species')['bill_length_mm'].mean()

Interesting! Chinstrap and Gentoo have longer bills than Adelie. Let's visualize this:

In [None]:
# Visualize the difference
sns.boxplot(data=penguins, x='species', y='bill_length_mm')
plt.title('Bill Length by Species')
plt.ylabel('Bill Length (mm)')
plt.show()

Clear difference! The boxes don't even overlap much.

**Another follow-up:** Is this true for other measurements too?

In [None]:
# Check body mass by species
sns.boxplot(data=penguins, x='species', y='body_mass_g')
plt.title('Body Mass by Species')
plt.ylabel('Body mass (g)')
plt.show()

Now Gentoo is the biggest! The pattern is different for body mass than for bill length.

Interesting follow-up: Bigger body mass sometimes (but not always?) seems to also have a longer bill. Are these related?

In [None]:
sns.scatterplot(data=penguins, x='body_mass_g', y='bill_length_mm')
plt.xlabel('Body Mass (g)')
plt.ylabel('Bill Length (mm)')
plt.title('Body mass vs Bill length')
plt.show()


There is a relationship, but there's also some weird clustering (e.g. in the top-left quadrant). Could it differ by species?

In [None]:
sns.scatterplot(data=penguins, x='body_mass_g', y='bill_length_mm', hue='species')
plt.xlabel('Body Mass (g)')
plt.ylabel('Bill Length (mm)')
plt.title('Body mass vs Bill length')
plt.show()

That's quite a clear-cut relationship!

### Documenting What We Found

As you explore, document your findings. Here's what we discovered:

**Finding 1:** Bill length varies significantly by species.
- Adelie: ~39mm (shortest)
- Chinstrap: ~49mm (longest)
- Gentoo: ~47mm

**Finding 2:** Body mass has a different pattern.
- Gentoo penguins are the heaviest (~5000g)
- Adelie and Chinstrap are similar (~3700g)

**Finding 3:** Body mass and bill length have a clear positive, linear relationship.
- For Adelie and Gentoo penguins, the relationship basically looks the same, just that Gentoo are heavier.
- For Chinstrap penguins, the relationshp is a little weaker, and compared to other penguins, for the same body mass they have significantly longer bills.

**Question for later:** Why does Gentoo have a different pattern? Is it related to their environment or diet?

## Your Turn: Explore a Question

Now it's your turn to follow a thread of inquiry. Choose ONE of these starting questions:

1. **Do larger penguins have longer bills?** (Is there a relationship between body mass and bill length?)
2. **Does body mass differ by island?** (Are penguins on some islands bigger?)
3. **Is there a relationship between flipper length and bill depth?**

For your chosen question, follow these steps:

### Step 1: Answer Your Starting Question

Compute a relevant statistic or create a visualization to answer your question.

In [None]:
# Your code here - answer your starting question


### Step 2: Document What You Found

Write 1-2 sentences about what you discovered. What pattern do you see? Does anything surprise you?

**Your finding:**

(Write your finding here)

### Step 3: Ask a Follow-up Question

Based on what you found, what's the natural next question? Write it down.

**Your follow-up question:**

(Write your question here)

### Step 4: Investigate Your Follow-up Question

In [None]:
# Your code here - investigate your follow-up question


### Step 5: Document Your Final Finding

What did you learn from this exploration? Write a brief summary (2-3 sentences).

**Your summary:**

(Write your summary here)

## Wrap-up Discussion

Turn to a neighbor and share:
1. Which question did you start with?
2. What did you find?
3. What follow-up question did you ask?

Notice that everyone's path was different - and that's okay! EDA is about following your curiosity.

## Key Takeaways

- **EDA is iterative**: Each answer leads to new questions
- **Document as you go**: Write down findings and decisions while they're fresh
- **There's no "right" path**: Different analysts will explore different directions
- **Follow your curiosity**: When something surprises you, dig deeper

The best EDA happens when you stay curious and keep asking "why?" and "what if?"