# IS 6482 - Week 1 — Exploratory Data Analysis (EDA) + Data Visualization

**Author:** Varun Gupta

**Agenda:** how to *understand* data before modeling  
**Libraries:** `pandas`, `numpy`, `matplotlib`, `seaborn` (optional: `statsmodels` for QQ plots)  
**Datasets:** Telco customer churn, Gapminder

---

### Learning goals
By the end of this notebook, you should be able to:

1. Quickly inspect a dataset (`.shape`, `.info()`, `.head()`, `.describe()`) and spot obvious issues.
2. Summarize distributions (histograms, ECDFs, boxplots) and compare groups.
3. Use simple segment analysis to compute and visualize **rates** (e.g., churn rate by segment).
4. Explore relationships (correlation, pairplots) to form hypotheses for modeling.
5. See a few advanced visualization techniques (reference: [Data Visualization: A Practical Introduction](https://socviz.co/)):  
   - bubble plots  
   - small multiples (facets)  
   - log scales  
   - connected dot (dumbbell) comparisons  

## 0. Setup

### Takeaways
- EDA and Visualization is how we **earn trust** in the dataset and **check our assumption** before we model anything.
- We will prefer a small set of libraries you will use repeatedly in business work.

In [None]:
# =========================
# Imports and display setup
# =========================

# Numerical computing (arrays, math helpers)
import numpy as np

# Tabular data (DataFrames) — the main workhorse for EDA
import pandas as pd

# Matplotlib = the core plotting library in Python
import matplotlib.pyplot as plt

# Seaborn = statistical plotting library built on top of matplotlib
import seaborn as sns

# Statsmodels for QQ-plot
import statsmodels.api as sm

# Set a consistent visual style (still uses matplotlib underneath)
sns.set_theme()
'''
# Option for prettier output
# Jupyter helper: display DataFrames nicely (especially inside loops)
from IPython.display import display

# Make pandas print wide tables without truncating as aggressively
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 200)



# Slightly sharper plots in notebooks (purely cosmetic)
plt.rcParams["figure.dpi"] = 110
'''

# Part A — Telco Customer Churn Dataset

### The Business problem
A common business problem is **customer churn**: “Which customers are likely to leave, why, and how do we retail them?”

In this section we will treat **`Churn`** as the outcome/target and practice EDA techniques that will later inform our modeling.

### Dataset notes
- We will use the “Telco customer churn” dataset originally distributed as an IBM Cognos Analytics sample. It is a synthetic dataset. We will use a subset of the columns -- the full data dictionary is at https://community.ibm.com/community/user/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
- For convenience in class, we load a CSV mirror from Plotly’s public datasets repository:  
  https://github.com/plotly/datasets/blob/master/telco-customer-churn-by-IBM.csv

### Takeaways
- Start with a business question and identify the **unit of analysis** (here: one row = one customer).
- Identify the **target variable** early (here: `Churn`).


In [None]:
# ==================================
# Load the Telco Customer Churn data
# ==================================

# The dataset is a CSV file hosted online.
# (In a production setting, you'd often load from your company database.)
telco_url = "https://raw.githubusercontent.com/plotly/datasets/master/telco-customer-churn-by-IBM.csv"

# Read the CSV into a pandas DataFrame (a spreadsheet-like table)
df = pd.read_csv(telco_url)

In [None]:
# Quick sanity check: how many rows and columns?
df.shape

## 1. First look: structure, column names, and data types

Sometimes the hardest part of EDA is simply: **“What columns do I have, and how do I grab the rows I want?”**

We start with the fastest “risk report” tools:
- `.shape` tells us the size of the dataset.
- `.head()` / `.tail()` show us what a row looks like.
- `.columns` tells you what’s available (the “menu”).
- `.info()` reveals data types and missing values.
- `.iloc[...]` selects by position (row/column numbers), which is great when you are exploring.

### Takeaways
- Real datasets often have at least one column with the “wrong” type (e.g., numbers stored as text).
- Knowing column names and types early prevents many downstream mistakes.


In [None]:
# First 5 rows: what does one record (one customer) look like?
df.head()

In [None]:
# Last 5 rows: sometimes the bottom of a file reveals parsing issues too
df.tail()

In [None]:
# List the column names (helpful when you are new to the dataset)
df.columns

# list(...) prints them as a regular Python list
# list(df.columns)

In [None]:
# .info() is a fast "risk report":
# - column names
# - non-null counts (missing values show up here)
# - data types (numeric vs text)
df.info()
# Seems like we have good news, there seem to be no null entries in this data set

In [None]:
# .iloc selects by integer position:
# - rows 0 to 2 (Python slices exclude the end)
# - columns 0 to 5
df.iloc[0:3, 0:6]

In [None]:
# .loc selects by index
# here 0:2 are the index, and are inclusive unlike list slicing
df.loc[0:2, "customerID":"tenure"]

### 1.2 Quick summary statistics with `.describe()`

`.describe()` gives you a fast summary that helps answer:
- “Are values in a reasonable range?”
- “Are there obvious outliers?”
- “Do we have missing values?”

### Takeaways
- `.describe()` is a fast way to sanity-check numeric columns.
- `.describe(include="object")` pr `df["categorical column"].describe()` summarizes categorical columns (counts, unique values, most common category).


In [None]:
# Numeric summary stats (count, mean, std, min, quartiles, max)
df.describe()

In [None]:
# Display summary for selected categorical columns
df[["InternetService", "PaymentMethod"]].describe()

In [None]:
# Categorical summary stats for all object columns
# Transpose (.T) makes the table easier to read (one row per column)
df.describe(include="object").T
# Note, there are 11 rows where TotalCharges is empty

In [None]:
# A quick histogram plot of the numeric columns using DataFrame.hist() function
num_cols = df.select_dtypes(include = 'number').columns
df.hist(num_cols);

### Stop & Discuss (with suggested answers)

1) **Question:** In `.info()` and `.describe()`, do you see any columns that *look like numbers but are stored as text*?  

2) **Question:** In `.info()` and `.describe()`, do you see any columns that *look like categorical data but are stored as numbers*?  

3) **Question:** If a numeric column has a much larger max than the 75th percentile, what might that indicate?  

4) **Question:** Why might `.describe()` be misleading for very skewed data?  


## 2. Minimal cleaning that EDA almost always discovers

Some common issues that should be fixed in EDA:
- Identify an ID column (here `customerID` is the correct ID column that identifies rows)
- Remove duplicate IDs (if present)
- Numeric columns that load as text (often due to blanks)
- Convert dtype of categorial attributes from `object` to `category` (efficient for storage and data processing) (*OMITTED TODAY*)
- A numeric version of binary target (useful for correlations)

### Takeaways
- Cleaning is part of EDA, but also guided by EDA: EDA tells you what to clean.
- Fixing types early helps your summaries and plots behave correctly.

In [None]:
# ==============================
# 2. Minimal Data Cleaning
# ==============================
'''
# WE WILL SKIP THIS STEP -- keeping the Id pandas created
# ---- Step 1: Identify an ID column (if present) ----
# Many real datasets have an ID column like customerID.
# We'll look for common names.
id_candidates = ["customerID", "CustomerID", "customer_id", "id"]

id_col = None  # default: we haven't found an ID column yet

for candidate in id_candidates:
    if candidate in df.columns:
        id_col = candidate
        break  # stop at the first match

if id_col is not None:
    # duplicated(...) returns True/False for each row
    # sum() counts how many True values we have
    duplicate_id_count = df.duplicated(subset=[id_col]).sum()

    print("ID column detected:", id_col)
    print("Duplicate IDs:", duplicate_id_count)
else:
    print("No obvious ID column found (that's ok for this demo).")
'''
# We will drop the customerID column since it will not be useful for us for further analysis
# I added the if because sometime we rerun cells, so this avoids getting an error
if "customerID" in df.columns:
    df = df.drop('customerID', axis=1)
df.info()

In [None]:
# ---- Step 2: Fix numeric columns that loaded as text ----
# 'TotalCharges' is often stored as an object column because some rows are blank.

# pd.to_numeric converts values to floats where possible.
# errors="coerce" means: if parsing fails, replace with NaN (missing).
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Confirm that TotalCharges shows up in summary of numerical columns
df.describe().T

In [None]:
# ---- Step 3: Create a numeric churn flag (0/1) ----
# This is convenient later for correlations and simple modeling baselines.

# Standardize text: strip spaces and lowercase
# df["Churn"].astype(str) returns a data frame with one column ("Churn") with entries converted to string
# str.strip() returns a copy of the string with leading and trailing whitespaces removed
# str.lower() returns the string converted to lower case
#churn_clean = df["Churn"].astype(str).str.strip().str.lower()
churn_clean = df["Churn"].str.strip().str.lower()

# Compare to "yes" -> True/False, then convert to 1/0
# Create a new column ChurnFlag with this 0/1 data
df["ChurnFlag"] = (churn_clean == "yes").astype(int)

# Confirm that ChurnFlag shows up in summary of numerical columns
df.describe().T

In [None]:
# ---- Step 4: Quick missingness scan ----
# isna() gives a True/False DataFrame of missing values.
# mean() on True/False gives the fraction missing in each column.
# sort_value() will show the columns with most missing fraction on top
missing_fraction = df.isna().mean().sort_values(ascending=False)

missing_fraction.head(10)

## 3. Target variable: class balance and “base rate”

Before any model, we want to know:
- What fraction of customers churn?
- Is the target imbalanced?

### Takeaways
- The **base rate** answers: “If I predict the majority class every time, how often am I correct?”
- Class imbalance affects modeling and evaluation later.


In [None]:
# ==============================
# 3. Target variable: base rate
# ==============================

# Count how many customers churned vs not churned
# dropna=False keeps missing values (if any) visible in the table
churn_counts = df["Churn"].value_counts(dropna=False)

churn_counts

In [None]:
# Convert counts to proportions (fractions that sum to 1.0)
churn_proportions = df["Churn"].value_counts(normalize=True, dropna=False)

In [None]:
# Simple bar chart of churn counts (good for a quick class balance check)
plt.figure(figsize=(5, 3))  # (width, height) in inches

# value_counts() returns a pandas Series -> Series.plot() uses matplotlib behind the scenes
ax = churn_counts.plot(kind="bar")

# Add clear labels (always label axes in business work!)
ax.set_title("Churn counts")
ax.set_xlabel("Churn (target)")
ax.set_ylabel("Number of customers")

plt.tight_layout()  # reduces label cutoff
plt.show()

### Stop & Discuss: what does class balance imply?

1) **Question:** If we always predict the most common class (usually “No churn”), why can the accuracy look “good” even if the model is useless?  

2) **Question:** In a churn context, which is usually more costly: a false positive (predict churn but they stay) or a false negative (predict stay but they churn)?  


## 4. Categorical EDA: what are the big segments?

For categorical columns, we often start with:
- `.value_counts()` (counts and proportions; returns the counts in descending order)
- bar charts (sorted)

### Takeaways
- Segment size matters: a segment can have high churn rate but be tiny.
- EDA is partly about **prioritization**: focus on columns that are likely to matter for business decisions.


In [None]:
# ============================================
# 4. Categorical EDA: what are the big segments?
# ============================================

# In pandas, categorical columns are often stored as dtype "object" (text).
# We also want to exclude the ID and target columns from the "feature" list.
exclude_cols = []

exclude_cols.append("Churn")
exclude_cols.append("ChurnFlag")

# Build a list of categorical feature columns (with type object)
cat_cols = []
for col in df.columns:
    if df[col].dtype == "object" and col not in exclude_cols:
        cat_cols.append(col)

cat_cols

In [None]:
# Show frequency tables for a few business-relevant categorical columns
key_cats = ["Contract", "InternetService", "PaymentMethod", "PaperlessBilling", "gender"]

for col in key_cats:
    print("\n" + "=" * 90)
    print("Column:", col)

    # Counts: how many rows in each category?
    counts = df[col].value_counts(dropna=False)

    # Proportions: what fraction of customers are in each category?
    proportions = df[col].value_counts(normalize=True, dropna=False)

    # Combine into a single easy-to-read table
    summary_table = pd.DataFrame({
        "count": counts,
        "proportion": proportions
    })

    display(summary_table.head(10))


In [None]:
# A readable bar chart for one important categorical variable (Contract)
plt.figure(figsize=(7, 3))

# Order bars from most common to least common
order = df["Contract"].value_counts().index

# countplot shows category counts; order= controls the category order
sns.countplot(data=df, x="Contract", order=order)

plt.title("Contract type distribution")
plt.xlabel("Contract")
plt.ylabel("Number of customers");

In [None]:
# Find Churn Rate within each segment of Contract
churn_by_contract = df["ChurnFlag"].groupby(df["Contract"]).mean()

#Sort for readability
churn_by_contract = churn_by_contract.sort_values(ascending=True)

display(churn_by_contract)

# 4) Plot (horizontal bar chart is readable for long labels)
plt.figure(figsize=(8, 3))
churn_by_contract.mul(100).plot(kind="bar")  # multiply by 100 to show percent, barh for horizontal bars
plt.xlabel("Churn rate (%)")
plt.ylabel("Payment method")
plt.title("Churn rate by PaymentMethod (sorted)");

### Stop & Discuss: segments vs outcomes

1) **Question:** Why do we look at segment size *and* churn rate?  

2) **Question:** If one segment has high churn, does that prove the segment *causes* churn?  


## 5. Numeric distributions (univariate): histograms, ECDFs, boxplots, normal overlay, QQ plot

We will practice multiple ways to look at *one* numeric column:
- Histogram: fast shape check (but depends on bins)
- ECDF: distribution view without bins
- Boxplot: median + spread + outliers
- Normal overlay + QQ plot: check whether “normality” is a reasonable approximation

### Takeaways
- Different plots show different things
- Many business variables are **not** normally distributed; checking is easy and worth it.


In [None]:
# Identify numeric columns (numbers we can compute means/correlations on)
num_cols = df.select_dtypes(include="number").columns.tolist()

num_cols

In [None]:
# Quick histogram grid for numeric columns
# - bins controls how many bars we use
# - figsize controls overall plot size
df[num_cols].hist(bins=30, figsize=(12, 8))
plt.suptitle("Numeric distributions (histograms)");

### Stop & Discuss: reading histograms

1) **Question:** Which numeric variables look skewed or have long tails?  
   
2) **Question:** If you see a spike at 0 (or a strange gap), what might that mean?  


### 5.1 ECDF (Empirical CDF)

ECDF answers: “What fraction of customers are below this value?”

**Why it’s nice:** it avoids arguments about histogram bin sizes.

(Seaborn ECDF docs: https://seaborn.pydata.org/generated/seaborn.ecdfplot.html)

### Takeaways
- ECDF is often easier than histograms for comparing groups.
- When curves differ a lot, the groups differ a lot.


In [None]:
# ECDF for MonthlyCharges (or fall back to the first numeric column)
# ECDF = Empirical Cumulative Distribution Function:
# it shows the fraction of observations <= a given value.

col = "MonthlyCharges"

plt.figure(figsize=(7, 4))

sns.ecdfplot(data=df, x=col, hue="Churn")
plt.grid(True)
plt.title(f"ECDF of {col}")
plt.xlabel(col)
plt.ylabel("Proportion of customers ≤ x");

### Stop & Discuss: how to read an ECDF

1) **Question:** At a value *x* on the horizontal axis, what does the ECDF’s y-value mean?  
   
2) **Question:** When comparing two ECDF lines (Churn=Yes vs No), what does it mean if one line is consistently above the other?  
   

### 5.2 Boxplots (including group comparison)

Boxplots help spot outliers and compare churn vs non-churn quickly.


In [None]:
# Boxplots show:
# - median (line)
# - middle 50% range (box)
# - potential outliers (points beyond whiskers)
col = "MonthlyCharges"

# ---- Boxplot for the full dataset ----
plt.figure(figsize=(6, 3))
plt.boxplot(df[col].dropna())
plt.title(f"Boxplot of {col}")
plt.ylabel(col)
plt.show()

In [None]:
# ---- Boxplot split by churn group ----
plt.figure(figsize=(6, 3))
sns.boxplot(data=df, x="Churn", y=col)
plt.title(f"{col} by churn")
plt.xlabel("Churn")
plt.ylabel(col)
plt.show()

### 5.3 Histogram + Normal overlay (reference only)

We sometimes compare a distribution to a normal curve as a quick sanity check.
This is *not* a test — it’s a visual reference.

### Takeaways
- If the data is highly skewed or heavy-tailed, normal assumptions may break.
- The goal is awareness, not perfection.
- Sometimes this can help us realize that we need to **transform** the data before further analysis (example, log transform)


In [None]:
# Histogram vs. Normal overlay
# Goal: visually compare a real business variable to a "perfect normal" distribution.
# This is useful because many modeling methods assume (approximately) normal errors.

col = "MonthlyCharges"

# Drop missing values and ensure the data is float
x = df[col].dropna()

# Compute mean and sample standard deviation (ddof=1 = sample std)
mu = x.mean()
sigma = x.std()

# Build a smooth x-axis for the normal curve
xs = np.linspace(x.min(), x.max(), 100)

# Normal distribution PDF formula (so we don't need extra libraries)
pdf = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((xs - mu) / sigma) ** 2)

plt.figure(figsize=(7, 4))

# density=True scales the histogram so the area sums to 1 (so we can overlay a PDF)
plt.hist(x, bins=30, density=True, alpha=0.6, label="Data (histogram)")

# Plot the normal curve with the same mean/std
plt.plot(xs, pdf, label="Normal curve (same mean/std)")

plt.title(f"{col}: histogram (density) + normal overlay")
plt.xlabel(col)
plt.ylabel("Density")
plt.legend()
plt.show()

print(f"Column Mean = {mu:.2f}, Column Std = {sigma:.2f}")

In [None]:
# ECDF vs. Normal Overlay
import seaborn as sns
from scipy.stats import norm

col = "MonthlyCharges"

# Drop missing values and ensure the data is float
x = df[col].dropna()
# Compute mean and sample standard deviation (ddof=1 = sample std)
mu = x.mean()
sigma = x.std()
# Build a smooth x-axis for the normal curve
xs = np.linspace(x.min(), x.max(), 100)

plt.figure(figsize=(8, 5))
sns.ecdfplot(data = df, x = x, label = 'Data (ECDF)')
# Calculate the theoretical CDF for these x values
# using the mean and std dev of your generated data (or known population parameters)
ys = norm.cdf(xs, mu, sigma)
plt.plot(xs, ys, color='red', linestyle='--', label='Theoretical Normal CDF');
plt.legend()

### 5.4 QQ / Probability plot (optional but powerful)

A QQ/probability plot compares your sample quantiles to theoretical normal quantiles.

- If points follow the line closely → normal approximation might be reasonable.
- Curvature (especially at the ends) → non-normal tails/skew.

(SciPy probplot docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html)

### Takeaways
- QQ plots show *where* the distribution differs (often in the tails).
- Tails matter in business (risk, churn extremes, high-spend customers).

In [None]:
# QQ / Probability plot (optional but powerful)
# If statsmodels is available, api.qqplot gives a fast diagnostic against a theoretical distribution.

col = "MonthlyCharges"
# Drop missing values and ensure the data is float
x = df[col].dropna()

plt.figure(figsize=(6, 6))
# dist="norm" means: compare against the normal distribution
sm.qqplot(x, line='s')
plt.title(f"QQ-plot (normal) for {col}")

plt.show()


## 6. Compare churn vs non-churn (tables + plots)

A simple, powerful pattern:
1. Group by the target (`Churn`)
2. Summarize numeric columns (`.describe()`, quantiles)
3. Visualize differences (boxplots, ECDF by group)

### Takeaways
- Tables give **effect size** (how big the difference is).
- We are not “modeling” yet: we are building intuition and hypotheses.


In [None]:
# ==============================================
# 6. Compare churn vs non-churn (tables + plots)
# ==============================================

# Pick a small set of numeric columns that are meaningful in this dataset
numeric_focus = ["tenure", "MonthlyCharges", "TotalCharges"]

# groupby("Churn") splits rows into two groups (Yes/No)
# describe() computes count/mean/std/min/quantiles/max for each numeric column
summary = df.groupby("Churn")[numeric_focus].describe().T




summary

## 7. Segment analysis: churn-rate heatmap (graph-table style)

Instead of asking “How many churned?” we ask:
> “What is the **churn rate** among customers with Contract = "One Year" and InternetService = "DSL"?

A useful pattern:
- Use `pd.crosstab(..., aggfunc="mean")` on a boolean churn indicator (crosstab = crosstabulation)
- Visualize it as a heatmap

### Takeaways
- Rates are usually more actionable than raw counts.
- Segment heatmaps are close to what many business dashboards need.


In [None]:
# =========================================================
# 7. Segment analysis: churn-rate heatmap (graph-table style)
# =========================================================
# We want churn RATE (a percentage), not just churn counts.
# A good segment plot answers: "Which groups have the highest churn rate?"

# Pick two segmentation columns if present
row_col = "Contract"
col_col = "InternetService"

# Crosstab + aggfunc="mean":
# mean(True/False) = fraction True = churn rate
churn_rate = pd.crosstab(
        df[row_col],               # rows
        df[col_col],               # columns
        values=df['ChurnFlag'],         # values to aggregate
        aggfunc="mean"             # mean -> rate
    )

display(churn_rate)

plt.figure(figsize=(7, 4))

# annot = True prints the values inside cells
# fmt = ".1%" formats as a percentage with 1 decimal place
sns.heatmap(churn_rate, annot=True, fmt=".1%")

plt.title("Churn rate by segment")
plt.ylabel(row_col)
plt.xlabel(col_col)
plt.show()

### Stop & Discuss: interpreting the churn-rate heatmap

1) **Question:** Which segment combination has the highest churn rate?  

2) **Question:** If a segment has high churn, what are two *non-causal* explanations you should consider?  

3) **Question:** What additional slice would you look at next to refine the story?  


## 8. Correlations (numeric) + heatmap

Correlation is a screening tool:
- Good for fast detection of linear relationships
- Not evidence of causality

### Takeaways
- Correlation helps you ask better questions.
- Correlation ≠ causation.


In [None]:
# ==========================================
# 8. Correlations (numeric subset) + heatmap
# ==========================================

# Correlation measures *linear* relationship between two numeric variables.
# We'll focus on a small numeric set plus ChurnFlag (0/1).
numeric_focus = ["tenure", "MonthlyCharges", "TotalCharges", "ChurnFlag"]

# corr() returns a correlation matrix (table)
corr = df[numeric_focus].corr(numeric_only=True)

corr

In [None]:
# Visualize the correlation matrix as a heatmap
plt.figure(figsize=(6, 4))

# center=0 makes 0 correlation a neutral color (helps interpretation)
sns.heatmap(corr, annot=True, center=0)

plt.title("Correlation heatmap (numeric subset)")
plt.show()

### Stop & Discuss: what correlation is (and isn’t)

1) **Question:** What does a correlation of +0.7 mean in plain language?  
   
2) **Question:** Why might correlation miss an important relationship?  
   
3) **Question:** Why does correlating with `ChurnFlag` (0/1) sometimes still make sense?  


## 9. Pairplot (curated variables + engineered ServiceCount)

Pairplots help us see:
- clusters
- separation between churn vs non-churn
- non-linear relationships

But pairplots do **not** scale to dozens of variables. We keep it small and readable.

### Takeaways
- Pairplots are best for a curated subset (3–6 variables).
- Light feature engineering can reveal structure that raw columns hide.


In [None]:
# ==========================================================
# 10. Pairplot (curated variables + engineered ServiceCount)
# ==========================================================

# Many telco columns are Yes/No service flags. Let's create one simple summary:
# ServiceCount = how many optional services the customer has.

services = [
    "OnlineSecurity",
    "OnlineBackup",
    "DeviceProtection",
    "TechSupport",
    "StreamingTV",
    "StreamingMovies",
    "MultipleLines",
]

df["ServiceCount"] = 0  # start at 0 for everyone

for col in services:
    # Add 1 if the service is "Yes", else add 0
    df["ServiceCount"] += df[col].astype(str).str.strip().eq("Yes").astype(int)

df["ServiceCount"].describe()

In [None]:
# Select a small set of numeric columns for the pairplot (readable, not overwhelming)
pair_cols =  ["tenure", "MonthlyCharges", "TotalCharges", "ServiceCount"]

# Pairplot can be slow for large datasets. We'll sample to keep it fast in class.
# Fix: Use a single list to select all required columns including the hue variable
pair_df = df[pair_cols + ["ChurnFlag"]].dropna()

# Sample down to at most 1000 rows for less cluttered plot (adjust as needed)
if len(pair_df) > 1000:
    pair_df = pair_df.sample(1000, random_state=42)

# pairplot shows:
# - scatterplots for each pair of variables
# - KDEs on the diagonal (diag_kind="kde")
# - hue="ChurnFlag" colors points by churn group
g = sns.pairplot(
    data=pair_df,
    vars=pair_cols,
    hue="ChurnFlag",
    diag_kind="kde",
    corner=True,                 # only plot the lower triangle (less clutter)
    plot_kws={"alpha": 0.4, "s": 18}  # alpha = transparency, s = marker size
)

g.fig.suptitle("Pairplot (sampled) — look for separation and relationships", y=1.02)
plt.show()

In [None]:
# sns.pairplot(df, vars=pair_cols, kind = 'reg');

In [None]:
# sns.pairplot(df, vars=pair_cols, hue = 'Churn', palette = 'husl', markers = ['o', 'D']);

### Stop & Discuss: what story does the pairplot tell?

1) **Question:** Do churners and non-churners look separated in any pair of variables?  

2) **Question:** If there is no clear visual separation, does that mean churn is not predictable?  


## 10. Telco recap: a reusable EDA checklist

### A practical EDA checklist
1. Define the business question + unit of analysis (row = what?)
2. Inspect structure (`.shape`, `.head()`, `.info()`)
3. Fix obvious issues (types, missingness, duplicates)
4. Understand the target (base rate / class balance)
5. Summarize categorical columns (`value_counts`)
6. Summarize numeric columns (`describe`, hist/ECDF/boxplot)
7. Compare target groups (groupby summaries + plots)
8. Segment analysis (rates, heatmaps)
9. Relationship scan (correlation, pairplot)
10. Write down 3–5 hypotheses to test with models

### Takeaways
- EDA is a repeatable workflow, not a mystery.
- The goal is not “make lots of plots.” The goal is to decide what to do next.


# Part B — Visualization Studio Add-on (Gapminder)  
*(Still only matplotlib + seaborn)*

Now we’ll switch domains and focus on visualization patterns you can reuse:
- log scales for skewed variables
- small multiples (facets)
- connected dot (dumbbell) comparisons

### Dataset notes
- Gapminder provides data as downloadable CSV indicators: https://www.gapminder.org/data/
- We load a common “five-year” Gapminder CSV from Plotly’s datasets repository:  
  https://github.com/plotly/datasets/blob/master/gapminderDataFiveYear.csv



In [None]:
# ==========================================
# Load Gapminder data (CSV) with pandas
# ==========================================

gap_url = "https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv"

# Read into a DataFrame
gap = pd.read_csv(gap_url)

# Quick shape check
gap.shape


In [None]:
# First few rows: what columns do we have?
gap.head()

In [None]:
# Column types + missingness
gap.info()

## 11. Bubble scatter (2007 snapshot): life expectancy vs GDP per capita (log x-axis)

We’ll replicate a classic view:
- `x = gdpPercap` (log scale)
- `y = lifeExp`
- `size = pop`
- `hue = continent`

### Takeaways
- Log scales can reveal relationships hidden by skewed axes.
- Multi-encoding (x/y/color/size) is powerful, but readability matters.


In [None]:
# -----------------------------------------
# Bubble scatter (2007 snapshot)
# -----------------------------------------
# x = GDP per capita (log scale)
# y = life expectancy
# size = population
# color = continent

# Filter to one year (keeps the plot readable)
gap07 = gap.query("year == 2007").copy()

plt.figure(figsize=(10, 6))

ax = sns.scatterplot(
    data=gap07,
    x="gdpPercap",
    y="lifeExp",
    hue="continent",
    size="pop",
    sizes=(20, 1200),   # min and max marker sizes
    alpha=0.7
)

# GDP per capita is very skewed; log scale makes patterns easier to see
# ax.set_xscale("log")

ax.set_title("2007: Life expectancy vs GDP per capita (log scale)")
ax.set_xlabel("GDP per capita (log scale)")
ax.set_ylabel("Life expectancy")

# Move legend outside the plot so it doesn't cover points
plt.legend(bbox_to_anchor=(1.02, 1), loc="upper left", title="Continent")
plt.tight_layout()
plt.show()


### Stop & Discuss: what story does the bubble chart tell?

1) **Question:** What is the overall relationship between GDP per capita and life expectancy?  

2) **Question:** Why do we use a log scale for GDP per capita?  


## 11. Small multiples (facets): same scatter, split by continent

Small multiples reduce clutter and make comparisons easier.

### Takeaways
- If one plot is too busy, don’t fight it — **split it**.
- Facets make patterns visible without needing a complicated legend.


In [None]:
# -----------------------------------------
# Small multiples (FacetGrid) by continent
# -----------------------------------------
# This reduces clutter and makes continent-by-continent comparisons easier.

g = sns.FacetGrid(
    data=gap07,
    col="continent",
    col_wrap=3,     # wrap into multiple rows
    height=3,
    sharex=False,   # allow each facet to choose its own x-limits
    sharey=True
)

# Draw the same plot in each facet
g.map_dataframe(sns.scatterplot, x="gdpPercap", y="lifeExp", alpha=0.7)

# Apply log scale to each facet axis
for ax in g.axes.flatten():
    ax.set_xscale("log")

g.set_axis_labels("GDP per capita (log scale)", "Life expectancy")
g.fig.suptitle("2007: LifeExp vs GDP per capita by continent", y=1.02)

plt.show()


### Stop & Discuss: why small multiples?

1) **Question:** What do you see in the faceted plots that was harder to see in the single global plot?  

2) **Question:** When is faceting a bad idea?  


## 12. Dumbbell (connected dot) plot: life expectancy change from 1952 → 2007

A dumbbell plot is great for comparing two time points across categories (here: countries).
It is also called a “connected dot plot.”  
Examples/definitions:  
- https://datavizcatalogue.com/blog/chart-snapshot-dumbbell-plot/  
- https://datavizproject.com/data-type/dumbbell-plot/

We will:
1. Choose the top 10 countries by population in 2007
2. Compare life expectancy in 1952 vs 2007

### Takeaways
- For “before vs after,” a dumbbell plot is often clearer than many time series lines.
- The connecting line makes *difference* the visual focus.


In [None]:
# -----------------------------------------
# Dumbbell / connected dot plot (1952 -> 2007)
# -----------------------------------------
# A dumbbell plot is great for "before vs after" comparisons.

# Pick a set of countries: top 10 by population in 2007
top_countries = gap07.nlargest(10, "pop")["country"].tolist()

# Keep only those countries and the two years we care about
subset = gap[
    gap["country"].isin(top_countries) &
    gap["year"].isin([1952, 2007])
].copy()

# Pivot so we get one row per country and two columns: 1952 and 2007
pivot = subset.pivot(index="country", columns="year", values="lifeExp")

# Add an improvement column to sort countries by change
pivot["improvement"] = pivot[2007] - pivot[1952]

# Sort so the biggest improvers appear at the top of the plot
pivot = pivot.sort_values("improvement")

pivot


In [None]:
# Plot the dumbbell chart (life expectancy in 1952 vs 2007)
y_positions = np.arange(len(pivot))  # 0..N-1

plt.figure(figsize=(9, 6))

# Horizontal lines connect 1952 to 2007 for each country
plt.hlines(
    y=y_positions,
    xmin=pivot[1952],
    xmax=pivot[2007]
)

# Points for each year
plt.plot(pivot[1952], y_positions, "o", label="1952")
plt.plot(pivot[2007], y_positions, "o", label="2007")

# Label y-axis with country names
plt.yticks(y_positions, pivot.index)

plt.title("Life expectancy change (1952 → 2007) for top-population countries")
plt.xlabel("Life expectancy (years)")
plt.legend()
plt.tight_layout()
plt.show()

### Stop & Discuss: interpreting the dumbbell plot

1) **Question:** Which country improved the most in life expectancy (1952 → 2007) *within this selected set*?  

2) **Question:** Why might we prefer a dumbbell plot over plotting a full time series for 10 countries?  


# Wrap-up and quick practice

### What you practiced today
- Inspecting structure and types (`info`, `describe`, missingness)
- Distribution views (hist, ECDF, boxplot, normal overlay, QQ plot)
- Group comparisons (`groupby().describe()`)
- Segment churn rates with heatmaps (graph-table style)
- Relationship scans (correlation, encoded correlations, pairplots)
- Visualization patterns (log scale, facets, dumbbell comparison)

### Practice exercises
1. **Telco:**  Which payment methods have the highest churn rate?
2. **Gapminder:** Which country improved its life expectancy the most?

### Takeaways
- EDA is a workflow you can reuse in every project.
- The best plots make the next decision obvious.