
# Data Wrangling & Tidy Data (Python + pandas)

**Goals for this tutorial (≈30 minutes):**
- Load and inspect a messy, presentation-style dataset
- Reshape from wide → long (tidy) with `melt`
- Handle missing values and implicit missingness
- Transform & normalize variables for analysis
- Visualize the results using matplotlib

**Dataset:** a simplified Pew Religion vs Income table (`pew_religion.csv`)


## 1) Setup & Imports

In [16]:
!pip install pandas numpy matplotlib


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
#install pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)


## 2) Load & Preview the Data

In [None]:

path = "/mnt/data/pew_religion.csv"
df = pd.read_csv(path)
df.head()


In [None]:

df.info()



### Identify "messiness"
- Income **ranges are column headers**, i.e., **values stored as headers**.
- This format is great for *presentation* but not for *analysis*.
- We want **tidy data**: each variable a column, each observation a row.


## 3) Reshape (Wide → Long) with `melt()`

In [None]:

tidy = df.melt(
    id_vars="religion",
    var_name="income_bracket",
    value_name="freq"
)
tidy.head(10)


## 4) Handle Missing Values

In [None]:

# Inspect explicit missingness
tidy.isna().sum()



In this demo dataset there may be no missing values; in real data you would choose a strategy:  
- Drop rows if appropriate (e.g., for counts that should exist): `tidy.dropna(subset=["freq"])`  
- Impute values (mean/median for numeric, "Unknown" for categorical) when justified.


In [None]:

# Example (no-op if none are missing):
tidy = tidy.dropna(subset=["freq"])
tidy.head(5)


## 5) Transform & Normalize

In [None]:

# Log-transform the frequency to compress scale (example transformation)
tidy = tidy.assign(
    freq_log = np.log1p(tidy["freq"])
)

# Optional: extract a numeric hint from income brackets (e.g., lower bound)
# For brackets like "<$10k", "$10–20k", we pull the first number we find
lower_bounds = tidy["income_bracket"].str.extract(r"(\d+)", expand=False).astype(float)
tidy = tidy.assign(income_lb_approx=lower_bounds)
tidy.head(10)


## 6) Visualize (Single Religion Example)

In [None]:

# Choose one religion to make a simple bar chart
one = tidy.query("religion == 'Agnostic'").copy()
one = one.sort_values("income_lb_approx", na_position="first")

plt.figure(figsize=(8,4))
plt.bar(one["income_bracket"], one["freq_log"])
plt.xticks(rotation=45, ha="right")
plt.title("Log Frequency by Income Bracket — Agnostic")
plt.xlabel("Income Bracket")
plt.ylabel("log(1 + freq)")
plt.tight_layout()
plt.show()


## 7) Visualize (Multiple Religions)

In [None]:

# Pivot for a grouped bar chart with matplotlib
pivot = (tidy
         .pivot(index="income_bracket", columns="religion", values="freq_log")
         .sort_index())

# Simple grouped bar chart
ax = pivot.plot(kind="bar", figsize=(10,5), rot=45)
ax.set_title("Log Frequency by Income Bracket and Religion")
ax.set_xlabel("Income Bracket")
ax.set_ylabel("log(1 + freq)")
plt.tight_layout()
plt.show()


## 8) Optional: Pipeline (Method Chaining)

In [None]:

clean = (
    pd.read_csv(path)
    .melt(id_vars="religion", var_name="income_bracket", value_name="freq")
    .dropna(subset=["freq"])
    .assign(
        freq_log=lambda d: np.log1p(d["freq"]),
        income_lb_approx=lambda d: d["income_bracket"].str.extract(r"(\d+)", expand=False).astype(float),
    )
)

clean.head(8)



## 9) Mini Exercises (if time permits)
1. **Imputation:** Replace any missing `freq` with 0 and compare plots.  
2. **Ordering:** Order brackets using `CategoricalDtype` for a natural income order.  
3. **Proportions:** Compute each religion’s *within-religion* proportions across income brackets.  
4. **Export:** Save your tidy dataset to CSV with `to_csv("tidy_output.csv", index=False)`.



## 10) Wrap-Up
- Tidy data = variables in columns, observations in rows, units in tables
- Use `melt`, `pivot`, and `merge` to reshape and combine
- Handle missing values thoughtfully (drop vs impute)
- Transform/normalize to prepare for modeling
- Visualize with matplotlib (tidy → straightforward plots)
