<a href="https://colab.research.google.com/github/jeflel/CS49JGroup2/blob/main/EDA_skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDA Notebook Skeleton
_Created: 2025-09-14_

This notebook is a starter template for your assignment:
- Keep your `.ipynb` in `notebooks/` in your repo.
- Put **instructions** and the **Data Card** in `data/README.md` (a template is provided).
- **Do not commit large data**; load it from a local upload, Drive, or a URL at runtime.


## 1) Setup
Import standard libraries. In Google Colab these are already available.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 100)


## 2) Load Data
Pick one approach below and adapt the path/URL/filename.
- **Upload a file** each session (small files only)
- **Load from a URL** (recommended if hosted)
- **Mount Drive** if you keep data there (not shown here)

In [None]:
# OPTION A: Upload a local CSV (interactive)
# from google.colab import files
# uploaded = files.upload()  # choose a file
# import io
# df = pd.read_csv(io.BytesIO(uploaded[list(uploaded.keys())[0]]))

# OPTION B: Load from a URL (replace with your dataset URL)
# url = 'https://raw.githubusercontent.com/.../path/to/data.csv'
# df = pd.read_csv(url)

# After loading, ensure df exists:
# df.head()

## 3) Data Overview
Get shape, dtypes, and a quick peek.

In [None]:
# Make sure df is defined before running these:
# print('Shape:', df.shape)
# display(df.head())
# display(df.dtypes)


## 4) Missingness Snapshot
Which columns have NaNs and rough % missing.

In [None]:
# Run only if df is defined
# miss = df.isna().mean().sort_values(ascending=False)
# (miss * 100).round(2).to_frame('pct_missing')

## 5) Data Dictionary (fill manually)
Create a short, human-readable description for columns you use.

| Column | Type | Units | Description |
|---|---|---|---|
| id | int | – | Unique row identifier |
| date | datetime | YYYY-MM-DD | Observation date |
| value | float | e.g., USD, kg | Measured value |

_Add/modify rows to match your dataset._

## 6) Transformations (pandas)
The following examples show each required transformation. **Edit column names** to match your dataset.
- Vectorized boolean mask → new boolean column
- `map` or `Series.apply` on a single column
- Optional `DataFrame.apply(axis=1)` for multi-column logic
- Categorical bucketing (e.g., low/med/high)
- Handle missing data (drop/fill/mark) with a brief justification in comments

In [None]:
# ====== EDIT BELOW TO MATCH YOUR COLUMN NAMES ======
# Example columns assumed here: 'score' (numeric), 'state' (string), 'age' (numeric)
# Remove comments and adjust as needed.

# 1) Vectorized boolean mask → new column
# df['Pass'] = df['score'] >= 70  # True/False

# 2) map or Series.apply on a SINGLE column
# Example: normalize state names to abbreviations via map
# state_map = {'California': 'CA', 'New York': 'NY'}
# df['state_abbrev'] = df['state'].map(state_map)  # values not in map become NaN

# Alternatively: clean strings with Series.apply (single-column)
# df['state_clean'] = df['state'].apply(lambda s: s.strip().title() if isinstance(s, str) else s)

# 3) OPTIONAL: DataFrame.apply(axis=1) for multi-column logic (justify in comments)
# def risk_row(row):
#     # Example logic needing multiple columns:
#     # Justification: threshold uses both age and score together.
#     return 'high' if (row.get('age', 0) >= 60 and row.get('score', 0) < 65) else 'normal'
# # df['risk'] = df.apply(risk_row, axis=1)

# 4) Categorical bucketing (bands)
# Example: bucket 'score' into low/med/high using quantiles or fixed bins
# bins = [0, 60, 80, 100]
# labels = ['low', 'med', 'high']
# df['score_band'] = pd.cut(df['score'], bins=bins, labels=labels, include_lowest=True)

# 5) Missing-data handling (choose ONE and explain)
# Option A: Fill numeric with median (explain: robust to outliers)
# if 'age' in df.columns:
#     df['age'] = df['age'].fillna(df['age'].median())

# Option B: Drop rows missing 'score' (explain: score is critical target/feature)
# df = df.dropna(subset=['score'])

# Option C: Mark missingness with an indicator and fill
# if 'state' in df.columns:
#     df['state_missing'] = df['state'].isna()
#     df['state'] = df['state'].fillna('Unknown')

# ===================================================
# display(df.head())

## 7) Basic Exploration
A few summaries or plots (optional examples).

In [None]:
# Example histogram (choose a numeric column):
# if 'score' in df.columns:
#     plt.figure()
#     df['score'].hist()
#     plt.title('Score Distribution')
#     plt.xlabel('score')
#     plt.ylabel('count')
#     plt.show()


## 8) Observations / Notes
- Note any quirks: mixed types, inconsistent labels, outliers, etc.
- What transformation choices did you make and why?

---
### Reminder
- Keep this notebook in `notebooks/` in your repo.
- Fill out `data/README.md` using the provided template.