# Intro to Data Science Bootcamp Capstone
## Choose a project you care about, then use data to answer something real

Welcome to the capstone. This is your chance to pick a topic that genuinely interests you and build a small, end‑to‑end data analytics project. “Analytics” can mean exploring and explaining patterns, forecasting, building a model, or creating a decision-support tool. The most important requirement is that you use data to answer a clear question and communicate your process and results well.

This notebook is both an introduction and a starter template. You will replace the prompts with your own content as you go.


## What you’re building
By the end, you will deliver a short “portfolio-style” project with three parts: a well-defined question, a reproducible analysis (this notebook), and a clear explanation of what you found.

You do not need a perfect model. You do need a coherent story: what you wanted to learn, what data you used, what you tried, what worked, what didn’t, and what you recommend next.


## What counts as a capstone topic
Your topic must be data analytics related and involve a dataset with enough rows/records to support analysis. You can pick something descriptive (what is happening), diagnostic (why is it happening), predictive (what will happen), or prescriptive (what should we do).

Because we discussed a lot of machine learning, you may include ML if it helps answer your question. That said, strong projects often start with solid exploration and measurement before any modeling.


## Suggested timeline
You have one week for this project, so the focus is on making clear, intentional choices rather than building something huge. Early in the week should be spent picking a topic you care about, finding a usable dataset, and clearly defining your question. The middle of the week is for cleaning the data, doing exploratory analysis, and establishing a simple baseline approach. The end of the week is for iterating where it makes sense, evaluating your results, and polishing your narrative and visuals.


## Deliverables
You will submit (1) this notebook with outputs saved, (2) a short written summary (one to two pages) or slide deck, and (3) your data source links and citation notes.

Your notebook should run top-to-bottom without manual edits beyond setting a data path or API key (if used). If a dataset is too large to include, provide instructions for how to obtain it.


## How you’ll be evaluated
A strong project is not “the fanciest model.” It is the clearest reasoning. The rubric below shows what matters.

| Category | What “strong” looks like |
|---|---|
| Question and scope | A specific, answerable question; scope fits the timeframe |
| Data | Data source is appropriate; data issues are acknowledged; columns are explained |
| Methods | Choices match the question; baseline approach included; iterations are justified |
| Evaluation | Clear metrics or validation; limitations discussed honestly |
| Communication | Visuals and narrative tell a coherent story; reader can follow |
| Reproducibility | Notebook runs; steps are documented; random seeds set when relevant |


## Project idea generator
If you’re stuck, start with (1) a domain you care about, (2) an outcome you want to understand, and (3) a decision you want to support. Then translate that into a question the data can answer.

Examples of domains include sports, music, movies, health (non-medical advice), fitness, finance (non-investment advice), retail, transportation, climate, education, gaming, social media, or your workplace (only if you have permission and can anonymize data).

Below are idea prompts you can adapt. Each is written as a question plus a typical approach.


### Idea bank (pick one and customize it)
| Theme | Example question | Typical approach |
|---|---|---|
| Consumer behavior | What factors predict repeat purchases or churn? | Cohort analysis, survival curves, logistic regression, tree models |
| Pricing | How do price changes affect demand? | A/B-style comparisons, time series, elasticity estimates |
| Sports analytics | What predicts wins or player performance? | Feature engineering, regression/classification, calibration |
| Music/streaming | What drives playlist adds or skips? | EDA + classification, imbalance handling |
| Job market | Which skills predict higher salaries in job postings? | NLP on descriptions, regression, explainability |
| Transportation | What predicts late arrivals? | Time features, weather join, classification, error analysis |
| Energy/climate | Can we forecast consumption or emissions? | Time series features, forecasting baselines, model comparison |
| Public safety | Where and when do incidents cluster? | Mapping, clustering, hotspot analysis, seasonality |
| Health behavior | What predicts adherence to a routine? | Segmentation, classification, interpretability |
| Education | What predicts course completion? | Missingness analysis, fairness checks, classification |


## Analytics-first mindset (even if you use ML)
A simple way to keep your project grounded is to answer these in order. First, what does the data look like and what are the main patterns. Next, what changes over time, across groups, or across locations. Then, what features seem associated with your outcome. Only after that should you decide whether ML is necessary.

If you do include machine learning, start with a baseline (for example a simple linear/logistic regression) before trying more complex models. Your report should explain why the complex model is worth it.


# Part 1. Project proposal (fill this in)
Write in complete sentences. Being specific here will save you time later.

## 1. Project title
Choose a short title that suggests your question.

## 2. Motivation
What do you care about here. Who would use the results, or what decision could it inform.

## 3. Research question
State one primary question and optionally one secondary question.

## 4. Data source
Where does the data come from. Include a link and a brief description of how it was collected.

## 5. Success criteria
How will you know your project worked. This can be a metric, a useful insight, or a decision-support visualization.


# Part 2. Setup
This section helps keep your work reproducible. Run these cells first.


In [None]:
# Core imports (add/remove as needed)
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# Optional: uncomment if you use these
# import seaborn as sns
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import mean_absolute_error, accuracy_score, roc_auc_score

np.random.seed(42)
pd.set_option("display.max_columns", 200)


## Load your data
Replace the example below with your dataset. Keep the raw data read step simple, then do cleaning in the next section.


In [None]:
# Example: CSV
# data_path = "data/your_dataset.csv"
# df_raw = pd.read_csv(data_path)

# Example: Parquet
# df_raw = pd.read_parquet("data/your_dataset.parquet")

# If you are using an API, you can still cache the result to a file for reproducibility.

df_raw = None  # replace this
df_raw


# Part 3. Data audit
Before cleaning, get a quick, honest read on what you have: size, columns, missing values, duplicates, and obvious data quality issues. Your write-up should mention any limitations you discover here.


In [None]:
def data_audit(df: pd.DataFrame, n_unique_preview: int = 8) -> pd.DataFrame:
    summary = []
    for col in df.columns:
        s = df[col]
        summary.append({
            "column": col,
            "dtype": str(s.dtype),
            "n_missing": int(s.isna().sum()),
            "pct_missing": float(s.isna().mean()),
            "n_unique": int(s.nunique(dropna=True)),
            "example_values": ", ".join(map(str, s.dropna().unique()[:n_unique_preview])),
        })
    out = pd.DataFrame(summary).sort_values(["pct_missing", "n_unique"], ascending=[False, True])
    return out

# Run after df_raw is loaded
# audit = data_audit(df_raw)
# audit.head(20)


# Part 4. Cleaning and feature prep
Create a clean working dataframe called `df`. Keep your cleaning decisions transparent. If you drop rows, explain why. If you impute missing values, explain how.


In [None]:
# df = df_raw.copy()

# Typical steps (use what applies)
# 1) Standardize column names
# df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

# 2) Parse dates
# df["date"] = pd.to_datetime(df["date"], errors="coerce")

# 3) Handle duplicates
# df = df.drop_duplicates()

# 4) Basic missingness handling
# df = df.dropna(subset=["target_column"])  # example

# 5) Create features
# df["day_of_week"] = df["date"].dt.day_name()

df = None  # replace this
df


# Part 5. Exploratory data analysis (EDA)
Your goal here is to learn what’s typical, what’s weird, and what relationships might matter. EDA is also where you catch data leakage and target definition problems early.


In [None]:
# Examples you can adapt

# 1) Basic shape
# df.shape

# 2) Quick numeric summary
# df.describe(include="number").T

# 3) Quick categorical summary
# df.describe(include="object").T

# 4) Simple plots (start with one variable at a time)
# df["some_numeric_col"].hist(bins=30)
# plt.title("Distribution of ...")
# plt.show()

# 5) Relationship to target (example)
# df.plot.scatter(x="feature", y="target")
# plt.show()


# Part 6. Baseline approach
Pick a baseline method that matches your question. If your project is descriptive, your baseline might be a clear set of summary stats and a simple dashboard-style figure. If you’re predicting a numeric value, a baseline can be “predict the mean” or a linear regression. If you’re predicting a class, a baseline can be “predict the majority class” or logistic regression.

Write down what your baseline is and why it is a fair starting point.


In [None]:
# Put your baseline here.
# If you model, define:
# X, y, train/test split, a simple model, and a metric.

# Example skeleton (classification):
# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import roc_auc_score
#
# y = df["target"]
# X = df.drop(columns=["target"])
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
#
# model = LogisticRegression(max_iter=200)
# model.fit(X_train, y_train)
# pred_proba = model.predict_proba(X_test)[:, 1]
# print("ROC AUC:", roc_auc_score(y_test, pred_proba))


# Part 7. Iteration and model comparison (optional but encouraged)
If you include ML, compare at least two approaches and explain the tradeoffs. Use error analysis. Look at where the model fails. Consider whether features are causing leakage. Keep an eye on fairness issues if your data contains sensitive attributes or proxies.

Even if you do not do ML, you can still iterate: try alternative groupings, alternative visualizations, or alternative definitions of your outcome.


# Part 8. Evaluation, limitations, and ethics
Every project should include a clear evaluation. For non-ML projects, evaluation can be robustness checks, sensitivity analysis, or triangulation across different slices of the data. For ML projects, use appropriate metrics and a holdout set or cross-validation.

Then list limitations honestly. Mention data coverage gaps, measurement issues, possible confounders, and how your approach might break in the real world.

Finally, include a short ethics note. Identify any risks of harm, privacy concerns, or misuse, and how you mitigated them.


# Part 9. Final story: what you found and what you recommend
End with a short, readable conclusion. Imagine the audience is a smart manager or stakeholder who does not want to read code. Your last section should answer the question, show one or two key visuals, and propose next steps.

A helpful structure is: what you asked, what data you used, what you found, what you recommend, and what you would do next with more time.


## Submission checklist
Use this as a final pass. You can copy this into your written summary.

Your notebook runs top-to-bottom. Your question is stated clearly near the top. Your data source is cited. Your cleaning steps are explained. Your EDA includes at least two useful figures. Your results are summarized in plain language. Your limitations and ethics note are included.


# Appendix: Where to find datasets
If you do not already have data, these are reliable starting points. Choose one that fits your interest area and time constraints. When using public datasets, be sure you understand what each row represents and how the data was collected.

Common sources include Kaggle datasets, data.gov (US), World Bank open data, city open data portals, NOAA climate data, Google Dataset Search, and sports/reference-style public stats sites. If you use web-scraped data, make sure you respect terms of service and keep your scrape gentle and legal.
