
# BRFSS Downloader — Per‑Year Pull & Parse (1990–2023)

This notebook downloads **BRFSS annual survey ZIPs** from CDC for 1990–2023, extracts the **SAS XPT** file in each ZIP, parses it with `pyreadstat`, and saves **per‑year CSV + Parquet**.

- Output ZIPs: `data/raw/brfss_zips/`
- Per‑year tables: `data/raw/brfss_year/` -> `brfss_<year>.csv` and (if pyarrow installed) `brfss_<year>.parquet`  
- Resume‑safe: already-downloaded/parsed years are skipped
- Failures are logged to `data/raw/brfss_year/_brfss_failures.json` so you can retry specific years

> Tip: Run the **smoke test** cell (single year) before launching decade blocks.



## Prereqs

- Python 3.10+ recommended
- Install dependencies (see `requirements.txt`):
  ```bash
  pip install -r requirements.txt
  ```

- Run this notebook from the project root (so `data/` folders are created as shown).


In [1]:

# --- imports & folders ---
from pathlib import Path
import io, zipfile, json
import requests
import pandas as pd
import pyreadstat
from tqdm import tqdm

# output structure
ZIPS_DIR = Path("data/raw/brfss_zips")
YEAR_DIR = Path("data/raw/brfss_year")
ZIPS_DIR.mkdir(parents=True, exist_ok=True)
YEAR_DIR.mkdir(parents=True, exist_ok=True)

print("ZIPs dir:", ZIPS_DIR.resolve())
print("Year dir:", YEAR_DIR.resolve())


ZIPs dir: /Users/umichmads/brfss_project/data/raw/brfss_zips
Year dir: /Users/umichmads/brfss_project/data/raw/brfss_year



## Master URL Map (1990–2023)

This is the set that worked in quick tests. You can add 1989 later if CDC publishes a public ZIP.


In [2]:

# master URL dict (1990–2023)
BRFSS_URLS = {
    1990: "https://www.cdc.gov/brfss/annual_data/1990/files/CDBRFS90XPT.zip",
    1991: "https://www.cdc.gov/brfss/annual_data/1991/files/CDBRFS91XPT.zip",
    1992: "https://www.cdc.gov/brfss/annual_data/1992/files/CDBRFS92XPT.zip",
    1993: "https://www.cdc.gov/brfss/annual_data/1993/files/CDBRFS93XPT.zip",
    1994: "https://www.cdc.gov/brfss/annual_data/1994/files/CDBRFS94XPT.zip",
    1995: "https://www.cdc.gov/brfss/annual_data/1995/files/CDBRFS95XPT.zip",
    1996: "https://www.cdc.gov/brfss/annual_data/1996/files/CDBRFS96XPT.zip",
    1997: "https://www.cdc.gov/brfss/annual_data/1997/files/CDBRFS97XPT.zip",
    1998: "https://www.cdc.gov/brfss/annual_data/1998/files/CDBRFS98XPT.zip",
    1999: "https://www.cdc.gov/brfss/annual_data/1999/files/CDBRFS99XPT.zip",
    2000: "https://www.cdc.gov/brfss/annual_data/2000/files/CDBRFS00XPT.ZIP",
    2001: "https://www.cdc.gov/brfss/annual_data/2001/files/CDBRFS01XPT.zip",
    2002: "https://www.cdc.gov/brfss/annual_data/2002/files/CDBRFS02XPT.ZIP",
    2003: "https://www.cdc.gov/brfss/annual_data/2003/files/CDBRFS03XPT.ZIP",
    2004: "https://www.cdc.gov/brfss/annual_data/2004/files/CDBRFS04XPT.zip",
    2005: "https://www.cdc.gov/brfss/annual_data/2005/files/CDBRFS05XPT.zip",
    2006: "https://www.cdc.gov/brfss/annual_data/2006/files/CDBRFS06XPT.ZIP",
    2007: "https://www.cdc.gov/brfss/annual_data/2007/files/CDBRFS07XPT.ZIP",
    2008: "https://www.cdc.gov/brfss/annual_data/2008/files/CDBRFS08XPT.ZIP",
    2009: "https://www.cdc.gov/brfss/annual_data/2009/files/CDBRFS09XPT.ZIP",
    2010: "https://www.cdc.gov/brfss/annual_data/2010/files/CDBRFS10XPT.zip",
    2011: "https://www.cdc.gov/brfss/annual_data/2011/files/LLCP2011XPT.ZIP",
    2012: "https://www.cdc.gov/brfss/annual_data/2012/files/LLCP2012XPT.ZIP",
    2013: "https://www.cdc.gov/brfss/annual_data/2013/files/LLCP2013XPT.ZIP",
    2014: "https://www.cdc.gov/brfss/annual_data/2014/files/LLCP2014XPT.ZIP",
    2015: "https://www.cdc.gov/brfss/annual_data/2015/files/LLCP2015XPT.zip",
    2016: "https://www.cdc.gov/brfss/annual_data/2016/files/LLCP2016XPT.zip",
    2017: "https://www.cdc.gov/brfss/annual_data/2017/files/LLCP2017XPT.zip",
    2018: "https://www.cdc.gov/brfss/annual_data/2018/files/LLCP2018XPT.zip",
    2019: "https://www.cdc.gov/brfss/annual_data/2019/files/LLCP2019XPT.zip",
    2020: "https://www.cdc.gov/brfss/annual_data/2020/files/LLCP2020XPT.zip",
    2021: "https://www.cdc.gov/brfss/annual_data/2021/files/LLCP2021XPT.zip",
    2022: "https://www.cdc.gov/brfss/annual_data/2022/files/LLCP2022XPT.zip",
    2023: "https://www.cdc.gov/brfss/annual_data/2023/files/LLCP2023XPT.zip",
}
sorted(BRFSS_URLS)[:5], sorted(BRFSS_URLS)[-5:]


([1990, 1991, 1992, 1993, 1994], [2019, 2020, 2021, 2022, 2023])

## Helpers (download ZIP, extract XPT, parse, normalize, save)

In [3]:

def download_zip(year: int) -> Path | None:
    """Download the ZIP for a year to ZIPS_DIR; return Path or None on failure."""
    url = BRFSS_URLS.get(year)
    if not url:
        print(f"{year}: no URL configured")
        return None
    out_path = ZIPS_DIR / Path(url).name
    if out_path.exists():
        return out_path
    try:
        r = requests.get(url, timeout=180)
        if r.status_code == 200:
            out_path.write_bytes(r.content)
            return out_path
        print(f"{year}: HTTP {r.status_code}")
        return None
    except Exception as e:
        print(f"{year}: download error -> {e}")
        return None

def extract_xpt_bytes(zip_path: Path) -> bytes | None:
    """Return the first *.xpt file bytes from a ZIP."""
    try:
        with zipfile.ZipFile(zip_path, "r") as zf:
            names = [n for n in zf.namelist() if n.lower().endswith(".xpt")]
            if not names:
                return None
            with zf.open(names[0]) as f:
                return f.read()
    except Exception as e:
        print("zip error:", e)
        return None

def normalize(df: pd.DataFrame, year: int) -> pd.DataFrame:
    """Minimal normalization: lowercase columns + add 'year' and optional 'fips' if _STATE/_CNTY exist."""
    df = df.copy()
    df.columns = [c.strip().lower() for c in df.columns]
    df["year"] = year
    if "_state" in df.columns:
        s = pd.to_numeric(df["_state"], errors="coerce").astype("Int64").astype("string").str.zfill(2)
        df["state_fips"] = s
    if "_cnty" in df.columns:
        c = pd.to_numeric(df["_cnty"], errors="coerce").astype("Int64").astype("string").str.zfill(3)
        df["fips"] = (df.get("state_fips", "") + c).where(df.get("state_fips").notna(), None)
    return df

def parse_and_save(year: int) -> dict:
    """Parse one year and save CSV + Parquet (if available). Returns a summary dict."""
    csv_path = YEAR_DIR / f"brfss_{year}.csv"
    pq_path  = YEAR_DIR / f"brfss_{year}.parquet"

    if csv_path.exists():  # resume-safe: treat as done if CSV exists
        return {"year": year, "status": "skipped_existing", "rows": None, "cols": None}

    zf = download_zip(year)
    if not zf:
        return {"year": year, "status": "download_failed", "rows": None, "cols": None}

    xpt_bytes = extract_xpt_bytes(zf)
    if not xpt_bytes:
        return {"year": year, "status": "no_xpt_in_zip", "rows": None, "cols": None}

    try:
        df, _ = pyreadstat.read_xport(io.BytesIO(xpt_bytes), apply_value_formats=False, formats_as_category=False)
        df = normalize(df, year)
        df.to_csv(csv_path, index=False)
        try:
            df.to_parquet(pq_path, index=False)
        except Exception:
            pass
        return {"year": year, "status": "ok", "rows": int(len(df)), "cols": int(df.shape[1])}
    except Exception as e:
        return {"year": year, "status": f"parse_error: {e}", "rows": None, "cols": None}


## Smoke test — run one year first

In [4]:

test_year = 2019
print(f"Processing {test_year}…")
summary = parse_and_save(test_year)
summary


Processing 2019…


{'year': 2019, 'status': 'no_xpt_in_zip', 'rows': None, 'cols': None}


## Batch runs (decade blocks)

Uncomment and run a block when you're ready. Each block resumes and skips already‑processed years.


In [5]:

# 1990–1999
results_90s = [parse_and_save(y) for y in tqdm(range(1990, 2000))]
results_90s


100%|███████████████████████████████████████████| 10/10 [00:00<00:00, 11.71it/s]


[{'year': 1990,
  'status': "parse_error: read_xport() got an unexpected keyword argument 'apply_value_formats'",
  'rows': None,
  'cols': None},
 {'year': 1991,
  'status': "parse_error: read_xport() got an unexpected keyword argument 'apply_value_formats'",
  'rows': None,
  'cols': None},
 {'year': 1992,
  'status': "parse_error: read_xport() got an unexpected keyword argument 'apply_value_formats'",
  'rows': None,
  'cols': None},
 {'year': 1993,
  'status': "parse_error: read_xport() got an unexpected keyword argument 'apply_value_formats'",
  'rows': None,
  'cols': None},
 {'year': 1994,
  'status': "parse_error: read_xport() got an unexpected keyword argument 'apply_value_formats'",
  'rows': None,
  'cols': None},
 {'year': 1995,
  'status': "parse_error: read_xport() got an unexpected keyword argument 'apply_value_formats'",
  'rows': None,
  'cols': None},
 {'year': 1996,
  'status': "parse_error: read_xport() got an unexpected keyword argument 'apply_value_formats'",
  'r

In [None]:

# # 2000–2009
# results_00s = [parse_and_save(y) for y in tqdm(range(2000, 2010))]
# results_00s


In [None]:

# # 2010–2019
# results_10s = [parse_and_save(y) for y in tqdm(range(2010, 2020))]
# results_10s


In [None]:

# # 2020–2023
# results_20s = [parse_and_save(y) for y in tqdm(range(2020, 2024))]
# results_20s


## Inventory what we have so far

In [None]:

from glob import glob
import os

csvs = sorted(glob(str(YEAR_DIR / "brfss_*.csv")))
print("CSV years:", [Path(p).stem.split("_")[1] for p in csvs][:10], "... total", len(csvs))
sizes = {Path(p).name: round(os.path.getsize(p)/(1024*1024),2) for p in csvs}
list(sizes.items())[:5]
