# 0. Introduction

In this notebook, we will use the 2025 Individual Market data from the [HIX Compare+ Dataset](https://hix-compare.org/individual-markets.html) and preprocess it for the purpose of our database project. The raw CSV file can be found under [data/plans_raw.csv](https://github.com/nashjafri/carefox_aca_healthcare_database/blob/main/data/plans_raw.csv) in this GitHub repository.

The raw dataset contains extensive plan-level information on ACA-compliant individual health insurance offerings for the 2025 coverage year, with over 700 columns related to benefits, deductibles, premiums, provider networks, and plan characteristics. However, the original file is highly detailed and designed for flexible research use, not immediate database integration or application development.

In this notebook, we will systematically clean, filter, and restructure the data to make it more usable for a healthcare plan comparison and analytics application. This involves reducing dimensionality by selecting only the most relevant fields, standardizing field names, categorizing plan features, and ensuring consistency in benefit and value fields. The final processed dataset will serve as the core data foundation for the CareFox platform.

## Reference

The data structure, field definitions, and dataset characteristics are based on:

> **HIX Compare Dataset (2014–2025) [[https://hix-compare.org]](https://hix-compare.org)**  
> Created by the Robert Wood Johnson Foundation (RWJF) and maintained in partnership with Ideon.  
> *HIX Compare+ Dataset Documentation, Version October 28, 2024.*  
> Questions can be directed to [HIXsupport@ideonapi.com](mailto:HIXsupport@ideonapi.com).

---

# 1. Data Overview

## 1.1 Source
This project uses data from the [HIX Compare+ Dataset](https://hix-compare.org/individual-markets.html) provided by IDEA on API, a comprehensive public dataset detailing ACA-compliant health insurance plans in the United States.

For the year 2025, the dataset covers individual market plans available both on and off the state and federal exchanges (Healthcare.gov), across all 50 states and Washington, D.C.

Plans include fully insured on-marketplace, off-marketplace, and small group plans, with rich information on benefits, premiums, cost-sharing, network structures, and plan metadata.

(https://www.kaggle.com/datasets/danofer/zipcodes-county-fips-crosswalk/discussion/244926)

(https://www.cms.gov/CCIIO/Programs-and-Initiatives/Health-Insurance-Market-Reforms/Downloads/StateSpecAgeCrv053117.pdf)

## 1.2 Dataset Structure and Contents

The original individual plan dataset is a raw csv file [[data/plans_raw.csv](https://github.com/nashjafri/carefox_aca_healthcare_database/blob/main/data/plans_raw.csv)] containing **723 columns** and is organized into three major types of fields:

### 1.2.1. Benefits Fields

These fields describe the **cost-sharing structures** (copay and coinsurance) associated with a range of healthcare services.  
Each *benefit* (e.g., primary care, emergency room, specialist visit, hospitalization) includes multiple subfields:

- **In-Network Copay** (tiered if applicable)
- **In-Network Coinsurance** (tiered if applicable)
- **Out-of-Network Copay and Coinsurance**
- **Complexity and Limitation Flags**

**Examples of Benefits:**
- Ambulance (AB)
- Emergency Room (ER)
- Inpatient Hospital Facility (IP)
- Mental Health Services (IN, OM)
- Primary Care Physician (PC)
- Specialist Visit (SP)
- Prescription Drugs (GD, PD, ND, SD)

Each benefit is typically captured across **14 associated columns**:
- e.g., `SP_CopayInnTier1`, `SP_CopayInnTier1A`, `SP_CopayOutofNetA`, `SP_CoinsInnTier1`, etc.

**Benefit fields reflect:**  
- Whether coverage is tiered (Tier 1, Tier 2)
- Whether copays/coinsurance apply before/after deductible
- Whether a benefit is limited in number of visits/services
- Complex rules like "first 5 visits free, then coinsurance applies"

### 1.2.2. Value Fields

These fields describe **deductibles** and **maximum out-of-pocket (MOOP)** expenses.

The data distinguishes between:
- **Medical vs. Drug** coverage
- **In-Network vs. Out-of-Network** coverage
- **Individual vs. Family** deductibles
- **Integrated vs. Separate** Medical/Drug deductibles

**Examples of Value Fields:**
- `MEHBDedInnIndividual`: Medical In-Network Individual Deductible
- `DEHBDedInnFamily`: Drug In-Network Family Deductible
- `TEHBInnFamilyMOOP`: Integrated In-Network Family Maximum Out-of-Pocket

Value fields can have multiple tiers, allowing for richer plan designs with multiple cost-sharing levels based on provider tiers.

### 1.2.3. Pricing and Metadata Fields

These columns provide **premium information** and **plan metadata**, including:

- **Premiums** for different age groups:
  - `PREMI27`: premium for a 27-year-old
  - `PREMI50`: premium for a 50-year-old
  - `PREMI2C30`: premium for 2 children + 30-year-old
- **Plan Metal Level**: Catastrophic, Bronze, Silver, Gold, Platinum
- **Plan Type**: HMO, PPO, EPO, POS
- **Network ID**: Identifying different network arrangements
- **Marketplace Participation**:
  - On-Market (sold through Healthcare.gov or state exchange)
  - Off-Market (sold privately)
- **Special Flags**:
  - `CSR`: Cost-Sharing Reduction variant
  - `CHILDONLY`: Child-only plan indicator
  - `MULTITIERED`: Flag indicating plans with tiered provider networks

---

## 1.3 Important Data Characteristics

- **Complex Benefit Structures**:  
  Some plans offer different cost-sharing based on service volume (e.g., different pricing after a certain number of visits) or special conditions (e.g., waived ER copay upon admission).

- **Geographical Organization**:  
  Plans are organized by **rating areas**, not strictly by county. Rating areas can span multiple counties or split a county across different rating areas.

- **Multiple Plan IDs**:  
  One insurance product can appear with multiple HIOS Plan IDs due to variations such as:
  - CSR status
  - Child-only plans
  - Network changes
  - Service area restrictions

- **Data Completeness Limitations**:  
  - Coverage is generally most complete for Healthcare.gov (FFM) plans.
  - Some fields, especially out-of-network cost-sharing or less common benefits (e.g., skilled nursing, habilitation services), may have missing or incomplete data.
  - Limited ability to represent volume-dependent or condition-dependent cost-sharing fully.

---

## 1.4 Key Definitions

- **HIOS Plan ID**:  
  Administrative identifier assigned by CMS. Variants ending with `-04`, `-05`, `-06` denote CSR variants.

- **Metal Level**:  
  Indicates the plan's actuarial value (Catastrophic, Bronze, Silver, Gold, Platinum).

- **CSR (Cost-Sharing Reduction)**:  
  Special versions of Silver plans that offer reduced deductibles and MOOP for eligible low-income individuals.

- **Network Tiers**:  
  Some plans differentiate providers into multiple tiers with different cost-sharing levels (e.g., preferred vs. non-preferred).

---

# 2. Data Preprocessing (to be added)

> The raw dataset undergoes substantial column selection, cleaning, and transformation to create a more manageable and analysis-ready database schema.


In [17]:
import pandas as pd

In [80]:
plans_raw = pd.read_csv('data/plans_raw.csv')
plans_raw.head()

Unnamed: 0,UNIQUE,YEAR,DATECAPTURE,PLANID,ST,AREA,CARRIER,PLANNAME,METAL,PLANTYPE,...,TEHBInnFamilyMOOP,TEHBInnFamilyMOOP_TIERS,TEHBInnTier1FamilyMOOPA,TEHBInnTier2FamilyMOOPA,MEHBOutOfNetFamilyMOOP,DEHBOutOfNetFamilyMOOP,TEHBOutOfNetFamilyMOOP,MEHBOutOfNetFamilyMOOPA,DEHBOutOfNetFamilyMOOPA,TEHBOutOfNetFamilyMOOPA
0,,2025,2024-10-05,73836AK0950001,AK,AK01,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,...,1,1.0,16000.0,,0,0,1,,,54600.0
1,,2025,2024-10-05,73836AK0950001,AK,AK02,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,...,1,1.0,16000.0,,0,0,1,,,54600.0
2,,2025,2024-10-05,73836AK0930001,AK,AK02,Moda Health,Moda Pioneer Gold 1500,Gold,1,...,1,1.0,12000.0,,0,0,1,,,36000.0
3,,2025,2024-10-05,73836AK0950001,AK,AK03,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,...,1,1.0,16000.0,,0,0,1,,,54600.0
4,,2025,2024-10-05,73836AK0930001,AK,AK03,Moda Health,Moda Pioneer Gold 1500,Gold,1,...,1,1.0,12000.0,,0,0,1,,,36000.0


In [32]:
# for col in plans_raw.columns:
#     print(col)

In [98]:
state_age_curve = pd.read_csv('data/state_age_curve.csv')
state_age_curve.head()

Unnamed: 0,AGE,DEFAULT,AL,DC,MA,MN,MS,OR,UT
0,<14,0.765,0.635,0.654,0.751,0.89,0.635,0.635,0.793
1,15,0.833,0.635,0.654,0.751,0.89,0.635,0.635,0.793
2,16,0.859,0.635,0.654,0.751,0.89,0.635,0.635,0.793
3,17,0.885,0.635,0.654,0.751,0.89,0.635,0.635,0.793
4,18,0.913,0.635,0.654,0.751,0.89,0.635,0.635,0.793


In [76]:
zip_fips = pd.read_csv('data/zip_fips_crosswalk.csv')
zip_fips.head()

Unnamed: 0,ZIP,COUNTY,STATE,FIPS
0,36003,Autauga County,AL,1001
1,36006,Autauga County,AL,1001
2,36067,Autauga County,AL,1001
3,36066,Autauga County,AL,1001
4,36703,Autauga County,AL,1001


In [74]:
county_ratings = pd.read_csv('data/county_rating_area_crosswalk.csv')
county_ratings.head()

Unnamed: 0,fips_code,county_name,rating_area_count,rating_area_id,year
0,1001,Autauga County,1,AL11,2025
1,1003,Baldwin County,1,AL13,2025
2,1005,Barbour County,1,AL13,2025
3,1007,Bibb County,1,AL03,2025
4,1009,Blount County,1,AL03,2025


In [82]:
county_ratings = county_ratings.drop(columns='year')
county_ratings.to_csv('data/county_rating_area_crosswalk.csv', index=False)
county_ratings.head()

Unnamed: 0,fips_code,county_name,rating_area_count,rating_area_id
0,1001,Autauga County,1,AL11
1,1003,Baldwin County,1,AL13
2,1005,Barbour County,1,AL13
3,1007,Bibb County,1,AL03
4,1009,Blount County,1,AL03


In [94]:
county_ratings[county_ratings['rating_area_count'] == 3]

Unnamed: 0,fips_code,county_name,rating_area_count,rating_area_id
1246,25021,Norfolk County,3,MA03
1247,25021,Norfolk County,3,MA05
1248,25021,Norfolk County,3,MA06
1249,25023,Plymouth County,3,MA03
1250,25023,Plymouth County,3,MA06
1251,25023,Plymouth County,3,MA07
1253,25027,Worcester County,3,MA01
1254,25027,Worcester County,3,MA02
1255,25027,Worcester County,3,MA03
