# 0. Introduction

In this notebook, we will use the 2025 Individual Market data from the [HIX Compare+ Dataset](https://hix-compare.org/individual-markets.html) and preprocess it for the purpose of our database project. The raw CSV file can be found under [data/HIX_ind_plans_unnormalized.csv](https://github.com/nashjafri/carefox_aca_healthcare_database/blob/main/data/HIX_ind_plans_unnormalized.csv) in this GitHub repository.

The raw dataset contains extensive plan-level information on ACA-compliant individual health insurance offerings for the 2025 coverage year, with over 700 columns related to benefits, deductibles, premiums, provider networks, and plan characteristics. However, the original file is highly detailed and designed for flexible research use, not immediate database integration or application development.

In this notebook, we will systematically clean, filter, and restructure the data to make it more usable for a healthcare plan comparison and analytics application. This involves reducing dimensionality by selecting only the most relevant fields, standardizing field names, categorizing plan features, and ensuring consistency in benefit and value fields. The final processed dataset will serve as the core data foundation for the CareFox platform.

## Reference

The data structure, field definitions, and dataset characteristics are based on:

> **HIX Compare Dataset (2014–2025) [[https://hix-compare.org]](https://hix-compare.org)**  
> Created by the Robert Wood Johnson Foundation (RWJF) and maintained in partnership with Ideon.  
> *HIX Compare+ Dataset Documentation, Version October 28, 2024.*  
> Questions can be directed to [HIXsupport@ideonapi.com](mailto:HIXsupport@ideonapi.com).

---

# 1. Data Overview

## 1.1 Source
This project uses data from the [HIX Compare+ Dataset](https://hix-compare.org/individual-markets.html) provided by IDEA on API, a comprehensive public dataset detailing ACA-compliant health insurance plans in the United States.

For the year 2025, the dataset covers individual market plans available both on and off the state and federal exchanges (Healthcare.gov), across all 50 states and Washington, D.C.

Plans include fully insured on-marketplace, off-marketplace, and small group plans, with rich information on benefits, premiums, cost-sharing, network structures, and plan metadata.

(https://www.kaggle.com/datasets/danofer/zipcodes-county-fips-crosswalk/discussion/244926)

(https://www.cms.gov/CCIIO/Programs-and-Initiatives/Health-Insurance-Market-Reforms/Downloads/StateSpecAgeCrv053117.pdf)

## 1.2 Dataset Structure and Contents

The original individual plan dataset is a raw csv file [[data/HIX_ind_plans_unnormalized.csv](https://github.com/nashjafri/carefox_aca_healthcare_database/blob/main/data/HIX_ind_plans_unnormalized.csv)] containing **723 columns** and is organized into three major types of fields:

### 1.2.1. Benefits Fields

These fields describe the **cost-sharing structures** (copay and coinsurance) associated with a range of healthcare services.  
Each *benefit* (e.g., primary care, emergency room, specialist visit, hospitalization) includes multiple subfields:

- **In-Network Copay** (tiered if applicable)
- **In-Network Coinsurance** (tiered if applicable)
- **Out-of-Network Copay and Coinsurance**
- **Complexity and Limitation Flags**

**Examples of Benefits:**
- Ambulance (AB)
- Emergency Room (ER)
- Inpatient Hospital Facility (IP)
- Mental Health Services (IN, OM)
- Primary Care Physician (PC)
- Specialist Visit (SP)
- Prescription Drugs (GD, PD, ND, SD)

Each benefit is typically captured across **14 associated columns**:
- e.g., `SP_CopayInnTier1`, `SP_CopayInnTier1A`, `SP_CopayOutofNetA`, `SP_CoinsInnTier1`, etc.

**Benefit fields reflect:**  
- Whether coverage is tiered (Tier 1, Tier 2)
- Whether copays/coinsurance apply before/after deductible
- Whether a benefit is limited in number of visits/services
- Complex rules like "first 5 visits free, then coinsurance applies"

### 1.2.2. Value Fields

These fields describe **deductibles** and **maximum out-of-pocket (MOOP)** expenses.

The data distinguishes between:
- **Medical vs. Drug** coverage
- **In-Network vs. Out-of-Network** coverage
- **Individual vs. Family** deductibles
- **Integrated vs. Separate** Medical/Drug deductibles

**Examples of Value Fields:**
- `MEHBDedInnIndividual`: Medical In-Network Individual Deductible
- `DEHBDedInnFamily`: Drug In-Network Family Deductible
- `TEHBInnFamilyMOOP`: Integrated In-Network Family Maximum Out-of-Pocket

Value fields can have multiple tiers, allowing for richer plan designs with multiple cost-sharing levels based on provider tiers.

### 1.2.3. Pricing and Metadata Fields

These columns provide **premium information** and **plan metadata**, including:

- **Premiums** for different age groups:
  - `PREMI27`: premium for a 27-year-old
  - `PREMI50`: premium for a 50-year-old
  - `PREMI2C30`: premium for 2 children + 30-year-old
- **Plan Metal Level**: Catastrophic, Bronze, Silver, Gold, Platinum
- **Plan Type**: HMO, PPO, EPO, POS
- **Network ID**: Identifying different network arrangements
- **Marketplace Participation**:
  - On-Market (sold through Healthcare.gov or state exchange)
  - Off-Market (sold privately)
- **Special Flags**:
  - `CSR`: Cost-Sharing Reduction variant
  - `CHILDONLY`: Child-only plan indicator
  - `MULTITIERED`: Flag indicating plans with tiered provider networks

---

## 1.3 Important Data Characteristics

- **Complex Benefit Structures**:  
  Some plans offer different cost-sharing based on service volume (e.g., different pricing after a certain number of visits) or special conditions (e.g., waived ER copay upon admission).

- **Geographical Organization**:  
  Plans are organized by **rating areas**, not strictly by county. Rating areas can span multiple counties or split a county across different rating areas.

- **Multiple Plan IDs**:  
  One insurance product can appear with multiple HIOS Plan IDs due to variations such as:
  - CSR status
  - Child-only plans
  - Network changes
  - Service area restrictions

- **Data Completeness Limitations**:  
  - Coverage is generally most complete for Healthcare.gov (FFM) plans.
  - Some fields, especially out-of-network cost-sharing or less common benefits (e.g., skilled nursing, habilitation services), may have missing or incomplete data.
  - Limited ability to represent volume-dependent or condition-dependent cost-sharing fully.

---

## 1.4 Key Definitions

- **HIOS Plan ID**:  
  Administrative identifier assigned by CMS. Variants ending with `-04`, `-05`, `-06` denote CSR variants.

- **Metal Level**:  
  Indicates the plan's actuarial value (Catastrophic, Bronze, Silver, Gold, Platinum).

- **CSR (Cost-Sharing Reduction)**:  
  Special versions of Silver plans that offer reduced deductibles and MOOP for eligible low-income individuals.

- **Network Tiers**:  
  Some plans differentiate providers into multiple tiers with different cost-sharing levels (e.g., preferred vs. non-preferred).

---

# 2. Data Preprocessing (to be added)

> The raw dataset undergoes substantial column selection, cleaning, and transformation to create a more manageable and analysis-ready database schema.


In [17]:
import pandas as pd

In [147]:
unnormalized_plans_data = pd.read_csv('data/HIX_ind_plans_unnormalized.csv')
unnormalized_plans_data.head()

Unnamed: 0,UNIQUE,YEAR,DATECAPTURE,PLANID,ST,AREA,CARRIER,PLANNAME,METAL,PLANTYPE,...,TEHBInnFamilyMOOP,TEHBInnFamilyMOOP_TIERS,TEHBInnTier1FamilyMOOPA,TEHBInnTier2FamilyMOOPA,MEHBOutOfNetFamilyMOOP,DEHBOutOfNetFamilyMOOP,TEHBOutOfNetFamilyMOOP,MEHBOutOfNetFamilyMOOPA,DEHBOutOfNetFamilyMOOPA,TEHBOutOfNetFamilyMOOPA
0,,2025,2024-10-05,73836AK0950001,AK,AK01,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,...,1,1.0,16000.0,,0,0,1,,,54600.0
1,,2025,2024-10-05,73836AK0950001,AK,AK02,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,...,1,1.0,16000.0,,0,0,1,,,54600.0
2,,2025,2024-10-05,73836AK0930001,AK,AK02,Moda Health,Moda Pioneer Gold 1500,Gold,1,...,1,1.0,12000.0,,0,0,1,,,36000.0
3,,2025,2024-10-05,73836AK0950001,AK,AK03,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,...,1,1.0,16000.0,,0,0,1,,,54600.0
4,,2025,2024-10-05,73836AK0930001,AK,AK03,Moda Health,Moda Pioneer Gold 1500,Gold,1,...,1,1.0,12000.0,,0,0,1,,,36000.0


In [282]:
# for col in unnormalized_plans_data.columns:
#     print(col)

In [268]:
state_age_curve = pd.read_csv('data/normalized_tables/state_age_curve.csv')
state_age_curve.head()

Unnamed: 0,AGE,DEFAULT,AL,DC,MA,MN,MS,OR,UT
0,0,0.765,0.635,0.654,0.751,0.89,0.635,0.635,0.793
1,1,0.765,0.635,0.654,0.751,0.89,0.635,0.635,0.793
2,2,0.765,0.635,0.654,0.751,0.89,0.635,0.635,0.793
3,3,0.765,0.635,0.654,0.751,0.89,0.635,0.635,0.793
4,4,0.765,0.635,0.654,0.751,0.89,0.635,0.635,0.793


In [270]:
zip_fips = pd.read_csv('data/normalized_tables/zip_fips_crosswalk.csv')
zip_fips.head()

Unnamed: 0,ZIP,COUNTY,STATE,FIPS
0,36003,Autauga County,AL,1001
1,36006,Autauga County,AL,1001
2,36067,Autauga County,AL,1001
3,36066,Autauga County,AL,1001
4,36703,Autauga County,AL,1001


In [276]:
county_area = pd.read_csv('data/HIX_ind_county_area_crosswalk.csv')
county_area.head()

Unnamed: 0,fips_code,county_name,rating_area_count,rating_area_id,year
0,1001,Autauga County,1,AL11,2025
1,1003,Baldwin County,1,AL13,2025
2,1005,Barbour County,1,AL13,2025
3,1007,Bibb County,1,AL03,2025
4,1009,Blount County,1,AL03,2025


In [278]:
county_area = county_area.drop(columns='year')

county_area_rename_dict = {
    'fips_code': 'FIPS',
    'county_name': 'COUNTY',
    'rating_area_count': 'AREA_COUNT',
    'rating_area_id': 'AREA'
}

county_area = county_area.rename(columns=county_area_rename_dict)
county_area.to_csv('data/normalized_tables/county_area_crosswalk.csv', index=False)
county_area.head()

Unnamed: 0,FIPS,COUNTY,AREA_COUNT,AREA
0,1001,Autauga County,1,AL11
1,1003,Baldwin County,1,AL13
2,1005,Barbour County,1,AL13
3,1007,Bibb County,1,AL03
4,1009,Blount County,1,AL03


In [280]:
county_area[county_area['AREA_COUNT'] == 3]

Unnamed: 0,FIPS,COUNTY,AREA_COUNT,AREA
1246,25021,Norfolk County,3,MA03
1247,25021,Norfolk County,3,MA05
1248,25021,Norfolk County,3,MA06
1249,25023,Plymouth County,3,MA03
1250,25023,Plymouth County,3,MA06
1251,25023,Plymouth County,3,MA07
1253,25027,Worcester County,3,MA01
1254,25027,Worcester County,3,MA02
1255,25027,Worcester County,3,MA03


In [149]:
plans_columns = [
    'PLANID',       
    'AREA',         
    'ST',          
    'CARRIER',      
    'PLANNAME',     
    'METAL',        
    'PLANTYPE',     
    'CSR',          
    'PLANMARKET',   
    'CHILDONLY',    
    'NETWORKID',    
    'actively_marketed',  
    'MULTITIERED'   
]

plans = unnormalized_plans_data[plans_columns].copy()

# Convert actively_marketed (bool to int) 
plans['actively_marketed'] = plans['actively_marketed'].astype('int64')
plans.to_csv('data/normalized_tables/plans.csv', index=False)
plans.head()

Unnamed: 0,PLANID,AREA,ST,CARRIER,PLANNAME,METAL,PLANTYPE,CSR,PLANMARKET,CHILDONLY,NETWORKID,actively_marketed,MULTITIERED
0,73836AK0950001,AK01,AK,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,0,3,0,201379.0,1,0
1,73836AK0950001,AK02,AK,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,0,3,0,201379.0,1,0
2,73836AK0930001,AK02,AK,Moda Health,Moda Pioneer Gold 1500,Gold,1,0,3,0,201379.0,1,1
3,73836AK0950001,AK03,AK,Moda Health,Moda Pioneer Alaska Standard Silver,Silver,1,0,3,0,201379.0,1,0
4,73836AK0930001,AK03,AK,Moda Health,Moda Pioneer Gold 1500,Gold,1,0,3,0,201379.0,1,1


In [284]:
plans[plans['actively_marketed'] == 0]

Unnamed: 0,PLANID,AREA,ST,CARRIER,PLANNAME,METAL,PLANTYPE,CSR,PLANMARKET,CHILDONLY,NETWORKID,actively_marketed,MULTITIERED
6153,86545CT1340022,CT04,CT,Anthem,Anthem Gold PPO Pathway 2000/10%,Gold,1,0,2,0,100232.0,0,0
6154,86545CT1340022,CT08,CT,Anthem,Anthem Gold PPO Pathway 2000/10%,Gold,1,0,2,0,100232.0,0,0
6155,86545CT1340022,CT02,CT,Anthem,Anthem Gold PPO Pathway 2000/10%,Gold,1,0,2,0,100232.0,0,0
6156,86545CT1340022,CT05,CT,Anthem,Anthem Gold PPO Pathway 2000/10%,Gold,1,0,2,0,100232.0,0,0
6158,86545CT1340022,CT06,CT,Anthem,Anthem Gold PPO Pathway 2000/10%,Gold,1,0,2,0,100232.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
76699,87226TX0110007,TX19,TX,Ambetter,Elite Gold,Gold,2,0,2,0,206547.0,0,0
76700,87226TX0110007,TX22,TX,Ambetter,Elite Gold,Gold,2,0,2,0,206547.0,0,0
76701,87226TX0110007,TX17,TX,Ambetter,Elite Gold,Gold,2,0,2,0,206547.0,0,0
77674,87226TX0110007,TX24,TX,Ambetter,Elite Gold,Gold,2,0,2,0,206547.0,0,0


In [167]:
# We check whether the base age premium rates are consistent when computed using 
# PREMI27 and PREMI50 (the premium rates for ages 27 and 50 respectively)

# As an example, we only consider UT plans for now
UT_plans = unnormalized_plans_data[(unnormalized_plans_data['ST'] == 'UT') & plans_raw['PREMI27'].notna() & unnormalized_plans_data['PREMI50'].notna()].copy()

# Calculate base rates from PREMI27 and PREMI50
UT_plans['base_from_27'] = UT_plans['PREMI27'] / 1.39
UT_plans['base_from_50'] = UT_plans['PREMI50'] / 2.127

# Calculate difference
UT_plans['abs_diff'] = (UT_plans['base_from_27'] - UT_plans['base_from_50']).abs()
UT_plans['percent_diff'] = 100 * UT_plans['abs_diff'] / UT_plans[['base_from_27', 'base_from_50']].mean(axis=1)
UT_plans[['PLANID', 'base_from_27', 'base_from_50', 'abs_diff', 'percent_diff']].head()

Unnamed: 0,PLANID,base_from_27,base_from_50,abs_diff,percent_diff
77927,42261UT0060022,396.129496,396.125999,0.003497,0.000883
77928,42261UT0060023-04,369.57554,369.572167,0.003372,0.000912
77929,42261UT0060024,267.57554,267.574048,0.001492,0.000557
77930,42261UT0060026-04,367.647482,367.64457,0.002912,0.000792
77931,42261UT0060025,386.719424,386.723084,0.00366,0.000946


In [171]:
max(UT_plans['abs_diff'])

0.008093947972838578

In [169]:
max(UT_plans['percent_diff'])

0.0020662829283792047

In [207]:
# We check whether the base age premium rates are consistent when computed using 
# PREMI27 and PREMI50 (the premium rates for ages 27 and 50 respectively)

# Now for example, we consider AK plans
AK_plans = unnormalized_plans_data[(unnormalized_plans_data['ST'] == 'AK') & plans_raw['PREMI27'].notna() & unnormalized_plans_data['PREMI50'].notna()].copy()

# Calculate base rates from PREMI27 and PREMI50
AK_plans['base_from_27'] = AK_plans['PREMI27'] / 1.048
AK_plans['base_from_50'] = AK_plans['PREMI50'] / 1.786 

# Calculate difference
AK_plans['abs_diff'] = (AK_plans['base_from_27'] - AK_plans['base_from_50']).abs()
AK_plans['percent_diff'] = 100 * AK_plans['abs_diff'] / AK_plans[['base_from_27', 'base_from_50']].mean(axis=1)
AK_plans[['PLANID', 'base_from_27', 'base_from_50', 'abs_diff', 'percent_diff']].head()

Unnamed: 0,PLANID,base_from_27,base_from_50,abs_diff,percent_diff
0,73836AK0950001,789.122137,789.473684,0.351547,0.044539
1,73836AK0950001,830.152672,829.787234,0.365438,0.04403
2,73836AK0930001,745.229008,744.680851,0.548157,0.073583
3,73836AK0950001,808.206107,807.950728,0.255379,0.031603
4,73836AK0930001,725.19084,725.083987,0.106853,0.014736


In [175]:
max(AL_plans['abs_diff'])

0.00746903396225207

In [177]:
max(AL_plans['percent_diff'])

0.002652292253040064

In [209]:
premium_subset = unnormalized_plans_data[['PLANID', 'AREA', 'ST', 'PREMI27', 'PREMI2C30', 'PREMC2C30']].dropna(subset=['PREMI27']).copy()

# Get rate multiplier at AGE 27 from the state_age_curve table
age_27_rate = state_age_curve[state_age_curve['AGE'] == 27].iloc[0]

# State-specific rules for AL, DC, MA, MN, MS, UT, OR
state_specific_states = ['AL', 'DC', 'MA', 'MN', 'MS', 'UT', 'OR']

# Function to pick correct multiplier
def select_multiplier(state):
    if state in state_specific_states:
        return age_27_rate[state]
    else:
        return age_27_rate['DEFAULT']

# Apply and calculate Base Individual Premium
premium_subset['27multiplier'] = premium_subset['ST'].apply(select_multiplier)
premium_subset['PREMI21_BASE'] = premium_subset['PREMI27'] / premium_subset['27multiplier']

# Keep final columns
premium = premium_subset[['PLANID', 'AREA', 'ST', 'PREMI21_BASE', 'PREMI2C30', 'PREMC2C30']]
premium.to_csv('data/normalized_tables/premium.csv', index=False)

premium.head()

Unnamed: 0,PLANID,AREA,ST,PREMI21_BASE,PREMI2C30,PREMC2C30
0,73836AK0950001,AK01,AK,789.122137,2104.0,3000.0
1,73836AK0950001,AK02,AK,830.152672,2212.0,3154.0
2,73836AK0930001,AK02,AK,745.229008,1985.0,2830.0
3,73836AK0950001,AK03,AK,808.206107,2153.0,3070.0
4,73836AK0930001,AK03,AK,725.19084,1933.0,2756.0


In [213]:
benefit_codes = [
    'AB', 'EY', 'EW', 'DT', 'DM', 'ER', 'GD', 'HA', 'HH', 'HS', 'IM', 'IB',
    'IP', 'IN', 'IH', 'IS', 'ND', 'OP', 'OM', 'OH', 'OS', 'PD', 'PN', 'PV',
    'PC', 'RH', 'SN', 'SP', 'SD', 'UC'
]

benefit_suffixes = [
    'LIMITED', 'CopayInn_TIERS', 'CopayInnTier1Complex', 'CopayInnTier1', 'CopayInnTier1A',
    'CopayInnTier2Complex', 'CopayInnTier2', 'CopayInnTier2A',
    'CoinsInn_TIERS', 'CoinsInnTier1Complex', 'CoinsInnTier1', 'CoinsInnTier1A',
    'CoinsInnTier2Complex', 'CoinsInnTier2', 'CoinsInnTier2A',
    'CopayOutofNetComplex', 'CopayOutofNet', 'CopayOutofNetA',
    'CoinsOutofNetComplex', 'CoinsOutofNet', 'CoinsOutofNetA'
]

# Build long format benefit table
long_benefits = []

for code in benefit_codes:
    cols = [f"{code}_{suffix}" for suffix in benefit_suffixes if f"{code}_{suffix}" in unnormalized_plans_data.columns]
    if not cols:
        continue
    subset = unnormalized_plans_data[['PLANID', 'AREA'] + cols].copy()
    # Rename benefit columns generically
    subset.columns = ['PLANID', 'AREA'] + benefit_suffixes[:len(cols)]
    subset['benefit_code'] = code
    long_benefits.append(subset)

# Combine all benefits into one DataFrame
benefits = pd.concat(long_benefits, ignore_index=True)

benefits.to_csv('data/normalized_tables/benefits.csv', index=False)
benefits.head()

Unnamed: 0,PLANID,AREA,LIMITED,CopayInn_TIERS,CopayInnTier1Complex,CopayInnTier1,CopayInnTier1A,CopayInnTier2Complex,CopayInnTier2,CopayInnTier2A,...,CoinsInnTier2Complex,CoinsInnTier2,CoinsInnTier2A,CopayOutofNetComplex,CopayOutofNet,CopayOutofNetA,CoinsOutofNetComplex,CoinsOutofNet,CoinsOutofNetA,benefit_code
0,73836AK0950001,AK01,0.0,1.0,0.0,0.0,,0.0,0.0,,...,0.0,0.0,,0.0,0.0,,0.0,4.0,40.0,AB
1,73836AK0950001,AK02,0.0,1.0,0.0,0.0,,0.0,0.0,,...,0.0,0.0,,0.0,0.0,,0.0,4.0,40.0,AB
2,73836AK0930001,AK02,0.0,1.0,0.0,0.0,,0.0,0.0,,...,0.0,0.0,,0.0,0.0,,0.0,4.0,30.0,AB
3,73836AK0950001,AK03,0.0,1.0,0.0,0.0,,0.0,0.0,,...,0.0,0.0,,0.0,0.0,,0.0,4.0,40.0,AB
4,73836AK0930001,AK03,0.0,1.0,0.0,0.0,,0.0,0.0,,...,0.0,0.0,,0.0,0.0,,0.0,4.0,30.0,AB


In [221]:
benefits.shape

(2547540, 24)

In [251]:
# Deductible table: columns containing 'Ded'
deductible = unnormalized_plans_data[['PLANID', 'AREA'] + [col for col in unnormalized_plans_data.columns if 'Ded' in col]]

# MOOP table: columns containing 'MOOP'
moop = unnormalized_plans_data[['PLANID', 'AREA'] + [col for col in unnormalized_plans_data.columns if 'MOOP' in col]]

deductible.to_csv('data/normalized_tables/deductibles.csv', index=False)
moop.to_csv('data/normalized_tables/moop.csv', index=False)

In [253]:
deductible.columns

Index(['PLANID', 'AREA', 'MEHBDedInnIndividual', 'MEHBDedInnIndividual_TIERS',
       'MEHBDedInnTier1IndividualA', 'MEHBDedInnTier2IndividualA',
       'DEHBDedInnIndividual', 'DEHBDedInnIndividual_TIERS',
       'DEHBDedInnTier1IndividualA', 'DEHBDedInnTier2IndividualA',
       'TEHBDedInnIndividual', 'TEHBDedInnIndividual_TIERS',
       'TEHBDedInnTier1IndividualA', 'TEHBDedInnTier2IndividualA',
       'MEHBDedOutOfNetIndividual', 'DEHBDedOutOfNetIndividual',
       'TEHBDedOutOfNetIndividual', 'MEHBDedOutOfNetIndividualA',
       'DEHBDedOutOfNetIndividualA', 'TEHBDedOutOfNetIndividualA',
       'MEHBDedInnFamily', 'MEHBDedInnFamily_TIERS', 'MEHBDedInnTier1FamilyA',
       'MEHBDedInnTier2FamilyA', 'DEHBDedInnFamily', 'DEHBDedInnFamily_TIERS',
       'DEHBDedInnTier1FamilyA', 'DEHBDedInnTier2FamilyA', 'TEHBDedInnFamily',
       'TEHBDedInnFamily_TIERS', 'TEHBDedInnTier1FamilyA',
       'TEHBDedInnTier2FamilyA', 'MEHBDedOutOfNetFamily',
       'DEHBDedOutOfNetFamily', 'TEHBDedOutOfN

In [256]:
deductible.shape

(84918, 38)

We’ll use this renaming pattern:

\[CoverageType]\_\[NetworkType]\_\[PersonType]\_\[Field]

Where:

CoverageType = MED (Medical), DRUG (Drug), TOT (Total/Integrated)

NetworkType = IN, OUT

PersonType = IND (Individual), FAM (Family)

Field = CODE (for MEHB/DEHB/TEHB), TIERS, TIER1A, TIER2A, etc

In [258]:
deduct_rename_dict = {
    'MEHBDedInnIndividual': 'MED_IN_IND_CODE',
    'MEHBDedInnIndividual_TIERS': 'MED_IN_IND_TIERS',
    'MEHBDedInnTier1IndividualA': 'MED_IN_IND_TIER1_AMOUNT',
    'MEHBDedInnTier2IndividualA': 'MED_IN_IND_TIER2_AMOUNT',
    
    'DEHBDedInnIndividual': 'DRUG_IN_IND_CODE',
    'DEHBDedInnIndividual_TIERS': 'DRUG_IN_IND_TIERS',
    'DEHBDedInnTier1IndividualA': 'DRUG_IN_IND_TIER1_AMOUNT',
    'DEHBDedInnTier2IndividualA': 'DRUG_IN_IND_TIER2_AMOUNT',
    
    'TEHBDedInnIndividual': 'TOT_IN_IND_CODE',
    'TEHBDedInnIndividual_TIERS': 'TOT_IN_IND_TIERS',
    'TEHBDedInnTier1IndividualA': 'TOT_IN_IND_TIER1_AMOUNT',
    'TEHBDedInnTier2IndividualA': 'TOT_IN_IND_TIER2_AMOUNT',
    
    'MEHBDedOutOfNetIndividual': 'MED_OUT_IND_CODE',
    'DEHBDedOutOfNetIndividual': 'DRUG_OUT_IND_CODE',
    'TEHBDedOutOfNetIndividual': 'TOT_OUT_IND_CODE',
    
    'MEHBDedOutOfNetIndividualA': 'MED_OUT_IND_AMOUNT',
    'DEHBDedOutOfNetIndividualA': 'DRUG_OUT_IND_AMOUNT',
    'TEHBDedOutOfNetIndividualA': 'TOT_OUT_IND_AMOUNT',
    
    'MEHBDedInnFamily': 'MED_IN_FAM_CODE',
    'MEHBDedInnFamily_TIERS': 'MED_IN_FAM_TIERS',
    'MEHBDedInnTier1FamilyA': 'MED_IN_FAM_TIER1_AMOUNT',
    'MEHBDedInnTier2FamilyA': 'MED_IN_FAM_TIER2_AMOUNT',
    
    'DEHBDedInnFamily': 'DRUG_IN_FAM_CODE',
    'DEHBDedInnFamily_TIERS': 'DRUG_IN_FAM_TIERS',
    'DEHBDedInnTier1FamilyA': 'DRUG_IN_FAM_TIER1_AMOUNT',
    'DEHBDedInnTier2FamilyA': 'DRUG_IN_FAM_TIER2_AMOUNT',
    
    'TEHBDedInnFamily': 'TOT_IN_FAM_CODE',
    'TEHBDedInnFamily_TIERS': 'TOT_IN_FAM_TIERS',
    'TEHBDedInnTier1FamilyA': 'TOT_IN_FAM_TIER1_AMOUNT',
    'TEHBDedInnTier2FamilyA': 'TOT_IN_FAM_TIER2_AMOUNT',
    
    'MEHBDedOutOfNetFamily': 'MED_OUT_FAM_CODE',
    'DEHBDedOutOfNetFamily': 'DRUG_OUT_FAM_CODE',
    'TEHBDedOutOfNetFamily': 'TOT_OUT_FAM_CODE',
    
    'MEHBDedOutOfNetFamilyA': 'MED_OUT_FAM_AMOUNT',
    'DEHBDedOutOfNetFamilyA': 'DRUG_OUT_FAM_AMOUNT',
    'TEHBDedOutOfNetFamilyA': 'TOT_OUT_FAM_AMOUNT'
}


deductible = deductible.rename(columns=deduct_rename_dict)
deductible.to_csv('data/normalized_tables/deductibles.csv', index=False)
deductible.head()

Unnamed: 0,PLANID,AREA,MED_IN_IND_CODE,MED_IN_IND_TIERS,MED_IN_IND_TIER1_AMOUNT,MED_IN_IND_TIER2_AMOUNT,DRUG_IN_IND_CODE,DRUG_IN_IND_TIERS,DRUG_IN_IND_TIER1_AMOUNT,DRUG_IN_IND_TIER2_AMOUNT,...,TOT_IN_FAM_CODE,TOT_IN_FAM_TIERS,TOT_IN_FAM_TIER1_AMOUNT,TOT_IN_FAM_TIER2_AMOUNT,MED_OUT_FAM_CODE,DRUG_OUT_FAM_CODE,TOT_OUT_FAM_CODE,MED_OUT_FAM_AMOUNT,DRUG_OUT_FAM_AMOUNT,TOT_OUT_FAM_AMOUNT
0,73836AK0950001,AK01,0,,,,0,,,,...,1,1.0,10000.0,,0,0,1,,,35400.0
1,73836AK0950001,AK02,0,,,,0,,,,...,1,1.0,10000.0,,0,0,1,,,35400.0
2,73836AK0930001,AK02,0,,,,0,,,,...,1,2.0,3000.0,6000.0,0,0,1,,,18000.0
3,73836AK0950001,AK03,0,,,,0,,,,...,1,1.0,10000.0,,0,0,1,,,35400.0
4,73836AK0930001,AK03,0,,,,0,,,,...,1,2.0,3000.0,6000.0,0,0,1,,,18000.0


In [260]:
moop.columns

Index(['PLANID', 'AREA', 'MEHBInnIndividualMOOP',
       'MEHBInnIndividualMOOP_TIERS', 'MEHBInnTier1IndividualMOOPA',
       'MEHBInnTier2IndividualMOOPA', 'DEHBInnIndividualMOOP',
       'DEHBInnIndividualMOOP_TIERS', 'DEHBInnTier1IndividualMOOPA',
       'DEHBInnTier2IndividualMOOPA', 'TEHBInnIndividualMOOP',
       'TEHBInnIndividualMOOP_TIERS', 'TEHBInnTier1IndividualMOOPA',
       'TEHBInnTier2IndividualMOOPA', 'MEHBOutOfNetIndividualMOOP',
       'DEHBOutOfNetIndividualMOOP', 'TEHBOutOfNetIndividualMOOP',
       'MEHBOutOfNetIndividualMOOPA', 'DEHBOutOfNetIndividualMOOPA',
       'TEHBOutOfNetIndividualMOOPA', 'MEHBInnFamilyMOOP',
       'MEHBInnFamilyMOOP_TIERS', 'MEHBInnTier1FamilyMOOPA',
       'MEHBInnTier2FamilyMOOPA', 'DEHBInnFamilyMOOP',
       'DEHBInnFamilyMOOP_TIERS', 'DEHBInnTier1FamilyMOOPA',
       'DEHBInnTier2FamilyMOOPA', 'TEHBInnFamilyMOOP',
       'TEHBInnFamilyMOOP_TIERS', 'TEHBInnTier1FamilyMOOPA',
       'TEHBInnTier2FamilyMOOPA', 'MEHBOutOfNetFamilyMOOP',
 

In [262]:
moop.shape

(84918, 38)

In [264]:
moop_rename_dict = {
    'MEHBInnIndividualMOOP': 'MED_IN_IND_CODE',
    'MEHBInnIndividualMOOP_TIERS': 'MED_IN_IND_TIERS',
    'MEHBInnTier1IndividualMOOPA': 'MED_IN_IND_TIER1_AMOUNT',
    'MEHBInnTier2IndividualMOOPA': 'MED_IN_IND_TIER2_AMOUNT',

    'DEHBInnIndividualMOOP': 'DRUG_IN_IND_CODE',
    'DEHBInnIndividualMOOP_TIERS': 'DRUG_IN_IND_TIERS',
    'DEHBInnTier1IndividualMOOPA': 'DRUG_IN_IND_TIER1_AMOUNT',
    'DEHBInnTier2IndividualMOOPA': 'DRUG_IN_IND_TIER2_AMOUNT',

    'TEHBInnIndividualMOOP': 'TOT_IN_IND_CODE',
    'TEHBInnIndividualMOOP_TIERS': 'TOT_IN_IND_TIERS',
    'TEHBInnTier1IndividualMOOPA': 'TOT_IN_IND_TIER1_AMOUNT',
    'TEHBInnTier2IndividualMOOPA': 'TOT_IN_IND_TIER2_AMOUNT',

    'MEHBOutOfNetIndividualMOOP': 'MED_OUT_IND_CODE',
    'DEHBOutOfNetIndividualMOOP': 'DRUG_OUT_IND_CODE',
    'TEHBOutOfNetIndividualMOOP': 'TOT_OUT_IND_CODE',

    'MEHBOutOfNetIndividualMOOPA': 'MED_OUT_IND_AMOUNT',
    'DEHBOutOfNetIndividualMOOPA': 'DRUG_OUT_IND_AMOUNT',
    'TEHBOutOfNetIndividualMOOPA': 'TOT_OUT_IND_AMOUNT',

    'MEHBInnFamilyMOOP': 'MED_IN_FAM_CODE',
    'MEHBInnFamilyMOOP_TIERS': 'MED_IN_FAM_TIERS',
    'MEHBInnTier1FamilyMOOPA': 'MED_IN_FAM_TIER1_AMOUNT',
    'MEHBInnTier2FamilyMOOPA': 'MED_IN_FAM_TIER2_AMOUNT',

    'DEHBInnFamilyMOOP': 'DRUG_IN_FAM_CODE',
    'DEHBInnFamilyMOOP_TIERS': 'DRUG_IN_FAM_TIERS',
    'DEHBInnTier1FamilyMOOPA': 'DRUG_IN_FAM_TIER1_AMOUNT',
    'DEHBInnTier2FamilyMOOPA': 'DRUG_IN_FAM_TIER2_AMOUNT',

    'TEHBInnFamilyMOOP': 'TOT_IN_FAM_CODE',
    'TEHBInnFamilyMOOP_TIERS': 'TOT_IN_FAM_TIERS',
    'TEHBInnTier1FamilyMOOPA': 'TOT_IN_FAM_TIER1_AMOUNT',
    'TEHBInnTier2FamilyMOOPA': 'TOT_IN_FAM_TIER2_AMOUNT',

    'MEHBOutOfNetFamilyMOOP': 'MED_OUT_FAM_CODE',
    'DEHBOutOfNetFamilyMOOP': 'DRUG_OUT_FAM_CODE',
    'TEHBOutOfNetFamilyMOOP': 'TOT_OUT_FAM_CODE',

    'MEHBOutOfNetFamilyMOOPA': 'MED_OUT_FAM_AMOUNT',
    'DEHBOutOfNetFamilyMOOPA': 'DRUG_OUT_FAM_AMOUNT',
    'TEHBOutOfNetFamilyMOOPA': 'TOT_OUT_FAM_AMOUNT'
}

moop = moop.rename(columns=moop_rename_dict)
moop.to_csv('data/normalized_tables/moop.csv', index=False)
moop.head()

Unnamed: 0,PLANID,AREA,MED_IN_IND_CODE,MED_IN_IND_TIERS,MED_IN_IND_TIER1_AMOUNT,MED_IN_IND_TIER2_AMOUNT,DRUG_IN_IND_CODE,DRUG_IN_IND_TIERS,DRUG_IN_IND_TIER1_AMOUNT,DRUG_IN_IND_TIER2_AMOUNT,...,TOT_IN_FAM_CODE,TOT_IN_FAM_TIERS,TOT_IN_FAM_TIER1_AMOUNT,TOT_IN_FAM_TIER2_AMOUNT,MED_OUT_FAM_CODE,DRUG_OUT_FAM_CODE,TOT_OUT_FAM_CODE,MED_OUT_FAM_AMOUNT,DRUG_OUT_FAM_AMOUNT,TOT_OUT_FAM_AMOUNT
0,73836AK0950001,AK01,0,,,,0,,,,...,1,1.0,16000.0,,0,0,1,,,54600.0
1,73836AK0950001,AK02,0,,,,0,,,,...,1,1.0,16000.0,,0,0,1,,,54600.0
2,73836AK0930001,AK02,0,,,,0,,,,...,1,1.0,12000.0,,0,0,1,,,36000.0
3,73836AK0950001,AK03,0,,,,0,,,,...,1,1.0,16000.0,,0,0,1,,,54600.0
4,73836AK0930001,AK03,0,,,,0,,,,...,1,1.0,12000.0,,0,0,1,,,36000.0
