# Cleaning `benefits`

In this notebook, I unpickled the compressed `benefits` CSV from the previous notebook and used Pandas to further manipulate the data.  I use the following libraries:

1. `pandas` - A library for manipulating data in a tabular format
2. `pickle` - used to store data for later use

In [1]:
import pandas as pd
import pickle

In [2]:
with open('../pickles/benefits.pkl', 'rb') as benefits:
    benefits = pickle.load(benefits)
benefits.shape

(5048408, 32)

Below, I dropped all `'PlanIds'` except '-00' variants, as the PlanId variants simply identify an insurance plan’s cost sharing reduction variant. Different PlanIds encode the same benefits information, so duplicative PlanIds are not needed.

In [3]:
benefits = benefits[benefits.PlanId.str.contains('-00')]
benefits.shape

(1195765, 32)

Below, I dropped duplicative or un-needed columns.

In [4]:
benefits.drop(columns=['EHBVarReason', 
                       'Exclusions',
                       'Explanation',
                       'ImportDate',
                       'IsEHB',
                       'IsStateMandate',
                       'VersionNum',
                       'StateCode2',
                       'IssuerId',
                       'IssuerId2',
                       'SourceName'], inplace=True)

Below, I dropped any benefits that were not covered by the health insurance plan.

In [5]:
benefits = benefits.dropna(subset=['IsCovered'])
benefits.shape

(1138140, 21)

I am focusing only on a single plan year, so I'm dropping plans not in the current year.

In [6]:
benefits = benefits[benefits['BusinessYear'] == 2016]
benefits.shape

(425215, 21)

Below, I'm removing "junk" characters.

In [7]:
benefits['BenefitName'] = benefits['BenefitName'].map(lambda x: x.replace('\x93', ''))
benefits['BenefitName'] = benefits['BenefitName'].map(lambda x: x.replace('\x94', ''))
benefits['BenefitName'] = benefits['BenefitName'].map(lambda x: x.replace('\t', ''))

In [8]:
benefits.shape

(425215, 21)

## Crosswalk

The benefits dataframe has, as you might expect, a column called `'BenefitName'`, which contains each benefit offered in each health insurance plan. Many of the benefits in the column are the same but differ due to misspellings or containing extraneous characters.

I created a `crosswalk` CSV in which I mapped `'BenefitName'` variants to a single name. For example, these three benefits:

- Cosmetic Orthodontia-Child
- Cosmetic Orthodontics - Child
- Cosmetic Orthodontics-Child

Were collapsed into one, renamed benefit:

- Orthodontia, Cosmetic - Child

Further, I created dummy columns for each benefit in the Crosswalk that I then merged with the `rate` details, in a later notebook. This allows us to analyze each plan in the rate file with its corresponding associated benefits.

In [9]:
crosswalk = pd.read_csv('../data/crosswalk2.csv')

In [10]:
benefits = pd.merge(benefits, crosswalk, on='BenefitName')

In [11]:
dummies = pd.get_dummies(benefits['Crosswalk'])

In [12]:
benefits.shape

(425215, 22)

In [13]:
benefits = pd.concat([benefits, dummies], axis=1)

In [14]:
benefits = benefits[benefits['delete'] != 1]

In [15]:
benefits = benefits.drop('delete', axis=1)

In [16]:
benefits.shape

(413907, 229)

In [17]:
# with open('../pickles/benefits_dum.pkl', 'wb') as file:
#     pickle.dump(benefits, file)