# 3. Join Rate and Benefits

In this notebook, I want to import my cleaned my **rate.pkl** file and do the following:
- drop certain un-needed columns
- filter by the most recent year available (2016)
- concatenate my `benefits` dataframe to my `rate` dataframe by `'StandardComponentId'`, which is the plan identifer.

## Load Files

In [1]:
import pandas as pd
import numpy as np

import pickle
import regex as re

import matplotlib.pyplot as plt
%matplotlib inline

Load a few files:
- rate.pkl
- benefits_dum.pkl
- crosswalk2.csv

In [2]:
with open('../pickles/rate.pkl', 'rb') as rate:
    rate = pickle.load(rate)

In [53]:
rate_cols = [col for col in rate.columns]

In [3]:
rate.shape

(12694445, 24)

In [4]:
with open('../pickles/benefits_dum.pkl', 'rb') as benefits_dum:
    benefits = pickle.load(benefits_dum)

In [5]:
benefits.shape

(413907, 229)

In [6]:
crosswalk = pd.read_csv('../data/crosswalk2.csv')

In [7]:
crosswalk.shape

(412, 2)

## Cleaning

Drop un-needed columns:

In [8]:
rate.drop(columns=['IssuerId', 
                   'SourceName',
                   'VersionNum',
                   'ImportDate',
                   'IssuerId2',
                   'FederalTIN'], inplace=True)

Filter by year:

In [9]:
rate = rate[rate['BusinessYear'] == 2016]

In [10]:
rate.head()

Unnamed: 0,BusinessYear,StateCode,RateEffectiveDate,RateExpirationDate,PlanId,RatingAreaId,Tobacco,Age,IndividualRate,IndividualTobaccoRate,Couple,PrimarySubscriberAndOneDependent,PrimarySubscriberAndTwoDependents,PrimarySubscriberAndThreeOrMoreDependents,CoupleAndOneDependent,CoupleAndTwoDependents,CoupleAndThreeOrMoreDependents,RowNumber
8472480,2016,AK,2016-01-01,2016-12-31,21989AK0030001,Rating Area 1,No Preference,0-20,43.0,,,,,,,,,14
8472481,2016,AK,2016-01-01,2016-03-31,21989AK0080001,Rating Area 1,No Preference,Family Option,50.67,,100.33,116.04,116.04,116.04,117.86,117.86,117.86,14
8472482,2016,AK,2016-04-01,2016-06-30,21989AK0080001,Rating Area 1,No Preference,Family Option,51.36,,101.7,117.62,117.62,117.62,180.28,180.28,180.28,14
8472483,2016,AK,2016-01-01,2016-03-31,21989AK0090001,Rating Area 1,No Preference,0-20,61.0,,,,,,,,,14
8472484,2016,AK,2016-04-01,2016-06-30,21989AK0090001,Rating Area 1,No Preference,0-20,62.0,,,,,,,,,14


1. Drop `'PlanId'` from `benefits`
2. Rename `'StandardComponentId'` to `'PlanId'`

In [11]:
benefits.drop(columns=['PlanId'], inplace=True)

In [12]:
benefits = benefits.rename(columns={'StandardComponentId': 'PlanId'}) 

In [13]:
rate.shape

(4221965, 18)

In [14]:
benefits.shape

(413907, 228)

## Filtering

Create a filter to view only the dummied columns in `benefits`:

In [15]:
cols = [x for x in crosswalk['Crosswalk'].unique()]

In [16]:
cols.remove('delete') 

In [17]:
cols.insert(0, 'BenefitName')
cols.insert(1, 'PlanId')

In [18]:
cols_dum = [x for x in cols if x != 'BenefitName' if x != 'PlanId']

In [19]:
benefits[cols_dum].head()

Unnamed: 0,"Dental Care, Basic - Child","Dental Care, Major - Child",Orthodontia - Child,"Dental, Accidental - Adult","Dental Care, Basic - Adult","Dental Care, Major - Adult","Dental Care, Routine - Adult",Orthodontia - Adult,Delivery and All Inpatient Services for Maternity Care,Durable Medical Equipment,...,Endodontics - Adult,Habilitation - Acquired Brain Injury,Dental Cleanings - Adult,Surgical Extraction - Adult,Surgical Extraction - Child,Cosmetic Orthodontia,"Renal Dialysis, End Stage",Post-cochlear implant aural therapy,X-Rays and Exams - Adult,"Dental Care, Minor - Adult"
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Merge `benefits` and `rate` on `'PlanId'`

In [20]:
len(rate['PlanId'].unique())

8887

In [21]:
len(benefits['PlanId'].unique())

8398

In [22]:
len(list(set(rate['PlanId'].unique()) - set(benefits['PlanId'].unique())))

489

Take `benefits` and groupby `'PlanId'` to get all features of each plan on one row. Then, merge `benefits` to `rate` on `'PlanId'`.

In [23]:
benefits_planid = benefits[cols].drop('BenefitName', axis=1).groupby('PlanId').sum()

In [24]:
ratebenefits = pd.merge(rate, benefits_planid, on='PlanId', how='outer')

In [25]:
ratebenefits.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4221965 entries, 0 to 4221964
Columns: 225 entries, BusinessYear to Dental Care, Minor - Adult
dtypes: category(6), float64(9), int64(2), object(1), uint8(207)
memory usage: 1.2+ GB


In [26]:
ratebenefits[ratebenefits.select_dtypes(['object']).columns] = ratebenefits.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [27]:
ratebenefits.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4221965 entries, 0 to 4221964
Columns: 225 entries, BusinessYear to Dental Care, Minor - Adult
dtypes: category(7), float64(9), int64(2), uint8(207)
memory usage: 1.2 GB


## Load `PlanAttributes`

In [28]:
plan = pd.read_csv('../data/PlanAttributes.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [29]:
plan[plan.select_dtypes(['object']).columns] = plan.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [30]:
plan.shape

(77353, 176)

In [31]:
plan = plan[plan.PlanId.str.contains('-00')]

In [32]:
plan = plan[plan['BusinessYear'] == 2016]

In [33]:
plan.shape

(8398, 176)

In [34]:
plan.drop(columns=['PlanId'], inplace=True)

In [35]:
plan = plan.rename(columns={'StandardComponentId': 'PlanId'}) 

In [36]:
plan.shape

(8398, 175)

In [37]:
plan_cols = ['PlanId',
             'IsNoticeRequiredForPregnancy', 
             'IsReferralRequiredForSpecialist', 
             'ChildOnlyOffering', 
             'WellnessProgramOffered', 
             'DiseaseManagementProgramsOffered', 
             'OutOfCountryCoverage', 
             'NationalNetwork']

In [38]:
len(plan[plan_cols])

8398

In [39]:
len(plan['PlanId'].unique())

8398

In [40]:
plan[plan_cols].head()

Unnamed: 0,PlanId,IsNoticeRequiredForPregnancy,IsReferralRequiredForSpecialist,ChildOnlyOffering,WellnessProgramOffered,DiseaseManagementProgramsOffered,OutOfCountryCoverage,NationalNetwork
49972,21989AK0030001,,,Allows Adult and Child-Only,,,No,Yes
49973,21989AK0080001,,,Allows Adult and Child-Only,,,No,Yes
49975,21989AK0050001,,,Allows Adult and Child-Only,,,No,Yes
49976,21989AK0080002,,,Allows Adult and Child-Only,,,No,Yes
49978,21989AK0050002,,,Allows Adult and Child-Only,,,No,Yes


## Merge `ratebenefits` and `plan` on `'PlanId'`

In [45]:
df = pd.merge(ratebenefits, plan[plan_cols], on='PlanId', how='outer')

In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4221965 entries, 0 to 4221964
Columns: 232 entries, BusinessYear to NationalNetwork
dtypes: category(13), float64(9), int64(2), object(1), uint8(207)
memory usage: 1.3+ GB
