# 3. Join Rate and Benefits

In this notebook, I want to import my cleaned my **rate.pkl** and **benefits_dum.pkl** files and do the following:
- drop certain un-needed columns
- filter by the most recent year available (2016)
- concatenate my `benefits` dataframe to my `rate` dataframe by `'StandardComponentId'`, (which will be renamed `'PlanId'`), which is the plan identifer.

## Load Files

In [24]:
import pandas as pd
import numpy as np

import pickle
import regex as re

import matplotlib.pyplot as plt
%matplotlib inline

Load a few files:
- rate.pkl
- benefits_dum.pkl
- crosswalk2.csv

In [25]:
with open('../pickles/rate.pkl', 'rb') as rate:
    rate = pickle.load(rate)
rate.shape

(12694445, 24)

In [27]:
with open('../pickles/benefits_dum.pkl', 'rb') as benefits_dum:
    benefits = pickle.load(benefits_dum)
benefits.shape

(413907, 229)

In [28]:
crosswalk = pd.read_csv('../data/crosswalk2.csv')
crosswalk.shape

(412, 2)

## Cleaning `rate`

Drop un-needed columns:

In [29]:
rate.drop(columns=['IssuerId', 
                   'SourceName',
                   'VersionNum',
                   'ImportDate',
                   'IssuerId2',
                   'FederalTIN'], inplace=True)

Filter by year:

In [30]:
rate = rate[rate['BusinessYear'] == 2016]

In [56]:
rate_cols = [col for col in rate.columns]

In [31]:
rate.shape

(4221965, 18)

## Cleaning `benefits`

1. Drop `'PlanId'` from `benefits`
2. Rename `'StandardComponentId'` to `'PlanId'`
3. Cast `'PlanId'` as an object (not as a category)

In [32]:
benefits.drop(columns=['PlanId'], inplace=True)
benefits = benefits.rename(columns={'StandardComponentId': 'PlanId'}) 
benefits.PlanId = benefits.PlanId.astype('object')

In [33]:
benefits.shape

(413907, 228)

## Filter to view dummy columns only in `benefits`

Create a filter to view only the dummied columns in `benefits`:

In [34]:
ben_cols = [x for x in crosswalk['Crosswalk'].unique()]

ben_cols.remove('delete') 

ben_cols.insert(0, 'BenefitName')
ben_cols.insert(1, 'PlanId')

In [35]:
ben_cols_dum = [x for x in ben_cols if x != 'BenefitName' if x != 'PlanId']

In [36]:
benefits[ben_cols_dum].head()

Unnamed: 0,"Dental Care, Basic - Child","Dental Care, Major - Child",Orthodontia - Child,"Dental, Accidental - Adult","Dental Care, Basic - Adult","Dental Care, Major - Adult","Dental Care, Routine - Adult",Orthodontia - Adult,Delivery and All Inpatient Services for Maternity Care,Durable Medical Equipment,...,Endodontics - Adult,Habilitation - Acquired Brain Injury,Dental Cleanings - Adult,Surgical Extraction - Adult,Surgical Extraction - Child,Cosmetic Orthodontia,"Renal Dialysis, End Stage",Post-cochlear implant aural therapy,X-Rays and Exams - Adult,"Dental Care, Minor - Adult"
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Merge `benefits` and `rate` on `'PlanId'`

1. Take `benefits` and groupby `'PlanId'` to get all features of each plan on one row and call the temp dataframe `benefits_planid`.  
    a. We will **sum** our columns in `benefits` when doing the groupby, and then replace any values "greater than 1" with '1'.
2. Then, merge `benefits_planid` to `rate` on `'PlanId'`.

In [37]:
benefits_planid = benefits[ben_cols].drop('BenefitName', axis=1).groupby('PlanId').sum()
benefits_planid[benefits_planid > 1] = 1

In [38]:
ratebenefits = pd.merge(rate, benefits_planid, on='PlanId', how='inner')
ratebenefits[ratebenefits.select_dtypes(['object']).columns] = ratebenefits.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [39]:
ratebenefits.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3977375 entries, 0 to 3977374
Columns: 225 entries, BusinessYear to Dental Care, Minor - Adult
dtypes: category(7), float64(9), int64(2), uint8(207)
memory usage: 1.2 GB


In [40]:
ratebenefits.shape

(3977375, 225)

Some rows were dropped from rate as the benefits column did not have all the rows that the rate column did. This is because we had filtered on the year 2016 earlier.

## Load and clean `PlanAttributes.csv`

In [41]:
attributes = pd.read_csv('../data/PlanAttributes.csv')

  interactivity=interactivity, compiler=compiler, result=result)


`attributes` shape before cleaning:

In [42]:
attributes.shape

(77353, 176)

In [43]:
attributes = attributes[attributes.PlanId.str.contains('-00')]
attributes = attributes[attributes['BusinessYear'] == 2016]

`attributes` shape after cleaning:

In [44]:
attributes.shape

(8398, 176)

Drop the `'PlanId'` column from `attributes` and rename `'StandardComponentId'` as `'PlanId'`

In [45]:
attributes[attributes.select_dtypes(['object']).columns] = attributes.select_dtypes(['object']).apply(lambda x: x.astype('category'))
attributes.drop(columns=['PlanId'], inplace=True)
attributes = attributes.rename(columns={'StandardComponentId': 'PlanId'}) 
attributes.PlanId = attributes.PlanId.astype('object')

`attributes` shape after dropping:

In [46]:
attributes.shape

(8398, 175)

Isolate columns that are from `attributes`:

In [47]:
attributes_cols = ['PlanId',
             'IsNoticeRequiredForPregnancy', 
             'IsReferralRequiredForSpecialist', 
             'ChildOnlyOffering', 
             'WellnessProgramOffered', 
             'DiseaseManagementProgramsOffered', 
             'OutOfCountryCoverage', 
             'NationalNetwork']

In [48]:
attributes[attributes_cols].head()

Unnamed: 0,PlanId,IsNoticeRequiredForPregnancy,IsReferralRequiredForSpecialist,ChildOnlyOffering,WellnessProgramOffered,DiseaseManagementProgramsOffered,OutOfCountryCoverage,NationalNetwork
49972,21989AK0030001,,,Allows Adult and Child-Only,,,No,Yes
49973,21989AK0080001,,,Allows Adult and Child-Only,,,No,Yes
49975,21989AK0050001,,,Allows Adult and Child-Only,,,No,Yes
49976,21989AK0080002,,,Allows Adult and Child-Only,,,No,Yes
49978,21989AK0050002,,,Allows Adult and Child-Only,,,No,Yes


## Merge `ratebenefits` and `attibutes[attibutes_cols]` on `'PlanId'` to create `df`

In [49]:
df = pd.merge(ratebenefits, attributes[attributes_cols], on='PlanId', how='outer')

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3977375 entries, 0 to 3977374
Columns: 232 entries, BusinessYear to NationalNetwork
dtypes: category(13), float64(9), int64(2), object(1), uint8(207)
memory usage: 1.2+ GB


In [54]:
with open('../pickles/df.pkl', 'wb') as file:
    pickle.dump(df, file)

In [29]:
# benefits cols
# ben_cols
with open('../pickles/ben_cols.pkl', 'wb') as file:
    pickle.dump(ben_cols, file)

with open('../pickles/ben_cols_dum.pkl', 'wb') as file:
    pickle.dump(ben_cols_dum, file)

In [30]:
# plan attribute cols
# attributes_cols

In [57]:
# rate columns
# rate_cols
with open('../pickles/rate_cols.pkl', 'wb') as file:
    pickle.dump(rate_cols, file)

## Find cosine similarity using `ben_cols_dum`

In [32]:
# no_dupes = df[ben_cols_dum].drop_duplicates()

In [33]:
# no_dupes.iloc[0, :]

In [34]:
# ben_vectors = np.asarray(no_dupes)

In [35]:
# from sklearn.metrics.pairwise import cosine_similarity

In [36]:
# cos_mat = cosine_similarity(ben_vectors, ben_vectors)

In [37]:
# np.argsort(cos_mat[0])[-2:-12:-1]