# Pre-processing data

In this notebook I compressed my CSV files by casting object columns as categories. This reduces Pandas' memory usage by hashing data in each column. I use the following libraries:

1. `pandas` - A library for manipulating data in a tabular format
2. `pickle` - used to store data for later use

I created this notebook to compress my CSVs and save them as pickles only. This allows me to reduce my memory usage as I perform the bulk of my work in later notebooks.

In [3]:
import pandas as pd
import pickle

## Compress `rate` CSV

The `rate` CSV file plan-level data on individual rates based on an eligible subscriber’s age, tobacco use, and geographic location, and family-tier rates. Additional information can be found in the [data dictionary provided by CMS here.](https://www.cms.gov/CCIIO/Resources/Data-Resources/Downloads/Rate-Data-Dictionary.pdf)

In [4]:
rate = pd.read_csv('../data/Rate.csv')

In [5]:
rate.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12694445 entries, 0 to 12694444
Data columns (total 24 columns):
BusinessYear                                 int64
StateCode                                    object
IssuerId                                     int64
SourceName                                   object
VersionNum                                   int64
ImportDate                                   object
IssuerId2                                    int64
FederalTIN                                   object
RateEffectiveDate                            object
RateExpirationDate                           object
PlanId                                       object
RatingAreaId                                 object
Tobacco                                      object
Age                                          object
IndividualRate                               float64
IndividualTobaccoRate                        float64
Couple                                       float64
Pr

In [8]:
rate[rate.select_dtypes(['object']).columns] = rate.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [9]:
rate.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12694445 entries, 0 to 12694444
Data columns (total 24 columns):
BusinessYear                                 int64
StateCode                                    category
IssuerId                                     int64
SourceName                                   category
VersionNum                                   int64
ImportDate                                   category
IssuerId2                                    int64
FederalTIN                                   category
RateEffectiveDate                            category
RateExpirationDate                           category
PlanId                                       category
RatingAreaId                                 category
Tobacco                                      category
Age                                          category
IndividualRate                               float64
IndividualTobaccoRate                        float64
Couple                             

In [6]:
# with open('../pickles/rate.pkl', 'wb') as file:
#     pickle.dump(rate, file)

## Compress `benefits` CSV

The `benefits` CSV file contains plan data on essential health benefits, coverage limits, and cost sharing for each health plan. Additional information can be found in the [data dictionary provided by CMS here.](https://www.cms.gov/CCIIO/Resources/Data-Resources/Downloads/Benefits-and-Cost-Sharing-Data-Dictionary.pdf)

In [10]:
benefits = pd.read_csv('../data/BenefitsCostSharing.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [11]:
benefits.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5048408 entries, 0 to 5048407
Data columns (total 32 columns):
BenefitName            object
BusinessYear           int64
CoinsInnTier1          object
CoinsInnTier2          object
CoinsOutofNet          object
CopayInnTier1          object
CopayInnTier2          object
CopayOutofNet          object
EHBVarReason           object
Exclusions             object
Explanation            object
ImportDate             object
IsCovered              object
IsEHB                  object
IsExclFromInnMOOP      object
IsExclFromOonMOOP      object
IsStateMandate         object
IsSubjToDedTier1       object
IsSubjToDedTier2       object
IssuerId               int64
IssuerId2              int64
LimitQty               float64
LimitUnit              object
MinimumStay            float64
PlanId                 object
QuantLimitOnSvc        object
RowNumber              int64
SourceName             object
StandardComponentId    object
StateCode          

In [12]:
benefits[benefits.select_dtypes(['object']).columns] = benefits.select_dtypes(['object']).apply(lambda x: x.astype('category'))

In [13]:
benefits.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5048408 entries, 0 to 5048407
Data columns (total 32 columns):
BenefitName            category
BusinessYear           int64
CoinsInnTier1          category
CoinsInnTier2          category
CoinsOutofNet          category
CopayInnTier1          category
CopayInnTier2          category
CopayOutofNet          category
EHBVarReason           category
Exclusions             category
Explanation            category
ImportDate             category
IsCovered              category
IsEHB                  category
IsExclFromInnMOOP      category
IsExclFromOonMOOP      category
IsStateMandate         category
IsSubjToDedTier1       category
IsSubjToDedTier2       category
IssuerId               int64
IssuerId2              int64
LimitQty               float64
LimitUnit              category
MinimumStay            float64
PlanId                 category
QuantLimitOnSvc        category
RowNumber              int64
SourceName             category
Stand

In [12]:
# with open('../pickles/benefits.pkl', 'wb') as file:
#     pickle.dump(benefits, file)