# About the condition: Arthroplasty Knee

Also known as Knee Replacement Surgery.

The procedure involves cutting away damaged bone and cartilage from thighbone, shinbone and kneecap and replacing it with an artificial joint (prosthesis) made of metal alloys, high-grade plastics and polymers.


# About this notebook

- **Part 1: Reading in relevant datasets** 
    - Sparcs (Individual medical records)
    - Puf (Hospital Features)
    - Crosswalk (to join Sparcs and puf on)
    
 
- **Part 2: Merging all the datasets**
    - Perform, if required:
        - Normalization
        - One hot encoding (dummy variables assignment)


- **Part 3: Running models and evalutate how each model fare**

In [95]:
import os
import urllib.request
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pylab as pl
%pylab inline

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  "\n`%matplotlib` prevents importing * from pylab and numpy"


In [2]:
dataFol = os.getcwd() + "/dataFiles/"
nysHealth = dataFol + "nysHealth/"

In [3]:
nysHealth

'/nfshome/qg412/002_ML/nycHospitalPricing/dataFiles/nysHealth/'

# PART 1

## Sparcs (individual record)

The sparcs segment is built on a base version provided by Mei

In [4]:
sparcsKnee = pd.read_csv(nysHealth+'sparcsKnee.csv')
sparcsKnee.drop(['Unnamed: 0'], axis=1, inplace=True)
sparcsKnee.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,APR DRG Code,APR DRG Description,APR MDC Code,APR MDC Description,APR Medical Surgical Description,APR Risk of Mortality,APR Severity of Illness Code,APR Severity of Illness Description,Abortion Edit Indicator,Age Group,...,Payment Typology 1,Payment Typology 2,Payment Typology 3,Race,Ratio of Total Costs to Total Charges,Total Charges,Total Costs,Type of Admission,Zip Code - 3 digits,yr
0,302,Knee joint replacement,8,Diseases and Disorders of the Musculoskeletal ...,Surgical,Minor,2,Moderate,N,70 or Older,...,Blue Cross/Blue Shield,Medicare,,White,,25128.07,49633.7,Elective,147,2013
1,302,Knee joint replacement,8,Diseases and Disorders of the Musculoskeletal ...,Surgical,Minor,1,Minor,N,50 to 69,...,Blue Cross/Blue Shield,,,White,,34503.42,84472.01,Elective,147,2013
2,302,Knee joint replacement,8,Diseases and Disorders of the Musculoskeletal ...,Surgical,Minor,2,Moderate,N,50 to 69,...,Medicaid,,,White,,54106.7,94418.08,Elective,148,2013
3,302,Knee joint replacement,8,Diseases and Disorders of the Musculoskeletal ...,Surgical,Moderate,2,Moderate,N,50 to 69,...,Blue Cross/Blue Shield,Medicare,,White,,30138.52,51083.21,Elective,OOS,2013
4,302,Knee joint replacement,8,Diseases and Disorders of the Musculoskeletal ...,Surgical,Major,2,Moderate,N,70 or Older,...,Medicare,,,White,,25886.07,49957.54,Elective,147,2013


In [5]:
sparcsKnee.columns

Index(['APR DRG Code', 'APR DRG Description', 'APR MDC Code',
       'APR MDC Description', 'APR Medical Surgical Description',
       'APR Risk of Mortality', 'APR Severity of Illness Code',
       'APR Severity of Illness Description', 'Abortion Edit Indicator',
       'Age Group', 'Attending Provider License Number', 'Birth Weight',
       'CCS Diagnosis Code', 'CCS Diagnosis Description', 'CCS Procedure Code',
       'CCS Procedure Description', 'Discharge Year',
       'Emergency Department Indicator', 'Ethnicity', 'Facility Id',
       'Facility Name', 'Gender', 'Health Service Area', 'Hospital County',
       'Length of Stay', 'Operating Certificate Number',
       'Operating Provider License Number', 'Other Provider License Number',
       'Patient Disposition', 'Payment Typology 1', 'Payment Typology 2',
       'Payment Typology 3', 'Race', 'Ratio of Total Costs to Total Charges',
       'Total Charges', 'Total Costs', 'Type of Admission',
       'Zip Code - 3 digits', 'yr'],


In [6]:
sparcsKnee.columns[19]

'Facility Id'

In [7]:
sparcsKnee.rename(columns={sparcsKnee.columns[19]: "fac_id" }, inplace=True)

In [8]:
sparcsKnee.dtypes

APR DRG Code                               int64
APR DRG Description                       object
APR MDC Code                               int64
APR MDC Description                       object
APR Medical Surgical Description          object
APR Risk of Mortality                     object
APR Severity of Illness Code               int64
APR Severity of Illness Description       object
Abortion Edit Indicator                   object
Age Group                                 object
Attending Provider License Number        float64
Birth Weight                               int64
CCS Diagnosis Code                         int64
CCS Diagnosis Description                 object
CCS Procedure Code                         int64
CCS Procedure Description                 object
Discharge Year                             int64
Emergency Department Indicator            object
Ethnicity                                 object
fac_id                                   float64
Facility Name       

## Hospital Features

In [9]:
# Reading in the PUF report
puf = pd.read_json("https://data.cms.gov/resource/8rp3-rzmi.json?state_code=NY")
puf.head()

Unnamed: 0,accounts_payable,accounts_receivable,allowable_dsh_percentage,buildings,cash_on_hand_and_in_banks,ccn_facility_type,city,combined_outpatient_inpatient,contract_labor,cost_of_charity_care,...,total_salaries_adjusted,total_salaries_from_worksheet,total_unreimbursed_and,type_of_control,unsecured_loans,wage_related_costs_core,wage_related_costs_for_interns,wage_related_costs_for_part,wage_related_costs_rhc_fqhc,zip_code
0,,,,,,PH,ORANGEBURG,22147704.0,,,...,93088364.0,93088364.0,-3912103.0,10,,49192404.0,,,,10962-1196
1,,,,,,PH,ROCHESTER,1939591.0,,,...,34442946.0,34442946.0,-6429628.0,10,,18333880.0,,,,14620-3965
2,,,,,,PH,UTICA,6125975.0,,,...,19373691.0,19373691.0,,10,,10361050.0,,,,13502-3803
3,,,,,,PH,DIX HILLS,,,,...,,,,10,,,,,,11746-5861
4,,,,,,STH,ROCHESTER,,,,...,,,,5,,,,,,14620-4629


In [10]:
# There's lots of columns, storing as a list for easy reference
puf.columns.tolist()

['accounts_payable',
 'accounts_receivable',
 'allowable_dsh_percentage',
 'buildings',
 'cash_on_hand_and_in_banks',
 'ccn_facility_type',
 'city',
 'combined_outpatient_inpatient',
 'contract_labor',
 'cost_of_charity_care',
 'cost_of_uncompensated_care',
 'cost_to_charge_ratio',
 'county',
 'deferred_income',
 'depreciation_cost',
 'disproporationate_share',
 'drg_amounts_after_october',
 'drg_amounts_before_october',
 'fiscal_year_begin_date',
 'fiscal_year_end_date',
 'fixed_equipment',
 'fte_employees_on_payroll',
 'general_fund_balance',
 'gross_revenue',
 'health_information_technology',
 'hospital_name',
 'hospital_number_of_beds_for',
 'hospital_total_bed_days',
 'hospital_total_days_title_1',
 'hospital_total_days_title_2',
 'hospital_total_days_v_xviii',
 'hospital_total_discharges_1',
 'hospital_total_discharges_2',
 'hospital_total_discharges_3',
 'inpatient_revenue',
 'inpatient_total_charges',
 'inventory',
 'investments',
 'land',
 'land_improvements',
 'leasehold_impr

#### Hospital features of interest

HOSPITAL FEATURES
- Provider CNN (same as provider id)
- Type of Control (type of hospital eg: nonprofit/ gov)
- Number of beds (only for lodging patients in acute, long term stay)
- Total Days (V + XVIII + XIX + Unknown) - Total number of inpatient days for all classes of patients for each component
- Contract labor (Total amount paid for services furnished under contract, rather than by employees, for direct patient care)
- Zip code (to reflect locality)

COST FEATURES
- Cost of Charity Care (cost of free, essential medical services rendered for ppl who can't pay)
- Total Bad Debt Expense (cost of hosp services expected to not be paid, does not include doctor & other professional fee)
- Total Costs (total hospital cost)
- DRG Amounts Other Than Outlier Payments (DRG payment paid for Prospective Payment System (PPS) discharges)
- Total IME payment (additional amt a teaching hospital 'earns' in addition to each Medicare case) 
- Disproporationate Share Adjustment (percentage add-on to the DRG payment, additional compensation for treating low income)
- Net Patient Revenue (net income earned for each patient seen) 
- Net Revenue from Medicaid (inclusive of DSH & IME revenue)
- Medicaid Charges (total revenue from Medicaid)
- Net Revenue from Stand-Alone SCHIP (SCHIP = The State Children’s Health Insurance Program, for low income kids)
- Stand-Alone SCHIP Charges (Total revenue from The State Children’s Health Insurance Program) 


In [11]:
# Selecting only the columns of interest as shortlisted in data inventory

hospFeat = puf[['provider_ccn',
 'provider_type',
 'type_of_control',
 'number_of_beds',
 'total_days_v_xviii_xix_unknown',
 'contract_labor',
 'zip_code',
 'cost_of_charity_care',
 'total_bad_debt_expense',
 'total_costs',
 'total_ime_payment',
 'disproporationate_share',
 'net_patient_revenue', 
 'net_revenue_from_medicaid',
 'medicaid_charges',
 'net_revenue_from_stand_alone',
 'stand_alone_schip_charges',
       ]]

hospFeat.head()

Unnamed: 0,provider_ccn,provider_type,type_of_control,number_of_beds,total_days_v_xviii_xix_unknown,contract_labor,zip_code,cost_of_charity_care,total_bad_debt_expense,total_costs,total_ime_payment,disproporationate_share,net_patient_revenue,net_revenue_from_medicaid,medicaid_charges,net_revenue_from_stand_alone,stand_alone_schip_charges
0,334015,2,10,404.0,135262.0,,10962-1196,,,165987411.0,,,,,,,
1,334020,2,10,185.0,62049.0,,14620-3965,,,60525172.0,,,,,,,
2,334021,2,10,25.0,8218.0,,13502-3803,,,29783192.0,,,,,,,
3,334064,2,10,,,,11746-5861,,,,,,,,,,
4,330403,2,5,,,,14620-4629,,,,,,,,,,


In [12]:
hospFeat.rename(columns={hospFeat.columns[0]: "provider_id" }, inplace=True)
hospFeat.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)


Unnamed: 0,provider_id,provider_type,type_of_control,number_of_beds,total_days_v_xviii_xix_unknown,contract_labor,zip_code,cost_of_charity_care,total_bad_debt_expense,total_costs,total_ime_payment,disproporationate_share,net_patient_revenue,net_revenue_from_medicaid,medicaid_charges,net_revenue_from_stand_alone,stand_alone_schip_charges
0,334015,2,10,404.0,135262.0,,10962-1196,,,165987411.0,,,,,,,
1,334020,2,10,185.0,62049.0,,14620-3965,,,60525172.0,,,,,,,
2,334021,2,10,25.0,8218.0,,13502-3803,,,29783192.0,,,,,,,
3,334064,2,10,,,,11746-5861,,,,,,,,,,
4,330403,2,5,,,,14620-4629,,,,,,,,,,


## CrossWalk

In [13]:
xWalk = pd.read_csv(nysHealth+'nysHospitalsProviderIdFacilityIdCrossWalk.csv')
xWalk.drop(['Unnamed: 0'], axis=1, inplace=True)
xWalk.head()

Unnamed: 0,in_nyc,Provider ID,fac_id,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,...,Readmission national comparison footnote,Patient experience national comparison,Patient experience national comparison footnote,Effectiveness of care national comparison,Effectiveness of care national comparison footnote,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote,Location
0,False,330013,1.0,ALBANY MEDICAL CENTER HOSPITAL,"43 NEW SCOTLAND AVENUE, MAIL CODE 34",ALBANY,NY,12208,ALBANY,5182622400,...,,Below the national average,,Same as the national average,,Below the national average,,Same as the national average,,"43 NEW SCOTLAND AVENUE, MAIL CODE 34 ALBANY, N..."
1,False,330003,4.0,ALBANY MEMORIAL HOSPITAL,600 NORTHERN BOULEVARD,ALBANY,NY,12204,ALBANY,5184713490,...,,Below the national average,,Same as the national average,,Below the national average,,Not Available,Results are not available for this reporting p...,"600 NORTHERN BOULEVARD ALBANY, NY (42.674685, ..."
2,False,330057,5.0,ST PETER'S HOSPITAL,315 SOUTH MANNING BOULEVARD,ALBANY,NY,12208,ALBANY,5185251550,...,,Below the national average,,Same as the national average,,Below the national average,,Same as the national average,,"315 SOUTH MANNING BOULEVARD ALBANY, NY (42.660..."
3,False,331301,37.0,"CUBA MEMORIAL HOSPITAL, INC",140 WEST MAIN STREET,CUBA,NY,14727,ALLEGANY,5859612000,...,Results are not available for this reporting p...,Not Available,There are too few measures or measure groups r...,Not Available,Results are not available for this reporting p...,Same as the national average,,Not Available,There are too few measures or measure groups r...,"140 WEST MAIN STREET CUBA, NY (42.213341, -78...."
4,False,330096,39.0,JONES MEMORIAL HOSPITAL,191 NORTH MAIN STREET,WELLSVILLE,NY,14895,ALLEGANY,5855931100,...,,Below the national average,,Same as the national average,,Same as the national average,,Not Available,Results are not available for this reporting p...,"191 NORTH MAIN STREET WELLSVILLE, NY (42.12287..."


In [14]:
xWalk['County Name'].unique()

array(['ALBANY', 'ALLEGANY', 'BROOME', 'CATTARAUGUS', 'CAYUGA',
       'CHAUTAUQUA', 'CHEMUNG', 'CHENANGO', 'CLINTON', 'COLUMBIA',
       'CORTLAND', 'DELAWARE', 'DUTCHESS', 'ERIE', 'SUFFOLK', 'ESSEX',
       'FRANKLIN', 'FULTON', 'GENESEE', 'HERKIMER', 'JEFFERSON', 'LEWIS',
       'LIVINGSTON', 'MADISON', 'MONROE', 'MONTGOMERY', 'NASSAU',
       'NIAGARA', 'ONEIDA', 'ONONDAGA', 'ONTARIO', 'ORANGE', 'ORLEANS',
       'OSWEGO', 'OTSEGO', 'PUTNAM', 'RENSSELAER', 'ROCKLAND',
       'ST. LAWRENCE', 'SARATOGA', 'SCHENECTADY', 'SCHUYLER', 'STEUBEN',
       'SULLIVAN', 'TOMPKINS', 'ULSTER', 'WARREN', 'WAYNE', 'WESTCHESTER',
       'WYOMING', 'BRONX', 'KINGS', 'NEW YORK', 'QUEENS', 'RICHMOND',
       'SCHOHARIE', 'YATES'], dtype=object)

In [15]:
len(xWalk['County Name'].unique())

57

Question: wikipedia said theres 62 county in new york though, we have only 57 here... 

In [16]:
# Selecting to rename specific columns when colums are dictionary indexed
xWalk.rename(columns={xWalk.columns[1]: "provider_id" }, inplace=True)
xWalk.head()

Unnamed: 0,in_nyc,provider_id,fac_id,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,...,Readmission national comparison footnote,Patient experience national comparison,Patient experience national comparison footnote,Effectiveness of care national comparison,Effectiveness of care national comparison footnote,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote,Location
0,False,330013,1.0,ALBANY MEDICAL CENTER HOSPITAL,"43 NEW SCOTLAND AVENUE, MAIL CODE 34",ALBANY,NY,12208,ALBANY,5182622400,...,,Below the national average,,Same as the national average,,Below the national average,,Same as the national average,,"43 NEW SCOTLAND AVENUE, MAIL CODE 34 ALBANY, N..."
1,False,330003,4.0,ALBANY MEMORIAL HOSPITAL,600 NORTHERN BOULEVARD,ALBANY,NY,12204,ALBANY,5184713490,...,,Below the national average,,Same as the national average,,Below the national average,,Not Available,Results are not available for this reporting p...,"600 NORTHERN BOULEVARD ALBANY, NY (42.674685, ..."
2,False,330057,5.0,ST PETER'S HOSPITAL,315 SOUTH MANNING BOULEVARD,ALBANY,NY,12208,ALBANY,5185251550,...,,Below the national average,,Same as the national average,,Below the national average,,Same as the national average,,"315 SOUTH MANNING BOULEVARD ALBANY, NY (42.660..."
3,False,331301,37.0,"CUBA MEMORIAL HOSPITAL, INC",140 WEST MAIN STREET,CUBA,NY,14727,ALLEGANY,5859612000,...,Results are not available for this reporting p...,Not Available,There are too few measures or measure groups r...,Not Available,Results are not available for this reporting p...,Same as the national average,,Not Available,There are too few measures or measure groups r...,"140 WEST MAIN STREET CUBA, NY (42.213341, -78...."
4,False,330096,39.0,JONES MEMORIAL HOSPITAL,191 NORTH MAIN STREET,WELLSVILLE,NY,14895,ALLEGANY,5855931100,...,,Below the national average,,Same as the national average,,Same as the national average,,Not Available,Results are not available for this reporting p...,"191 NORTH MAIN STREET WELLSVILLE, NY (42.12287..."


# PART 2

### Merging crosswalk and hospital features

In [17]:
xWalk.shape

(170, 31)

In [18]:
hospFeat.shape

(212, 17)

In [19]:
# merging hospital features to xWalk
# decided to do a left on merge as an inner join will result in the loss of 1 row of data
xHosp = pd.merge(xWalk, hospFeat, on='provider_id', how='left')
xHosp.shape

(170, 47)

In [20]:
xHosp.head()

Unnamed: 0,in_nyc,provider_id,fac_id,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,...,cost_of_charity_care,total_bad_debt_expense,total_costs,total_ime_payment,disproporationate_share,net_patient_revenue,net_revenue_from_medicaid,medicaid_charges,net_revenue_from_stand_alone,stand_alone_schip_charges
0,False,330013,1.0,ALBANY MEDICAL CENTER HOSPITAL,"43 NEW SCOTLAND AVENUE, MAIL CODE 34",ALBANY,NY,12208,ALBANY,5182622400,...,6104462.0,11555203.0,690391293.0,27148518.0,3503897.0,812073789.0,123233408.0,550930533.0,2725265.0,14236543.0
1,False,330003,4.0,ALBANY MEMORIAL HOSPITAL,600 NORTHERN BOULEVARD,ALBANY,NY,12204,ALBANY,5184713490,...,201240.0,9769210.0,83385847.0,,87836.0,90065239.0,8849186.0,47687910.0,105008.0,568278.0
2,False,330057,5.0,ST PETER'S HOSPITAL,315 SOUTH MANNING BOULEVARD,ALBANY,NY,12208,ALBANY,5185251550,...,474889.0,14089983.0,430051864.0,2689637.0,1760836.0,527179517.0,50768513.0,224980779.0,410500.0,2815623.0
3,False,331301,37.0,"CUBA MEMORIAL HOSPITAL, INC",140 WEST MAIN STREET,CUBA,NY,14727,ALLEGANY,5859612000,...,131365.0,184696.0,11236408.0,,,8437534.0,861654.0,1404062.0,,
4,False,330096,39.0,JONES MEMORIAL HOSPITAL,191 NORTH MAIN STREET,WELLSVILLE,NY,14895,ALLEGANY,5855931100,...,136314.0,663201.0,31717135.0,,70917.0,30753715.0,4908444.0,15294809.0,,


### Joining Crosswalk and Sparcs

This is slightly more complicated... 

#### 1) Getting everything into one dataset

In [21]:
sparcsKnee.shape

(150077, 39)

In [22]:
xWalk.shape

(170, 31)

In [23]:
xWalk[xWalk['fac_id'].isnull()]

Unnamed: 0,in_nyc,provider_id,fac_id,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,...,Readmission national comparison footnote,Patient experience national comparison,Patient experience national comparison footnote,Effectiveness of care national comparison,Effectiveness of care national comparison footnote,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote,Location
166,False,331307,,CLIFTON FINE HOSPITAL,1014 OSWEGATCHIE TRAIL,STAR LAKE,NY,13690,ST. LAWRENCE,3158483351,...,Results are not available for this reporting p...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,"1014 OSWEGATCHIE TRAIL STAR LAKE, NY (44.16292..."
167,False,331304,,MARGARETVILLE MEMORIAL HOSPITAL,42084 STATE HIGHWAY 28,MARGARETVILLE,NY,12455,DELAWARE,8455862631,...,,Not Available,There are too few measures or measure groups r...,Not Available,Results are not available for this reporting p...,Not Available,Results are not available for this reporting p...,Not Available,Results are not available for this reporting p...,"42084 STATE HIGHWAY 28 MARGARETVILLE, NY (42.1..."
168,False,331306,,MOSES-LUDINGTON HOSPITAL,1019 WICKER STREET,TICONDEROGA,NY,12883,ESSEX,5185853700,...,Results are not available for this reporting p...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,There are too few measures or measure groups r...,Not Available,Results are not available for this reporting p...,"1019 WICKER STREET TICONDEROGA, NY (43.849591,..."
169,False,331314,,SOLDIERS AND SAILORS MEMORIAL HOSPITAL OF YATES,418 NORTH MAIN STREET,PENN YAN,NY,14527,YATES,3157874175,...,,Same as the national average,,Same as the national average,,Above the national average,,Not Available,Results are not available for this reporting p...,"418 NORTH MAIN STREET PENN YAN, NY (42.670468,..."


Noticed that there are 4 hospitals with NaN as facility id

In [24]:
test = pd.merge(xWalk, sparcsKnee, on='fac_id', how='outer')
test.shape

(150146, 69)

In [25]:
test.columns

Index(['in_nyc', 'provider_id', 'fac_id', 'Hospital Name', 'Address', 'City',
       'State', 'ZIP Code', 'County Name', 'Phone Number', 'Hospital Type',
       'Hospital Ownership', 'Emergency Services',
       'Meets criteria for meaningful use of EHRs', 'Hospital overall rating',
       'Hospital overall rating footnote', 'Mortality national comparison',
       'Mortality national comparison footnote',
       'Safety of care national comparison',
       'Safety of care national comparison footnote',
       'Readmission national comparison',
       'Readmission national comparison footnote',
       'Patient experience national comparison',
       'Patient experience national comparison footnote',
       'Effectiveness of care national comparison',
       'Effectiveness of care national comparison footnote',
       'Timeliness of care national comparison',
       'Timeliness of care national comparison footnote',
       'Efficient use of medical imaging national comparison',
       'E

#### 2) Narrow down to columns of interest

In [26]:
test['Hospital Type'].unique()

array(['Acute Care Hospitals', 'Critical Access Hospitals', 'Childrens',
       nan], dtype=object)

In [27]:
test['Hospital Ownership'].unique()

array(['Voluntary non-profit - Private', 'Proprietary',
       'Government - State', 'Voluntary non-profit - Church',
       'Voluntary non-profit - Other', 'Government - Local',
       'Government - Federal',
       'Government - Hospital District or Authority', nan], dtype=object)

In [28]:
test['Hospital overall rating'].unique()

array(['2', '3', 'Not Available', '1', '4', '5', nan], dtype=object)

In [29]:
test['Safety of care national comparison'].unique()

array(['Above the national average', 'Same as the national average',
       'Below the national average', 'Not Available', nan], dtype=object)

In [30]:
test['Patient experience national comparison'].unique()

array(['Below the national average', 'Not Available',
       'Same as the national average', 'Above the national average', nan], dtype=object)

In [31]:
test['Type of Admission'].unique()

array(['Elective', 'Emergency', 'Urgent', 'Newborn', 'Not Available', nan,
       'Trauma'], dtype=object)

In [32]:
test2 = test[['in_nyc', 'provider_id', 'fac_id', 'County Name',
       'Hospital Ownership','Hospital overall rating',
       'Safety of care national comparison',
       'Patient experience national comparison',
       'Effectiveness of care national comparison',
       'Timeliness of care national comparison','Length of Stay',
       'Payment Typology 1', 'Payment Typology 2', 'Payment Typology 3',
       'Ratio of Total Costs to Total Charges', 'Total Charges',
       'Total Costs', 'Type of Admission']]

test2.head()

Unnamed: 0,in_nyc,provider_id,fac_id,County Name,Hospital Ownership,Hospital overall rating,Safety of care national comparison,Patient experience national comparison,Effectiveness of care national comparison,Timeliness of care national comparison,Length of Stay,Payment Typology 1,Payment Typology 2,Payment Typology 3,Ratio of Total Costs to Total Charges,Total Charges,Total Costs,Type of Admission
0,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,3,Medicare,"Managed Care, Unspecified",Self-Pay,,31847.63,14631.67,Elective
1,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,2,Blue Cross/Blue Shield,"Managed Care, Unspecified",Self-Pay,,20423.09,7842.28,Elective
2,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,2,Medicare,Blue Cross/Blue Shield,Self-Pay,,29283.52,14613.3,Elective
3,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,2,Medicare,"Managed Care, Unspecified",Self-Pay,,30368.72,14990.05,Elective
4,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,3,Miscellaneous/Other,"Managed Care, Unspecified",Self-Pay,,31850.13,14724.31,Elective


#### 3) One hot encoding?

In [33]:
# assigning a unique integer number to each unique row of data
# code credit@ https://stackoverflow.com/questions/32011359/convert-categorical-data-in-pandas-dataframe

# Step 1: Convert as many variables into category data type
test2['in_nyc'] = test2['in_nyc'].astype('category')
test2['County Name'] = test2['County Name'].astype('category')
test2['Hospital Ownership'] = test2['Hospital Ownership'].astype('category')
test2['Safety of care national comparison'] = test2['Safety of care national comparison'].astype('category')
test2['Patient experience national comparison'] = test2['Patient experience national comparison'].astype('category')
test2['Effectiveness of care national comparison'] = test2['Effectiveness of care national comparison'].astype('category')
test2['Timeliness of care national comparison'] = test2['Timeliness of care national comparison'].astype('category')
test2['Type of Admission'] = test2['Type of Admission'].astype('category')

# Step 2: Assign a variable to store all the catergory dtype variables 
cat_columns = test2.select_dtypes(['category']).columns

# Step 3: Mass one-hot encoding
for i in cat_columns:
    test2[cat_columns+'_int'] = test2[cat_columns].apply(lambda x: x.cat.codes)

test2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

Unnamed: 0,in_nyc,provider_id,fac_id,County Name,Hospital Ownership,Hospital overall rating,Safety of care national comparison,Patient experience national comparison,Effectiveness of care national comparison,Timeliness of care national comparison,...,Total Costs,Type of Admission,in_nyc_int,County Name_int,Hospital Ownership_int,Safety of care national comparison_int,Patient experience national comparison_int,Effectiveness of care national comparison_int,Timeliness of care national comparison_int,Type of Admission_int
0,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14631.67,Elective,0,0,7,0,1,3,1,0
1,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,7842.28,Elective,0,0,7,0,1,3,1,0
2,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14613.30,Elective,0,0,7,0,1,3,1,0
3,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14990.05,Elective,0,0,7,0,1,3,1,0
4,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14724.31,Elective,0,0,7,0,1,3,1,0
5,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,16521.22,Elective,0,0,7,0,1,3,1,0
6,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,17687.62,Elective,0,0,7,0,1,3,1,0
7,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14180.15,Elective,0,0,7,0,1,3,1,0
8,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,15187.77,Elective,0,0,7,0,1,3,1,0
9,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,16076.91,Elective,0,0,7,0,1,3,1,0


In [34]:
# retrieving the mapping of variables and the newly assigned int values
# code credit @ https://stackoverflow.com/questions/30510562/get-mapping-of-categorical-variables-in-pandas

mapping = []
for i in range(0,7):
    mapping.append(dict(enumerate(test2[cat_columns[i]].cat.categories)))

In [35]:
mapping

[{0: False, 1: True},
 {0: 'ALBANY',
  1: 'ALLEGANY',
  2: 'BRONX',
  3: 'BROOME',
  4: 'CATTARAUGUS',
  5: 'CAYUGA',
  6: 'CHAUTAUQUA',
  7: 'CHEMUNG',
  8: 'CHENANGO',
  9: 'CLINTON',
  10: 'COLUMBIA',
  11: 'CORTLAND',
  12: 'DELAWARE',
  13: 'DUTCHESS',
  14: 'ERIE',
  15: 'ESSEX',
  16: 'FRANKLIN',
  17: 'FULTON',
  18: 'GENESEE',
  19: 'HERKIMER',
  20: 'JEFFERSON',
  21: 'KINGS',
  22: 'LEWIS',
  23: 'LIVINGSTON',
  24: 'MADISON',
  25: 'MONROE',
  26: 'MONTGOMERY',
  27: 'NASSAU',
  28: 'NEW YORK',
  29: 'NIAGARA',
  30: 'ONEIDA',
  31: 'ONONDAGA',
  32: 'ONTARIO',
  33: 'ORANGE',
  34: 'ORLEANS',
  35: 'OSWEGO',
  36: 'OTSEGO',
  37: 'PUTNAM',
  38: 'QUEENS',
  39: 'RENSSELAER',
  40: 'RICHMOND',
  41: 'ROCKLAND',
  42: 'SARATOGA',
  43: 'SCHENECTADY',
  44: 'SCHOHARIE',
  45: 'SCHUYLER',
  46: 'ST. LAWRENCE',
  47: 'STEUBEN',
  48: 'SUFFOLK',
  49: 'SULLIVAN',
  50: 'TOMPKINS',
  51: 'ULSTER',
  52: 'WARREN',
  53: 'WAYNE',
  54: 'WESTCHESTER',
  55: 'WYOMING',
  56: 'YATES'}

#### 4) Normalizing
- Total costs and total charges need to be divided by length of stay for each record in order to be a fair comparision

In [36]:
test2.head()

Unnamed: 0,in_nyc,provider_id,fac_id,County Name,Hospital Ownership,Hospital overall rating,Safety of care national comparison,Patient experience national comparison,Effectiveness of care national comparison,Timeliness of care national comparison,...,Total Costs,Type of Admission,in_nyc_int,County Name_int,Hospital Ownership_int,Safety of care national comparison_int,Patient experience national comparison_int,Effectiveness of care national comparison_int,Timeliness of care national comparison_int,Type of Admission_int
0,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14631.67,Elective,0,0,7,0,1,3,1,0
1,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,7842.28,Elective,0,0,7,0,1,3,1,0
2,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14613.3,Elective,0,0,7,0,1,3,1,0
3,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14990.05,Elective,0,0,7,0,1,3,1,0
4,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,14724.31,Elective,0,0,7,0,1,3,1,0


In [37]:
test2['Length of Stay'].unique()

array([3, 2, 5, 4, 7, 8, 6, 12, 10, 1, 9, 27, 31, 17, 34, 24, 25, 11, 15,
       14, 13, 29, 19, 16, 22, 20, nan, 39, 18, 73, 21, 35, 36, 49, 37, 56,
       26, 23, 32, '2', '6', '3', '4', '5', '7', '1', '9', '10', '11',
       '14', '16', '8', '21', '15', 60, 67, 63, 69, 30, 28, 40, 44, 58, 42,
       '12', '22', '13', 83, '23', 33, 47, 45, 46, 70, 52, 62, 84, 71,
       '19', '17', 51, 55, '20', '120 +', '18', '31', '43', 59, 48, 65,
       '102', '58', '55', '30', '78', '48', '49', 41, '29', '38', '42',
       '71', '34', '27', '24', '59', 50, '35', '25'], dtype=object)

In [38]:
test2[test2['Length of Stay']=='120 +']

Unnamed: 0,in_nyc,provider_id,fac_id,County Name,Hospital Ownership,Hospital overall rating,Safety of care national comparison,Patient experience national comparison,Effectiveness of care national comparison,Timeliness of care national comparison,...,Total Costs,Type of Admission,in_nyc_int,County Name_int,Hospital Ownership_int,Safety of care national comparison_int,Patient experience national comparison_int,Effectiveness of care national comparison_int,Timeliness of care national comparison_int,Type of Admission_int
89823,True,330202.0,1301.0,KINGS,Government - Local,1,Below the national average,Below the national average,Below the national average,Below the national average,...,738050.7,Emergency,1,21,2,1,1,1,1,1


In [39]:
len(test2)

150146

In [40]:
#code credit@ https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression
test3 = test2.drop(test2[test2['Length of Stay']=='120 +'].index)
len(test3)

150145

In [41]:
test3['Length of Stay'] = test3['Length of Stay'].astype(float)

In [42]:
test3['chargesDaily'] = test3['Total Charges'] / test3['Length of Stay']
test3['costsDaily'] = test3['Total Costs'] / test3['Length of Stay']
test3.head()

Unnamed: 0,in_nyc,provider_id,fac_id,County Name,Hospital Ownership,Hospital overall rating,Safety of care national comparison,Patient experience national comparison,Effectiveness of care national comparison,Timeliness of care national comparison,...,in_nyc_int,County Name_int,Hospital Ownership_int,Safety of care national comparison_int,Patient experience national comparison_int,Effectiveness of care national comparison_int,Timeliness of care national comparison_int,Type of Admission_int,chargesDaily,costsDaily
0,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,0,0,7,0,1,3,1,0,10615.876667,4877.223333
1,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,0,0,7,0,1,3,1,0,10211.545,3921.14
2,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,0,0,7,0,1,3,1,0,14641.76,7306.65
3,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,0,0,7,0,1,3,1,0,15184.36,7495.025
4,False,330013.0,1.0,ALBANY,Voluntary non-profit - Private,2,Above the national average,Below the national average,Same as the national average,Below the national average,...,0,0,7,0,1,3,1,0,10616.71,4908.103333


Dropping the no longer relevant columns

In [43]:
test3.columns

Index(['in_nyc', 'provider_id', 'fac_id', 'County Name', 'Hospital Ownership',
       'Hospital overall rating', 'Safety of care national comparison',
       'Patient experience national comparison',
       'Effectiveness of care national comparison',
       'Timeliness of care national comparison', 'Length of Stay',
       'Payment Typology 1', 'Payment Typology 2', 'Payment Typology 3',
       'Ratio of Total Costs to Total Charges', 'Total Charges', 'Total Costs',
       'Type of Admission', 'in_nyc_int', 'County Name_int',
       'Hospital Ownership_int', 'Safety of care national comparison_int',
       'Patient experience national comparison_int',
       'Effectiveness of care national comparison_int',
       'Timeliness of care national comparison_int', 'Type of Admission_int',
       'chargesDaily', 'costsDaily'],
      dtype='object')

In [44]:
test4 = test3.drop(['in_nyc','fac_id', 'County Name', 'Hospital Ownership',
       'Hospital overall rating', 'Safety of care national comparison',
       'Patient experience national comparison',
       'Effectiveness of care national comparison',
       'Timeliness of care national comparison', 'Length of Stay',
       'Payment Typology 1', 'Payment Typology 2', 'Payment Typology 3',
       'Total Charges', 'Total Costs',
       'Type of Admission'],axis=1)

test4.head()

Unnamed: 0,provider_id,Ratio of Total Costs to Total Charges,in_nyc_int,County Name_int,Hospital Ownership_int,Safety of care national comparison_int,Patient experience national comparison_int,Effectiveness of care national comparison_int,Timeliness of care national comparison_int,Type of Admission_int,chargesDaily,costsDaily
0,330013.0,,0,0,7,0,1,3,1,0,10615.876667,4877.223333
1,330013.0,,0,0,7,0,1,3,1,0,10211.545,3921.14
2,330013.0,,0,0,7,0,1,3,1,0,14641.76,7306.65
3,330013.0,,0,0,7,0,1,3,1,0,15184.36,7495.025
4,330013.0,,0,0,7,0,1,3,1,0,10616.71,4908.103333


In [45]:
test4.dtypes

provider_id                                      float64
Ratio of Total Costs to Total Charges            float64
in_nyc_int                                          int8
County Name_int                                     int8
Hospital Ownership_int                              int8
Safety of care national comparison_int              int8
Patient experience national comparison_int          int8
Effectiveness of care national comparison_int       int8
Timeliness of care national comparison_int          int8
Type of Admission_int                               int8
chargesDaily                                     float64
costsDaily                                       float64
dtype: object

In [46]:
test4.shape

(150145, 12)

In [47]:
xWalk.shape

(170, 31)

#### 5) Groupby

In [60]:
# code credit@ https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns
xSparcs = test4.groupby('provider_id').agg({'provider_id':'count',
                                 'in_nyc_int':'mean',
                                 'Safety of care national comparison_int':'mean',
                                 'Patient experience national comparison_int':'mean',
                                 'Effectiveness of care national comparison_int':'mean',
                                 'Timeliness of care national comparison_int':'mean',
                                 'chargesDaily':'mean',
                                 'costsDaily':'mean'}).rename(columns={'provider_id':'count',
                                                                       'in_nyc_int':'in_nyc',
                                                                       'Safety of care national comparison_int':'safetyOfCare_mean',
                                                                       'Effectiveness of care national comparison_int':'effectiveCare_mean',
                                                                       'Timeliness of care national comparison_int':'timeliness_mean',
                                                                       'chargesDaily':'chargesDaily_mean',
                                                                       'costsDaily':'costsDaily_mean'}).reset_index()

### Crosswalk x Hospital Features x Sparcs

In [61]:
xSparcs.head()

Unnamed: 0,provider_id,effectiveCare_mean,safetyOfCare_mean,timeliness_mean,Patient experience national comparison_int,chargesDaily_mean,costsDaily_mean,count,in_nyc
0,330003.0,3,3,1,1,8892.936942,4525.247373,75,0
1,330004.0,3,1,1,1,16764.629613,6562.924556,8,0
2,330005.0,3,0,1,1,14424.250314,5462.211349,2889,0
3,330006.0,3,1,1,1,7718.770202,2767.304013,284,0
4,330008.0,3,2,3,1,6884.487211,4958.64913,53,0


In [67]:
hospFeat.columns

Index(['provider_id', 'provider_type', 'type_of_control', 'number_of_beds',
       'total_days_v_xviii_xix_unknown', 'contract_labor', 'zip_code',
       'cost_of_charity_care', 'total_bad_debt_expense', 'total_costs',
       'total_ime_payment', 'disproporationate_share', 'net_patient_revenue',
       'net_revenue_from_medicaid', 'medicaid_charges',
       'net_revenue_from_stand_alone', 'stand_alone_schip_charges'],
      dtype='object')

In [70]:
hospFeat1 = hospFeat.drop(['provider_type', 'type_of_control', 'zip_code'],axis=1)
hospFeat1.head()

Unnamed: 0,provider_id,number_of_beds,total_days_v_xviii_xix_unknown,contract_labor,cost_of_charity_care,total_bad_debt_expense,total_costs,total_ime_payment,disproporationate_share,net_patient_revenue,net_revenue_from_medicaid,medicaid_charges,net_revenue_from_stand_alone,stand_alone_schip_charges
0,334015,404.0,135262.0,,,,165987411.0,,,,,,,
1,334020,185.0,62049.0,,,,60525172.0,,,,,,,
2,334021,25.0,8218.0,,,,29783192.0,,,,,,,
3,334064,,,,,,,,,,,,,
4,330403,,,,,,,,,,,,,


In [85]:
xHospFeatSparcs = pd.merge(xSparcs, hospFeat1, on='provider_id')
xHospFeatSparcs1 = xHospFeatSparcs.fillna(0)
xHospFeatSparcs1.head()

Unnamed: 0,provider_id,effectiveCare_mean,safetyOfCare_mean,timeliness_mean,Patient experience national comparison_int,chargesDaily_mean,costsDaily_mean,count,in_nyc,number_of_beds,...,cost_of_charity_care,total_bad_debt_expense,total_costs,total_ime_payment,disproporationate_share,net_patient_revenue,net_revenue_from_medicaid,medicaid_charges,net_revenue_from_stand_alone,stand_alone_schip_charges
0,330003.0,3,3,1,1,8892.936942,4525.247373,75,0,73.0,...,201240.0,9769210.0,83385847.0,0.0,87836.0,90065240.0,8849186.0,47687910.0,105008.0,568278.0
1,330004.0,3,1,1,1,16764.629613,6562.924556,8,0,149.0,...,2502719.0,7689449.0,90764599.0,1913880.0,439562.0,98463770.0,14458648.0,63772960.0,0.0,0.0
2,330005.0,3,0,1,1,14424.250314,5462.211349,2889,0,940.0,...,2805639.0,17524495.0,940242825.0,38915503.0,4073125.0,1102709000.0,181572223.0,593660924.0,4402353.0,15926351.0
3,330006.0,3,1,1,1,7718.770202,2767.304013,284,0,131.0,...,7810988.0,8332970.0,203614146.0,1833341.0,1538202.0,204365200.0,111403442.0,141595363.0,0.0,690131.0
4,330008.0,3,2,3,1,6884.487211,4958.64913,53,0,50.0,...,687093.0,1426681.0,44829421.0,0.0,40462.0,38935430.0,2582993.0,5593693.0,0.0,0.0


# Part 3: Running Models

**Decision Tree/ Random Forest**

In [78]:
## Statistical Modelling
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.datasets.longley import load
import sklearn.preprocessing as preprocessing
from sklearn.ensemble  import RandomForestRegressor as rfr
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix

In [86]:
xHospFeatSparcs1.columns

Index(['provider_id', 'effectiveCare_mean', 'safetyOfCare_mean',
       'timeliness_mean', 'Patient experience national comparison_int',
       'chargesDaily_mean', 'costsDaily_mean', 'count', 'in_nyc',
       'number_of_beds', 'total_days_v_xviii_xix_unknown', 'contract_labor',
       'cost_of_charity_care', 'total_bad_debt_expense', 'total_costs',
       'total_ime_payment', 'disproporationate_share', 'net_patient_revenue',
       'net_revenue_from_medicaid', 'medicaid_charges',
       'net_revenue_from_stand_alone', 'stand_alone_schip_charges'],
      dtype='object')

In [111]:
# choosing the dependent and independent variables
x = xHospFeatSparcs1[['effectiveCare_mean', 'safetyOfCare_mean',
       'timeliness_mean', 'Patient experience national comparison_int',
       'in_nyc',
       'number_of_beds', 'total_days_v_xviii_xix_unknown', 'contract_labor',
       'cost_of_charity_care', 'total_bad_debt_expense', 'total_costs',
       'total_ime_payment', 'disproporationate_share', 'net_patient_revenue',
       'net_revenue_from_medicaid', 'medicaid_charges',
       'net_revenue_from_stand_alone', 'stand_alone_schip_charges']]
y1 = xHospFeatSparcs1[['chargesDaily_mean']]
y2 = xHospFeatSparcs1[['costsDaily_mean']]

In [112]:
## Standardize the data by each feature, so that each feature has 
## mean zero and standard deviation = 1
x = ((x - x.mean()) / x.std())

In [113]:
## Train-test split the data to have 1/3 test size and 2/3 train size
X1_train, X1_test, y1_train, y1_test = train_test_split(x, y1, test_size=0.33, 
                                                    random_state=42)

In [117]:
## Train-test split the data to have 1/3 test size and 2/3 train size
X2_train, X2_test, y2_train, y2_test = train_test_split(x, y2, test_size=0.33, 
                                                    random_state=42)

In [114]:
# Supervised transformation based on random forests
rf1 = rfr(max_depth=None, n_estimators=10)
rf1.fit(X1_train, y1_train)

  app.launch_new_instance()


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [118]:
# Supervised transformation based on random forests
rf2 = rfr(max_depth=None, n_estimators=10)
rf2.fit(X2_train, y2_train)

  app.launch_new_instance()


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [120]:
y1_pred = rf1.predict(X1_test)
print('Random Forest diagnostic score for training data:',rf.score(X1_train, y1_train))
print('Random Forest diagnostic score for testing data:',rf.score(X1_test, y1_test))

Random Forest diagnostic score for training data: 0.878977006097
Random Forest diagnostic score for testing data: 0.203692780999


In [121]:
y2_pred = rf2.predict(X2_test)
print('Random Forest diagnostic score for training data:',rf.score(X2_train, y2_train))
print('Random Forest diagnostic score for testing data:',rf.score(X2_test, y2_test))

Random Forest diagnostic score for training data: -4.26911734422
Random Forest diagnostic score for testing data: -12.464294349


In [116]:
labels = x.columns
importances = rf1.feature_importances_
indices = np.argsort(importances)[::-1]
print('Top five most important features:',np.array(labels)[indices][:5])

Top five most important features: ['total_days_v_xviii_xix_unknown' 'cost_of_charity_care'
 'net_revenue_from_medicaid' 'medicaid_charges' 'disproporationate_share']


In [122]:
labels = x.columns
importances = rf2.feature_importances_
indices = np.argsort(importances)[::-1]
print('Top five most important features:',np.array(labels)[indices][:5])

Top five most important features: ['total_days_v_xviii_xix_unknown' 'cost_of_charity_care'
 'effectiveCare_mean' 'total_bad_debt_expense' 'number_of_beds']
