##### 1_downloadHospitalMasterCharges

This file reads in the googlesheet found [here]( https://docs.google.com/spreadsheets/d/1F8yPe-2uMcAOOzRmYnFenC77GiXnJ0afnRXtBUYe5TQ/edit?usp=sharing).
In this googlesheet, we've compiled some data based on our research of hospitals in NY state and NYC. In general, there are 2 main sources of data (NYS Health and CMS). The two sources do not match up nicely with each other due to differences in provider and facility ids. We've had t do some manual mapping where we found that multiple facility ids could map to the same provider id. 

In addition, we had to do some manual web crawling on the individual hospital websites to identify/ locate the file location for the machine readable mastercharge sheet. This is not uniformed and varied grately hospital to hospital. Further, some hospitals did not publish this data at all, we had to call them and leave our contact information. 

Lastly, different hospitals reported their mastercharges differently. Some used the MS Drg coding, while some used APR Drg coding, while some used no coding at all. 

This file exports out:

- nycHospitalsProviderIdFacilityIdCrossWalk.csv
    - This is a cross walk of the provider and facility ids for hospitals listed in the CMS

- nysHospitalsProviderIdFacilityIdCrossWalk.csv
    - This is cross walk file and includes some hospital rating data. 
    
- Main source data is retrieved from: https://data.medicare.gov/Hospital-Compare/Hospital-General-Information/xubh-q36u

In [1]:
import os
import sys
import json
import numpy as np
import pandas as pd
import re
import csv

from googleapiclient import discovery
from apiclient.discovery import build # libraries needed for google sheets API
from httplib2 import Http
from oauth2client import file, client, tools

import urllib
import requests

print(sys.version)

3.7.2 (default, Dec 29 2018, 06:19:36) 
[GCC 7.3.0]


In [2]:
# reference using google API and Medium posts on how to access the Google API
# https://towardsdatascience.com/how-to-access-google-sheet-data-using-the-python-api-and-convert-to-pandas-dataframe-5ec020564f0e
# https://developers.google.com/sheets/api/guides/concepts#spreadsheet_id

# the following function was modified from Medium link sourced above and the quickstart.py file from google
def get_google_sheet(spreadsheet_id, range_name):
    """ Retrieve sheet data using OAuth credentials and Google Python API. """
    scopes = 'https://www.googleapis.com/auth/spreadsheets.readonly'
    # Setup the Sheets API
    store = file.Storage(os.getenv("HOME")+'/keys/'+'token.json')
    creds = store.get()
    if not creds or creds.invalid:
        flow = client.flow_from_clientsecrets(os.getenv("HOME")+'/keys/'+'credentials.json', scopes)
        creds = tools.run_flow(flow, store)
    service = build('sheets', 'v4', http=creds.authorize(Http()))

    # Call the Sheets API
    gsheet = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=range_name).execute()
    return gsheet

In [3]:
# this is a google sheet that was downloaded from https://data.medicare.gov/Hospital-Compare/Hospital-General-Information/xubh-q36u
# we had to manually look up the link to each mastercharge file on each hospitals' website
# google sheet link https://docs.google.com/spreadsheets/d/1F8yPe-2uMcAOOzRmYnFenC77GiXnJ0afnRXtBUYe5TQ/edit?usp=sharing

SHEET_ID = '1F8yPe-2uMcAOOzRmYnFenC77GiXnJ0afnRXtBUYe5TQ' 
RANGE = 'nycHospitalsProviderIds' # this is the range

gsheet = get_google_sheet(SHEET_ID, RANGE)



In [4]:
gsheet.keys() # check the keys of the gsheet

dict_keys(['range', 'majorDimension', 'values'])

In [5]:
## turn the list of hospitals from gsheet into pd df
nycHospitals = pd.DataFrame.from_records(gsheet['values'][1:], columns=gsheet['values'][0]) 

In [6]:
nycHospitals.head(3) # to see the dataframe

Unnamed: 0,in_nyc,Provider ID,fac_id,Hospital Name,Main webpage link,DRG webpage link,All hospital charges link,Remarks on chargemaster link,manualDownload,Address,...,Readmission national comparison footnote,Patient experience national comparison,Patient experience national comparison footnote,Effectiveness of care national comparison,Effectiveness of care national comparison footnote,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote,Location
0,True,330204,1438,BELLEVUE HOSPITAL CENTER,https://www.bellevuehospital.com/patient-pricing,,https://www.bellevuehospital.com/sites/default...,"csv file, no DRG",no,462 FIRST AVENUE,...,,Below the national average,,Same as the national average,,Below the national average,,Same as the national average,,"462 FIRST AVENUE NEW YORK, NY (40.740079, -73...."
1,True,330009,1164,BRONX-LEBANON HOSPITAL CENTER,https://www.bronxcare.org/about-us/paying-for-...,https://www.bronxcare.org/fileadmin/SiteFiles/...,https://www.bronxcare.org/fileadmin/SiteFiles/...,"2017 data, Excel file; for DRG, split into \nA...",no,1276 FULTON AVENUE,...,,Below the national average,,Same as the national average,,Below the national average,,Same as the national average,,"1276 FULTON AVENUE BRONX, NY (40.83175, -73.90..."
2,True,330233,1286,BROOKDALE HOSPITAL MEDICAL CENTER,http://www.brookdalehospital.org/charge-master...,http://www.brookdalehospital.org/assets/charge...,http://www.brookdalehospital.org/assets/brookd...,Excel file,no,1 BROOKDALE PLAZA,...,,Below the national average,,Below the national average,,Below the national average,,Same as the national average,,"1 BROOKDALE PLAZA BROOKLYN, NY (40.6545, -73.9..."


In [7]:
RANGE2 = 'MSDRG Types'
gsheet2 = get_google_sheet(SHEET_ID, RANGE2)
drgTypes = pd.DataFrame.from_records(gsheet2['values'][1:], columns=gsheet2['values'][0])
drgTypes.head()

Unnamed: 0,Provider ID,Hospital Name,C-Section?,DRG Type
0,330009,BRONX-LEBANON HOSPITAL CENTER,Yes,APR-DRG
1,330196,CONEY ISLAND HOSPITAL,Yes,APR-DRG
2,330128,ELMHURST HOSPITAL CENTER,Yes,APR-DRG
3,330240,HARLEM HOSPITAL CENTER,Yes,APR-DRG
4,330127,JACOBI MEDICAL CENTER,Yes,APR-DRG


In [8]:
pd.merge(drgTypes, nycHospitals[['Provider ID', 'fac_id']], on='Provider ID').to_csv('dataFiles/nycHospitalsProviderIdFacilityIdCrossWalk.csv')

In [9]:
nycHospitals['url'] = np.where(nycHospitals['DRG webpage link']=='NA',
                               nycHospitals['All hospital charges link'], 
                               nycHospitals['DRG webpage link'])

In [10]:
nycHospitals.columns

Index(['in_nyc', 'Provider ID', 'fac_id', 'Hospital Name', 'Main webpage link',
       'DRG webpage link', 'All hospital charges link',
       'Remarks on chargemaster link', 'manualDownload', 'Address', 'City',
       'State', 'ZIP Code', 'County Name', 'Phone Number', 'Hospital Type',
       'Hospital Ownership', 'Emergency Services',
       'Meets criteria for meaningful use of EHRs', 'Hospital overall rating',
       'Hospital overall rating footnote', 'Mortality national comparison',
       'Mortality national comparison footnote',
       'Safety of care national comparison',
       'Safety of care national comparison footnote',
       'Readmission national comparison',
       'Readmission national comparison footnote',
       'Patient experience national comparison',
       'Patient experience national comparison footnote',
       'Effectiveness of care national comparison',
       'Effectiveness of care national comparison footnote',
       'Timeliness of care national compariso

In [11]:
nycHospitals.drop(['in_nyc', 'Main webpage link',
       'DRG webpage link', 'All hospital charges link',
       'Remarks on chargemaster link', 'Address', 'City', 'State', 'ZIP Code',
       'County Name', 'Phone Number', 'Hospital Type', 'Hospital Ownership',
       'Emergency Services', 'Meets criteria for meaningful use of EHRs',
       'Hospital overall rating', 'Hospital overall rating footnote',
       'Mortality national comparison',
       'Mortality national comparison footnote',
       'Safety of care national comparison',
       'Safety of care national comparison footnote',
       'Readmission national comparison',
       'Readmission national comparison footnote',
       'Patient experience national comparison',
       'Patient experience national comparison footnote',
       'Effectiveness of care national comparison',
       'Effectiveness of care national comparison footnote',
       'Timeliness of care national comparison',
       'Timeliness of care national comparison footnote',
       'Efficient use of medical imaging national comparison',
       'Efficient use of medical imaging national comparison footnote',
       'Location'], axis = 1, inplace = True)

In [12]:
nycHospitals['format'] = nycHospitals['url'].str.split('.').str[-1:]
nycHospitals.head()

Unnamed: 0,Provider ID,fac_id,Hospital Name,manualDownload,url,format
0,330204,1438,BELLEVUE HOSPITAL CENTER,no,https://www.bellevuehospital.com/sites/default...,[csv]
1,330009,1164,BRONX-LEBANON HOSPITAL CENTER,no,https://www.bronxcare.org/fileadmin/SiteFiles/...,[xlsx]
2,330233,1286,BROOKDALE HOSPITAL MEDICAL CENTER,no,http://www.brookdalehospital.org/assets/charge...,[xlsx]
3,330056,1288,BROOKLYN HOSPITAL CENTER AT DOWNTOWN CAMPUS,no,https://www.tbh.org/sites/default/files/Brookl...,[csv]
4,330196,1294,CONEY ISLAND HOSPITAL,no,https://hhinternet.blob.core.windows.net/uploa...,[xlsx]


In [13]:
nycHospitals.columns = ['providerId', 'fac_id', 'hospitalName', 'manualDownload','url', 'format']

In [14]:
nycHospitals['tempName'] = nycHospitals['hospitalName'].str.translate ({ord(c): " " for c in "!@#$%^&*()[]{};:,./<>?\|`~-=_+"}).str.title().str.split()

In [15]:
l = len(nycHospitals)

In [17]:
destFol = os.getcwd()+'/dataFiles/rawHospitalChargeData/'
filenames = []

for i in range(0, l):
    s=''
    url = nycHospitals['url'][i]
    name = nycHospitals['tempName'][i][0].lower()+s.join(nycHospitals['tempName'][i][1:])
    pid = nycHospitals['providerId'][i]
    
    if nycHospitals['format'][i][0] == 'csv':
        try:
            urllib.request.urlretrieve(url, name+pid+'.csv')
            os.system('mv ' + name+pid+'.csv ' + destFol)
            filenames.append(name+pid+'.csv')
            #print(i, name)
        except:
            print('An error occurred in : Download Manually ', name+pid+'.csv', '\n' + url)
    elif nycHospitals['format'][i][0] == 'xlsx':
        try:
            urllib.request.urlretrieve(url, name+pid+'.xlsx')
            os.system('mv ' + name+pid+'.xlsx ' + destFol)
            filenames.append(name+pid+'.xlsx')
            #print(i, name)
        except:
            print('An error occurred in :  Download Manually ', name+pid+'.xlsx', '\n' + url)
            filenames.append(name+pid+'.xlsx')
    elif nycHospitals['format'][i][0] == 'xls':
        try:
            urllib.request.urlretrieve(url, name+pid+'.xls')
            os.system('mv ' + name+pid+'.xls ' + destFol)
            filenames.append(name+pid+'.xls')
            # print(i, name)
        except:
            print('An error occurred in :  Download Manually ', name+pid+'.xls', '\n' + url)
            filenames.append(name+pid+'.xls')
    else:
        print(i, name+pid, "_____Requires Manual Download")
        filenames.append(name+pid+'.xlsx')

An error occurred in :  Download Manually  lenoxHillHospital330119.xlsx 
https://www.northwell.edu/sites/northwell.edu/files/inline-files/Northwell%20Health%20-%20CMS%20Mandate%20UPLOAD%20Files%2012192018_2.xlsx
An error occurred in :  Download Manually  longIslandJewishMedicalCenter330195.xlsx 
https://www.northwell.edu/sites/northwell.edu/files/inline-files/Northwell%20Health%20-%20CMS%20Mandate%20UPLOAD%20Files%2012192018_2.xlsx
An error occurred in :  Download Manually  mountSinaiBethIsrael330169.xlsx 
https://www.mountsinai.org/files/MSHealth/Assets/HS/About/Chargemaster_MSH.xlsx
An error occurred in :  Download Manually  mountSinaiHospital330024.xlsx 
https://www.mountsinai.org/files/MSHealth/Assets/HS/About/Average%20Charges%20by%20DRG_MSH.xlsx
An error occurred in :  Download Manually  mountSinaiWest330046.xlsx 
https://www.mountsinai.org/files/MSHealth/Assets/HS/About/Average%20Charge%20by%20DRG_MSL.xlsx
An error occurred in :  Download Manually  nYEyeAndEarInfirmary330100.xls

In [18]:
gsheet2 = get_google_sheet(SHEET_ID, 'nystatehospitalsProviderIds')

In [19]:
## turn the list of hospitals from gsheet into pd df
nysHospitals = pd.DataFrame.from_records(gsheet2['values'][1:], columns=gsheet2['values'][0]) 

In [20]:
## write out a csv copy of the file to archive
nysHospitals.to_csv('dataFiles/cms/nysHospitalsProviderIdFacilityIdCrossWalk.csv')

In [21]:
nysHospitals.head()

Unnamed: 0,in_nyc,Provider ID,fac_id,Hospital Name,Address,City,State,ZIP Code,County Name,Phone Number,...,Readmission national comparison footnote,Patient experience national comparison,Patient experience national comparison footnote,Effectiveness of care national comparison,Effectiveness of care national comparison footnote,Timeliness of care national comparison,Timeliness of care national comparison footnote,Efficient use of medical imaging national comparison,Efficient use of medical imaging national comparison footnote,Location
0,False,330013,1,ALBANY MEDICAL CENTER HOSPITAL,"43 NEW SCOTLAND AVENUE, MAIL CODE 34",ALBANY,NY,12208,ALBANY,5182622400,...,,Below the national average,,Same as the national average,,Below the national average,,Same as the national average,,"43 NEW SCOTLAND AVENUE, MAIL CODE 34 ALBANY, N..."
1,False,330003,4,ALBANY MEMORIAL HOSPITAL,600 NORTHERN BOULEVARD,ALBANY,NY,12204,ALBANY,5184713490,...,,Below the national average,,Same as the national average,,Below the national average,,Not Available,Results are not available for this reporting p...,"600 NORTHERN BOULEVARD ALBANY, NY (42.674685, ..."
2,False,330057,5,ST PETER'S HOSPITAL,315 SOUTH MANNING BOULEVARD,ALBANY,NY,12208,ALBANY,5185251550,...,,Below the national average,,Same as the national average,,Below the national average,,Same as the national average,,"315 SOUTH MANNING BOULEVARD ALBANY, NY (42.660..."
3,False,331301,37,"CUBA MEMORIAL HOSPITAL, INC",140 WEST MAIN STREET,CUBA,NY,14727,ALLEGANY,5859612000,...,Results are not available for this reporting p...,Not Available,There are too few measures or measure groups r...,Not Available,Results are not available for this reporting p...,Same as the national average,,Not Available,There are too few measures or measure groups r...,"140 WEST MAIN STREET CUBA, NY (42.213341, -78...."
4,False,330096,39,JONES MEMORIAL HOSPITAL,191 NORTH MAIN STREET,WELLSVILLE,NY,14895,ALLEGANY,5855931100,...,,Below the national average,,Same as the national average,,Same as the national average,,Not Available,Results are not available for this reporting p...,"191 NORTH MAIN STREET WELLSVILLE, NY (42.12287..."
