## **Open Payments 2020 General Payments Data : Hospital Information for the Hospital Table**

In [1]:
# packages to import
import numpy as np
import pandas as pd

The Open Payments data full csv file with 91 columns and millions of rows:
* 2020 (3.7GB) https://download.cms.gov/openpayments/PGYR20_P012023/OP_DTL_GNRL_PGYR2020_P01202023.csv



In [2]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [3]:
# My Team3Folder was in my top level of my Google drive. If yours isn't, then you just need to modify it to add the Folder hierarchy
%cd /content/gdrive/MyDrive/Team3Folder/Data/Raw

/content/gdrive/MyDrive/Team3Folder/Data/Raw


In [4]:
# Read in columns needed for our data frame
df=pd.read_csv("OP_DTL_GNRL_PGYR2020_P01202023.csv", usecols=['Covered_Recipient_Type',
                  'Teaching_Hospital_CCN',
                  'Teaching_Hospital_ID',
                  'Teaching_Hospital_Name',
                  'Recipient_City',
                  'Recipient_State',
                  'Recipient_Zip_Code',
                  'Recipient_Country'])
df.shape

  df=pd.read_csv("OP_DTL_GNRL_PGYR2020_P01202023.csv", usecols=['Covered_Recipient_Type',


(5823162, 8)

In [5]:
# Retrieve the data for Hospitals only
df_hospitals=df[df["Covered_Recipient_Type"]!="Covered Recipient Physician"]
df_hospitals.shape

(32406, 8)

In [6]:
# Remove the duplicate rows if the Hosptials have the same identifying information of Teaching_Hospital_Id, Teaching_Hosptial_CCN and Teaching_Hospital_Name
df_unique_hospitals=df_hospitals.drop_duplicates( subset=['Teaching_Hospital_CCN','Teaching_Hospital_ID', 'Teaching_Hospital_Name'],keep=False, inplace=False)
print(df_unique_hospitals.shape)
df_unique_hospitals.head(10)

(447, 8)


Unnamed: 0,Covered_Recipient_Type,Teaching_Hospital_CCN,Teaching_Hospital_ID,Teaching_Hospital_Name,Recipient_City,Recipient_State,Recipient_Zip_Code,Recipient_Country
1757,Covered Recipient Teaching Hospital,430016.0,9287.0,Avera McKennan,Sioux Falls,SD,57117,United States
1758,Covered Recipient Teaching Hospital,360055.0,9736.0,Trumbull Memorial Hospital,Warran,OH,44483,United States
1760,Covered Recipient Teaching Hospital,240063.0,9226.0,Healtheast St Josephs Hospital,Rochester,MN,55102,United States
1761,Covered Recipient Teaching Hospital,493301.0,9371.0,Childrens Hospital of the Kings Da,Norfolk,VA,23507,United States
2517,Covered Recipient Teaching Hospital,110006.0,9426.0,ST MARYS HEALTH CARE SYSTEM INC,ATHENS,GA,30606,United States
2522,Covered Recipient Teaching Hospital,110087.0,9442.0,GWINNETT HOSPITAL SYSTEM INC,LAWRENCEVILLE,GA,30046,United States
2525,Covered Recipient Teaching Hospital,360020.0,9032.0,SUMMA HEALTH SYSTEM,AKRON,OH,44308,United States
7706,Covered Recipient Teaching Hospital,454076.0,9713.0,Harris Co Psychiatric Center,Houston,TX,77225,United States
8899,Covered Recipient Teaching Hospital,140010.0,9083.0,NorthShore University HealthSystem,Evanston,IL,60201-1613,United States
10234,Covered Recipient Teaching Hospital,453302.0,9672.0,Childrens Medical Center of Dallas,Dallas,TX,75235,United States


In [7]:
df_unique_hospitals.columns

Index(['Covered_Recipient_Type', 'Teaching_Hospital_CCN',
       'Teaching_Hospital_ID', 'Teaching_Hospital_Name', 'Recipient_City',
       'Recipient_State', 'Recipient_Zip_Code', 'Recipient_Country'],
      dtype='object')

In [8]:
#Makes sense to  rename the rest of the columns to simpler names
dict={
    "Teaching_Hospital_ID":"ID",
    "Teaching_Hospital_CCN":"CCN",
    "Teaching_Hospital_Name":"Hospital_Name",
    "Recipient_City":"City",
    "Recipient_State":"State",
    "Recipient_Zip_Code":"ZipCode",
    "Recipient_Country":"Country",
    "Covered_Recipient_Type":"Type"}
df_unique_hospitals.rename(columns=dict,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unique_hospitals.rename(columns=dict,inplace=True)


In [9]:
#Now that we have reduced the size, Lets look at the columns
df_unique_hospitals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 447 entries, 1757 to 5822337
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Type           447 non-null    object 
 1   CCN            447 non-null    float64
 2   ID             447 non-null    float64
 3   Hospital_Name  447 non-null    object 
 4   City           447 non-null    object 
 5   State          447 non-null    object 
 6   ZipCode        447 non-null    object 
 7   Country        447 non-null    object 
dtypes: float64(2), object(6)
memory usage: 31.4+ KB


In [10]:
# Lets find how how many nulls are in each column
df_unique_hospitals.isna().sum()

Type             0
CCN              0
ID               0
Hospital_Name    0
City             0
State            0
ZipCode          0
Country          0
dtype: int64

In [11]:
# delete all rows with column 'NPI' being Null
df_unique_hospitals = df_unique_hospitals.dropna(subset=['CCN'])
df_unique_hospitals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 447 entries, 1757 to 5822337
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Type           447 non-null    object 
 1   CCN            447 non-null    float64
 2   ID             447 non-null    float64
 3   Hospital_Name  447 non-null    object 
 4   City           447 non-null    object 
 5   State          447 non-null    object 
 6   ZipCode        447 non-null    object 
 7   Country        447 non-null    object 
dtypes: float64(2), object(6)
memory usage: 31.4+ KB


In [12]:
# Lets find how how many nulls are left in each column
df_unique_hospitals.isna().sum()

Type             0
CCN              0
ID               0
Hospital_Name    0
City             0
State            0
ZipCode          0
Country          0
dtype: int64

In [13]:
# ID and CCN columns are the only numeric data types, lets take a look
df_unique_hospitals[['ID', 'CCN']].head

<bound method NDFrame.head of              ID       CCN
1757     9287.0  430016.0
1758     9736.0  360055.0
1760     9226.0  240063.0
1761     9371.0  493301.0
2517     9426.0  110006.0
...         ...       ...
5795520  8963.0  220111.0
5811813  8696.0  330181.0
5811818  8965.0  233027.0
5811826  9114.0  143027.0
5822337  9800.0  464009.0

[447 rows x 2 columns]>

In [14]:
# It looks like they are composed of integer values represented as decimals. Lets convert these columns to ints.
df_unique_hospitals[['ID', 'CCN']]=df_unique_hospitals[['ID', 'CCN']].astype(int)
df_unique_hospitals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 447 entries, 1757 to 5822337
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Type           447 non-null    object
 1   CCN            447 non-null    int64 
 2   ID             447 non-null    int64 
 3   Hospital_Name  447 non-null    object
 4   City           447 non-null    object
 5   State          447 non-null    object
 6   ZipCode        447 non-null    object
 7   Country        447 non-null    object
dtypes: int64(2), object(6)
memory usage: 31.4+ KB


In [15]:
df_unique_hospitals[['ID', 'CCN']].head(10)

Unnamed: 0,ID,CCN
1757,9287,430016
1758,9736,360055
1760,9226,240063
1761,9371,493301
2517,9426,110006
2522,9442,110087
2525,9032,360020
7706,9713,454076
8899,9083,140010
10234,9672,453302


In [16]:
#Now lets reset the index of the df_unique_hospitals data frame to  the NPI column
df_unique_hospitals.set_index('CCN', inplace=True)
df_unique_hospitals.head()

Unnamed: 0_level_0,Type,ID,Hospital_Name,City,State,ZipCode,Country
CCN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
430016,Covered Recipient Teaching Hospital,9287,Avera McKennan,Sioux Falls,SD,57117,United States
360055,Covered Recipient Teaching Hospital,9736,Trumbull Memorial Hospital,Warran,OH,44483,United States
240063,Covered Recipient Teaching Hospital,9226,Healtheast St Josephs Hospital,Rochester,MN,55102,United States
493301,Covered Recipient Teaching Hospital,9371,Childrens Hospital of the Kings Da,Norfolk,VA,23507,United States
110006,Covered Recipient Teaching Hospital,9426,ST MARYS HEALTH CARE SYSTEM INC,ATHENS,GA,30606,United States


In [17]:
# I notices the Hospital_Name and City columns have mixed case. To meake it simpler, and based on the dashboard, lets make them consistent
df_unique_hospitals['Hospital_Name']=df_unique_hospitals['Hospital_Name'].str.capitalize()
df_unique_hospitals['City']=df_unique_hospitals['City'].str.capitalize()
df_unique_hospitals

Unnamed: 0_level_0,Type,ID,Hospital_Name,City,State,ZipCode,Country
CCN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
430016,Covered Recipient Teaching Hospital,9287,Avera mckennan,Sioux falls,SD,57117,United States
360055,Covered Recipient Teaching Hospital,9736,Trumbull memorial hospital,Warran,OH,44483,United States
240063,Covered Recipient Teaching Hospital,9226,Healtheast st josephs hospital,Rochester,MN,55102,United States
493301,Covered Recipient Teaching Hospital,9371,Childrens hospital of the kings da,Norfolk,VA,23507,United States
110006,Covered Recipient Teaching Hospital,9426,St marys health care system inc,Athens,GA,30606,United States
...,...,...,...,...,...,...,...
220111,Covered Recipient Teaching Hospital,8963,Good samaritan medical center,Brockton,MA,02301,United States
330181,Covered Recipient Teaching Hospital,8696,Glen cove hospital,Glen cove,NY,11542,United States
233027,Covered Recipient Teaching Hospital,8965,Rehabilitation institute of michigan,Detroit,MI,48201,United States
143027,Covered Recipient Teaching Hospital,9114,Marianjoy rehab hospital & clinic,Wheaton,IL,60187,United States


In [21]:
df_unique_hospitals.to_csv('/content/gdrive/My Drive/Team3Folder/Data/Processed/OP_2020_hospitals.csv', encoding='utf-8')