# hospital

The hospital table is a very simple table that summarizes high level hospital information. Unlike other tables, it does not contain any patient identifiers, and instead can only be joined to the patient table using `hospitalid`.

**Note: many hospitals described in the hospital table have 0 patient admissions in the patient table**.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import getpass
import pdvega

# for configuring connection 
from configobj import ConfigObj
import os

%matplotlib inline

In [2]:
# Create a database connection using settings from config file
config='../db/config.ini'

# connection info
conn_info = dict()
if os.path.isfile(config):
    config = ConfigObj(config)
    conn_info["sqluser"] = config['username']
    conn_info["sqlpass"] = config['password']
    conn_info["sqlhost"] = config['host']
    conn_info["sqlport"] = config['port']
    conn_info["dbname"] = config['dbname']
    conn_info["schema_name"] = config['schema_name']
else:
    conn_info["sqluser"] = 'postgres'
    conn_info["sqlpass"] = ''
    conn_info["sqlhost"] = 'localhost'
    conn_info["sqlport"] = 5432
    conn_info["dbname"] = 'eicu'
    conn_info["schema_name"] = 'public,eicu_crd'
    
# Connect to the eICU database
print('Database: {}'.format(conn_info['dbname']))
print('Username: {}'.format(conn_info["sqluser"]))
if conn_info["sqlpass"] == '':
    # try connecting without password, i.e. peer or OS authentication
    try:
        if (conn_info["sqlhost"] == 'localhost') & (conn_info["sqlport"]=='5432'):
            con = psycopg2.connect(dbname=conn_info["dbname"],
                                   user=conn_info["sqluser"])            
        else:
            con = psycopg2.connect(dbname=conn_info["dbname"],
                                   host=conn_info["sqlhost"],
                                   port=conn_info["sqlport"],
                                   user=conn_info["sqluser"])
    except:
        conn_info["sqlpass"] = getpass.getpass('Password: ')

        con = psycopg2.connect(dbname=conn_info["dbname"],
                               host=conn_info["sqlhost"],
                               port=conn_info["sqlport"],
                               user=conn_info["sqluser"],
                               password=conn_info["sqlpass"])
query_schema = 'set search_path to ' + conn_info['schema_name'] + ';'

Database: eicu
Username: postgres


In [3]:
from sqlalchemy import create_engine
con= create_engine('postgresql://eicu@localhost:5432/eicu')

In [34]:
pd.options.display.max_rows=None#Notebook 的一个cell的显示行数
pd.options.display.max_columns=None#Notebook 的一个cell的显示列数

## Compare data completion for hospitals with and without admissions

coalesce
- 将控制替换成其他值
- 返回第一个非空值
- COALESCE是一个函数， (expression_1, expression_2, ...,expression_n)依次参考各参数表达式，
<br>遇到非null值即停止并返回该值。如果所有的表达式都是空值，最终将返回一个空值。使用COALESCE在于大部分包含空值的表达式最终将返回空值
<br> select coalesce(success_cnt, 1) from tableA
<br> 当success_cnt 为null值的时候，将返回1，否则将返回success_cnt的真实值。



In [4]:
query = query_schema + """
with tt as
(
select hospitalid, count(*) as n
from patient
group by hospitalid
)
select h.*, coalesce(tt.n, 0) as n
from hospital h
left join tt
on h.hospitalid = tt.hospitalid
"""

# query = query_schema + """
# with tt as
# (
# select hospitalid, count(*) as n
# from patient
# group by hospitalid
# )
# select h.*, 
#     case 
#         when tt.n is not null then tt.n 
#         else 0 
#     end as n
# from hospital h
# left join tt
# on h.hospitalid = tt.hospitalid
# """
df = pd.read_sql_query(query, con)
df.head(n=5)

Unnamed: 0,hospitalid,numbedscategory,teachingstatus,region,n
0,56,<100,False,Midwest,325
1,58,100 - 249,False,Midwest,321
2,59,<100,False,Midwest,854
3,60,<100,False,Midwest,458
4,61,<100,False,Midwest,233


In [5]:
print('{} hospitals have 0 admissions.'.format(df.loc[df['n']==0,'hospitalid'].nunique()))


0 hospitals have 0 admissions.


## Data completion among hospitals with admissions

First impute 'missing' so that our groupby reports on the number of hospitals with missing data. Also, we define a convenience function for reporting the absolute count and percent of the total.

In [19]:
def count_with_percent(x, N):
    return '{:3d} ({:5.2f}%)'.format(x.count(), x.count()*100.0/N)

In [20]:
for c in ['region','numbedscategory','teachingstatus','region']:
    # df[c].fillna('Missing',inplace=True)
    df[c]= df[c].fillna('Missing')

In [28]:
idx = df['n']!=0
N = np.sum(idx)

for c in ['region','numbedscategory','teachingstatus']:
    grp = df.loc[idx, :].groupby(c)['hospitalid']
    
    print('')
    print(grp.apply(count_with_percent, N))


region
Midwest       70 (33.65%)
Northeast     13 ( 6.25%)
South         56 (26.92%)
West          43 (20.67%)
Name: hospitalid, dtype: object

numbedscategory
100 - 249     62 (29.81%)
250 - 499     35 (16.83%)
<100          46 (22.12%)
>= 500        23 (11.06%)
Name: hospitalid, dtype: object

teachingstatus
False    189 (90.87%)
True      19 ( 9.13%)
Name: hospitalid, dtype: object


Above we can see that the frequency of missing data ranges between 11-18% in the hospital table.