# MySQL Relational Database


<img src="datamodel.png" alt="Data Model" style="width: 300px; float: right; margin-left: 20px; border: 1px solid">

This sample database has the hospitals, systems, services, and visits information we talked about in the slides.  The model (table names, columns, and relationships) is show to the right.

In this model, there are four tables:
* System Affiliations
* Hospitals
* ED Visits
* Hospital Services

In these examples, we're going to ask a few questions and answer them with both SQL and Python Pandas data processes.


## Setup

Create a connection to the MySQL database

In [1]:
import os
import pymysql
from sqlalchemy import create_engine
import pandas as pd

In [5]:
host = 'slucor2021b.cgdcoitnku0k.us-east-1.rds.amazonaws.com'
port = '3306'
user = 'slucor2020'
password = 'SLUcor2020'
database = 'hds5210'

In [6]:
conn = create_engine('mysql+pymysql://' + 
                     user + ':' + 
                     password + '@' + 
                     host + '/' + 
                     database, echo=False)

## Give me a list of hospitals

In [7]:
pd.read_sql_query("SELECT * FROM hospitals", conn)

Unnamed: 0,hospital_name,city,system_name,beds
0,BJH,St. Louis,BJC,1243
1,Mercy STL,Ladue,Mercy,1120
2,MoBap,Ladue,BJC,443


# All the hospitals that are not affiliated with the Catholic church

Using SQL, we join the hospitals with system_affiliations tables to get a list of hospitals.

In [8]:
pd.read_sql_query("""
  SELECT h.*, s.affiliation
  FROM 
    hospitals h JOIN
    system_affiliations s ON h.system_name = s.system_name
  WHERE
    s.affiliation != 'Catholic'
""", conn)

Unnamed: 0,hospital_name,city,system_name,beds,affiliation
0,BJH,St. Louis,BJC,1243,Non-Religious
1,MoBap,Ladue,BJC,443,Non-Religious


## Read in all the DB tables into an array of Dataframes

1. Get a list of all of the tables
2. Read each table into a dataframe; and
3. Store that in a dictionary of dataframes

In [9]:
df = pd.read_sql_query("SHOW TABLES", conn)

In [10]:
df

Unnamed: 0,Tables_in_hds5210
0,Medicaid_EP_Hospital_Type
1,NPI
2,Provider_Name
3,Test
4,corona_counts
5,county_population
6,ed_visits
7,hospital_services
8,hospitals
9,mo_locations


In [11]:
tables = {}

for n,t in pd.read_sql_query("SHOW TABLES", conn).iterrows():
    name = str(t['Tables_in_hds5210'])
    print(name)
    tables[name] = pd.read_sql("SELECT * FROM "+name,conn)

Medicaid_EP_Hospital_Type
NPI
Provider_Name
Test
corona_counts
county_population
ed_visits
hospital_services
hospitals
mo_locations
population
ppp_data
system_affiliations
ttt_data
x


In [12]:
tables

{'Medicaid_EP_Hospital_Type':                                     Provider_Name         NPI    CCN  \
 0                            Sutter Bay Hospitals  1659439834  50008   
 1    PRIME HEALTHCARE SERVICES - GARDEN GROVE LLC  1659538858  50230   
 2                          ST MARY MEDICAL CENTER  1669456299  50300   
 3                       MADERA COMMUNITY HOSPITAL  1669673646  50568   
 4                    Temecula Valley Hospital Inc  1679816201  50775   
 ..                                            ...         ...    ...   
 323                   KAISER FOUNDATION HOSPITALS  1326119967  50071   
 324      COUNTY OF LOS ANGELES AUDITOR CONTROLLER  1336154020  50717   
 325                    REEDLEY COMMUNITY HOSPITAL  1336167550  50192   
 326  PROVIDENCE HEALTH SYSTEM-SOUTHERN CALIFORNIA  1336173269  50235   
 327                   KAISER FOUNDATION HOSPITALS  1336294040  50411   
 
     Medicaid_EP_Hospital_Type           Street_Address           City  \
 0        Acute Car

# All the hospitals that are not affiliated with the Catholic church

This time, using Pandas

In [13]:
hospitals = tables['hospitals'].merge(tables['system_affiliations'])

In [14]:
hospitals

Unnamed: 0,hospital_name,city,system_name,beds,affiliation
0,BJH,St. Louis,BJC,1243,Non-Religious
1,MoBap,Ladue,BJC,443,Non-Religious
2,Mercy STL,Ladue,Mercy,1120,Catholic


In [15]:
hospitals[hospitals['affiliation'] != 'Catholic']

Unnamed: 0,hospital_name,city,system_name,beds,affiliation
0,BJH,St. Louis,BJC,1243,Non-Religious
1,MoBap,Ladue,BJC,443,Non-Religious


## Hospitals with more than 43,000 ED visits each year

First in SQL

In [16]:
pd.read_sql_query("""
  SELECT h.*, v.*
  FROM 
    hospitals h JOIN
    ed_visits v ON h.hospital_name = v.hospital_name
  WHERE
    v.ed_visits > 43000
""", conn)

Unnamed: 0,hospital_name,city,system_name,beds,hospital_name.1,year,ed_visits
0,BJH,St. Louis,BJC,1243,BJH,2016,72348
1,BJH,St. Louis,BJC,1243,BJH,2017,81221
2,Mercy STL,Ladue,Mercy,1120,Mercy STL,2016,51932
3,Mercy STL,Ladue,Mercy,1120,Mercy STL,2017,52221
4,MoBap,Ladue,BJC,443,MoBap,2017,43921


Now, let's use SQL to pivot that so that we can see 2016 and 2017 in two separate columns...

# ...

Come on.  Let's do it?

# ...

Oh.  That isn't supported.  :(

## Let's try with Pandas

In [17]:
visits = tables['hospitals'].merge(tables['ed_visits'])

In [18]:
visits

Unnamed: 0,hospital_name,city,system_name,beds,year,ed_visits
0,BJH,St. Louis,BJC,1243,2016,72348
1,BJH,St. Louis,BJC,1243,2017,81221
2,Mercy STL,Ladue,Mercy,1120,2016,51932
3,Mercy STL,Ladue,Mercy,1120,2017,52221
4,MoBap,Ladue,BJC,443,2016,42983
5,MoBap,Ladue,BJC,443,2017,43921


In [19]:
visits = visits[visits['ed_visits'] > 43000]

In [20]:
visits

Unnamed: 0,hospital_name,city,system_name,beds,year,ed_visits
0,BJH,St. Louis,BJC,1243,2016,72348
1,BJH,St. Louis,BJC,1243,2017,81221
2,Mercy STL,Ladue,Mercy,1120,2016,51932
3,Mercy STL,Ladue,Mercy,1120,2017,52221
5,MoBap,Ladue,BJC,443,2017,43921


In [22]:
pd.pivot_table(visits, 
               index=['hospital_name','city','system_name','beds'], 
               columns='year', 
               values='ed_visits').reset_index()

year,hospital_name,city,system_name,beds,2016,2017
0,BJH,St. Louis,BJC,1243,72348.0,81221.0
1,Mercy STL,Ladue,Mercy,1120,51932.0,52221.0
2,MoBap,Ladue,BJC,443,,43921.0


## OK.  We can do it in SQL...

But can you figure out why this is a bad idea?

In [23]:
pd.read_sql_query("""
  SELECT 
    h.*, v.year, 
    CASE WHEN v.year = 2016 THEN v.ed_visits ELSE NULL END as visits_2016,
    CASE WHEN v.year = 2017 THEN v.ed_visits ELSE NULL END as visits_2017
  FROM 
    hospitals h JOIN
    ed_visits v ON h.hospital_name = v.hospital_name
  WHERE
    v.ed_visits > 43000
""", conn)

Unnamed: 0,hospital_name,city,system_name,beds,year,visits_2016,visits_2017
0,BJH,St. Louis,BJC,1243,2016,72348.0,
1,BJH,St. Louis,BJC,1243,2017,,81221.0
2,Mercy STL,Ladue,Mercy,1120,2016,51932.0,
3,Mercy STL,Ladue,Mercy,1120,2017,,52221.0
4,MoBap,Ladue,BJC,443,2017,,43921.0


In [24]:
pd.read_sql_query("""
  SELECT 
    h.*, 
    MIN(CASE WHEN v.year = 2016 THEN v.ed_visits ELSE NULL END) as visits_2016,
    MIN(CASE WHEN v.year = 2017 THEN v.ed_visits ELSE NULL END) as visits_2017
  FROM 
    hospitals h JOIN
    ed_visits v ON h.hospital_name = v.hospital_name
  WHERE
    v.ed_visits > 43000
  GROUP BY 
    h.hospital_name, h.city, h.system_name, h.beds
""", conn)

Unnamed: 0,hospital_name,city,system_name,beds,visits_2016,visits_2017
0,BJH,St. Louis,BJC,1243,72348.0,81221
1,Mercy STL,Ladue,Mercy,1120,51932.0,52221
2,MoBap,Ladue,BJC,443,,43921


## Now, let's create some new data from scratch

1. Create a dataframe
2. Use a name that includes your username so that it's unique
3. Write the df to a new table
4. Query your data back out

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

In [25]:
columns = ['hospital_name','year','ip_visits']
data = [
    ['BJH',2016,3124],
    ['Mercy STL',2016,4321],
    ['MoBap',2016,2783]
]
df = pd.DataFrame(data, columns=columns)

In [26]:
df

Unnamed: 0,hospital_name,year,ip_visits
0,BJH,2016,3124
1,Mercy STL,2016,4321
2,MoBap,2016,2783


In [27]:
import getpass
myname = getpass.getuser().split('-')[1]
myname

'paulboal'

In [28]:
df.to_sql(myname + '_ipv', conn, index=False)

In [29]:
ip_visits = pd.read_sql_query('SELECT * FROM ' + myname + '_ipv', conn)

In [30]:
ip_visits

Unnamed: 0,hospital_name,year,ip_visits
0,BJH,2016,3124
1,Mercy STL,2016,4321
2,MoBap,2016,2783


In [31]:
pd.read_sql_query('SHOW TABLES', conn)

Unnamed: 0,Tables_in_hds5210
0,Medicaid_EP_Hospital_Type
1,NPI
2,Provider_Name
3,Test
4,corona_counts
5,county_population
6,ed_visits
7,hospital_services
8,hospitals
9,mo_locations


## And we need to clean up after ourselves by dropping our table

In [32]:
with conn.connect() as c:
    c.execute('DROP TABLE IF EXISTS ' + myname + '_ipv')