# Tutorial 6: Advanced Usage - Working with SQL

<div class="alert alert-block alert-info"> <b>Before we get started: </b> 
    <ul style="list-style-type: none;margin: 0;padding: 0;">
        <li>✍️ To run this notebook, you need to have Ponder installed and set up on your machine. If you have not done so already, please refer to our <a href="https://docs.ponder.io/getting_started/quickstart.html">Quickstart guide</a> to get started.</li>
        <li>📖 Otherwise, if you're just interested in browsing through the tutorial, keep reading below!</li>
    </ul>
</div>

In this tutorial, we will showcase a few tips and tricks that helps you more easily move between Ponder and SQL. We will be using the [MIMIC-III demo dataset](https://physionet.org/content/mimiciii-demo/1.4/) as an example dataset. The MIMIC-III Clinical Database contains deidentified health-related data of patients who stayed in an intensive care unit (ICU) at the Beth Israel Deaconess Medical Center in Boston. The demo dataset contains records for 100 patients across three tables `PATIENTS`, `ICUSTAYS`, and `ADMISSIONS`. 

### Data Definition (DDL) with SQLAlchemy

In SQL, DDL statements involve modifications to the database schema, e.g., `CREATE`, `ALTER`, `DROP`. Oftentimes, you may want to run a DDL statement alongside your analysis, either via an external query editor or through SQLAlchemy. 



In [1]:
# Install SQLAlchemy if you don't have it already
! pip install --upgrade sqlalchemy-bigquery --quiet

In [None]:
import os; os.chdir("..");

In [3]:
from sqlalchemy import create_engine
engine = create_engine('bigquery://', credentials_path='credential.json')
connection = engine.connect()

Then we can run the SQL query directly to create a new dataset named `MIMIC3`

In [None]:
connection.execute("CREATE SCHEMA MIMIC3;") 

Now if we print out the list of datasets in our project, we can see the new dataset MIMIC3 added: 

In [5]:
ret = connection.execute("SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA;") 

In [6]:
import pandas
pandas.DataFrame(ret.all())

Unnamed: 0,schema_name
0,MIMIC3
1,PONDER
2,TEST


### Existing SQL DML with Ponder

We will be using a few example tables for the remainder of this tutorial. You can run this python script to populate the required datasets to your database. This will populate three different tables `PATIENTS`, `ADMISSIONS`, and `ICUSTAYS` to your database. 

In [None]:
!python populate_mimic3.py > /dev/null 2>&1

Oftentime, you may already have an existing SQL script that you've been using to join and denormalize some tables or perform some pre-aggregation or ETL before you perform your analysis. You want to reuse that SQL code while working with Ponder for the remaining analysis workflow. In this example, we show how you can feed this into the `pd.read_sql` to operate on the resulting table.

In [7]:
import ponder; ponder.init()
import modin.pandas as pd
from google.cloud import bigquery
from google.cloud.bigquery import dbapi
from google.oauth2 import service_account
import json
bigquery_con = dbapi.Connection(bigquery.Client(credentials=service_account.Credentials.from_service_account_info(json.loads(open("credential.json").read()),scopes=["https://www.googleapis.com/auth/bigquery"])))
ponder.configure(bigquery_dataset='MIMIC3', default_connection=bigquery_con)



For example, we may want to use [this existing SQL query](https://mimic.mit.edu/docs/iii/tutorials/intro-to-mimic-iii/#5-patient-age-and-mortality) from the MIT MIMIC-III tutorial to jumpstart our analysis. (We recommend that you omit the `;` at the end of the SQL statement to prevent potential errors.)

In [8]:
df = pd.read_sql('''SELECT p.subject_id, p.dob, a.hadm_id,
                    a.admittime, p.expire_flag
                    FROM `innate-empire-381416.MIMIC3.ADMISSIONS` a
                    INNER JOIN `innate-empire-381416.MIMIC3.PATIENTS` p
                    ON p.subject_id = a.subject_id''', con = bigquery_con)

In [9]:
df

Unnamed: 0,subject_id,dob,hadm_id,admittime,expire_flag
0,10040,2061-10-23 00:00:00,157839,2147-02-23 11:43:00,1
1,10013,2038-09-03 00:00:00,165520,2125-10-04 23:36:00,1
2,10042,2076-05-06 00:00:00,148562,2147-02-06 12:38:00,1
3,10064,2058-04-23 00:00:00,111761,2127-03-19 14:39:00,1
4,10067,2101-06-10 00:00:00,160442,2130-10-06 01:34:00,1
...,...,...,...,...,...
124,10056,2046-02-27 00:00:00,100375,2129-05-02 00:12:00,1
125,10088,2029-07-09 00:00:00,149044,2107-05-12 18:00:00,1
126,10088,2029-07-09 00:00:00,169938,2107-01-04 11:59:00,1
127,10088,2029-07-09 00:00:00,168233,2107-01-29 04:00:00,1


Then we can continue using Ponder by writing pandas as always.

In [31]:
df["age"] = df["admittime"].dt.year  - df["dob"].dt.year

In [32]:
df["age"]

0      86
1      87
2      71
3      69
4      29
       ..
124    83
125    78
126    78
127    78
128    81
Name: age, Length: 129, dtype: Int64

### Working with multiple tables

With Ponder, you can work with multiple tables at the same time by creating different dataframes using the `read_sql` or `read_csv` command.

In [35]:
patients = pd.read_sql("PATIENTS", con=bigquery_con)
admissions = pd.read_sql("ADMISSIONS", con=bigquery_con)

Now we can work with these two dataframes in pandas. Here, we perform the same query as the SQL query above: 
```sql
SELECT p."subject_id", p."dob", a."hadm_id",
        a."admittime", p."expire_flag"
        FROM MIMIC3.PUBLIC.ADMISSIONS as a
        INNER JOIN MIMIC3.PUBLIC.PATIENTS as p
        ON p."subject_id" = a."subject_id"
```

In [36]:
patients.merge(admissions,on="subject_id")[["subject_id", "dob", "hadm_id","admittime", "expire_flag"]]

Unnamed: 0,subject_id,dob,hadm_id,admittime,expire_flag
0,43870,2097-05-16 00:00:00,142633,2186-02-09 21:32:00,1
1,10112,2069-05-05 00:00:00,188574,2148-01-13 22:32:00,1
2,10127,2181-04-19 00:00:00,182839,2198-06-28 05:34:00,1
3,40612,2073-08-13 00:00:00,104697,2159-11-17 03:28:00,1
4,40687,2073-06-05 00:00:00,129273,2155-03-08 02:35:00,1
...,...,...,...,...,...
124,44222,2107-06-27 00:00:00,192189,2180-07-19 06:55:00,1
125,40177,2082-06-27 00:00:00,198480,2169-05-06 23:16:00,1
126,44083,2057-11-15 00:00:00,125157,2112-05-04 08:00:00,1
127,44083,2057-11-15 00:00:00,131048,2112-05-22 15:37:00,1
