## Using Ponder for Healthcare Data Analysis

With the growth of electronic health records, healthcare providers store, process, and analyze information about patients, their medical history, treatment, and outcomes. Effective analysis of electronic health records has shown to improve the patient care experience, support clinical decision-making, and advance the frontiers of medical research.

In this post, we will walk through a real-world analysis scenario of how you can use Ponder to analyze electronic health records directly in your data warehouse. You can download the notebook associated with this post [here](https://github.com/ponder-org/ponder-blog/blob/main/MIMIC-III%20Health%20Record%20Analysis.ipynb).


### The MIMIC-III Clinical Dataset

In this blog post, we will be looking at the [MIMIC-III demo dataset](https://physionet.org/content/mimiciii-demo/1.4/). The MIMIC-III Clinical Database contains deidentified health-related data of patients who stayed in an intensive care unit (ICU) at the Beth Israel Deaconess Medical Center in Boston. The demo dataset contains records for 100 patients across three tables `PATIENTS`, `ICUSTAYS`, and `ADMISSIONS`. 

Citation: 
```
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific data, 3, 160035.
```

### What is Ponder? 

Ponder lets you run pandas directly in your data warehouse. Data teams can interact with their data through their familiar pandas-native experience, while enjoying the scalability and security benefits that comes with a modern cloud data warehouses. You can learn more about Ponder in our [recent blogpost](https://ponder.io/run-pandas-on-1tb-directly-in-your-data-warehouse/) and [sign up here](https://ponder.io/product/) to try out Ponder today. 

In [None]:
import json; import os; os.chdir("..")
creds = json.load(open(os.path.expanduser("credential.json")))

Ponder uses your data warehouse as an engine, so we first need to establish a connection with BigQuery in order to start querying the data. The code below shows how you can set up the database connection.

In [None]:
import ponder.bigquery
bigquery_con = ponder.bigquery.connect(creds, schema = "TEST")
ponder.bigquery.init(bigquery_con,enable_ssl=True)

In [None]:
import modin.pandas as pd

Once we have connected to the warehouse, we can use the `pd.read_sql` command to connect to our `ICUSTAYS` and `PATIENTS` table. Note that this creates a dataframe that points to those tables without the data being loaded in memory!

In [None]:
icu = pd.read_sql("ICUSTAYS",con=bigquery_con)
patients = pd.read_sql("PATIENTS",con=bigquery_con)

Now we have connected to the two tables in our warehouse, we can operate on it just like you typically do with any pandas dataframe. Here, we join the two tables on the patient identifier `subject_id`.

In [None]:
df = patients.merge(icu,on="subject_id")
# this rename line is not necessary, temp fix because of a Ponder bug (can remove from blogpost)
df = df.rename(columns={"subject_id_x":"subject_id"})

We drop the `row_id` columns, which is just a record ID from the database.

In [None]:
df = df.drop(list(df.filter(regex="row_id")),axis=1)

### Exploratory Data Analysis: Dataset Overview

Exploratory data analysis is a important first step in any data science project that help uncover trends, patterns, and insights that inform subsequent analyses.

To start off, let's look at the descriptive statistics to get an overview of the dataset.

In [None]:
df.describe()

Next, we print out a summary of our dataframe including the data types and non-null values of each column.

In [None]:
df.info()

We want to understand the correlation between the patient's age and their average length of stay in the ICU. To do this, we must first convert the timestamps into datetime objects.

In [None]:
df["intime"] = pd.to_datetime(df["intime"])
df["outtime"] = pd.to_datetime(df["outtime"])
df["dob"] = pd.to_datetime(df["dob"])

Then to compute the patient's length of stay, we simply subtract the the time the patient exits the ICU (`outtime`) from the time they enter (`intime`).

In [None]:
df["length_of_stay"] = (df["outtime"]-df["intime"])/pd.Timedelta('1 hour')

In [None]:
df["age"] = df["intime"].dt.year-df["dob"].dt.year

Finally, we filter out outlier entries where age is listed as above 100.

In [None]:
df = df[df["age"]<100]

To look at the correlation between these variables, we plot them on a scatterplot.

In [None]:
df.plot("age","length_of_stay",kind="scatter")

We see here that there is a concentrated cluster of ICU patients between 50-90 and most patients don't stay for more than 200 hours, but there are some wild outliers to this distribution.

---

**How would you do this in SQL?**

With Ponder, you can work directly with pandas while we run it on your data warehouse for you. 
There's no need to write a single line of SQL. Note that our example draw from [this tutorial](https://mimic.mit.edu/docs/iv/tutorials/bigquery/#tldr) written by MIT researchers who developed MIMIC, here is the equivalent SQL query from the same post:

```sql
WITH re AS (
SELECT
  DATETIME_DIFF(icu.outtime, icu.intime, HOUR) AS icu_length_of_stay,
  DATE_DIFF(DATE(icu.intime), DATE(pat.dob), YEAR) AS age
FROM `physionet-data.mimiciii_demo.icustays` AS icu
INNER JOIN `physionet-data.mimiciii_demo.patients` AS pat
  ON icu.subject_id = pat.subject_id)
SELECT
  age,
  AVG(icu_length_of_stay) AS stay
FROM re
WHERE age < 100
GROUP BY age
ORDER BY age
```

Note that the pandas query is just as easy (if not easier) to write than the SQL query. Moreover, visualization is much more integrated and seamless with Ponder than by working with SQL. 

The [MIMIC tutorial](https://mimic.mit.edu/docs/iv/tutorials/bigquery/#tldr) shows that to plot visualizations in BigQuery you would need to export the data from your Query Editor as a CSV, then the author used Google Sheets to plot the final result. With Ponder, visualization is just a single line of code via pandas's `df.plot` — fully integrated with the rest of your data analysis workflow.

---

Outside of the ICU stay information, we also want to look at the hospital admissions record to understand what happened *before* the patients were admitted to the ICU. 

In [None]:
admissions = pd.read_sql("ADMISSIONS",con=bigquery_con)

Here, we incorporate the admissions table to look at how long the patient stayed at the hospital before they were admitted to the ICU. 

In [None]:
df = df.merge(admissions,on=["hadm_id","subject_id"])

In [None]:
df["pre_icu_length_of_stay"]= (df["intime"]-df["admittime"])/pd.Timedelta('1 day')

By plotting the distribution of pre-ICU length of stay, we learn that most patients were admitted to the ICU within a day of being admitted to the hospital. This reflects the fact that ICU typically caters to patients with severe or life-threatening conditions requiring immediate attention.

In [None]:
df["pre_icu_length_of_stay"].hist()

In [None]:
print(f"Percentage of ICU admissions within 1 day: \
        {len(df[df['pre_icu_length_of_stay']<1])/len(df)*100:.2f}%")

You can find the SQL that performs a similar query on BigQuery in [this tutorial](https://mimic.mit.edu/docs/iii/tutorials/intro-to-mimic-iii-bq/#solution-to-step-4).

### Working with Text: Parsing through clinical diagnosis

Electronic health records can include both structured data (such as clinical measurements of temperature, blood pressure, etc.) and unstructured data (such as imaging, physician notes, etc.). For example, take a look at the [`diagnosis` column](https://mimic.mit.edu/docs/iii/tables/admissions/#diagnosis), which contains free-text diagnosis assigned by the clinician:

In [None]:
df.diagnosis

Thankfully, it is easy to work with text data in pandas, since pandas comes with a [convenient set of functions](https://pandas.pydata.org/docs/user_guide/text.html) for operating with string and object type columns. 

Here, we normalize the text by replacing special characters and custom separators with spaces. Then we combine all diagnosis of all the patients together to determine what is the top 5 words that are used in the diagnosis.

In [None]:
df.diagnosis = df.diagnosis.str.replace(";"," ").str.strip()
all_diagnosis_str = df.diagnosis.str.cat(sep=" ")

subs = {"\\":" ", "-":"", "/":"", "?":""}
for s in subs.keys(): 
    all_diagnosis_str = all_diagnosis_str.replace(s, subs[s])

all_diagnosis = all_diagnosis_str.split(" ")

In [None]:
import collections
c = collections.Counter(all_diagnosis)
c.most_common(5)

In [None]:
top_5_keyword = sorted(c, key=c.get, reverse=True)[:5]
print(f"Top 5 most common diagnostic terms are: {top_5_keyword}")

Based on these top five words, we create a binary feature that indicates the presence of each diagnostic term. 

In [None]:
for keyword in top_5_keyword:
    df[keyword]=df['diagnosis'].str.contains(keyword)

### Machine Learning: Mortality prediction of ICU Patients

[Survivial analysis](https://www.nature.com/articles/s41746-022-00679-6) and [mortality prediction](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4321691/) are common use cases by researchers and clinicians for analyzing electronic health data, especially as it pertains to ICU stays.

Now with all the diagnostic features we created, we build a basic machine learning model to predict the likelihood of survival of patients. 

The `hospital_expire_flag` is a binary attribute that captures whether a patient died in the hospital. By printing out the [value counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) of this attribute, we see that about a third of ICU patients die in the hospital. 

In [None]:
df["hospital_expire_flag"].value_counts()

We build a binary classification model, where `X` is our features (i.e., presence of diagnostic terms) and `y` is the target (i.e., predict if patient dies in the hospital). 

In [None]:
X = df[top_5_keyword].astype(int)
y = df['hospital_expire_flag']

In [None]:
X

In [None]:
y

We split the data into training and test sets. We will hold out 10% of the data for testing the model and use the remaining dataset for training.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                   test_size= 0.1,random_state=0)

We fit a basic [Naive Bayes classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes) and evaluate the model predictions.

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train,y_train)

In [None]:
predictions = model.predict(X_test)

Here we plot the confusion matrix to show the number of true positives, true negatives, false positives and false negatives.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=model.classes_)
disp.plot()

In [None]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, predictions)
print(f'Accuracy of the binary classifier = {score:0.2f}')

Of course, we are working with a very small sample here, so this is mostly intended to be an illustrative example.

**Why use Ponder for healthcare data analysis?**

Now that you've gotten a taste of the types of analysis you could do with Ponder. Let's take a step back and look at why Ponder is an excellent fit for clinicians and healthcare providers looking to perform healthcare data analysis in your data warehouse.

- **It's secure.** Healthcare data analysis often involves working with sensitive Personal Identifiable Health (PIH) data stored securely by the healthcare provider or hospital. Typically, working with pandas requires pulling the data out of the data warehouse before you can work with pandas on your local machine. This increases the risks of data leakage or unauthorized disclosures. With Ponder, all the pandas operations are being pushed down to the warehouse, with the computation happening entirely in the warehouse. This means that IT teams can enforce the same access and security controls as the warehouse.
- **It's scalable.** The volume of healthcare data has been exponential over the past few years and continues to grow. This presents significant opportunities and challenges ahead for health data management. When working with pandas, [scalability bottlenecks related to memory and performance](https://pandas.pydata.org/docs/user_guide/scale.html) often make analyzing large datasets impossible without resorting to big data frameworks such as Spark. With Ponder, you are no longer limited to the confines of in-memory analytics with pandas. Since the computation is happening entirely on the warehouse, Ponder inherits the parallelism and scalability of your warehouse. In fact, we have shown that Ponder works on workloads involving [more than a terabyte of data](https://ponder.io/run-pandas-on-1tb-directly-in-your-data-warehouse/). This is incredibly useful not just for analyzing electronic health records across a large population of patients, but also for large scale genomic analysis, which are often terabyte-scale and more. 
- **It's plain vanilla pandas.** Python is the de facto tool for data science, used by [one in every two software developers](https://ponder.io/pandas-is-now-as-popular-as-python-was-in-2016/) and [more than 90%](https://storage.googleapis.com/kaggle-media/surveys/Kaggle%20State%20of%20Machine%20Learning%20and%20Data%20Science%20Report%202022.pdf) of all data scientists. Among healthcare data practitioners and life scientists, it is also rapidly growing in popularity thanks to the increasing number of [health informatics programs](https://www.coursera.org/specializations/health-informatics) offered by [universities](https://healthinformatics.ucsf.edu/) and [online courses](https://www.coursera.org/specializations/genomic-data-science). Ponder gives you the exact same look-and-feel of pandas, but runnning directly in your warehouse, without requiring you to write a single line of SQL or Spark. Our mission at Ponder is to empower a wider number of domain specialist and practitioners to more easily work with the production-scale data in their warehouse, while sticking with their familiar API. 

### Summary

In this post, we saw how Ponder lets you analyze electronic health records with ease by allowing practitioners to: 
- browse high-level summary and overview of the dataset,
- discover patterns and insights based on visualizations and basic statistics,
- perform date time operations to compute patient's length of stay,
- develop features based on clinician free-text diagnosis,
- build a classification model to predict ICU patient mortality.

Ponder lets you seamlessly move between feature engineering, visualization, and machine learning — all within the Python data ecosystem, while operating directly on the data in your data warehouse. 


Looking to try Ponder on your next healthcare data analysis project? Sign up [here](http://ponder.io/product) to get started in using Ponder!