# FHIR for Research Workshop

## Exercise 0 - Getting Started

### Motivation / Purpose


#### For this exercise we'll introduce you to the basic framework for working with FHIR data in Python. 

In order to to this we will walk you through the following steps:
<ol>
    <li>Connect to the client server</li>
    <li> Format and structure queries to submit </li>
    <li> Pull relevant data from the FHIR server and save it locally as a JSON document.</li>
    <li> Parse the JSON into a Readable Dataframe</li>
    </ol>


## Initial Setup

## Step 1 Connect to Client

First let's sync to our source server for data extraction, and pull a sample patient file.

We use the requests library to submit a request structured as a url and then convert the file (already in JSON format), into a JSON object in our dataframe.

Generally speaking the format fr a RESTful GET query appended to a url will take the form of: url/Resource/Specification 

Let's import the required libraries and submit a sample query to return the basic information from a single patient: 'smart-1032702':

In [22]:
import requests
import json

r = requests.get(f"https://api.logicahealth.org/researchonfhir/open/Patient/smart-1032702", headers={'Accept':'application/fhir+json'}, verify=False)
bundle = r.json()



Let's output the bundle to confirm we successfully accessed the server and queried data.

In [23]:
bundle

{'resourceType': 'Patient',
 'id': 'smart-1032702',
 'meta': {'versionId': '1',
  'lastUpdated': '2020-07-15T02:51:25.000+00:00',
  'source': '#KQSArAdbxORTtqVw'},
 'text': {'status': 'generated',
  'div': '<div xmlns="http://www.w3.org/1999/xhtml">Amy Shaw</div>'},
 'identifier': [{'use': 'official',
   'type': {'coding': [{'system': 'http://terminology.hl7.org/CodeSystem/v2-0203',
      'code': 'MR',
      'display': 'Medical Record Number'}],
    'text': 'Medical Record Number'},
   'system': 'http://hospital.smarthealthit.org',
   'value': 'smart-1032702'}],
 'active': True,
 'name': [{'use': 'official', 'family': 'Shaw', 'given': ['Amy', 'V']}],
 'telecom': [{'system': 'phone', 'value': '800-782-6765', 'use': 'mobile'},
  {'system': 'email', 'value': 'amy.shaw@example.com'}],
 'gender': 'female',
 'birthDate': '2007-03-20',
 'address': [{'use': 'home',
   'line': ['49 Meadow St'],
   'city': 'Mounds',
   'state': 'OK',
   'postalCode': '74047',
   'country': 'USA'}],
 'generalPrac

## Step 2 Query Data

We are now positioned to submit specific queries to our source sevcer and retrieve data. We then save the results locally in JSON files for subsequent parseing

#### For this exercise we'll look at an entire patient cohort in our dataset, and determine some basic demographic details. 

In order to to this we will need to do the following:
<or>
    <li> Pull relevant data from the FHIR server on patients</li>
    <li> Convert our JSON resource into a python Pandas dataframe</li>
    <li> Trim our dataset down to only the key relevant features</li>
    <li>Conduct basic EDA on the results</li>

To pull all patients we simply nee to call the 'Patient' resource

In [13]:
r = requests.get(f"https://api.logicahealth.org/researchonfhir/open/Patient", headers={'Accept':'application/fhir+json'}, verify=False)
bundle = r.json()



In [14]:
bundle

{'resourceType': 'Bundle',
 'id': 'dac490c6-2d57-4c56-8cb7-cb9f355fe116',
 'meta': {'lastUpdated': '2022-01-08T18:56:59.611+00:00'},
 'type': 'searchset',
 'total': 70,
 'link': [{'relation': 'self',
   'url': 'https://api.logicahealth.org/researchonfhir/open/Patient'},
  {'relation': 'next',
   'url': 'https://api.logicahealth.org/researchonfhir/open?_getpages=dac490c6-2d57-4c56-8cb7-cb9f355fe116&_getpagesoffset=50&_count=50&_pretty=true&_bundletype=searchset'}],
 'entry': [{'fullUrl': 'https://api.logicahealth.org/researchonfhir/open/Patient/BILIBABY',
   'resource': {'resourceType': 'Patient',
    'id': 'BILIBABY',
    'meta': {'versionId': '1',
     'lastUpdated': '2020-07-15T02:51:23.000+00:00',
     'source': '#mNKBng6Y74bFyYWP'},
    'text': {'status': 'generated',
     'div': '<div xmlns="http://www.w3.org/1999/xhtml">Bili Baby</div>'},
    'extension': [{'url': 'http://hl7.org/fhir/StructureDefinition/patient-birthTime',
      'valueDateTime': '2016-01-04T00:00:00-06:00'}],
  

We can use the open function to generate a new file and then write the content of our dataframe to it

In [15]:
open('fhir-data/data.json', 'wb').write(r.content)

78900

## Step 3 Mount Data onto Pandas Dataframe

Now that we've extracted information we need, we will then take the FHIR formatted data and convert it into a pandas dataframe for subsequent analysis.

The following set of functions parse the JSON into a pandas dataframe. 

The first code block allows us to parse a JSON file. 

The second creates a wrapping function process() that allows you to input a JSON file or directory of files and parses it. No work is needed on your part, simply call the process function on the directory you've saved your JSON file too.

In [10]:
from pandas.io.json import json_normalize
import pandas as pd
import os


class Fhiry(object):
    def __init__(self):
        self._df = None
        self._filename = ""
        self._folder = ""

    @property
    def df(self):
        return self._df

    @property
    def filename(self):
        return self._filename

    @property
    def folder(self):
        return self._folder

    @filename.setter
    def filename(self, filename):
        self._filename = filename
        self._df = self.read_bundle_from_file(filename)

    @folder.setter
    def folder(self, folder):
        self._folder = folder

    def read_bundle_from_file(self, filename):
        with open(filename, 'r') as f:
            json_in = f.read()
            json_in = json.loads(json_in)
            return json_normalize(json_in['entry'])

    def delete_unwanted_cols(self):
        del self._df['resource.text.div']

    def process_df(self):
        """Read a single JSON resource or a directory full of JSON resources
        ONLY COMMON FIELDS IN ALL resources will be mapped
        """
        if self._folder:
            df = pd.DataFrame(columns=[])
            for file in os.listdir(self._folder):
                if file.endswith(".json"):
                    self._df = self.read_bundle_from_file(
                        os.path.join(self._folder, file))
                    self.delete_unwanted_cols()
                    self.convert_object_to_list()
                    self.add_patient_id()
                    if df.empty:
                        df = self._df
                    else:
                        df = pd.concat([df, self._df])
            self._df = df
        elif self._filename:
            self._df = self.read_bundle_from_file(self._filename)
            self.delete_unwanted_cols()
            self.convert_object_to_list()
            self.add_patient_id()

    def process_file(self, filename):
        self._df = self.read_bundle_from_file(filename)
        self.delete_unwanted_cols()
        self.convert_object_to_list()
        self.add_patient_id()
        return self._df

    def convert_object_to_list(self):
        """Convert object to a list of codes
        """
        for col in self._df.columns:
            if 'coding' in col:
                codes = self._df.apply(
                    lambda x: self.process_list(x[col]), axis=1)
                self._df = pd.concat(
                    [self._df, codes.to_frame(name=col+'codes')], 1)
                del self._df[col]
            if 'display' in col:
                codes = self._df.apply(
                    lambda x: self.process_list(x[col]), axis=1)
                self._df = pd.concat(
                    [self._df, codes.to_frame(name=col+'display')], 1)
                del self._df[col]

    def add_patient_id(self):
        """Create a patientId column with the resource.id of the first Patient resource
        """
        self._df['patientId'] = self._df[(
            self._df['resource.resourceType'] == "Patient")].iloc[0]['resource.id']

    def get_info(self):
        if self._df is None:
            return "Dataframe is empty"
        return self._df.info()

    def process_list(self, myList):
        """Extracts the codes from a list of objects
        Args:
            myList (list): A list of objects
        Returns:
            list: A list of codes
        """
        myCodes = []
        if isinstance(myList, list):
            for entry in myList:
                if 'code' in entry:
                    myCodes.append(entry['code'])
                else:
                    myCodes.append(entry['display'])
        return myCodes

In [11]:
# parallel file
import multiprocessing as mp



def process_files(file):
    f = Fhiry()
    return f.process_file(file)


def process_ndjson(file):
    f = Fhirndjson()
    return f.process_file(file)

def process(folder):
    try:
        pool = mp.Pool(mp.cpu_count())
        list_of_dataframes = pool.map(process_files, [folder + '/' + row for row in os.listdir(folder)])
        pool.close()
        return pd.concat(list_of_dataframes)
    except:
        f = Fhiry()
        f.folder = folder
        f.process_df()
        return f.df


def ndjson(folder):
    try:
        pool = mp.Pool(mp.cpu_count())
        list_of_dataframes = pool.map(
            process_ndjson, [folder + '/' + row for row in os.listdir(folder)])
        pool.close()
        return pd.concat(list_of_dataframes)
    except:
        f = Fhirndjson()
        f.folder = folder
        f.process_df()
        return f.df

We can now create our dataframe by calling the process function to parse all the json files within a given directory 

In [24]:
df = process('fhir-data')


Let's output some information about our new dataframe...

In [17]:
df.columns

Index(['fullUrl', 'resource.resourceType', 'resource.id',
       'resource.meta.versionId', 'resource.meta.lastUpdated',
       'resource.meta.source', 'resource.text.status', 'resource.extension',
       'resource.active', 'resource.name', 'resource.gender',
       'resource.birthDate', 'search.mode', 'resource.identifier',
       'resource.telecom', 'resource.address', 'resource.generalPractitioner',
       'patientId'],
      dtype='object')

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 18 columns):
fullUrl                         50 non-null object
resource.resourceType           50 non-null object
resource.id                     50 non-null object
resource.meta.versionId         50 non-null object
resource.meta.lastUpdated       50 non-null object
resource.meta.source            50 non-null object
resource.text.status            50 non-null object
resource.extension              1 non-null object
resource.active                 50 non-null bool
resource.name                   50 non-null object
resource.gender                 50 non-null object
resource.birthDate              50 non-null object
search.mode                     50 non-null object
resource.identifier             49 non-null object
resource.telecom                49 non-null object
resource.address                47 non-null object
resource.generalPractitioner    49 non-null object
patientId                       5

In [18]:
df.head(5)

Unnamed: 0,fullUrl,resource.resourceType,resource.id,resource.meta.versionId,resource.meta.lastUpdated,resource.meta.source,resource.text.status,resource.extension,resource.active,resource.name,resource.gender,resource.birthDate,search.mode,resource.identifier,resource.telecom,resource.address,resource.generalPractitioner,patientId
0,https://api.logicahealth.org/researchonfhir/op...,Patient,BILIBABY,1,2020-07-15T02:51:23.000+00:00,#mNKBng6Y74bFyYWP,generated,[{'url': 'http://hl7.org/fhir/StructureDefinit...,True,"[{'family': 'Bili', 'given': ['Baby']}]",male,2016-01-04,match,,,,,BILIBABY
1,https://api.logicahealth.org/researchonfhir/op...,Patient,smart-1032702,1,2020-07-15T02:51:25.000+00:00,#KQSArAdbxORTtqVw,generated,,True,"[{'use': 'official', 'family': 'Shaw', 'given'...",female,2007-03-20,match,"[{'use': 'official', 'type': {'coding': [{'sys...","[{'system': 'phone', 'value': '800-782-6765', ...","[{'use': 'home', 'line': ['49 Meadow St'], 'ci...",[{'reference': 'Practitioner/smart-Practitione...,BILIBABY
2,https://api.logicahealth.org/researchonfhir/op...,Patient,smart-1081332,1,2020-07-15T02:51:26.000+00:00,#WnCTEkK79sEBIQNe,generated,,True,"[{'use': 'official', 'family': 'Ross', 'given'...",male,2003-10-02,match,"[{'use': 'official', 'type': {'coding': [{'sys...","[{'system': 'phone', 'value': '800-960-9294', ...","[{'use': 'home', 'line': ['19 Church St'], 'ci...",[{'reference': 'Practitioner/smart-Practitione...,BILIBABY
3,https://api.logicahealth.org/researchonfhir/op...,Patient,smart-1098667,1,2020-07-15T02:51:28.000+00:00,#t8B1SBPceqIytzZz,generated,,True,"[{'use': 'official', 'family': 'Hill', 'given'...",male,1953-10-27,match,"[{'use': 'official', 'type': {'coding': [{'sys...","[{'system': 'email', 'value': 'robert.hill@exa...","[{'use': 'home', 'line': ['42 Park St'], 'city...",[{'reference': 'Practitioner/smart-Practitione...,BILIBABY
4,https://api.logicahealth.org/researchonfhir/op...,Patient,smart-1134281,1,2020-07-15T02:51:29.000+00:00,#SpkMq39qajrl0UKR,generated,,True,"[{'use': 'official', 'family': 'Taylor', 'give...",male,2004-10-15,match,"[{'use': 'official', 'type': {'coding': [{'sys...","[{'system': 'phone', 'value': '800-539-3986', ...","[{'use': 'home', 'line': ['24 Pine Rd'], 'city...",[{'reference': 'Practitioner/smart-Practitione...,BILIBABY


As we can see, while there is useful information here, a number of columns are not useful, so let's trim it down to essential fields.

In [30]:
dfconcise = df[['resource.id',
       'resource.name', 'resource.gender',
       'resource.birthDate']]

In [31]:
dfconcise.head()

Unnamed: 0,resource.id,resource.name,resource.gender,resource.birthDate
0,BILIBABY,"[{'family': 'Bili', 'given': ['Baby']}]",male,2016-01-04
1,smart-1032702,"[{'use': 'official', 'family': 'Shaw', 'given'...",female,2007-03-20
2,smart-1081332,"[{'use': 'official', 'family': 'Ross', 'given'...",male,2003-10-02
3,smart-1098667,"[{'use': 'official', 'family': 'Hill', 'given'...",male,1953-10-27
4,smart-1134281,"[{'use': 'official', 'family': 'Taylor', 'give...",male,2004-10-15


## Step 4 Exploratory Data Analysis 

Let's look at the number of patients in our dataset:

In [34]:
dfconcise['resource.id'].count()

50

Let's look at our gender breakdown

In [36]:
dfconcise['resource.gender'].value_counts()

female    25
male      25
Name: resource.gender, dtype: int64