## 1. This patient may have sepsis
<p>Sepsis is a deadly syndrome where a patient has a severe infection that causes organ failure. The sooner septic patients are treated, the more likely they are to survive, but sepsis can be difficult to recognize. It may be possible to use hospital data to develop machine learning models that could flag patients who are likely to be septic. Before predictive algorithms can be developed, however, we need a reliable way to pick out the patients who had sepsis. One component to identify is a severe infection.</p>
<p>In this notebook, we will use hospital electronic health record (EHR) data covering a two-week period to find out which patients were suspected to have a severe infection. In other words, we will look into the hospital's records to see what happened during a patient's hospital stay, and try to figure out whether s/he had a severe infection. </p>
<p>We will do this by checking whether the doctor ordered a blood test to look for bacteria (a blood culture) and also gave the patient a series of antibiotics. We will use data documenting antibiotics administered and blood cultures drawn.</p>

In [157]:
# Load packages
library(data.table)

# The data.table package is pre-loaded
# Read in the data
antibioticDT <- fread('datasets/antibioticDT.csv')

# Look at the first 30 rows
head(antibioticDT, 30)

patient_id,day_given,antibiotic_type,route
1,2,ciprofloxacin,IV
1,4,ciprofloxacin,IV
1,6,ciprofloxacin,IV
1,7,doxycycline,IV
1,9,doxycycline,IV
1,15,penicillin,IV
1,16,doxycycline,IV
1,18,ciprofloxacin,IV
8,1,doxycycline,PO
8,2,penicillin,IV


## 2. Which antibiotics are "new?"
<p>These data represent all drugs administered in a hospital over a two-week period. Each row represents one time the patient was given a drug. The variables include the patient id, the day the drug was administered, and the type of drug. For example, patient "0010" received doxycycline by mouth on the first day of her stay.</p>
<p>We are identifying patients with infection using the following very specific criteria. The basic idea is that a patient starts antibiotics within a couple of days of a blood culture, and is then given antibiotics for at least 4 days.</p>
<p><strong>Criteria for Suspected Infection</strong>*</p>
<ul>
<li>Patient receives antibiotics for a sequence of 4 days, with gaps of 1 day allowed.</li>
<li>The sequence must start with a “new antibiotic,” defined as an antibiotic type that hasn't been given in the past 2 days.</li>
<li>The sequence must start within 2 days of a blood culture.  </li>
<li>There must be at least one <strong>IV</strong> antibiotic within the +/-2 day window period. (An IV drug is one that is given intravenously.)</li>
</ul>
<p>Let's start with the second item, by finding which rows represent 'new' antibiotics. We will be checking whether each particular antibiotic type was given in the past 2 days. Let's visualize this task by looking at the data sorted by id, then antibiotic type, and finally, day.</p>

In [159]:
# Sort the data by id, antibiotic type, day. 
setorder(antibioticDT, patient_id, antibiotic_type, day_given)

# Print and examine the first 40 rows.
head(antibioticDT, 40)

# Use `shift` to calculate the last day the particular drug was administered.
antibioticDT[ , last_administration_day := shift(day_given, 1), 
  by = .(patient_id, antibiotic_type)]

# Calculate the number of days it's been since the last administration.
antibioticDT[ , days_since_last_admin := day_given - last_administration_day]

# Make a new variable called `antibiotic_new` with an initial value of 1. 
antibioticDT[ , antibiotic_new := 1]

# Reset this variable to 0 when it's only been 1 or 2 days since the last administration.
antibioticDT[ , antibiotic_new := ifelse(days_since_last_admin %in% 0:2, 0, 1)]

patient_id,day_given,antibiotic_type,route
1,2,ciprofloxacin,IV
1,4,ciprofloxacin,IV
1,6,ciprofloxacin,IV
1,18,ciprofloxacin,IV
1,7,doxycycline,IV
1,9,doxycycline,IV
1,16,doxycycline,IV
1,15,penicillin,IV
8,1,doxycycline,PO
8,3,doxycycline,IV


## 3. Looking at the blood culture data
<p>Now let's look at blood culture data from the same two-week period in this hospital. These data are in blood_cultureDT.csv. Let's start by reading it into the workspace and having a look at a few rows. </p>
<p>Each row represents one blood culture and gives the patient's id and the day it occurred. For example, patient "0006" had a culture on the first day of his hospitalization and again on the ninth. Notice that some patients from the antibiotic data are not in this data and vice versa. (Some patients are in neither because they received neither antibiotics nor a blood culture.)</p>

In [161]:
# Read in `blood_cultureDT.csv`.
blood_cultureDT <- fread('datasets/blood_cultureDT.csv')

# Print the first 30 rows
head(blood_cultureDT, 30)

patient_id,blood_culture_day
1,3
1,13
8,2
8,13
23,3
39,10
45,4
45,9
45,11
51,3


## 4. Combine the antibiotic data and the blood culture data
<p>To find which antibiotics were given close to a blood culture, we'll need to combine the drug administration data with the blood culture data. Let's keep only patients that are still candidates for infection, so only those in both data sets.</p>
<p>A tricky part is that some patients will have had blood cultures on several different days. For each of those days, we are going to see if there's a sequence of antibiotic days close to it. To accomplish this, in the merge we will match each blood culture to each antibiotic day.</p>
<p>After sorting the data following the merge, you should be able to see that each patient's antibiotic sequence is repeated for each blood culture day. This will allow us to look at each blood culture day and check whether it is associated with a qualifying sequence of antibiotics.</p>

In [163]:
# Make a combined dataset by merging antibioticDT with blood_cultureDT.
combinedDT <- merge(blood_cultureDT, antibioticDT, by='patient_id',
                   all=FALSE)

# Sort by patient_id, blood_culture_day, day_given, and antibiotic_type.
setorder(combinedDT, patient_id, blood_culture_day, day_given, antibiotic_type)

# Print and examine the first 40 rows.
head(combinedDT, 40)

patient_id,blood_culture_day,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin,antibiotic_new
1,3,2,ciprofloxacin,IV,,,1
1,3,4,ciprofloxacin,IV,2.0,2.0,0
1,3,6,ciprofloxacin,IV,4.0,2.0,0
1,3,7,doxycycline,IV,,,1
1,3,9,doxycycline,IV,7.0,2.0,0
1,3,15,penicillin,IV,,,1
1,3,16,doxycycline,IV,9.0,7.0,1
1,3,18,ciprofloxacin,IV,6.0,12.0,1
1,13,2,ciprofloxacin,IV,,,1
1,13,4,ciprofloxacin,IV,2.0,2.0,0


## 5. Determine whether each row is in-window
<p>Now that we have the drug and blood culture data combined, we can test each drug administration against each blood culture to see if it's "in window."</p>

In [165]:
# Make a new variable called `drug_in_bcx_window`, which is 1 if the drug was given in window and zero otherwise.
combinedDT[, drug_in_bcx_window := ifelse(abs(day_given-blood_culture_day)<=2, 1, 0)]

## 6. Check the IV requirement
<p>Now let's look at the fourth item in the criteria. </p>
<p><strong>Criteria for Suspected Infection</strong>*</p>
<ul>
<li>Patient receives antibiotics for a sequence of 4 days, with gaps of 1 day allowed.</li>
<li>The sequence must start with a “new antibiotic” (not given in the prior 2 days).</li>
<li>The sequence must start within +/-2 days of a blood culture.  </li>
<li>There must be at least one <strong>IV</strong> antibiotic within the +/-2 day window period. (An IV drug is one that is given intravenously, not by mouth.)</li>
</ul>

In [167]:
# Make a new indicator of whether a given blood culture day had at least one IV drug given in window.
combinedDT[ , 
          any_iv_in_bcx_window := as.numeric(any((drug_in_bcx_window) & 
                                             (route=='IV'))),
          by = .(patient_id, blood_culture_day)]

# Exclude rows in which the blood_culture_day does not have any IV drugs in window. 
combinedDT = combinedDT[any_iv_in_bcx_window>0,]

In [168]:
head(combinedDT,40)

patient_id,blood_culture_day,day_given,antibiotic_type,route,last_administration_day,days_since_last_admin,antibiotic_new,drug_in_bcx_window,any_iv_in_bcx_window
1,3,2,ciprofloxacin,IV,,,1,1,1
1,3,4,ciprofloxacin,IV,2.0,2.0,0,1,1
1,3,6,ciprofloxacin,IV,4.0,2.0,0,0,1
1,3,7,doxycycline,IV,,,1,0,1
1,3,9,doxycycline,IV,7.0,2.0,0,0,1
1,3,15,penicillin,IV,,,1,0,1
1,3,16,doxycycline,IV,9.0,7.0,1,0,1
1,3,18,ciprofloxacin,IV,6.0,12.0,1,0,1
1,13,2,ciprofloxacin,IV,,,1,0,1
1,13,4,ciprofloxacin,IV,2.0,2.0,0,0,1


## 7. Find the first day of possible sequences
<p>We're getting close! Let's review the criteria:</p>
<p><strong>Criteria for Suspected Infection</strong>*</p>
<ul>
<li>Patient receives antibiotics for a sequence of 4 days, with gaps of 1 day allowed.</li>
<li>The sequence must start with a “new antibiotic” (not given in the prior 2 days).</li>
<li>The sequence must start within +/-2 days of a blood culture.  </li>
<li>There must be at least one IV antibiotic within the +/-2 day window period.</li>
</ul>
<p>Let's assess the first criterion, starting by finding the first day of possible 4-day qualifying sequences.    </p>

In [170]:
# Create a new variable called day_of_first_new_abx_in_window.
combinedDT[ , 
    day_of_first_new_abx_in_window := 
        day_given[drug_in_bcx_window][1],
    by = .(patient_id, blood_culture_day)]

# Remove rows where the day is before this first qualifying day.
#combinedDT <- combinedDT[blood_culture_day >= day_of_first_new_abx_in_window]
combinedDT <- combinedDT[day_given >= day_of_first_new_abx_in_window]

## 8. Simplify the data
<p>The first criterion was: Patient receives antibiotics for a sequence of 4 days, with gaps of 1 day allowed.</p>
<p>We've pinned down the first day for possible sequences, so now we can check for sequences of four days. So now we don't need the drug type, we just need the days of administration.</p>

In [172]:
# Create a new data.table containing only patient_id, blood_culture_day, and day_given. 
simplified_data <- combinedDT[,c('patient_id', 
                                'blood_culture_day',
                                'day_given'),with=FALSE]
print(dim(simplified_data))
# Remove duplicate rows.
simplified_data = unique(simplified_data)
print(dim(simplified_data))

[1] 5853    3
[1] 3956    3


## 9. Extract first four rows for each blood culture
<p>To check for sequences of 4 days, let's pull out the first four days (rows) for each patient-blood culture combination. Some patients will have less than four antibiotic days. Let's remove them first.</p>

In [174]:
# Make a new variable showing the number of antibiotic days each patient-blood culture day combination had.
simplified_data[ , num_antibiotic_days := .N, by = .(patient_id, blood_culture_day)]

# Remove blood culture days with less than four antibiotic days (rows). 
simplified_data = simplified_data[num_antibiotic_days>=4]
print(dim(simplified_data))
# Select the first four days for each blood culture.
first_four_days <- simplified_data[ , .SD[1:4], by = .(patient_id, blood_culture_day)]
#first_four_days <- simplified_data[ .SD[1:4], , by = .(patient_id, blood_culture_day)]
dim(first_four_days)

[1] 3704    4


In [175]:
head(first_four_days, 40)

patient_id,blood_culture_day,day_given,num_antibiotic_days
1,3,2,8
1,3,4,8
1,3,6,8
1,3,7,8
1,13,2,8
1,13,4,8
1,13,6,8
1,13,7,8
8,2,1,6
8,2,2,6


## 10. Consecutive sequence
<p>Now we need to check whether each 4-day sequence qualifies by having no gaps of more than one day.</p>
<!--"Patient receives antibiotics for a sequence of 4 days, with gaps of 1 day allowed."-->

In [177]:
# Make a new variable indicating whether the antibiotic sequence has no skips of more than one day.
first_four_days[ , 
                four_in_seq := max(diff(day_given))<=2,
                by = .(patient_id, blood_culture_day)]
four_in_seq = first_four_days$four_in_seq

In [178]:
head(first_four_days, 50)

patient_id,blood_culture_day,day_given,num_antibiotic_days,four_in_seq
1,3,2,8,True
1,3,4,8,True
1,3,6,8,True
1,3,7,8,True
1,13,2,8,True
1,13,4,8,True
1,13,6,8,True
1,13,7,8,True
8,2,1,6,False
8,2,2,6,False


## 11. Select the patients who meet criteria
<p>A patient meets the criteria if any of his/her blood cultures were accompanied by a qualifying sequence of antibiotics. Now that we've determined whether each blood culture qualifies, let's select the patients who meet the criteria.</p>

In [180]:
# Select the rows which have `four_in_seq` equal to `1`.
suspected_infection <- first_four_days[four_in_seq==1,]

# Retain only the `patient_id` column.
suspected_infection <- suspected_infection[,'patient_id']

# Get rid of duplicates.
suspected_infection <- unique(suspected_infection)

# Make an infection indicator
suspected_infection[ , infection := 1]

## 12. Find the prevalence of sepsis
<p>In this notebook, we've used two EHR data sets and used this information to flag patients who were suspected to have a severe infection. We've also gotten a data.table workout!</p>
<p>Let's see what proportion of patients had serious infection in these data. </p>
<p>So far we've been looking at records of all antibiotic administrations and blood cultures occurring over a two week period at a particular hospital. However, not all patients who were hospitalized over this period are represented in combinedDT, since not all of them had antibiotics or blood cultures.</p>

In [182]:
# Read in all_patients.csv
all_patientsDT <- fread('datasets/all_patients.csv')

# Merge this with the infection flag data.
all_patientsDT <- merge(all_patientsDT, suspected_infection, by='patient_id', all=TRUE)

# Set any missing values of the infection flag to 0.
all_patientsDT[, infection := ifelse(is.na(infection), 0, infection)]

# Calculate the percentage of patients who met the criteria for presumed infection.
ans=mean(all_patientsDT[,infection])*100