# Midterm take-home

<p>Instructions
<ol>
<li>Download the file <i>appointments.csv</i> into the same folder as the current Jupyter notebook
<li>Run the code below and then answer the questions
</ol></p>

<b>Penalties:</b> You will incur penalties if:
<ul>
<li>Your answer is different from the correct one</li>
<li>Your code is unncessarily slow</li>
<li>Your code is longer than specified</li>
<li>You will be penalized if, in an attempt to limit the lines of code, you make your code too hard to read or too slow -- for example, by copy-pasting pieces of code in the same line instead of declaring a variable in one line and using the variable in another.
</ul>

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('appointments.csv', index_col=0)

## Data Description

In [2]:
len(df)

41631

This is an appointment data set for an undisclosed outpatient clinic. One row corresponds to an appointment. Each appointment is characterized by the following attributes:
<ul>
<li><b>AppointmentID</b>: The unique identifier of the appointment.
<li><b>MRN</b>: The unique identifier of the patient (MRN = Medical Record Number).
<li><b>Appt Date</b>: The date when the appointment took place.
<li><b>Appt Time</b>: The time (expressed in minutes after midnight) when the appointment took place.
<li><b>Appointment Status</b>: The outcome of the appointment. 
    <ul>
    <li><i>Arrived</i>: the appointment took place regularly.
    <li><i>Cancelled</i>: the appointment was cancelled by the patient before taking place. 
    <li><i>Bumped</i>: the appointment was cancelled by the provider.
    <li><i>No Show</i>: the patient did not show up for the appointment.
    <li><i>Pending</i>: the appointment did not take place yet in the moment when the data was pulled.
    </ul>
<li><b>Time When Appt Arrived</b>: The time (expressed in minutes after midnight) when the patient checked in on the appointment day.
<li><b>Date When Appt Scheduled</b>: The day when the appointment was scheduled
<li><b>CAN or BMP Date</b>: If the appointment was cancelled or bumped, the date when this even happened.
<li><b>Provider ID</b>: The id of the provider scheduled to see the patient.
<li><b>Gender</b>: The patient's gender.
<li><b>Patient Age at Appt Date</b>: The age of the patient at Appt Date.
<li><b>Marital Status</b>: The patient's marital status.
<li><b>Employment status</b>: The patient's employment status.
</ul>

## Question 1 (2 pts, $\le$ 3 lines of code)

Find whether marital status and gender affect the probability of no-show. First, you will need to make a copy of df without the appointments whose outcome is different from no-show or arrived. Then, for each existing combination of marital status and gender, find the number of appointments and their probability of no-show. <b>Use at most 3 lines of code</b>.

In [3]:
dfcopy = df[(df['Appointment Status'] == 'No Show') | (df['Appointment Status'] == 'Arrived')].copy()

In [4]:
dfcopy['No Show'] = dfcopy['Appointment Status'] == 'No Show'

In [36]:
dfcopy.groupby(by = ['Marital Status','Gender']).agg({'No Show': [lambda se: len(se[se.values == True]), 'mean']}).rename(columns = {'<lambda>':'Number of No Show','mean':'Probability of No Show'})

Unnamed: 0_level_0,Unnamed: 1_level_0,No Show,No Show
Unnamed: 0_level_1,Unnamed: 1_level_1,Number of No Show,Probability of No Show
Marital Status,Gender,Unnamed: 2_level_2,Unnamed: 3_level_2
DIVORCED,F,379,0.157392
DIVORCED,M,91,0.172023
LIFE PARTNER,F,2,0.095238
MARRIED,F,868,0.100719
MARRIED,M,463,0.127478
SEPARATED,F,187,0.267143
SEPARATED,M,42,0.25
SINGLE,F,1171,0.199116
SINGLE,M,514,0.219377
UNKNOWN,F,9,0.409091


## Question 2 (2 pts, $\le$ 3 lines of code)

<p>Using the data frame constructed in Q1, find whether the lead time to the appointment affects the probability of no-show. The lead time to the appointment is the number of days elapsed from the moment when the appointment was requested to the appointment date. </p>
<p>Find the no-show probability and number of appointments for each of the following lead time intervals (pay attention to the interval boundaries):
<ul>
<li>$\le$ 10 days
<li>between 11 and 20 days
<li>between 21 and 30 days
<li>over 30 days
</ul>
<p> <b>Use at most 3 lines of code</b>. Hint: You may find the function pd.cut helpful.</p>

In [6]:
dfcopy['leadTime'] = (pd.to_datetime(dfcopy['Appt Date']) - pd.to_datetime(dfcopy['Date When Appt Scheduled'])).dt.days

In [7]:
dfcopy['LeadTimeInterval'] = pd.cut(dfcopy.leadTime, [0, 11, 21, 31, dfcopy.leadTime.max() + 1], right=False, labels=['≤ 10 days','between 11 and 20 days','between 21 and 30 days','over 30 days'])

In [8]:
dfcopy.groupby('LeadTimeInterval').agg({'No Show': [lambda se: len(se[se.values == True]), 'mean']}).rename(columns = {'<lambda>':'Number of No Show','mean':'Probability of No Show'})

Unnamed: 0_level_0,No Show,No Show
Unnamed: 0_level_1,Number of No Show,Probability of No Show
LeadTimeInterval,Unnamed: 1_level_2,Unnamed: 2_level_2
≤ 10 days,587,0.106341
between 11 and 20 days,553,0.1513
between 21 and 30 days,515,0.163856
over 30 days,2391,0.169864


## Question 3 (6 pts, $\le$ 8 lines of code not including comments, $\le$ 20 words of explanation)

<p>Using the data frame constructed in Q1, find whether the no-show probability of a given appointment is affected by the patient's no-show behavior prior to that appointment. Do not consider first-time appointments. </p>

<p>Present your results with one table and then discuss it in at most 20 words. Make sure that your table is easy to understand; for example, try to use descriptive column headers.</p>

<p> This problem is left vague on purpose. In particular, how to analyze past no-show behavior is up to you. No help will be given to answer this question, aside from clarifications on the wording and on the data. </p>

##### Group data by patients and aggregate on their Appointment Status.     
For instance, if certain patient has 3 appointments, then his/her Appointment Status behavior list will look like    
[No Show, No Show, Arrived]

In [40]:
dfcopy.head(20)

Unnamed: 0,AppointmentID,MRN,Appt Date,Appt Time,Appointment Status,Time When Appt Arrived,Date When Appt Scheduled,CAN or BMP Date,Provider ID,Gender,Patient Age at Appt Date,Marital Status,Employment Status,No Show,leadTime,LeadTimeInterval
1,2,7264,2013-11-14,780,Arrived,759.0,2013-05-20,,15,F,33,MARRIED,NOT EMPLOYED,False,178,over 30 days
2,3,9903,2013-10-16,720,Arrived,708.0,2013-06-12,,15,F,40,MARRIED,EMPLOYED FULL TIME,False,126,over 30 days
3,4,9588,2012-12-13,720,Arrived,712.0,2012-11-19,,15,F,25,SINGLE,EMPLOYED FULL TIME,False,24,between 21 and 30 days
5,6,4706,2013-07-04,720,Arrived,713.0,2013-06-24,,15,F,26,SINGLE,EMPLOYED FULL TIME,False,10,≤ 10 days
7,8,5786,2013-05-09,630,Arrived,621.0,2013-05-07,,15,F,50,MARRIED,EMPLOYED PART TIME,False,2,≤ 10 days
9,10,2429,2012-07-26,690,Arrived,708.0,2011-07-28,,15,F,37,SINGLE,EMPLOYED FULL TIME,False,364,over 30 days
10,11,9592,2012-12-13,600,Arrived,601.0,2012-07-05,,15,F,24,SINGLE,EMPLOYED PART TIME,False,161,over 30 days
12,13,5802,2012-12-25,600,Arrived,594.0,2012-12-03,,15,F,22,SINGLE,NOT EMPLOYED,False,22,between 21 and 30 days
15,16,1177,2012-05-10,660,Arrived,661.0,2011-10-27,,15,M,74,MARRIED,RETIRED,False,196,over 30 days
19,20,411,2011-12-22,690,Arrived,689.0,2011-08-22,,15,F,26,SINGLE,EMPLOYED PART TIME,False,122,over 30 days


In [41]:
# Line 1
df2 = dfcopy.groupby('MRN').agg({'Appointment Status' : lambda se: [se.values]})['Appointment Status']

In [33]:
# Line 1
df2 = dfcopy.groupby('MRN').agg({'Appointment Status' : lambda se: [se.values]})['Appointment Status'].apply(lambda listOflist: listOflist[0]).to_frame().reset_index()

In [34]:
# This line is for data display
df2[:1] 

Unnamed: 0,MRN,Appointment Status
0,0,[Arrived]


Remove patients that had 'Arrived' in all of their appointments, because we want to find out prior 'No-show' impact on following appointment, therefor he/she must had not 'arrived' all of his appointment 

In [11]:
# Line 2
df2['NoShow'] = df2['Appointment Status'].apply(lambda row: True if 'No Show' in row else False )

In [12]:
# Line 3
df2 =  df2[df2.NoShow == True]

Remove patients that had only one appointment 

In [23]:
# Line 4
df2['OneRecord'] = df2['Appointment Status'].apply(lambda row: True if len(row) == 1 else False)

In [24]:
# Line 5
df2 = df2[(df2.OneRecord == False)]

In [25]:
# This line is for data display
df2[:1] 

Unnamed: 0,MRN,Appointment Status,OneRecord
2,4,"[Arrived, Arrived]",False


Extract the 'Appointment Status' column and convert the record to a list of appointment behaviors pairs.    
If certain patient has an appointment status list [No Show, No Show, Arrived], then his visiting pattern is [[No Show, No Show],[No Show, Arrived]]     

In [26]:
# Line 6
se = df2['Appointment Status'].apply(lambda row: [[row[i], row[i+1]] for i in range(len(row) - 1)])

In [27]:
# Certain patient's appointment pattern. This line is for display only
se[:1].values[0] 

[['Arrived', 'Arrived']]

Taking every patient's visiting patterns out and place them as rows in a new DataFrame   
This DataFrame will look like   
Previous	     ThisTime   
No Show	         No Show   
No Show	         Arrived

In [28]:
# Line 7
df3 = pd.DataFrame(sum(se.values.tolist(), []), columns = ['PreviousTime','ThisTime'])

In [29]:
# This line is for display only
df3[:1]

Unnamed: 0,PreviousTime,ThisTime
0,Arrived,Arrived


We groupby patients' prior behavior and then compute the no-show probability given that behavior.    
We can see 'No Show Probability Of This Time' is much higher if prior behavior is 'No-show' than prior behavior is 'Arrived'

In [31]:
# Line 8
df3.groupby('PreviousTime')['ThisTime'].agg(lambda se: len(se[se.values == 'No Show']) / float(len(se))).to_frame('No Show Probability Of This Time')

Unnamed: 0_level_0,No Show Probability Of This Time
PreviousTime,Unnamed: 1_level_1
Arrived,0.12589
No Show,0.292938


#### Discussion of findings on the table :
#### Population-wise,the 'No-show' probability of a given appointment is affected by the patient's no-show behavior prior to that appointment.