# Data Story

1. Questions
    - Comparisons
    - Counts
    - Trends
    - Plots (Bar/Histogram/Scatter/Time-Series)
    - Cross-tabs
2. Insights
    - Correlations
    - Hypotheses
3. Narrative
    - Present in a story form, needs to flow
    - What trends/relationships would make it more complete?

**Notes:**

- Date/Time needs to be categorized before being used in ML models
- Only choose a few columns to explore in detail, can always go back and explore further after starting ML
- *Histogram of diff in appointment-date relative to scheduled-date*

## Import clean DataFrame from 2.0-jkg-data-wrangling

In [1]:
# Package imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# import pickle file
df_clean = pd.read_pickle('../data/interim/clean_df.pickle')

## Visualization TODOs

- [x] Create DataFrame of Patients w/ duplicates removed (this would also help identify outliers, patients w/tons of appointments)
- Histogram of DateDiff (overall and missed appointments)
- Crosstab of SMS_sent and noshow

**Make Patients DataFrame**

In [3]:
df_patients = df_clean.copy()

# Drop unnecessary columns from patients dataframe
# Note: removing 'Age' here because it changes for some patients over time
df_patients.drop(columns=['Scheduled_Date', 'Appointment_Date', 'SMS_sent', 'Age'], inplace=True)

# Convert no_show column to 1/0 for easier calculation
df_patients.No_show.replace(('Yes', 'No'), (1, 0), inplace=True)

# Calculate total number of appointments and number of no_show appointments
patient_appointments = df_patients.groupby('Patient_ID')['Appointment_ID'].count()
patient_noshows = df_patients.groupby('Patient_ID')['No_show'].sum()

# Remove appointment columns (no longer needed after sum/count in previous step)
df_patients.drop(columns=['Appointment_ID', 'No_show'], inplace=True)

# Make DataFrame 'per-patient' by removing duplicates and setting Patient_ID as index
df_patients.drop_duplicates(inplace=True)
df_patients.set_index('Patient_ID', inplace=True)

# Add calculated series to dataframe as columns
patient_appointments.name = 'total_appointments'
df_patients = df_patients.join(patient_appointments)

patient_noshows.name = 'noshow_appointments'
df_patients = df_patients.join(patient_noshows)

# sort completed dataframe
df_patients.sort_index(inplace=True)

**Make Appointments DataFrame**

In [4]:
df_appointments = df_clean.copy()

df_appointments.set_index('Appointment_ID', inplace=True)

# Drop extra columns
df_appointments.drop(columns=['Age', 'Neighborhood', 'Welfare', 'Hypertension', 'Diabetes', 'Alcoholism', 'Disability'], inplace=True)

# Calculate Date Difference (Appointment - Scheduled)
# Convert scheduled date to date since appointment date does not have time included
# BUT...need to also use `.dt.date` with Appointment_Date due to type mismatch otherwise...
df_appointments['date_diff'] = df_appointments['Appointment_Date'].dt.date - df_appointments['Scheduled_Date'].dt.date


In [7]:
df_appointments.No_show.value_counts()

No     88208
Yes    22319
Name: No_show, dtype: int64

In [6]:
# Save DataFrames for use in modeling
df_patients.to_pickle('../data/interim/patients_df.pickle')
df_appointments.to_pickle('../data/interim/appointments_df.pickle')