# Data Story

1. Questions
    - Comparisons
    - Counts
    - Trends
    - Plots (Bar/Histogram/Scatter/Time-Series)
    - Cross-tabs
2. Insights
    - Correlations
    - Hypotheses
3. Narrative
    - Present in a story form, needs to flow
    - What trends/relationships would make it more complete?

**Notes:**

- Date/Time needs to be categorized before being used in ML models
- Only choose a few columns to explore in detail, can always go back and explore further after starting ML
- *Histogram of diff in appointment-date relative to scheduled-date*

## Create clean DataFrame based on 2.0-jkg-data-wrangling

In [2]:
# Package imports
import pandas as pd

# import raw data into DataFrame
df_import = pd.read_csv('../data/raw/KaggleV2-May-2016.csv')

# Rename columns
df_import.columns = ['Patient_ID',
                  'Appointment_ID',
                  'Gender',
                  'Scheduled_Date',
                  'Appointment_Date',
                  'Age',
                  'Neighborhood',
                  'Welfare',
                  'Hypertension',
                  'Diabetes',
                  'Alcoholism',
                  'Disability',
                  'SMS_sent',
                  'No_show']

# Convert Scheduled_Date to DateTime
df_import['Scheduled_Date'] = pd.to_datetime(df_import['Scheduled_Date'])

# Convert Appointment_Date to DateTime
df_import['Appointment_Date'] = pd.to_datetime(df_import['Appointment_Date'])

# Set index to Appointment_ID (unique)
df_import.set_index('Appointment_ID')

# Copy df_import to df_clean
df_clean = df_import.copy()

## Visualization TODOs

- Create DataFrame of Patients w/ duplicates removed
    - Gender
    - Age
    - Count of Total Appointments
    - Count of No-show Appointments
- Histogram of DateDiff (overall and missed appointments)
- Crosstab of SMS_sent and noshow

In [8]:
# Calculate Date Difference (Appointment - Scheduled)

(df_clean['Appointment_Date'] - df_clean['Scheduled_Date'])

0        -1 days +05:21:52
1        -1 days +07:51:33
2        -1 days +07:40:56
3        -1 days +06:30:29
4        -1 days +07:52:37
5          1 days 15:23:09
6          1 days 08:54:48
7          1 days 08:20:02
8        -1 days +15:57:44
9          1 days 11:11:35
10         1 days 09:01:49
11         2 days 15:15:48
12         0 days 12:26:09
13         0 days 09:07:53
14         0 days 13:53:36
15         2 days 15:12:33
16         0 days 15:08:13
17         0 days 14:31:03
18         2 days 13:05:42
19       -1 days +13:16:46
20         1 days 16:08:46
21         1 days 13:09:15
22         3 days 10:30:44
23         0 days 13:32:55
24       -1 days +09:40:41
25         2 days 08:55:43
26       -1 days +09:40:18
27         1 days 13:08:15
28       -1 days +08:11:58
29       -1 days +08:43:31
                ...       
110497   -1 days +14:13:27
110498   -1 days +13:38:46
110499   -1 days +14:17:04
110500   -1 days +14:24:47
110501   -1 days +13:40:48
110502   -1 days +13:09:18
1

In [3]:
# New Patients DataFrame df_patients
df_patients = df_clean.copy()

In [4]:
df_patients.groupby(['Patient_ID', 'Gender', 'Age', 'No_show'])

<pandas.core.groupby.DataFrameGroupBy object at 0x00000156A6435D30>