# Health data wrangling and input modelling exercises.

**In this lab you will:**

* Gain practical knowledge in pre-processing and analysing real world stochastic health system data
* Learn how to fit distribution to data
* Learn how to select a suitable distribution for your data

> **STUDENT BEWARE**: This lab can be frustrating and will test your `pandas` skills! It is designed to show you the sort of data wrangling, analysis and modelling decisions/assumptions you may need to perform in real simulation study.  But do persevere with it (answers are available as well!).  The experience should demonstrate that fitting distributions to real data is difficult and not quite as textbooks make out! By the end of the lab both your `pandas` skills and simulation input modelling skills will have improved.  >_<  

> **P.s** If you find yourself working on a simulation project in your job, it is worth remembering that simulation studies are very time consuming (i.e. problem structuring, data collection, data wrangling, input modelling, model coding and output analysis) and you have to be fairly pragmatic in your input modelling in order to get it done on time!

# Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Autofit imports

In [None]:
from input_modelling.fitting import auto_fit

# Exercise 1: Time dependent arrival process at an A&E

In this notebook, you will work with a hospital's Accident and Emergency (A&E) data taken from [NHS England's Synthetic A&E dataset](https://data.england.nhs.uk/dataset/a-e-synthetic-data).

Each row in the dataset represents an individual patient and information about their attendance at the A&E. 

You have the following fields:

* **Age band**: '1-17'; '18-24'; '25-44'; '45-64'; '65-84'; '84+'
* **Sex**: 1.0 or 2.0 
* **AE_Arrive_Date**: The data that patient arrived at the A&E e.g. 2015-07-02
* **AE_Arrive_HourOfDay**: Six 4 hour time bands.  '01-04'; '05-08'; '09-12'; '13-16'; '17-20'; '21-24'
* **AE_Time_Mins**: Length of stay int A&E round to the nearest 10.

The aim of exercise 1 is to investigate the **time-dependent arrival rate** to the A&E.   By the end of the exercise you need to produced a table of inter-arrival times broken down by the six **AE_Arrive_HourOfDay** bands.  

## Exercise 1.a: Read in the raw data

**Task:**
* The data for this exercise is located at the URL below.
* Use `pandas` to load it into a DataFrame.

**Questions**:
* Inspect the dataframe.  Check its
    * dimensions 
    * variable datatypes
    * if there are any missing data within the fields.

In [None]:
url = 'https://raw.githubusercontent.com/health-data-science-OR/hpdm097-datasets/master/ed_input_modelling.csv'
#your code here...

## Exercise 1.b. Preprocesses the data

**Task:**
* The field 'AE_Arrive_Date' should be a date.
* Convert this field to a datetime.
* Drop the age band 1-17 (children) as they use the paedatric A&E.

**Hints:**
* There is a built in `pandas` function to help.

**Questions**:
* What are the maximum and minimum dates int the dataset?
* How many days worth of data are there in the dataset?

In [None]:
#your code here...

# Exercise 1.c Analyse the data for patterns

Before you produce the arrival rate you should explore the dataset for a potential trend and systemic breaks (where there is a big change in demand).

**Task**:
* Wrangle the dataset into a time series that reports the number of attendances per day.
* Plot the time series.

**Questions**
* Is there a trend in the data?
* Do your findings alter your plans for what data to include when modelling arrival rates to the A&E
* If you exclude any data how much remains?

**Hints:**
* It is possible to do this with one line of `pandas`.  If you are unsure investigate pandas options for grouping data.

In [None]:
#your code here ...

## Exercise 1.d: General arrival rates by arrival time band.

**Task:**
* Ignoring any day of week or monthly effect, calculate the mean number of arrivals per time band.
* Calculate the inter-arrival rate (IAT) for each of the 4 hour time bands.
* Plot the mean arrival rate by time of day.

**Hints:**
* This problem can again by solved by using `pandas` grouping methods.  
* The first thing you need to do is calculate the **total** number of arrivals by time band.
* To calculate the average you also need to know how many **days** there are in your dataset.
* Remember that each time band represents a 4 hour period.  IAT = Total time (in mins) / Mean No. Arrivals.


In [None]:
#your code here ...

# Exercise 2: Choosing a distribution to model length of stay

The next step in your A&E simulation input modelling is to select a distribution to represent the total time a patient spends in the department.  It is possible the distribution - or at least the parameters - will vary by different subgroups in the population.  For, example the over 65's may have a different distribution from the under 65s.

## Exercise 2.a

**Task**
* Plot a histogram of the time spent in A&E, ignoring any subgroups of the population.
* Use `auto_fit` to help you select a distribution.  Set the parameter `pp=True`

**Questions**
* Has `auto_fit` been useful?  If so which elements of it?
* What do  you conclude from the p-p plots?

**Hints**
* For this exercise it is okay to use the full date range of the ED data, but make sure you exclude the under 18s.

In [None]:
#your code here...

## Exercise 2.b: LoS by age banding

**Task:**
* Investigate if different subgroups have different length of stay distributions.
* Your analysis should check if there is a difference in the over and under 65s. 

**Questions**:
* What distributions would you select for these subgroups?

In [None]:
#your code here ...

# End.