# Track&Know Pilot 2 Synthetic Data Generator

## Acknowledgement

!["Funded by EU logo"](https://github.com/ibadkureshi/tnk-locationallocation/raw/master/docs/images/EU-H2020.jpg "Funded by EU H2020") This project has received funding from the European Unionâ€™s Horizon 2020 research and innovation programme under grant agreement No 780754.

## About

This notebook generates a synthetic dataset of patient appointments and referrals to a fictional service in the North East of England. The code can be adjusted to incorporate any area on mainland Great Britain. NI or the islands can be integrated too, however the structure of postcode, GP and OSA public data is different, and data input handlers will need to be adjusted.

The behaviour of the patients (visiting their nearby GP followed by attending a specialist clinic), appointments (clinic appointments within 7day-6weeks of the referral (gp appointment)), and facilities (one major facility taking the load, along with minor facilities) is meant to mirror the real data used under Pilot 2 of the Track & Know Project.

Real postcodes, from Roayl Mail, are used to generate the appointment population, real facilities are used based on the British Lung Foundations study of Obstructive Sleep Apnea, and real GP's are used based on public data from the NHS. 

**All data is randomly generated and changes with every run of the generation code. Any resemblance to actual events or locales or persons, living or dead, is entirely coincidental.**

This data differs from the original data in that postcodes from the catchment are randomly selected. There were patterns in the occurrences of OSA and the referral schedules within the real data that are not reproduced in this synthetic dataset. These patterns are not reproduced because reapplication of this code (w/ the pattern reproduction) on the Track&Know pilot catchment area *could* lead to an exposure of actual patients.

Please read the 'Variable Operationalisation' document for more information about the output of this code.

Please refer to the Track&Know website for more information about the project and Pilot 2. https://trackandknowproject.eu

In [1]:
import pandas as pd
import random
from datetime import datetime as dt
from datetime import timedelta
from geopy.distance import geodesic
from dateutil.rrule import DAILY, rrule, MO, TU, WE, TH, FR
from IPython.display import Image

## Data Input

### Area of Interest

Create a geographic bounding box to map out area of interest where the points need to reside 

In [2]:
LowerLeft = [53.594777, -3.732939]
UpperRight = [54.976496, -2.439935]

This section is only required if you want to visualise the area.

In [3]:
#import folium
#
#m = folium.Map(location=[(LowerLeft[0]+UpperRight[0])/2,(LowerLeft[1]+UpperRight[1])/2], tiles='cartodbpositron', zoom_start=8)
#folium.Rectangle(
#    bounds=[LowerLeft, UpperRight],
#    popup='Area of Interest',
#    color='crimson',
#    fill=False,
#).add_to(m)

#m

### Number of Patients

In [4]:
numPatients = 10000

### Facilities

This data set is a synthetic dataset of patients for an imaginary Obstructive Sleep Apnea service. Using the British Lung Foundations data and propensity maps, we identify facilities that exist in our area of interest. ***This is a manual process and would need to be adjusted if the area of interest changes***. The map can be found here
https://www.blf.org.uk/sites/default/files/BLF_OSA_Map_A4_UK_Overall_Weighted_Clinics_0.pdf

To replicate the observed behaviour in Track&Know's Pilot 2, we are identifying a primary centre and a set of outreach clinics. As it happens the BLF data highlights that this area of interest has 1 major centre and 4 minor centres. So the first row is the major centre.

![Facilities in area of interest](imgs/facilities.png)

In [5]:
facilities = [
    ['Blackpool Victoria Hospital','FY3 8NR','53.820777','-3.013842'],
    ['Furness General Hospital','LA14 4LF','54.136862','-3.208707'],
    ['Westmoorland General Hospital','LA9 7RG','54.307665','-2.732336'],
    ['Royal Lancaster Infirmary','LA1 4RP','54.042097','-2.798714'],
    ['Royal Blackburn Hospital','BB2 3HH','53.734792','-2.460098']
]

### Postcodes and Residential Areas

Open the UK Postcodes to Lat/Long. Both the full list and the list of outcodes is required. These can be found here:
https://www.freemaptools.com/download-uk-postcode-lat-lng.htm
Retain the master list of postcodes, but also create a new dataframe with only those postcodes inside the area of interest

In [6]:
postcodes = pd.read_csv('input/ukpostcodes.csv')

In [7]:
i = postcodes[(postcodes.latitude >= LowerLeft[0]) & (postcodes.latitude <= UpperRight[0])]
postcodePool = i[(i.longitude >= LowerLeft[1]) & (i.longitude <= UpperRight[1])]

### Short listing general practitioners (GPs)

Open the list of GP's available from the NHS. This data is not in the bundle but can be found here:
https://digital.nhs.uk/services/organisation-data-service/data-downloads/gp-and-gp-practice-related-data
As there are no headers in the file, we add names to those columns that are important.

In [8]:
gps = pd.read_csv('input/epraccur.csv', names = ["ident","name","a","b","location","street","town","district","c","postcode","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t"])

Create a function to merge the GP dataframe with the postcode dataframe - this gives lat/lon pairs to the GP addresses. And then create a shortlist of GP's that are in our area of interest.

In [9]:
def selective_merge(df1, df2, index, columns):
    #merging to dataframes (original dataset + postcodes)
    df = df1.merge(df2, on=index, how='inner')
    df = pd.concat([df[column] for column in columns], axis=1)
    return df

In [10]:
x=selective_merge(postcodes, gps, 'postcode', 
                             ['ident','name','town','postcode','latitude','longitude'])
i = x[((x.latitude >= LowerLeft[0]) & (x.latitude <= UpperRight[0]))]
gpPool = i[(i.longitude >= LowerLeft[1]) & (i.longitude <= UpperRight[1])]

This section is only required if you want to visualise the GP's. Please note the folium box above needs to be uncommented too.

In [11]:
#for i in range(0,len(gpPool)):
#    folium.Marker([gpPool.iloc[i]['latitude'],gpPool.iloc[i]['longitude']]).add_to(m)
#m

## Synthetic Generation of Data

### Appointment Dates

Use the rrule function to generate a table of workdays for the interested time period. Epoch times are used and this generator is working from 01 January 2020 till 31 December 2020. The code doesnt account for public holidays but this can be reached. 

In [12]:
AppointDates = rrule(DAILY, dtstart=dt.fromtimestamp(1546333200), until=dt.fromtimestamp(1577782800), byweekday=(MO,TU,WE,TH,FR), count=10000)

### Facilities

As the number of facilities is an order of magnitude we and we need to send more apointments to the major centre we create a list (the size of the number of patients) where there each item is 50-50 major or all (minor+major). In the all allocation its a 1 in X chance (where X is the number of facilities defined above) for a facility to be selected. In this example with 5 facilities the split will be 60-10-10-10-10. 

In [13]:
possibleFacilities = []

In [14]:
for i in range(0,numPatients):
    if random.randint(0,1):
        possibleFacilities.append(facilities[random.randint(0,4)])
    else:
        possibleFacilities.append(facilities[0])

### Patient Appointments

Next we go 1-by-1 creating patients by randomly selecting a postcode from our pool of relavent postcodes. Selection of a postcode does not remove it from the pool i.e. you can have multiple patients from the same. 
We then identify the nearest GP as per catchment rules your GP is usually the closest one. 
We start allocating an appointment date by identifying the average appointments per day and incrementing the counter based on that. i.e. if 40 is the average appointments/day then the first 40 appointments are allocated on day 1 and then next 40 on day 2.
Finally we calculate the appointment date of the actual OSA clinic by adding on a random value between 7-42 days (6 weeks is the SLA). 
The patient record is then added to the dataset incorporating the corresponding row of which facility the appointment is held in. 

In [15]:
dataArray = []

In [None]:
for i in range(0,numPatients):
    # Generate a random number to pick a postcode for this patient (i.e. postcode lottery)
    index = random.randint(0,len(postcodePool)-1)
    lat = postcodePool.iloc[index]['latitude']
    lon = postcodePool.iloc[index]['longitude']

    # identify the closest gp - find the shortest distance between the patients residence and GP's from the pool
    closestGP = []
    for j in range(0,len(gpPool)):
        glat = gpPool.iloc[j]['latitude']
        glon = gpPool.iloc[j]['longitude']
        distance = geodesic([lat,lon], [glat,glon]).m
        
        currentGP = [distance, gpPool.iloc[j]['name'], gpPool.iloc[j]['postcode'], glat, glon]
        if closestGP == []:
            closestGP.append(currentGP)
        else:
            if closestGP[0][0] > distance:
                closestGP.pop(0)
                closestGP.append(currentGP)

    # Identify the appointment date - this is done by averaging the number of referrals per day
    # we find the average by dividing the number of synthetic patients with the number of working
    # days. While in the original dataset there is some fluctuation on new referrals becuase most
    # patients are on followup assuming an average is OK. Then calculate based on referral date
    # the likely clinic date - between 7days to 6 weeks.
    maxPatients = int(numPatients/len(list(AppointDates)))+2 # to ensure there are no round down errors
    gpAppoint = AppointDates[int(i/maxPatients)].strftime('%d/%m/%Y')
    clinicAppoint = (AppointDates[int(i/maxPatients)] + timedelta(days=random.randint(7,42))).strftime('%d/%m/%Y')
    
    # Merge all values together and append to record.
    payload = [i,postcodePool.iloc[index]['postcode'],lat,lon, 
               gpAppoint,closestGP[0][1],closestGP[0][2],closestGP[0][3],closestGP[0][4], 
               clinicAppoint,possibleFacilities[i][0],possibleFacilities[i][1],possibleFacilities[i][2],possibleFacilities[i][3]]
    dataArray.append(payload)

## Data Output

### Master Record

Write out the master appointment log.

In [231]:
patientAppointmentLog = pd.DataFrame(dataArray,columns=[
    'patientno','postcode','latitude','longitude',
    'gpdate','gpname','gppostcode','gplatitude','gplongitude',
    'clinicdate','clinicname','clinicpostcode','cliniclatitude','cliniclongitude'
])

In [232]:
patientAppointmentLog.to_csv('output/appointmentlog.csv')

### Demand Points

This file just contains lat/lon and id columns to simplify location allocation work

In [240]:
w = open('output/demandfile.csv','w')
for i in range(0, len(dataArray)):
    data = [str(dataArray[i][2]),str(dataArray[i][3]),str(dataArray[i][0])] 
    print(','.join(data),file=w)
w.close()