<p style="margin-left:30%; margin-right:1%;">
<img align="middle" src="./images/flow.png" style="width: 49%; height: 49%;" />
</p>


<br>

This notebook is focused on the structuring, integrating, and features engineering aspects of the project.  It

* Processes, structures, and integrates geographic, demographic, and patient flows data.
* Optionally reads England's SARS-CoV-2/Coronavirus 19 Disease measures via the coronavirus.data.gov.uk API.
* Creates features.
* Creates raw design matrices per National Health Service Trust.

<br>

**A note about reading England's SARS-CoV-2/Coronavirus 19 Disease measures via the coronavirus.data.gov.uk API**

The coronavirus.data.gov.uk section is commented out because a large volume of data has to be read from coronavirus.data.gov.uk, via its API. The *infections* repository includes
a downloaded data set

> https://github.com/premodelling/infections/tree/master/warehouse/virus

Otherwise, uncomment the code blocks within the coronavirus.data.gov.uk section.  However, beware, after a few data reads data.gov.uk might deny 
access - after an unkown number of API reads.

<br>

# Preliminaries

In [1]:
!rm -f *.pdf

<br>

### Paths

In [2]:
import os
import pathlib
import sys

In [3]:
if not 'google.colab' in str(get_ipython()):
    
    parts = pathlib.Path(os.getcwd()).parts    
    limit = max([index for index, value in enumerate(parts) if value == 'infections'])    
    parent = os.path.join(*list(parts[:(limit + 1)]))
    
    sys.path.append(os.path.join(parent, 'src'))


In [4]:
parent

'J:\\library\\premodelling\\projects\\infections'

<br>
<br>

### Libraries

In [5]:
%matplotlib inline

import datetime

import logging
import collections

import numpy as np
import pandas as pd

import time

<br>
<br>

### Custom

In [6]:
import src.preprocessing.interface

import src.virus.measures
import src.virus.agegroupcases
import src.virus.agegroupvaccinations

import config

<br>

Setting-up

In [7]:
configurations = config.Config()

<br>

The coronavirus.data.gov.uk API (application programming interface) data fields that would be extracted per LTLA (lower tier local authority) geographic area, and per NHS Trust, of England. 

In [8]:
fields_ltla = configurations.fields_ltla
fields_trusts = configurations.fields_trust

<br>

England's unique set of LTLA & NHS Trust codes.

In [9]:
districts = configurations.districts()
codes_ltla = districts.ltla.unique()

In [10]:
trusts = configurations.trusts()
codes_trusts = trusts.trust_code.unique()

<br>
<br>

### Logging

In [11]:
logging.basicConfig(level=logging.INFO,
                    format='\n\n%(message)s\n%(asctime)s.%(msecs)03d',
                        datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger(__name__)

<br>
<br>

# Part I

## Integration, Features Engineering


### The Supplementary Data Files

In [12]:
times = src.preprocessing.interface.Interface().exc()



preprocessing ...
2022-03-16 17:06:22.762



districts: 
 ['2020: succeeded', '2019: succeeded', '2018: succeeded', '2017: succeeded', '2016: succeeded', '2015: succeeded']

patients: 
 ['2019: succeeded', '2018: succeeded', '2017: succeeded', '2016: succeeded', '2015: succeeded', '2014: succeeded', '2013: succeeded', '2012: succeeded', '2011: succeeded']

populations MSOA: 
 ['2012: succeeded', '2013: succeeded', '2014: succeeded', '2015: succeeded', '2016: succeeded', '2017: succeeded', '2018: succeeded', '2019: succeeded', '2020: succeeded']

MSOA Populations Disaggregated by Sex & Age Group: 
 ['2012: succeeded', '2013: succeeded', '2014: succeeded', '2015: succeeded', '2016: succeeded', '2017: succeeded', '2018: succeeded', '2019: succeeded', '2020: succeeded']

LTLA Populations: 
 ['2012: succeeded', '2013: succeeded', '2014: succeeded', '2015: succeeded', '2016: succeeded', '2017: succeeded', '2018: succeeded', '2019: succeeded', '2020: succeeded']

LTLA Populations Disaggregated by Sex & Age Group: 
 ['2011: succeeded', 

<br>

Delete compute DAG diagrams

In [13]:
!rm -f *.pdf

<br>

Times

In [14]:
pd.DataFrame.from_records(data=times['programs'])

Unnamed: 0,desc,program,seconds
0,districts,preprocessing.districts,2.953169
1,patients,preprocessing.patients,310.482759
2,MSOA populations,preprocessing.populationsmsoa,239.203682
3,MSOA populations: age group & sex brackets,preprocessing.agegroupsexmsoa,3.782216
4,LTLA populations,preprocessing.populationsltla,3.187182
5,LTLA populations: age group & sex brackets,preprocessing.agegroupsexltla,2.185125
6,2011 demographic data,preprocessing.exceptions,4.823276
7,special MSOA demographics for vac,preprocessing.vaccinationgroupsmsoa,9.573547
8,special LTLA demographics for vac,preprocessing.vaccinationgroupsltla,2.119121


<br>
<br>

### coronavirus.data.gov.uk

England's SARS-CoV-2 infections and coronavirus 19 disease measures are acquireable via the United Kingdom's coronavirus.data.gov.uk API.  Four different data sets are of interest, which are read via the 4 steps that follow.  Instead of the 4 steps below you may run

> %%bash
>
> `python src/virus/interface.py`

 <br>
 
 **Lower Tier Local Authority Level Measures**

> ```python
measures = src.virus.measures.Measures(fields=fields_ltla, path=os.path.join('ltla', 'measures')) \
    .exc(area_codes=codes_ltla, area_type='ltla')
logger.info('%d LTLA areas queried.', len(measures))
time.sleep(60)
```

<br>

**Trust level measures**

> ```python
measures = src.virus.measures.Measures(fields=fields_trusts, path=os.path.join('trusts', 'measures')) \
    .exc(area_codes=codes_trusts, area_type='nhsTrust')
logger.info('%d NHS Trusts queried.', len(measures))
time.sleep(60)
```

<br>

**LTLA Level measures: Cases disaggregated by Age Group**

> ```python
measures = src.virus.agegroupcases.AgeGroupCases().exc(area_codes=codes_ltla, area_type='ltla')
logger.info('%d LTLA areas queried.', len(measures))
time.sleep(60)
```

<br>

**LTLA Level measures: Vaccinations disaggregated by Age Group** 

A few areas do not have any data, albeit their request response status is 200

> ```python
area_codes = list(set(codes_ltla) - {'E06000053', 'E09000001', 'E06000060'})
measures = src.virus.agegroupvaccinations.AgeGroupVaccinations().exc(area_codes=area_codes, area_type='ltla')
logger.info('%d LTLA areas queried.', len(measures))
```

<br>
<br>

### Weights

determining multi-granularity patient flow weights, from LTLA $\longrightarrow$ NHS Trust, via MSOA $\longrightarrow$ NHS Trust numbers

In [15]:
%%bash

python src/catchments/interface.py



2011
2022-03-16 17:16:03.347


2011: weights calculated for approx. 139 trusts
2022-03-16 17:16:18.710


2012
2022-03-16 17:16:18.711


2012: weights calculated for approx. 140 trusts
2022-03-16 17:16:34.802


2013
2022-03-16 17:16:34.802


2013: weights calculated for approx. 140 trusts
2022-03-16 17:16:50.659


2014
2022-03-16 17:16:50.660


2014: weights calculated for approx. 140 trusts
2022-03-16 17:17:06.765


2015
2022-03-16 17:17:06.765


2015: weights calculated for approx. 140 trusts
2022-03-16 17:17:23.361


2016
2022-03-16 17:17:23.361


2016: weights calculated for approx. 140 trusts
2022-03-16 17:17:39.943


2017
2022-03-16 17:17:39.943


2017: weights calculated for approx. 140 trusts
2022-03-16 17:17:56.586


2018
2022-03-16 17:17:56.587


2018: weights calculated for approx. 140 trusts
2022-03-16 17:18:14.064


2019
2022-03-16 17:18:14.064


2019: weights calculated for approx. 140 trusts
2022-03-16 17:18:31.446


For approx. 140 trusts, a file has been created per t

<br>

Delete compute DAG diagrams

In [16]:
!rm -f *.pdf

<br>
<br>

### Vaccination Specific Weights

determining the vaccinations specific multi-granularity patient flow weights; different because its age groupings/brackets differ from the standard 5 year groupings/brackets

In [17]:
%%bash

python src/vaccinations/interface.py



vaccinations
2022-03-16 17:21:35.728


2012
2022-03-16 17:21:35.728


2012: weights calculated for approx. 140 trusts
2022-03-16 17:21:52.503


2013
2022-03-16 17:21:52.503


2013: weights calculated for approx. 140 trusts
2022-03-16 17:22:08.744


2014
2022-03-16 17:22:08.745


2014: weights calculated for approx. 140 trusts
2022-03-16 17:22:25.334


2015
2022-03-16 17:22:25.335


2015: weights calculated for approx. 140 trusts
2022-03-16 17:22:41.689


2016
2022-03-16 17:22:41.689


2016: weights calculated for approx. 140 trusts
2022-03-16 17:22:58.482


2017
2022-03-16 17:22:58.483


2017: weights calculated for approx. 140 trusts
2022-03-16 17:23:14.801


2018
2022-03-16 17:23:14.802


2018: weights calculated for approx. 140 trusts
2022-03-16 17:23:32.268


2019
2022-03-16 17:23:32.268


2019: weights calculated for approx. 140 trusts
2022-03-16 17:23:50.591


For approx. 140 trusts, a file has been created per trust - it contains the weights data of a trust, for all years.
202

<br>
<br>

### Design Matrix & Outcome Variables


Estimating NHS Trust coronavirus measures per NHS Trust BY transforming LTLA measures to weighted NHS Trust Components via the calculated multi-granularity patient flow weights.  Subsequently, a tensor consisting of the raw matrix of independent variables vectors, and the outcome vector is constructed.

In [18]:
%%bash

python src/design/interface.py



(['RTD succeeded', 'RVV succeeded', 'RWA succeeded', 'RJN succeeded', 'RKB succeeded', 'RWG succeeded', 'RQW succeeded', 'RWY succeeded', 'R1K succeeded', 'RAX succeeded', 'RRV succeeded', 'RBN succeeded', 'RCB succeeded', 'RHQ succeeded', 'RXL succeeded', 'RM1 succeeded', 'RP6 succeeded', 'RWF succeeded', 'RTP succeeded', 'RXQ succeeded', 'RTX succeeded', 'RCF succeeded', 'RK9 succeeded', 'RVR succeeded', 'RYR succeeded', 'RWD succeeded', 'RHU succeeded', 'RRJ succeeded', 'REM succeeded', 'RNQ succeeded', 'RHM succeeded', 'RK5 succeeded', 'RGR succeeded', 'RGT succeeded', 'RA4 succeeded', 'RBS succeeded', 'RCX succeeded', 'RTK succeeded', 'RCU succeeded', 'R1F succeeded', 'RN3 succeeded', 'RD1 succeeded', 'RBQ succeeded', 'RP5 succeeded', 'RJL succeeded', 'RLT succeeded', 'RWE succeeded', 'RJ2 succeeded', 'RXN succeeded', 'RJ1 succeeded', 'RM3 succeeded', 'R0B succeeded', 'RL4 succeeded', 'RVW succeeded', 'RGP succeeded', 'RAL succeeded', 'RLQ succeeded', 'RTF succeeded', 'RWW succe

<br>
<br>

## Delete DAG Diagrams

In [19]:
%%bash

rm -rf *.pdf