This notebook is focused on the modelling aspect of the project.  It predominantly depends on the modules of [src/modelling](./src/modelling), and on the raw design matrix data [warehouse/design/raw](./warehouse/design/raw). &nbsp;&nbsp; The developed models depend on TensorFlow, and their program's

* [Convolutional Neural Network](./src/modelling/EstimatesCNN.py)
* [Long short-term memory (LSTM)](./src/modelling/EstimatesLSTM.py)
* [Gated Recurrent Unit (GRU)](./src/modelling/EstimatesGRU.py)

are GPU (graphics processing units) compatible. &nbsp;&nbsp; In order to run the programs within a GPU enabled machine disable the text 

> os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

within each of the three model files. &nbsp;&nbsp; This text enables a CPU (Central Processing Unit) based run; the programs are GPU programs by default. 

<br>

The [project report](https://github.com/premodelling/trusts/blob/master/trusts.pdf) includes a summary of the error metrics w.r.t. the developed models.  Additionally, an [online graphical summary](https://public.tableau.com/app/profile/greyhypotheses/viz/SCC460Project/SCC460) of the error metrics, w.r.t. a few runs of the developed models, is available [here](https://public.tableau.com/app/profile/greyhypotheses/viz/SCC460Project/SCC460).  In general, the error metrics due to model runs are stored in [warehouse/modelling/evaluations/](./warehouse/modelling/evaluations)

<br>


## Preliminaries

### Paths

In [1]:
import os
import pathlib
import sys

In [2]:
if not 'google.colab' in str(get_ipython()):
    
    parts = pathlib.Path(os.getcwd()).parts    
    limit = max([index for index, value in enumerate(parts) if value == 'infections'])    
    parent = os.path.join(*list(parts[:(limit + 1)]))
    
    sys.path.append(os.path.join(parent, 'src'))


In [3]:
parent

'J:\\library\\premodelling\\projects\\infections'

<br>
<br>

### Libraries

In [4]:
%matplotlib inline

import datetime

import logging
import collections

import numpy as np
import pandas as pd


<br>

### Custom

In [5]:
import src.modelling.DataStreams
import src.modelling.DataReconstructions
import src.modelling.Differences
import src.modelling.DataNormalisation
import src.modelling.Estimates

<br>
<br>

### Logging

In [6]:
logging.basicConfig(level=logging.INFO,
                    format='\n\n%(message)s\n%(asctime)s.%(msecs)03d',
                        datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger(__name__)

<br>
<br>

## Part II

### Setting-Up

A class for the data splitting fractions

In [7]:
Fraction = collections.namedtuple(
    typename='Fraction',
    field_names=['training', 'validating', 'testing'])

<br>

**Modelling Arguments**

> * Predict `output_steps` days into the future, based on `input_width` days of history

Herein

* $input\_width \in widths$  $\qquad$  [$widths$ is a range of input window values (days)]
* $output\_steps = 15$ days
  
And

* $label\_width = output\_steps$


In [8]:
Arguments = collections.namedtuple(
    typename='Arguments',
    field_names=['input_width', 'label_width', 'shift', 'training_', 'validating_', 'testing_', 'label_columns'])

<br>

Vary the widths' range as required.

In [9]:
widths = range(18, 39)
output_steps = 15

<br>
<br>

### Training, Validating, Testing Data

Foremost: The data sets for training, validating, and testing

In [10]:
training, validating, testing = src.modelling.DataStreams.DataStreams(root=parent, fraction=Fraction._make(
        (0.75, 0.15, 0.10))).exc()

In [11]:
logger.info(training.columns)



Index(['group', 'date', 'covidOccupiedBeds', 'covidOccupiedMVBeds',
       'estimatedNewAdmissions', 'EDC0-4', 'EDC5-9', 'EDC10-14', 'EDC15-19',
       'EDC20-24', 'EDC25-29', 'EDC30-34', 'EDC35-39', 'EDC40-44', 'EDC45-49',
       'EDC50-54', 'EDC55-59', 'EDC60-64', 'EDC65-69', 'EDC70-74', 'EDC75-79',
       'EDC80-84', 'EDC85-89', 'EDC90+', 'newDeaths28DaysByDeathDate',
       'EDV12-15', 'EDV16-17', 'EDV18-24', 'EDV25-29', 'EDV30-34', 'EDV35-39',
       'EDV40-44', 'EDV45-49', 'EDV50-54', 'EDV55-59', 'EDV60-64', 'EDV65-69',
       'EDV70-74', 'EDV75-79', 'EDV80-84', 'EDV85-89', 'EDV90+'],
      dtype='object')
2022-03-30 15:48:53.101


<br>
<br>

### Reconstruction

Reconstructions: Each data set is a concatenation of records from various NHS Trusts, however because the aim is a single predicting/forecasting model for all trusts, the data should be reconstructed ...

In [12]:
reconstructions = src.modelling.DataReconstructions.DataReconstructions()
training = reconstructions.exc(blob=training)
validating = reconstructions.exc(blob=validating)
testing = reconstructions.exc(blob=testing)

<br>
<br>

### Differences

Using difference values rather than actual values

In [13]:
differences = src.modelling.Differences.Differences()
training = differences.exc(blob=training)
validating = differences.exc(blob=validating)
testing = differences.exc(blob=testing)

<br>
<br>

### Normalisation

In [14]:
normalisation = src.modelling.DataNormalisation.DataNormalisation(reference=training)
training_ = normalisation.normalise(blob=training)
validating_ = normalisation.normalise(blob=validating)
testing_ = normalisation.normalise(blob=testing)

training_.drop(columns='point', inplace=True)
validating_.drop(columns='point', inplace=True)
testing_.drop(columns='point', inplace=True)

<br>
<br>

### Modelling

The resulting error metrics are stored in [warehouse/modelling/evaluations/](./warehouse/modelling/evaluations)


In [None]:
arguments = Arguments(input_width=None, label_width=output_steps, shift=output_steps,
                      training_=training_, validating_=validating_, testing_=testing_,
                      label_columns=['estimatedNewAdmissions'])

validations, tests = src.modelling.Estimates.Estimates(
    n_features=training_.shape[1], 
    output_steps=output_steps).exc(widths=widths, arguments=arguments)

logger.info(validations)
logger.info(tests)

<br>
<br>

### Delete DAG Diagrams

In [16]:
%%bash

rm -rf *.pdf