PRM GP2GP Data Sandbox

This repository contains focused explorations of data associated with GP2GP utilisation.

These are each contained within Jupyter notebooks under the notebooks directory.

Notebooks typically read data in from S3. Some sources of data, such as transfers data or ODS metadata, are ready to use as-is and can be read in directly from the relevant S3 bucket. Datasets that require manual preparation (e.g raw SPINE logs) should be stored in the notebooks data bucket in a directory named after the notebook the data is supporting. This directory should be named in the following format:

[bucket-name]/[story-number]-[description]

For example:

prm-gp2gp-notebook-data-prod/PRMT-1234-attachments-metadata-aug2021/

The data directory contains:

Data-sets used in early notebooks (typically exported via queries from NMS).
Small data lookups (e.g GP2GP error codes)
Helper functions to load in data that requires pre-processing to work in Pandas (e.g ODS metadata)

Local Setup

If you need to run the notebooks locally, follow these steps:

Create a virtual environment

From the base directory of the project, create a python3 virtual environment and activate it:

python3 -m venv venv
source venv/bin/activate

The shell prompt should now show that the virtual environment has been activated.

Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Configure notebook friendly git diffs

Configure nbdime (link) for viewing diffs in jupyter notebooks

nbdime config-git --enable

If you receive an error like fatal: external diff died, stopping at <notebookpath>.ipynb when running git diff, then check that the virtual environment is activated - this should fix the issue.

To deactivate the virtual environment:

deactivate

Starting the Jupyter server

To start a local Jupyter server and access the notebooks:

Activate the virtual environment:

source venv/bin/activate

Start the Jupyter server:

jupyter notebook

Name		Name	Last commit message	Last commit date
Latest commit History 329 Commits
assets		assets
athena		athena
data		data
gocd		gocd
notebooks		notebooks
scripts		scripts
splunk		splunk
.gitignore		.gitignore
.talismanrc		.talismanrc
Dojofile		Dojofile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
tasks		tasks

License

nhsconnect/prm-gp2gp-data-sandbox

Folders and files

Latest commit

History

Repository files navigation

PRM GP2GP Data Sandbox

Local Setup

Create a virtual environment

Install dependencies

Configure notebook friendly git diffs

Starting the Jupyter server

Activate the virtual environment:

Start the Jupyter server:

About

Resources

License

Stars

Watchers

Forks

Languages