DI Data Analytics Platform

Data and Analytics platform which will enable the implementation of the OneLogin reporting strategy.

Prerequisites

Install development tools

The project uses the current (as of 03/05/2023) LTS of Node, version 18. The GDS recommendation is to use nvm to manage Node versions - installation instructions can be found here.

Core

AWS SAM CLI - for running SAM commands
Node - for lambda development and running npm commands
Docker - for running sam local
Checkov - for validating IaC code. Install on GDS Macs in the terminal by running pip3 install checkov

Optional

AWS CLI - for interacting with AWS on the command line
GitHub CLI - for interacting with GitHub on the command line. Can do some things not possible via the GUI, such as running workflows that have not been merged to main

Set up commit signing

Commits will be rejected by GitHub if they are not signed using an SSH or GPG key. SSH keys do not support expiration or revocation so GPG is preferred. Follow the instructions here to generate a key and set it up with GitHub. You may need to install gpg first - on a GDS Mac open the terminal and run brew install gpg.

Set up husky hooks

Husky is used to run githooks, specifically pre-commit and pre-push. To install the hooks run npm run husky:install. After this, the hooks defined under the .husky directory will automatically run when you commit or push.* The lint-staged library is used to only run certain tasks if certain files are modified.

Config can be found in the lint-staged block in package.json. Note that lint-staged works by passing a list of the matched staged files to the command defined, which is why the commands in package.json are e.g. prettier --write, with no file, directory or glob arguments. (usually if you wanted to run prettier you would need such an argument, e.g.prettier --write . or prettier --check src. More information can be found here.

* Git LFS hooks also live in this directory - see section below

Set up Git LFS

If you intend to make changes to any of the large binary files in this repository (currently just *.tar.gz and *.jar) then you will need to install Git LFS. This is necessary as GitHub blocks files larger than 100 MiB.

If you do not install Git LFS you will only get the pointer files and not the actual data. This is not a problem unless you want to edit these files. See this section of the GitHub docs for more information

Git LFS also uses hooks, specifically post-checkout, post-commit, post-merge and pre-push. In the case of the latter, husky also uses this hook which is why the file at .husky/_/pre-push contains both husky and Git LFS code. Note that the Git LFS hooks are in the husky directory because husky was installed in the repository before Git LFS and so that directory structure was already in place. Manually editing the hooks was necessary due to the clash on pre-push, and this comment was the general direction taken.

Repository structure

Lambdas

The lambdas and supporting code are written in TypeScript and built with esbuild.

Individual lambda handlers (and unit tests) can be found in subdirectories of the src/handlers directory. Common and utility code can be found in the src/shared directory.

In addition, files to support running lambdas with sam local invoke are in the sam-local-examples directory.

IaC

IaC code is written in AWS SAM (a superset of CloudFormation templates) and deployed as SAM applications.

IaC code can be found in the iac directory. There are currently two applications, each with its own subdirectory (main and quicksight-access). In each there is a base file, base.yml, which contains everything except the Resources section. In the resources/ subdirectory, there are YAML files containing all the stack resources, grouped by functional area.

A package.json script, iac:build, concatenates all these files for a particular application into a single top-level template.yaml file that is expected by SAM and Secure Pipelines. The script requires an argument for which application you wish to build, e.g. npm run iac:build -- main. To build all applications at once (useful for linting and scanning), an additional npm script, iac:buildall, exists which puts the template files it builds into the (git ignored) iac-dist directory.

The AWS SAM config is at samconfig.toml.

Workflows

Workflows that enable GitHub Actions can be found in the .github/workflows directory. Below is a list of workflows. The ✳️ symbol at the start of a workflow name indicates that it can be run manually.

Name	File	Triggers	Purpose
Deploy to an AWS environment	deploy-to-aws.yml	other workflows	Deploys to a deployable AWS environment (dev, build, test)
✳️ Deploy to the test environment	deploy-to-test.yml	manual other workflows	Deploys IaC and lambda code to the test AWS
✳️ Deploy to the dev environment	deploy-to-dev.yml	merge to main manual	Deploys IaC and lambda code to the dev AWS
Deploy to the build environment	deploy-to-build.yml	merge to main	Deploys IaC and lambda code to the build AWS
✳️ Test and validate iac and lambdas	test-and-validate.yml	other workflows pull requests manual	Runs linting, formatting and testing of lambda code, and linting and scanning of IaC code
✳️ Upload Athena files to S3	upload-athena-files.yml	manual	Uploads athena scripts for a particular environment (under athena-scripts) to S3
✳️ Pull request deploy and test	pull-request-deploy-and-test.yml	manual pull requests (on open, reopen and update)	Deploys a pull request branch to the feature environment and runs integration tests when a pull request is opened, reopened or updated
✳️ Pull request tear down	pull-request-tear-down.yml	manual ~~pull requests (on close)~~	Tears down the feature environment ~~when a pull request is merged or otherwise closed~~
Upload testing image to ECR	upload-testing-image.yml	other workflows	Builds a testing dockerfile in `tests/scripts/` and uploads the image to ECR
✳️ Upload testing images to ECR	upload-testing-images.yml	merge to main manual	Builds one or more testing dockerfiles in `tests/scripts/` and uploads the images to ECR. Which dockerfiles to build can be specified via inputs
SonarCloud Code Analysis	code-quality-sonarcloud.yml	pull requests (on open, reopen and update)	Runs a SonarCloud analysis on the repository
✳️ Run flyway command on redshift	run-flyway-command.yml	manual	Runs a specified flyway command on the redshift database in a specified environment. For more on how to use this workflow see the README here
✳️ Add Quicksight user	add-quicksight-user.yml	manual	Provides an interface to add a user to Cognito and Quicksight by invoking the quicksight-add-users lambda
✳️ Add Quicksight users from spreadsheet	add-quicksight-users.yml	manual	Reads the DAP account management spreadsheet and attempts to add users to Cognito and Quicksight
✳️ Deploy to the production preview environment	deploy-to-production-preview.yml	manual	Deploys to the production-preview environment
SAM deploy	sam-deploy.yml	other workflows	Performs a SAM deploy to an environment without secure pipelines (feature, production-preview)
✳️ Upload Flyway files to S3	upload-flyway-files.yml	manual other workflows	Uploads flyway files for a particular environment (under redshift-scripts/flyway) to S3
✳️ Export analysis from Quicksight	quicksight-export.yml	manual	Exports a Quicksight analysis to S3 using the asset bundle APIs
✳️ Import analysis to Quicksight	quicksight-import.yml	manual	Imports a Quicksight analysis from S3 using the asset bundle APIs

Testing

Unit tests

Unit testing is done with Jest and the lambdas should all have associated unit tests (*.spec.ts).

npm run test - run all tests under src/
jest consumer - run a specific test or tests
- anything after jest is used as a regex match - so in this example consumer causes jest to match all tests under the src/handlers/txma-event-consumer/ directory (and any other directory that might have consumer in its name)

Integration tests

TODO

Test reports

After running unit or integration tests, a test report called index.html will be available in the test-report directory. This behaviour is provided by jest-stare and configured in jest.config.js.

Linting, formatting and validation

Lambdas

Linting and formatting are handled by ESLint and Prettier (with an EditorConfig file) respectively. typescript-eslint is used to allow these tools to work with TypeScript.

npm run lint:check - run linting and formatting checks and print warnings
npm run lint:fix - run linting and formatting checks and (attempt to) automatically fix issues

IaC

AWS SAM can perform validation and linting of CloudFormation files. In addition, checkov can find misconfigurations. Prettier can also check (or fix) the formatting of the YAML of the SAM template. Prettier is used to ensure consistent formatting of the YAML.

npm run iac:lint - run validation and linting checks and print warnings
npm run iac:scan - run checkov scan and print warnings
npm run iac:format:check - run formatting checks and print warnings
npm run iac:format:fix - run formatting checks and automatically fix issues

Scripts

Prettier is used to ensure consistent formatting of the script files in the scripts/ directory. The ability to format shell scripts comes from the prettier-plugin-sh library.

npm run scripts:format:check - run formatting checks and print warnings
npm run scripts:format:fix - run formatting checks and automatically fix issues

Building and running

Lambdas

npm run build - build (transpile, bundle, etc.) lambdas into the dist directory

Lambdas can be run locally with sam local invoke. A few prerequisites:

Docker is running
Lambda you wish to run has been built into a .js file (npm run build)
Lambda you wish to run is defined in CloudFormation and has been built into the top-level template.yml file (npm run iac:build)
- You can use the CloudFormation resource name (e.g. AthenaGetConfigLambda or EventConsumerLambda) to refer to the lambda in the invoke command
SAM application has been built (sam build)
- Order matters here - this command copies the lambda JS into .aws-sam/, so make sure npm run build has been run beforehand
You have defined a JSON file (ideally here) containing the event you wish to be the input event of the lambda (unless you don't need an input event)
You have added any environment variables you need the lambda to take to env.json

An example invocation might be

npm run build
npm run iac:build
sam build

# invoke with no input event or environment vars
sam local invoke EventConsumerLambda

# invoke specifying both an input event and environment variables
sam local invoke EventConsumerLambda --env-vars sam-local-examples/env.json --event sam-local-examples/txma-event-consumer/valid.json

A note on args

The --env-vars arg takes the path to a JSON file with any environment vars you want the lambda to have access to (via node process.env). Find these (and define more) in per-function objects within the main object in sam-local-examples/env.json
The --event arg takes the path to a JSON file with the input event you want the lambda to have. Find these (and define more) in per-function subdirectories under sam-local-examples/
A different template file path can be specified with the --template-file flag

SAM local can also be used to generate events. An example invocation might be sam local generate-event sqs receive-message or sam local generate-event s3 put. You can run sam local generate with no args for a list of supported services.

IaC

AWS SAM can build the YAML template. Artifacts will be placed into .aws-sam/. If you wish the lambda code to be included, it must first have been built into a .js file (npm run build). An example invocation might be

sam build

which will build template.yaml and use the lambda code in dist/. A different template file path can be specified with the --template-file flag and a different lambda code directory by changing the CodeUri global property in template.yaml.

Deploying and environments

Deployment is done via Secure Pipelines*. The deployments are done via the Secure Pipelines SAM deployment stack, and tests are run after the SAM deployment, which is done via Secure Pipelines testing containers.

The deployment of the platform is currently split two applications, main and quicksight-access (each having its own subdirectory in iac). This was to overcome an issue where we had hit a hard character limit for the programmatic permissions boundary (used by lambdas) caused by us having so many AllowedServices in the SAM deployment stack. The solution was to make a second SAM deployment stack to have some of the AllowedServices and split off some of the IaC to become its own application deployed by that stack (the Cognito and Quicksight functionality as it was the source of the most recent permissions we had requested that had put our permissions boundary over the limit).

From a Secure Pipelines point-of-view, environments can be split into two types: 'higher' and 'lower' environments. The lower environments are test, dev and build**. The higher environments are staging, integration and production. More information can be found using the Secure Pipelines link above, but the key differences are that the lower environments are the only ones that can be deployed to directly from GitHub. Deployment to the higher environments relies on 'promotion' from a lower environment, specifically the build environment. In addition, the higher environment lambdas are triggered by real TxMA event queues***, whereas lower environments use a placeholder one that we create and must put our own test events onto.

* With the exception of the feature environment - see section below

** Strictly speaking, test and dev do not form part of the Secure Pipelines build system which takes an application that is deployed to build all the way to production via the other higher environments. Our test and dev environments are disconnected sandboxes; however they still use Secure Pipelines to deploy directly from GitHub

*** An important exception is that dev is connected to the real TxMA staging queue. This is intended to be temporary since at time of writing we do not have the higher environments set up. Once our own staging account is ready, it will receive the real TxMA staging queue and dev will get a placeholder queue

Lower Environments

Test

Our test environment is a standalone environment and can therefore be used as a sandbox. A dedicated GitHub Action Deploy to the test environment exists to enable this. It can be manually invoked on a chosen branch by finding it in the GitHub Actions tab and using the Run workflow button.

Dev

Our dev environment is also a standalone environment and can therefore be used as a sandbox. A dedicated GitHub Action Deploy to the dev environment exists to enable this, allowing manual deploys like the one for test.

Additionally, the action will automatically run after a merge into the main branch after a Pull Request is approved.

Build

The build environment is the entry point to the Secure Pipelines world. It is sometimes referred to as the 'Initial Account' in Secure Pipelines, as it is the first account on the journey to Production, and has unique needs (compared with the higher environments) such as the ability to deploy to from GitHub.

A GitHub Action Deploy to the build environment exists to enable this. The action cannot be invoked manually like the one for dev, only by merging into the main branch after a Pull Request is approved.

Higher Environments

Higher environment config

Because they use real TxMA event queues (from external AWS accounts and not in our IaC code), deployment to higher environments* relies on the following AWS System Manager Parameters being available in the target account:

* These parameters are also required in the dev account for the reasons mentioned above (dev currently having the real TxMA staging queue). They are additionally required in the production preview account as it also has a real TxMA queue.

Name	Description
TxMAEventQueueARN	ARN of the TxMA event queue which triggers the `txma-event-consumer` lambda
TxMAKMSKeyARN	ARN of the TxMA KMS key needed for the `txma-event-consumer` lambda

You can see these values being referenced in the template files in the following way:

'{{resolve:ssm:TxMAEventQueueARN}}'

See the following links for how to create the parameters via:

Parameter values can be found on this page - recall that our dev environment currently takes the values assigned to staging on that page.

Staging

The staging environment is the first higher environment and so cannot be directly deployed to. When a deployment pipeline is successful in the build environment, the artifact will be put in a promotion bucket in the build account, which is polled by staging. When staging picks up a new build it is deployed to that environment.

Integration and Production

The integration and production environments are the second (and final) level of higher environment. They behave like the staging environment in the sense that they cannot be deployed to but instead poll for promoted artifacts from a lower environment. The difference between them and staging is that the promotion bucket integration and production poll is the one in staging.

Other Environments

The following accounts are not in secure pipelines.

Feature

The feature environment is a standalone environment for the purpose of testing GitHub pull requests. It has a GitHub Action Pull request deploy and test which deploys there and then runs integration tests. This deployment is not done via Secure Pipelines, but just manual sam deploy commands. Likewise, the tests are not run with the Secure Pipelines testing container approach, but instead manually invoked with npm run. This action can be manually invoked, but will also automatically run when a pull request is opened, reopened or updated. Unlike other environments, feature has a second GitHub Action Pull request tear down which completely deletes the stacks. Like the first action it can be manually invoked, ~~but will also automatically run when a pull request is merged or otherwise closed~~. Automatic running has been disabled until DAC-1862 is done.

To perform the deployments and tear downs we use a special role in the feature environment called dap-feature-tear-down-role. It is not in the IaC because it causes one of the following issues:

Without a DeletionPolicy, it gets deleted while the stack is being deleted and so the deletion fails part way through as there are then no longer the permissions to do the deletion
With an appropriate DeletionPolicy this doesn't happen, but instead the next stack creation fails because the resource already exists

Production Preview

The production preview environment is another standalone environment that exists outside of Secure Pipelines (like the feature environment it is deployed with a manual sam deploy). It has a GitHub Action Deploy to the production preview environment but no corresponding tear down one.

The deployments use a special role in the production preview environment, dap-production-preview-deploy-role, much like the role in feature.

Production preview has a real TxMA queue in addition to its placeholder queue and so requires the SSM parameters mentioned above in the Higher environment config section.

Config for cross account data sync

Because production preview and staging are used for cross account data sync, they have a single SSM parameter holding the name of the cross account data sync role. They use this to allow access to their SQS queues and usage of their KMS keys to enable the cross account data sync process.

Name	Description
CrossAccountDataSyncRoleARN	ARN of the role allowing cross account data sync

Additional Documents

For a guide to how and why certain development decisions, coding practices, etc. were made, please refer to the Development Decisions document.

For a list of TODOs for the project, please see the TODOs document.

Name		Name	Last commit message	Last commit date
Latest commit History 449 Commits
.github		.github
.husky		.husky
athena-scripts		athena-scripts
docs		docs
events-registry		events-registry
iac		iac
redshift-scripts		redshift-scripts
sam-local-examples		sam-local-examples
scripts		scripts
src		src
statemachine		statemachine
tests		tests
.cfnlintrc.yml		.cfnlintrc.yml
.checkov.ci.yml		.checkov.ci.yml
.checkov.local.yml		.checkov.local.yml
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
jest.config.js		jest.config.js
jest.setup.js		jest.setup.js
package-lock.json		package-lock.json
package.json		package.json
samconfig.toml		samconfig.toml
sonar-project.properties		sonar-project.properties
tsconfig.json		tsconfig.json

License

govuk-one-login/data-analytics-platform

Folders and files

Latest commit

History

Repository files navigation

DI Data Analytics Platform

Prerequisites

Install development tools

Core

Optional

Set up commit signing

Set up husky hooks

Set up Git LFS

Repository structure

Lambdas

IaC

Workflows

Testing

Unit tests

Integration tests

Test reports

Linting, formatting and validation

Lambdas

IaC

Scripts

Building and running

Lambdas

A note on args

IaC

Deploying and environments

Lower Environments

Test

Dev

Build

Higher Environments

Higher environment config

Staging

Integration and Production

Other Environments

Feature

Production Preview

Config for cross account data sync

Additional Documents

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages