openFDA AWS data pipeline

Enabling advanced analytics of openFDA data in AWS

This data pipeline leverages the power of AWS to:

Automate the extraction of bulk data from the openFDA website
Loads the raw data into an S3 data lake
Transforms the date with custom options for filtered fields and enrichment with NLP models to a curated (clean) S3 bucket
Loads the data into Elasticsearch for advanced full text search and visualizations
Enables other analytics to be executed on the data using AWS Glue, EMR, or other AWS analytics

openFDA AWS data pipeline

Architecture diagram

An architecture overview:

Getting started

The plan is to make the entire build automated using a CloudFormation or serverless template. The current setup is a little manual. See building for instructions.

Initial Configuration

This project uses Python 3.8. Testing the functions locally is possible using the AWS CLI (https://docs.aws.amazon.com/cli/).

Developing

Clone the repository:

git clone https://github.com/prescode/open-fda-data-pipeline.git

Navigate to the function you want to change:

cd transform

Create a new virtual Python environment:

python3 -m venv .venv

Activate your virtual environment:

source .venv/bin/activate

Use pip to install the function's current dependencies into the virtual environment:

pip install -r ./requirements.txt

Start the Python shell:

python

Then test your code changes by entering function definitions, variable assignments, and other code into the python shell. Test events can be created using the included json files to simulate S3 put events:

test_event = json.load(open('test_event.json'))
lambda_handler(test_event, '')

Add new dependencies to the requirements file (after installing them into your virtual environment using pip install)

pip freeze > requirements.txt

After making (and testing) your code changes close the python shell and deactivate the virtual environment:

deactivate

Remove your virtual environment (a build script will be used to create a new one for deployment):

rm -r .venv

Building

Each function is built separately. A setup.sh file is included in each folder.

./setup.sh

The shell script will create a virtual python environment, install all the necessary dependencies, then package the dependencies along with the function python file and create a package.zip file ready to be deployed to AWS Lambda. The virtual environment and setup directory will be cleaned up after the process is complete.

Deploying / Publishing

See AWS Setup for more details.

Create S3 buckets for:
1. urls
2. raw data
3. clean data
Create Elasticsearch domain
For each function
1. Create Lambda functions via CLI or console
2. Create Lambda IAM role with permissions to write to S3 bucket (and Elasticsearch for "load" function)
3. Upload package.zip from build folder
4. Update handler name to match name of python file (e.g. load.lambda_handler)
5. Add environment variables defined in the python file
6. Set environment variables
7. Update basic settings: memory and timeout
8. Add trigger as S3 put from configured S3 bucket (except for url-search)
Create empty test event in url-search lambda and trigger it manually using the "Test" button

Features

Clean and transform FDA dataset specifically for your use case
Use the power of AWS Analytics to gain insights on the FDA datasets
Join FDA data with other datasets for advanced analytics

Configuration

Filtered data fields can be configured in the "transform" function's list of filtered fields variable (Elasticsearch index mapping may also need to be changed).

Contributing

If you'd like to contribute, please fork the repository and use a feature branch. Pull requests are much appreciated.

Links

Repository: https://github.com/prescode/open-fda-data-pipeline/
Issue tracker: https://github.com/prescode/open-fda-data-pipeline/issues
openFDA website: https://open.fda.gov/
AWS Elasticsearch: https://aws.amazon.com/elasticsearch-service/
Related projects:
- openFDA: https://github.com/FDA/openfda/
Diagram authoring: https://draw.io

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
docs		docs
extract		extract
load		load
transform		transform
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
prerequisite_aws_setup.txt		prerequisite_aws_setup.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openFDA AWS data pipeline

Architecture diagram

Getting started

Initial Configuration

Developing

Building

Deploying / Publishing

Features

Configuration

Contributing

Links

Licensing

About

Releases

Packages

Languages

License

prescode/open-fda-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

openFDA AWS data pipeline

Architecture diagram

Getting started

Initial Configuration

Developing

Building

Deploying / Publishing

Features

Configuration

Contributing

Links

Licensing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages