spark4achilles

Objective

Runs variations of ACHILLES analytics using Scala + Spark

Data

Data was generated using the ETL-CMS repo and then loaded into PostgreSQL using the CommonDataModel repo. High-level instructions are below.

Obtain Raw Data

To get the data:

Clone the ETL-CMS tool from: https://github.com/OHDSI/ETL-CMS/tree/unm-improvements
Change to the scripts directory
Run the get_synpuf_files.py script
The data will then be downloaded
Change to the python_etl directory
Follow the readme in this direcotry to setup your system.
Finally, run the CMS_SynPUF_ETL_CDM_v5.py script to convert

This will produce valid .csv files ready for import and usage.

Importing to PostgreSQL

To add the raw data to PostgreSQL

Clone the CommonDataModel repo from: https://github.com/OHDSI/CommonDataModel/tree/master/PostgreSQL
Login to PostgreSQL and create an empty schema in the database
Use the OMOP CDM ddl - PostgreSQL.sql to create tables and fields into the schema for the CDM
Load data into the schema by modifying the OMOP CDM vocabulary load - PostgreSQL.sql script in the VocabImport folder to accept the correct tables.
Add constraints including primary and foreign keys by running OMOP CDM constraints - PostgreSQL.sql
Add a minimum set of indexes to the data by running OMOP CDM indexes required - PostgreSQL.sql

The database import mail fail due to incorrect data types. For my project I changed the database scheme to allow varchar, however I would highly suggest modifying the data itself to remove the non-numeric charachters as an easier solution.

Environment

The author used two separate environments:

Local Baremetal
- Intel Core i5-3570K processor with 16 GB of memory, a 512 GB SSD, and runs Ubuntu 15.10.
- Ran the ACHILLES benchmark and acted as a "single-node" Spark cluster
Amazon Web Services (AWS) Elastic MapReduce (EMR)
- Four memory-optimized compute nodes (r3.xlarge)
- A fifth node acted as the master node
- 4 vCPUs, 30 GB of memory, and an 80 GB SSD
- EMR version 4.5.0

Both infrastructures used Apache Spark 1.6.1 and Apache Hadoop 2.7.2. The Scala-based Spark application will use OpenJDK 7, SBT 0.13.8, and Scala 2.10.6.

How to Run

First, the data needs to be generated. See the readme in the data directory.

Next, sbt will bring in everything you need, but you need to build the JAR:

sbt assembly

Then to run with Spark:

spark-submit --class edu.gatech.cse8803.main.Main cse8803_project-assembly-1.3.jar

Supplemental Material

Contributors

Joshua Powers
- CSE8803 Big Data Analytics for Health Care (Spring 2016)
- Georgia Institute of Technology

Reviewers

A huge thank you to the following for their feedback, evaluation, and support:

Dr. Jimeng Sun
The SunLab
Dr. Watler & Marjie Powers
Olga Martyusheva
Alex Balderson

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
project		project
src/main/scala/edu/gatech/cse8803		src/main/scala/edu/gatech/cse8803
.gitignore		.gitignore
CSE8803_BDAH_2016.pdf		CSE8803_BDAH_2016.pdf
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src/main/scala/edu/gatech/cse8803

src/main/scala/edu/gatech/cse8803

.gitignore

.gitignore

CSE8803_BDAH_2016.pdf

CSE8803_BDAH_2016.pdf

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

Repository files navigation

spark4achilles

Objective

Data

Obtain Raw Data

Importing to PostgreSQL

Environment

How to Run

Supplemental Material

Contributors

Reviewers

License

About

Releases

Packages

Languages

License

powersj/spark4achilles

Folders and files

Latest commit

History

Repository files navigation

spark4achilles

Objective

Data

Obtain Raw Data

Importing to PostgreSQL

Environment

How to Run

Supplemental Material

Contributors

Reviewers

License

About

Resources

License

Stars

Watchers

Forks

Languages