Development in progress
Use cases examples of Apache Spark applications ready to run on AWS EMR using data in DynamoDB tables.
- An AWS account;
- awscli >= 2 with AWS account credentials configured;
- An installed JDK 8;
Follow the instructions of the spark-dynamodb-infrastructure project.
The PopulateCovid19Citations application will store the WHO database of studies with COVID-19 citations in a DynamoDB table. Instructions:
-
Upload the 'in/WHOCovid19CitationsDatabase.csv' file to the 'spark-dynamodb-example' S3 bucket;
-
Generate the application jar file:
./gradlew clean fatJarPopulateCitations
- the application file will be generated in build/libs/PopulateCovid19Citations-1.0.jar.
The Covid19CitationsWordCount application will count the number of times each word was used in the COVID-19 citations titles and print the result in the console. Instructions:
- Generate the application jar file:
./gradlew clean fatJarCitationsWordCount
- the application file will be generated in build/libs/Covid19CitationsWordCount-1.0.jar.
-
Upload the generated application file to the 'spark-dynamodb-example' bucket;
-
Connect to the EMR cluster master node with SSH (click the SSH link in the cluster summary panel and follow the instructions);
-
Download the application jar file to the master node:
aws s3 cp s3://spark-dynamodb-example/<app_name>.jar .
- Execute the application:
spark-submit <app_name>.jar