Development in progress
Implementation of a WordCount Apache Spark application using input data stored in a DynamoDB table.
This project connects to DynamoDB using the Audience Project data source. For examples using the AWS emr-dynamodb-connector, see spark-dynamodb-example.
- An AWS account;
- awscli >= 2 with AWS account credentials configured;
- An installed JDK 8;
Follow the instructions of the spark-dynamodb-infrastructure project.
The DynamoDB input data used in this project was generated using the instructions contained in spark-dynamodb-example.
The Covid19CitationsWordCount application will count the number of times each word was used in the COVID-19 citations titles and store the result in a DynamoDB table. Instructions:
- Generate the application jar file:
./gradlew clean fatJar
- the application file will be generated in build/libs/Covid19CitationsWordCount-1.0-SNAPSHOT.jar.
-
Upload the generated application file to the 'spark-dynamodb-example' bucket;
-
Connect to the EMR cluster master node with SSH (click the SSH link in the cluster summary panel and follow the instructions);
-
Download the application jar file to the master node:
aws s3 cp s3://spark-dynamodb-example/Covid19CitationsWordCount-1.0-SNAPSHOT.jar .
- Execute the application:
spark-submit --packages com.audienceproject:spark-dynamodb_2.11:1.0.2 Covid19CitationsWordCount-1.0-SNAPSHOT.jar