SPARK-DYNAMODB-AUDIENCE

Development in progress

Description

Implementation of a WordCount Apache Spark application using input data stored in a DynamoDB table.

This project connects to DynamoDB using the Audience Project data source. For examples using the AWS emr-dynamodb-connector, see spark-dynamodb-example.

Install requirements

An AWS account;
awscli >= 2 with AWS account credentials configured;
An installed JDK 8;

Providing the infrastructure

Follow the instructions of the spark-dynamodb-infrastructure project.

Generating the Input Data

The DynamoDB input data used in this project was generated using the instructions contained in spark-dynamodb-example.

Spark App #1: Counting the words in the COVID-19 citations titles

The Covid19CitationsWordCount application will count the number of times each word was used in the COVID-19 citations titles and store the result in a DynamoDB table. Instructions:

Generate the application jar file:

./gradlew clean fatJar

the application file will be generated in build/libs/Covid19CitationsWordCount-1.0-SNAPSHOT.jar.

Running in the AWS EMR cluster

Upload the generated application file to the 'spark-dynamodb-example' bucket;
Connect to the EMR cluster master node with SSH (click the SSH link in the cluster summary panel and follow the instructions);
Download the application jar file to the master node:

aws s3 cp s3://spark-dynamodb-example/Covid19CitationsWordCount-1.0-SNAPSHOT.jar .

Execute the application:

spark-submit --packages com.audienceproject:spark-dynamodb_2.11:1.0.2 Covid19CitationsWordCount-1.0-SNAPSHOT.jar

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
gradle/wrapper		gradle/wrapper
src/main/java/com/lcarvalho/sparkddb		src/main/java/com/lcarvalho/sparkddb
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARK-DYNAMODB-AUDIENCE

Description

Install requirements

Providing the infrastructure

Generating the Input Data

Spark App #1: Counting the words in the COVID-19 citations titles

Running in the AWS EMR cluster

About

Releases

Packages

Languages

leohoc/spark-dynamodb-audience

Folders and files

Latest commit

History

Repository files navigation

SPARK-DYNAMODB-AUDIENCE

Description

Install requirements

Providing the infrastructure

Generating the Input Data

Spark App #1: Counting the words in the COVID-19 citations titles

Running in the AWS EMR cluster

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages