SPARK-DYNAMODB-EXAMPLE

Development in progress

Description

Use cases examples of Apache Spark applications ready to run on AWS EMR using data in DynamoDB tables.

Install requirements

An AWS account;
awscli >= 2 with AWS account credentials configured;
An installed JDK 8;

Providing the infrastructure

Follow the instructions of the spark-dynamodb-infrastructure project.

Generating the Application Files

Spark App #1: Populating the Covid19Citation table

The PopulateCovid19Citations application will store the WHO database of studies with COVID-19 citations in a DynamoDB table. Instructions:

Upload the 'in/WHOCovid19CitationsDatabase.csv' file to the 'spark-dynamodb-example' S3 bucket;
Generate the application jar file:

./gradlew clean fatJarPopulateCitations

the application file will be generated in build/libs/PopulateCovid19Citations-1.0.jar.

Spark App #2: Counting the words in the COVID-19 citations titles

The Covid19CitationsWordCount application will count the number of times each word was used in the COVID-19 citations titles and print the result in the console. Instructions:

Generate the application jar file:

./gradlew clean fatJarCitationsWordCount

the application file will be generated in build/libs/Covid19CitationsWordCount-1.0.jar.

Running in the AWS EMR cluster

Upload the generated application file to the 'spark-dynamodb-example' bucket;
Connect to the EMR cluster master node with SSH (click the SSH link in the cluster summary panel and follow the instructions);
Download the application jar file to the master node:

aws s3 cp s3://spark-dynamodb-example/<app_name>.jar .

Execute the application:

spark-submit <app_name>.jar

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
gradle/wrapper		gradle/wrapper
in		in
src/main/java/com/lcarvalho/sparkddb		src/main/java/com/lcarvalho/sparkddb
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SPARK-DYNAMODB-EXAMPLE

Description

Install requirements

Providing the infrastructure

Generating the Application Files

Spark App #1: Populating the Covid19Citation table

Spark App #2: Counting the words in the COVID-19 citations titles

Running in the AWS EMR cluster

About

Releases

Packages

Languages

leohoc/spark-dynamodb-example

Folders and files

Latest commit

History

Repository files navigation

SPARK-DYNAMODB-EXAMPLE

Description

Install requirements

Providing the infrastructure

Generating the Application Files

Spark App #1: Populating the Covid19Citation table

Spark App #2: Counting the words in the COVID-19 citations titles

Running in the AWS EMR cluster

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages