spark-streaming-parquet

Scala code to read Parquet files as streams in Spark Streaming using Avro.

Build

The build.sbt file sets various configuration parameters and lists the dependencies of the code to other components. To compile the code:

sbt clean compile

To create the jar file that is submitted to Spark:

sbt package

This will create target/parquetstreaming-1.0.jar.

Submit to Spark

The Bash script spark-parquet.sh can be used to submit the code to Spark:

spark-parquet.sh target/parquetstreaming-1.0.jar <sampling-period (seconds)>  <parquet-dir>

The script collects the jar files that the Scala code depends on and submits them to Spark.

Output

The parquet-dir is checked every sampling-period seconds for a new Parquet file (the file extension should be .parquet). The stream is converted to a DataFrame. The schema of the DataFrame is printed along with the top 20 rows of data. Moreover, the total number of rows in the Parquet file is printed followed by the data first and then the complete record of (key, data) for each row in the file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
project		project
src		src
LICENSE.md		LICENSE.md
README.md		README.md
build.sbt		build.sbt
spark-parquet.sh		spark-parquet.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-streaming-parquet

Build

Submit to Spark

Output

About

Releases

Packages

Languages

License

gpapag/spark-streaming-parquet

Folders and files

Latest commit

History

Repository files navigation

spark-streaming-parquet

Build

Submit to Spark

Output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages