Spark_Data_poecessing_pipeline

The pipeline processes the data in the AWS EMR cluster with Spark.

Amazon EMR is the cloud based big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. (Source aws.amazon.com)

In this pipeline, I created a Spark-based Transient AWS EMR cluster. The transient EMR cluster terminates automatically once it finishes the job. The cluster performs the job in 3 steps. The written instructions are available in the custom JAR script. Check the custom JAR script in the repository. The 3 steps are mentioned below.

EMR cluster pulls the sparkProcessingScript.py script. All the processing actions are written down in this python script.
The cluster read the data from s3 bucket (users_app_big_dataset.csv) and perform all the processing task on it.
After processing, save the processed data in the S3 Bucket as parquet file.

If you want to play locally without EMR cluster, check sparkProcessingScript.ipynb notebook. You need spark installed in your machine.

In the next part of this pipeline, I will create the dimensional model in the Redshift Data warehouse. I will also implement the redshift spectrum pipeline. The redshift spectrum pipeline will insert and update the data inside the dimensional table and fact table.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Input File		Input File
Processed:output File		Processed:output File
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
Redshiftdimensionmodel.png		Redshiftdimensionmodel.png
SparkPipeline.png		SparkPipeline.png
custom JAR.rtf		custom JAR.rtf
readParquet.py		readParquet.py
spark processing.png		spark processing.png
sparkProcessingScript.ipynb		sparkProcessingScript.ipynb
sparkProcessingScript.py		sparkProcessingScript.py
sparkprocessing.png		sparkprocessing.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spark_Data_poecessing_pipeline

About

Uh oh!

Releases

Packages

Languages

ktnsh24/Spark_Data_processing_pipeline

Folders and files

Latest commit

History

Repository files navigation

Spark_Data_poecessing_pipeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages