Creation of a Batch-Input Data Pipeline using AWS Services

Objective

The goal of this project is to create and end to end data pipeline fetching unstructured JSON data from twitter at specific time intervals and transforming them into readable, relational format for analytics and data visualization

Solution and Architecture

This has two parts

Ingestion and Storage

For Data Ingestion, a Kinesis Data Firehose is created with Boto3 API. The python application calls twitter API to fetch tweets on a particular topic. The tweets are written into the firehose, which delivers them into S3 in specific buffer time intervals.

Processing

For processing we have used AWS Glue - which will crawl the S3 bucket for unstructured data schema and then create a job mapping the JSON nodes into relational targets and save the data in S3 in parquet/csv format.

Alternative Processing

An alternative way of processing would be to run a spark application using a EMR cluster. The Spark application could be run from any local system using AWS CLI, specifying steps - and schedule cluster termination once the job is completed. The same can be automated by running the Command Line Script by AWS Data Pipeline.

Process and End Result

The Script Ingestion.py contains the code for creating the Kinesis Firehose, fetch twitter data and write into the same. After this has been done for a certain period of time, the Kinesis Firehose is terminated. A Glue crawler is setup which crawls the S3 bucket and determines schema on its own. After this process is complete - we create a job, map input and output, and execute the same.Glue transforms the JSON files into CSV and stores them in S3.

Mapping:-

CSV Output:-

If you would like a detailed process review please follow the youtube links: https://www.youtube.com/watch?v=JDy3QWVz8Ws&t=49s

https://www.youtube.com/watch?v=jkO9wdpHt4w&t=534s

Alternative Process:

An EMR cluster can be spun up with the script - EMRJobcreation.py(using boto3) or by using cli_bash.sh

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.gitignore		.gitignore
BuildFirehoseStream.py		BuildFirehoseStream.py
EMRJobCreation.py		EMRJobCreation.py
LICENSE		LICENSE
README.md		README.md
TwitterGrabData.py		TwitterGrabData.py
cli_bash.sh		cli_bash.sh
emrjobrunscript.sh		emrjobrunscript.sh
maximizeconfig.sh		maximizeconfig.sh
sparkprocess.py		sparkprocess.py
twitterfile.json		twitterfile.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

BuildFirehoseStream.py

BuildFirehoseStream.py

EMRJobCreation.py

EMRJobCreation.py

LICENSE

LICENSE

README.md

README.md

TwitterGrabData.py

TwitterGrabData.py

cli_bash.sh

cli_bash.sh

emrjobrunscript.sh

emrjobrunscript.sh

maximizeconfig.sh

maximizeconfig.sh

sparkprocess.py

sparkprocess.py

twitterfile.json

twitterfile.json

Repository files navigation

Creation of a Batch-Input Data Pipeline using AWS Services

Objective

Solution and Architecture

This has two parts

Ingestion and Storage

Processing

Alternative Processing

Process and End Result

About

Releases

Packages

Languages

License

nilabja9/Data-Pipeline-AWS

Folders and files

Latest commit

History

Repository files navigation

Creation of a Batch-Input Data Pipeline using AWS Services

Objective

Solution and Architecture

This has two parts

Ingestion and Storage

Processing

Alternative Processing

Process and End Result

About

Resources

License

Stars

Watchers

Forks

Languages