Logs Data Pipeline

This project consists of implementing a data pipeline. The goal of our pipeline is to bring insights from the logs of deployed microservices. Since we didn't really implement, nor deployed microservices, we will simulate these logs using the following Kaggle dataset.

Architecture

The process is simple, each log line will be sent to a Kafka topic. We'll have a Spark Streaming process that takes each line of log from Kafka and cleans it - taking the IP address and extracting the country from it, trimming edges of strings, droping useless information, etc. The cleaned data then will be stored to MongoDB. There's a Spark Batch process that will get all of the data from MongoDB and generate stats - the possibilities are endless. In our case, we extracted two stats; one related to APIs' response time and the other related to the distribution of countries when it comes to sending requests. These informations are stored in MongoDB to be consumed by a Flask API when the Angular dashboad fetches the data.

How to run the project

First, you will need to run the containers, so start by running the following

docker-compose up

While waiting forever for the containers to run - yes, it does take time sadly - let's install the libraries you'll need in order to run the python scripts.

python3 -m pip install requirements.txt

I use Linux, but I believe if you're a Windows user, you should probably either use py -m pip install requirements.txt or simply pip install requirements.txt.

Now moving forward, if your containers are ready, what you'll need to do next is download the Kaggle dataset and put it in the root of the project.

Now, we have our containers running, we have our libraries installed, we have our dataset - next thing you'll need to do is run the producer that will send to Kafka the logs line by line.

python3 producer.py

Now, while that process is running, run the Spark Streaming process.

python3 preprocessing.py

And now, Spark is treating the logs line by line and is saving the processed data in Mongo.

To run the Spark Batch process;

python3 processing.py

Now we have our stats in Mongo, let's run the flask app! Let's run the following;

cd api

export FLASK_APP=api

flask run

And your flask app is running! Next thing is to run the Angular dashboard. Open an another terminal and;

cd dashboard

npm install

npm start

You should be met with the following interface;

Now I know that it's not the best UI in the world, but it's honest work...

In order to purge all of the data on MongoDB, just run;

python3 purge.py

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
api		api
assets		assets
dashboard		dashboard
.gitignore		.gitignore
docker-compose.yml		docker-compose.yml
example.logfiles.log		example.logfiles.log
hadoop.env		hadoop.env
preprocessing.py		preprocessing.py
processing.py		processing.py
producer.py		producer.py
purge.py		purge.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api

api

assets

assets

dashboard

dashboard

.gitignore

.gitignore

docker-compose.yml

docker-compose.yml

example.logfiles.log

example.logfiles.log

hadoop.env

hadoop.env

preprocessing.py

preprocessing.py

processing.py

processing.py

producer.py

producer.py

purge.py

purge.py

readme.md

readme.md

requirements.txt

requirements.txt

Repository files navigation

Logs Data Pipeline

Architecture

How to run the project

And that's all folks, I hope you enjoyed the project!

About

Releases

Packages

Contributors 3

Languages

hajali-amine/logs-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Logs Data Pipeline

Architecture

How to run the project

And that's all folks, I hope you enjoyed the project!

About

Topics

Resources

Stars

Watchers

Forks

Languages