Introduction

we want to find out anomaly which will occur if mean of cpu-usage in a window of time be greater than a number that is specified in .env file . if this difference be more than standard deviation we will count it as a WARN and if this difference be more than double of standard deviation we will count it as an ERROR .

we read data from a kafka topic, and then we calculate mean and stddev of every window with spark then write spark streaming results to the console and in another kafka topic.

windows are slots of time every 500 millisecond's and our spark's output mode is "update mode".

two variables NORMAL_MEAN , NORMAL_STDDEV are mean of all data and standard deviation of all data in .env file you can change it.

jar files will be cached in storage directory when you start spark_application with docker.

just errors and warnings will be printed in console. but all statuses will be in kafka.

Install

sudo apt update -y
sudo apt install -y git python3 python3-pip python3-venv

git clone https://github.com/hoseinlook/cpu-anomaly-detection-with-spark.git

cp -n .env.example .env
nano .env

Run

to run this project at first provide infrastructure like kafka and zookeeper with docker

start kafka:

docker-compose up

kafka bootstrap host (EXTERNAL listener): localhost:9093
zookeeper server: localhost:2181

Note:

watch spark outputs in docker-compose logs

docker-compose logs -f spark_application

also you can see output in kafka topic (anomaly) bootstrap server

start pipeline without docker:

now start pyspark pipeline if you want to run spark in your local machine without docker you should set KAFKA_SERVERS=localhost:9093 (EXTERNAL kafka listener) in .env file

python3.8 -m venv venv
source venv/bin/activate
pip install -U pip
pip install -r requirements.txt

source venv/bin/activate
python -m pipeline

spark webUI: localhost:4040

produce example data to kafka:

bash send-data-to-kafka.sh

Note:

you can watch checkpoint's and data of kafka and data of zookeeper in storage directory

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
pipeline		pipeline
storage		storage
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Data-Engineering-Task.pdf		Data-Engineering-Task.pdf
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt
send-data-to-kafka.sh		send-data-to-kafka.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline

pipeline

storage

storage

.dockerignore

.dockerignore

.env.example

.env.example

.gitignore

.gitignore

Data-Engineering-Task.pdf

Data-Engineering-Task.pdf

Dockerfile

Dockerfile

README.md

README.md

docker-compose.yaml

docker-compose.yaml

requirements.txt

requirements.txt

send-data-to-kafka.sh

send-data-to-kafka.sh

Repository files navigation

Introduction

Install

Run

start kafka:

Note:

start pipeline without docker:

produce example data to kafka:

Note:

About

Releases

Packages

Languages

hoseinlook/cpu-anomaly-detection-with-spark

Folders and files

Latest commit

History

Repository files navigation

Introduction

Install

Run

start kafka:

Note:

start pipeline without docker:

produce example data to kafka:

Note:

About

Topics

Resources

Stars

Watchers

Forks

Languages