Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.
/ pipelining-data Public archive

This is a test project on the data transfer pipeline to the database.

License

Notifications You must be signed in to change notification settings

obervinov/pipelining-data

Repository files navigation

pipelining-data

GitHub release (latest SemVer) GitHub last commit GitHub Release Date GitHub issues GitHub repo size PyPI - Python Version

About this project

This project is a test task for creating a data pipeline. And it is a miniature stand.

The project consists of four components:

  • A utility written in python to generate test data and send it to kafka
  • Apache kafka based data bus
  • Clickhouse for data aggregation and storage
  • Grafana for monitoring and alerting abnormal values

Algorithm of operation

  1. The python service generates data according to the test task and sends it to apache kafka with the interval CREATE_REPORT_INTERVAL.
  2. Apache kafka acts as an intermediate data bus and lines up a queue in front of clickhouse.
  3. Clickhouse uses the Kafka engine to subtract data from kafka and save it to a table with the MergeTree engine.
  4. Grafana fetches BID and ASK data from clickhouse according to the formula (ask_01 + bid_01) / 2. The threshold is set to exceed the limit of 9.9

Repository map

.
├── CHANGELOG.md
├── Dockerfile              ### Manifest for creating a docker image with a python service
├── LICENSE                 ### Information about the project license
├── README.md               ### The file you're reading now
├── doc                     ### Сontent for documentation
├── configs                 ### A set of configurations for clickhouse and grafana
│   ├── clickhouse           ## A set of configurations for clickhouse
│   │   ├── grafana.xml       # Configuration for configuring clickhouse users
│   │   └── init-db.sh        # Script for preparing tables and mapping schemas for storing test data
│   └── grafana              ## A set of configurations for grafana provisioning
│       ├── alerts.json       # Json model of the dashboard
│       ├── dashboard.yaml    # Parameters for provisioning dashboards
│       └── datasource.yaml   # Parameters for provisioning datasources
├── docker-compose.yml      ### Manifest for launching containers with all the services necessary for the project
├── requirements.txt        ### Dependencies for python service
└── src                     ### The source code of the service in python
    ├── data_generator.py    ## Module for generating test data
    ├── kafka_sender.py      ## Module for sending generated data to kafka
    └── main.py              ## The main mod for initializing dependent classes and extracting environment variables

Requirements

Environment variables

Variable Service Description Default
ALLOW_ANONYMOUS_LOGIN Zookeeper Allows anonymous authorization in zookeeper yes
ALLOW_PLAINTEXT_LISTENER Kafka Allow to use the PLAINTEXT listener yes
KAFKA_CFG_ZOOKEEPER_CONNECT Kafka Link to Zookeeper server zookeeper:2181
KAFKA_CFG_LISTENERS Kafka Allow for remote connections PLAINTEXT://:9092
KAFKA_CFG_ADVERTISED_LISTENERS Kafka Link to kafka for the announcement PLAINTEXT://kafka:9092
KAFKA_CLUSTERS_0_NAME kafka-ui The name of the kafka cluster to display in webui kafka
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS kafka-ui Link to kafka cluster kafka:9092
GF_INSTALL_PLUGINS grafana List of plugins to install when launching a container from grafana grafana-clickhouse-datasource
GF_AUTH_ANONYMOUS_ENABLED grafana Allow access to grafana without authorization yes
BID_SIZE pygenerator The number of bid elements in the generated data 50
ASK_SIZE pygenerator The number of ask elements in the generated data 50
RANDOM_START pygenerator From what number does the range start for random generation of values in bid and ask elements 1
RANDOM_DECIMAL pygenerator The accuracy of the blown float values from random generation for bid and ask elements 1
CREATE_REPORT_INTERVAL pygenerator Pause between report generation iterations 0.1 in seconds
KAFKA_BOOTSTRAP_SERVERS pygenerator Kafka cluster for transmitting generated data kafka:9092
KAFKA_TOPIC pygenerator The name of the kafka topic for transmitting the generated data metrics_raw

How to run with docker-compose

  1. Building and launching of all necessary services
docker-compose up -d
  1. Viewing logs
docker logs -f pygenerator

Description of components and important parts

Python service

Additional external modules are required for the service to work in python:

The source code of the service consists of three separate modules:

  • data_generator.py This module is responsible for generating report data according to the test task
  • kafka_sender.py This module is responsible for initializing the connection with kafka and sending the report data to the topic
  • main.py The main module of the service, it is also an entrypoint. Initializes the classes of modules specified earlier and defines environment variables.

Apache Kafka

Receives data from the python service and lines up queues in front of the database.

Nothing interesting.

You can view incoming messages via kafka-ui here: http://0.0.0.0:8080/ui/clusters/kafka/topics/metrics_raw

Clickhouse

Reading data from Kafka and further saving to the database is implemented according to the official documentation

The initdb shell script is used to configure the clickhouse:

  • Creating a pipeliningData_queue table on the Kafka engine to read data from the kafka topic
  • Creating a pipeliningData table for permanent storage of read data on the MergeTree engine
  • Creating a MATERIALIZED VIEW pipeliningData_mv to fetch data from pipeliningData_queue to pipeliningData

The configuration file for creating a separate user that graphana will use:

  • Creating a user
  • Creating a role for the created user

Grafana

The grafana is needed in this project to build monitoring based on data stored in clickhouse.

A graph that is already ready for operation is configured using provisioning:

The url of the grafana that is available after launching all services in docker-compose: http://0.0.0.0:3000 Attention, charting may take several minutes!

How to build a docker image with a python service

export APP_VERSION=v1.0.0
export APP_NAME="pipelining-data"
docker build -t ghcr.io/${GITHUB_USERNAME}/${APP_NAME}:${APP_VERSION} .
docker push ghcr.io/${GITHUB_USERNAME}/${APP_NAME}:${APP_VERSION}

About

This is a test project on the data transfer pipeline to the database.

Resources

License

Stars

Watchers

Forks

Packages

No packages published