Fetch Data Ops / Data Engineer Take Home Exercise - Real Time Data Processing and Analytics using Kafka, Docker, Python and OpenSearch
Clone the repository
$ git clone https://github.com/jyok117/real-time-data-processing-analytics
$ cd real-time-data-processing-analyticsStart the services
The services include Kafka, Zookeeper, OpenSearch, OpenSearch Dashboards, my-python-producer and my-consumer
$ docker-compose up --buildThe my-python-producer generates random user login data and sends it to the Kafka topic user-login.
Check if the Kafka is receiving the data from the my-python-producer.
$ KAFKA_CONTAINER=$(docker ps --filter "name=kafka" --format "{{.ID}}")
$ docker exec -it $KAFKA_CONTAINER kafka-console-consumer --bootstrap-server localhost:9092 --topic user-login --group my-appIf there is no group associated with the consumer, you can create the group by running the above command.
The consumer-1 consumes messages from the Kafka topic user-login, processes the messages, and sends the processed messages to the Kafka topic processed-user-login. The consumer-2 then consumes messages from the Kafka topic processed-user-login, indexes the messages to an OpenSearch index processed-user-login-index.
The credentials for the OpenSearch and OpenSearch Dashboards are admin:admin.
The OpenSearch Dashboards can be accessed at http://localhost:5601. You can import the dashboard from the dashboard.ndjson file.
OpenSearch Dashboard -> Management -> Stack Management -> Saved Objects -> Import -> Select dashboard.ndjson file -> Import
This will create the dashboard with visualizations and the required index patterns.
OpenSearch Dashboard -> Dashboard -> Real-Time Data Processing and Analytics
Set the time range to Last 15 minutes and refresh time to Every 1 second to see the real-time data.
flowchart TD
A[my-python-producer] -->|streams login data,
produces it| B((kafka topic
user-login))
B --> C[kafka consumer-1]
C -->|consumes data from the topic,
processes it and produces it| D((kafka topic
processed-user-login))
D --> E[kafka consumer-2]
E -->|consumes data from the topic,
indexes it| F[[opensearch index
processed-user-login-index]]
F --> G[[opensearch dashboard
visualizations]]
Deploying this application in production involves several steps to ensure it runs reliably, securely, and efficiently.
- Use a Container Orchestration Platform (Kubernetes) and Infrastructure as Code tools (Terraform) to define and provision the infrastructure. Use managed Kubernetes services like AWS EKS, or Azure AKS to simplify management and scaling.
- Set up persistent volumes for services that require data persistence, such as OpenSearch.
- Use Kubernetes services to expose the application to external traffic. Implement Ingress controllers for routing external traffic to the appropriate services.
- Use monitoring tools like Prometheus and Grafana to monitor the health and performance of the application. Set up centralized logging using tools like Fluentd.
- Use Kubernetes secrets or services like HashiCorp Vault for managing sensitive information
To make this application production-ready, consider adding the following components:
- Integrate security scanning for container images (e.g., Trivy) within the CI/CD pipeline.
- Implement backup solutions for OpenSearch and other stateful services.
- Configure Kubernetes to automatically scale (horizontal) the number of pods based on CPU/memory usage. Use cluster autoscalers to add or remove nodes based on the overall demand.
- Manage application configuration using Kubernetes ConfigMaps and Secrets.
Scaling the application to handle a growing dataset involves several strategies:
- Use partitioned Kafka topics to distribute the load across multiple brokers. Deploy multiple instances of consumer services to parallelize data processing.
- Use sharding in OpenSearch to distribute data across multiple nodes.
- Optimize OpenSearch indices by adjusting shard count, replication factor, and using index templates.
- Add more instances of your services to handle increased load (Horizontal Scaling). Increase the resources (CPU, memory) allocated to your services (Vertical Scaling).
- Use load balancers to distribute traffic across multiple instances of your services.
The producer is a Python script that generates random user login data and sends it to the Kafka topic user-login.
From my-python-producer
{
"user_id": "a6397988-1649-4859-ae5f-a1407b638ca6",
"app_version": "2.3.0",
"ip": "108.228.145.239",
"locale": "RI",
"device_id": "6221eb3d-89b5-4585-8c98-545b78e24c53",
"timestamp": 1720576825,
"device_type": "iOS"
}The consumer-1 is a Python script that reads messages from the Kafka topic user-login, processes the messages, and sends the processed messages to the Kafka topic processed-user-login.
The consumer-2 is a Python script that reads messages from the Kafka topic processed-user-login, indexes the messages to an OpenSearch index processed-user-login-index.
$ python -m venv .venv
$ source .venv/bin/activate
(.venv) $ cd src/
(.venv) $ pip install -r requirements.txt
(.venv) $ python consumer-1.py &
(.venv) $ python consumer-2.py &# Start the services
$ docker-compose up
# Get the details of the running containers
$ docker ps -a
# Consuming messages from the Kafka Topic
$ docker exec -it <kafka> kafka-console-consumer --bootstrap-server localhost:9092 --topic user-login --group my-app
# Get the group id of the Kafka Consumer
$ docker exec -it <kafka> kafka-consumer-groups --bootstrap-server localhost:9092 --list
# List the topics in the Kafka Cluster
$ docker exec -it <kafka> kafka-topics --bootstrap-server localhost:9092 --list

