real-time-data-processing-analytics

Fetch Data Ops / Data Engineer Take Home Exercise - Real Time Data Processing and Analytics using Kafka, Docker, Python and OpenSearch

Demo

Requirements

Tools

Libraries

Setup

Clone the repository

$ git clone https://github.com/jyok117/real-time-data-processing-analytics
$ cd real-time-data-processing-analytics

Start the services

The services include Kafka, Zookeeper, OpenSearch, OpenSearch Dashboards, my-python-producer and my-consumer

$ docker-compose up --build

The my-python-producer generates random user login data and sends it to the Kafka topic user-login.

Check if the Kafka is receiving the data from the my-python-producer.

$ KAFKA_CONTAINER=$(docker ps --filter "name=kafka" --format "{{.ID}}")
$ docker exec -it $KAFKA_CONTAINER kafka-console-consumer --bootstrap-server localhost:9092 --topic user-login --group my-app

If there is no group associated with the consumer, you can create the group by running the above command.

The consumer-1 consumes messages from the Kafka topic user-login, processes the messages, and sends the processed messages to the Kafka topic processed-user-login. The consumer-2 then consumes messages from the Kafka topic processed-user-login, indexes the messages to an OpenSearch index processed-user-login-index.

The credentials for the OpenSearch and OpenSearch Dashboards are admin:admin.

The OpenSearch Dashboards can be accessed at http://localhost:5601. You can import the dashboard from the dashboard.ndjson file.

OpenSearch Dashboard -> Management -> Stack Management -> Saved Objects -> Import -> Select dashboard.ndjson file -> Import

This will create the dashboard with visualizations and the required index patterns.

OpenSearch Dashboard -> Dashboard -> Real-Time Data Processing and Analytics

Set the time range to Last 15 minutes and refresh time to Every 1 second to see the real-time data.

Workflow

flowchart TD
    A[my-python-producer] -->|streams login data,
        produces it| B((kafka topic
        user-login))
    B --> C[kafka consumer-1]
    C -->|consumes data from the topic,
        processes it and produces it| D((kafka topic
        processed-user-login))
    D --> E[kafka consumer-2]
    E -->|consumes data from the topic,
        indexes it| F[[opensearch index
        processed-user-login-index]]
    F --> G[[opensearch dashboard
        visualizations]]

Additional Questions

How would you deploy this application in production?

Deploying this application in production involves several steps to ensure it runs reliably, securely, and efficiently.

Use a Container Orchestration Platform (Kubernetes) and Infrastructure as Code tools (Terraform) to define and provision the infrastructure. Use managed Kubernetes services like AWS EKS, or Azure AKS to simplify management and scaling.
Set up persistent volumes for services that require data persistence, such as OpenSearch.
Use Kubernetes services to expose the application to external traffic. Implement Ingress controllers for routing external traffic to the appropriate services.
Use monitoring tools like Prometheus and Grafana to monitor the health and performance of the application. Set up centralized logging using tools like Fluentd.
Use Kubernetes secrets or services like HashiCorp Vault for managing sensitive information

What other components would you want to add to make this production-ready?

To make this application production-ready, consider adding the following components:

Integrate security scanning for container images (e.g., Trivy) within the CI/CD pipeline.
Implement backup solutions for OpenSearch and other stateful services.
Configure Kubernetes to automatically scale (horizontal) the number of pods based on CPU/memory usage. Use cluster autoscalers to add or remove nodes based on the overall demand.
Manage application configuration using Kubernetes ConfigMaps and Secrets.

How can this application scale with a growing dataset?

Scaling the application to handle a growing dataset involves several strategies:

Use partitioned Kafka topics to distribute the load across multiple brokers. Deploy multiple instances of consumer services to parallelize data processing.
Use sharding in OpenSearch to distribute data across multiple nodes.
Optimize OpenSearch indices by adjusting shard count, replication factor, and using index templates.
Add more instances of your services to handle increased load (Horizontal Scaling). Increase the resources (CPU, memory) allocated to your services (Vertical Scaling).
Use load balancers to distribute traffic across multiple instances of your services.

Producer

The producer is a Python script that generates random user login data and sends it to the Kafka topic user-login.

Sample Message

From my-python-producer

{
  "user_id": "a6397988-1649-4859-ae5f-a1407b638ca6",
  "app_version": "2.3.0",
  "ip": "108.228.145.239",
  "locale": "RI",
  "device_id": "6221eb3d-89b5-4585-8c98-545b78e24c53",
  "timestamp": 1720576825,
  "device_type": "iOS"
}

Consumer

The consumer-1 is a Python script that reads messages from the Kafka topic user-login, processes the messages, and sends the processed messages to the Kafka topic processed-user-login.

The consumer-2 is a Python script that reads messages from the Kafka topic processed-user-login, indexes the messages to an OpenSearch index processed-user-login-index.

$ python -m venv .venv
$ source .venv/bin/activate
(.venv) $ cd src/
(.venv) $ pip install -r requirements.txt
(.venv) $ python consumer-1.py &
(.venv) $ python consumer-2.py &

Useful Commands

# Start the services
$ docker-compose up

# Get the details of the running containers
$ docker ps -a

# Consuming messages from the Kafka Topic
$ docker exec -it <kafka> kafka-console-consumer --bootstrap-server localhost:9092 --topic user-login --group my-app

# Get the group id of the Kafka Consumer
$ docker exec -it <kafka> kafka-consumer-groups --bootstrap-server localhost:9092 --list

# List the topics in the Kafka Cluster
$ docker exec -it <kafka> kafka-topics --bootstrap-server localhost:9092 --list

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
docs		docs
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
steps.md		steps.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

real-time-data-processing-analytics

Table of Contents

Demo

Requirements

Tools

Libraries

Setup

Workflow

Additional Questions

How would you deploy this application in production?

What other components would you want to add to make this production-ready?

How can this application scale with a growing dataset?

Producer

Sample Message

Consumer

Useful Commands

About

Uh oh!

Uh oh!

Languages

kaamalajyothsna/real-time-data-processing-analytics

Folders and files

Latest commit

History

Repository files navigation

real-time-data-processing-analytics

Table of Contents

Demo

Requirements

Tools

Libraries

Setup

Workflow

Additional Questions

How would you deploy this application in production?

What other components would you want to add to make this production-ready?

How can this application scale with a growing dataset?

Producer

Sample Message

Consumer

Useful Commands

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages