# GameStatsPx
## A Real-Time Videogame Statistics Processing Platform

---

### Distributed Architecture with Machine Learning

<br>

*A scalable system for collecting, processing, and predicting game statistics using modern data engineering tools*

## Table of Contents

1. **System Overview**
2. **Architecture Components**
3. **Data Flow Pipeline**
4. **Machine Learning Layer**
5. **Storage & Visualization**
6. **Deployment & Networking**
7. **Key Features**
8. **Technical Specifications**

## 1. System Overview

### Purpose
GameStatsPx is a **distributed real-time data processing platform** designed to:

- Collect game statistics
- Process data streams in real-time
- Apply machine learning models for predictions
- Store and visualize analytics results

### Technology Stack
- **Containerization**: Docker & Docker Compose
- **Message Streaming**: Apache Kafka (KRaft mode)
- **Log Processing**: Fluent Bit
- **Machine Learning**: Apache Spark
- **Storage**: Elasticsearch
- **Visualization**: Kibana

## 2. Architecture Components

### Core Services

| Component | Container Name | Purpose | Port |
|-----------|----------------|---------|------|
| **Crawler** | gameStatsPx-crawler | Data collection | - |
| **Fluent Bit** | fluentbit | Log aggregation | 9090 |
| **Kafka** | kafka | Message broker | 9092 |
| **Kafka UI** | kafkaWebUI | Monitoring interface | 8585 |
| **Spark** | gameStatsPx-spark | ML processing | - |
| **Elasticsearch** | elasticsearch | Data storage | 9200 |
| **Kibana** | kibana | Data visualization | 5601 |

## 3. Data Flow Pipeline

### End-to-End Data Journey

![architecture](img/architecture.png)

### Pipeline Stages

#### Stage 1: Data Collection
- **Games Crawler** scrapes statistics from game sources
- Generates logs of collected data
- Sends to Fluent Bit for processing

#### Stage 2: Log Processing
- **Fluent Bit** receives raw logs
- Filters and transforms data
- Forwards to Kafka topics

#### Stage 3: Message Streaming
- **Kafka** acts as distributed message queue
- Topic: `gameStatsPx` (partition: 1, replication: 1)
- Ensures reliable, ordered message delivery

#### Stage 4: ML Processing
- **Apache Spark** consumes from Kafka
- Applies regression models
- Generates predictions and analytics

#### Stage 5: Storage & Indexing
- **Elasticsearch** receives processed results
- Indexes data for fast querying
- Supports full-text search and aggregations

#### Stage 6: Visualization
- **Kibana** queries Elasticsearch
- Creates dashboards and visualizations
- Provides real-time monitoring

## 4. Machine Learning Layer

### Apache Spark ML Pipeline

#### Regression Models
The system implements multiple regression techniques:

1. **Linear Regression**
   - For linear relationships in game statistics
   - Fast training and prediction
   - Interpretable coefficients

2. **Non-Linear Regression**
   - Captures complex patterns
   - Better accuracy for non-linear relationships
   - Examples: logistic regression, decision trees, random forests



#### NLP and Sentiment analysis
Upcoming videogames will be enriched with community scores based on sentiment analysis.
Web scraping from public sub-reddit such as `r/Games` and from Metacritic website, provides information about community sentiment, used in predicion.

### Spark Configuration

#### Container Details
- **Name**: `gameStatsPx-spark`
- **Base Image**: Custom Spark build
- **Dependencies**: Elasticsearch connector

#### Processing Flow
1. Read streaming data from Kafka
2. Apply feature engineering
3. Train/update ML models
4. Generate predictions
5. Write results to Elasticsearch

## 5. Storage & Visualization

### Elasticsearch Configuration

#### Core Settings
- **Version**: 9.0.4
- **Mode**: Single-node cluster
- **Memory**: 512MB heap (Xms/Xmx)
- **Port**: 9200

#### Security
- X-Pack security enabled
- Default user: `elastic` (password: `changeme`)
- Kibana user: `kibana_system_user` (password: `kibanapass123`)
- Role-based access control

### Kibana Dashboard

#### Features
- **Version**: 9.0.4 (matching Elasticsearch)
- **Port**: 5601
- **License**: Basic (self-generated)

### Python Initialization Script

#### Purpose
- Seeds initial data into Elasticsearch
- Creates index templates and mappings
- Sets up baseline datasets

#### Communication
- **Direct HTTP**: Connects to Elasticsearch on port 9200
- **Independent**: Runs separately from Docker containers
- **One-time**: Executes during system setup

![screen3](./img/screen3.png)

## 6. Deployment & Networking

### Docker Compose Architecture

#### Network Configuration
- **Network Name**: `kafka-network`
- **Driver**: Bridge
- **Isolation**: All services communicate within isolated network
- **DNS**: Automatic service discovery by container name

#### Service Dependencies
```yaml
crawler ──> depends_on ──> fluentbit
fluentbit ──> depends_on ──> kafka
spark ──> depends_on ──> elasticsearch
kibana ──> depends_on ──> elasticsearch
kafka-ui ──> depends_on ──> kafka
```

### Kafka Configuration (KRaft Mode)

#### Why KRaft?
- **No Zookeeper**: Simplified architecture
- **Better Performance**: Reduced latency
- **Easier Management**: Fewer moving parts

#### Listener Configuration
| Listener | Port | Purpose |
|----------|------|----------|
| **PLAINTEXT** | 39092 | Internal container communication |
| **HOST** | 9092 | External host access |
| **CONTROLLER** | 29093 | KRaft controller protocol |

#### KRaft Settings
- **Mode**: Enabled (`KAFKA_KRAFT_MODE: true`)
- **Roles**: Combined broker + controller
- **Node ID**: 1
- **Cluster ID**: `MkU3OEVBNTcwNTJENDM2Qk`

### Port Mapping Summary

#### Exposed Ports
| Service | Internal Port | External Port | Access |
|---------|---------------|---------------|--------|
| **Fluent Bit** | 9090 | 9090 | Log forwarding |
| **Kafka** | 39092 | 9092 | Message broker |
| **Kafka UI** | 8080 | 8585 | Web interface |
| **Elasticsearch** | 9200 | 9200 | REST API |
| **Kibana** | 5601 | 5601 | Dashboard |

#### Access URLs
- Kafka UI: `http://localhost:8585`
- Elasticsearch: `http://localhost:9200`
- Kibana: `http://localhost:5601`
- Fluent Bit: `http://localhost:9090`

## 7. Key Features

### Scalability
- **Horizontal scaling**: Add more Kafka partitions
- **Spark cluster**: Can be expanded to distributed mode
- **Elasticsearch**: Supports cluster expansion

### Security
- **Authentication**: Elasticsearch user management
- **Network isolation**: Docker bridge network
- **X-Pack security**: Role-based access control

### Real-Time Processing
- **Stream processing**: Kafka + Spark Streaming
- **Low latency**: Optimized pipeline
- **Live dashboards**: Kibana real-time updates

### Machine Learning
- **Multiple models**: Linear & non-linear regression
- **Real-time predictions**: Streaming ML inference

### Observability
- **Kafka UI**: Monitor topics and consumer groups
- **Kibana**: Visualize data patterns
- **Fluent Bit metrics**: Log processing stats

### Data Persistence
- **Volume mapping**: Elasticsearch data persistence
- **Fluent Bit config**: External configuration

## 8. Technical Specifications

- Docker Engine 20.10+
- Docker Compose 1.29+
- Python 3.8+ (for init script)
- Git (for source control)

### Version Matrix

| Component | Version | Image |
|-----------|---------|-------|
| **Fluent Bit** | 1.8 | fluent/fluent-bit:1.8 |
| **Kafka** | Latest | confluentinc/cp-kafka:latest |
| **Kafka UI** | Latest | provectuslabs/kafka-ui:latest |
| **Elasticsearch** | 9.0.4 | docker.elastic.co/elasticsearch/elasticsearch:9.0.4-amd64 |
| **Kibana** | 9.0.4 | docker.elastic.co/kibana/kibana:9.0.4-amd64 |
| **Spark** | Custom | Custom build |
| **Crawler** | Custom | Custom build |

### Resource Allocation

#### Memory Limits
- **Elasticsearch**: 512MB heap
- **Kafka**: Default JVM settings
- **Spark**: Configurable (default cluster mode)

#### Storage Volumes
```
./esdata ──> Elasticsearch data
./fluent-bit/fluent-bit.conf ──> Fluent Bit config
```

#### Network Bandwidth
- Kafka internal: High throughput for streaming
- Elasticsearch: Bulk indexing operations
- Crawler: Depends on data source API limits

## Getting Started

### Quick Start Guide

#### 1. Clone Repository
```bash
git clone <repository-url>
cd gameStatsPx
```

#### 2. Start Services
```bash
docker-compose up -d
```

#### 3. Initialize Data
```bash
python dataset_gen.py
```

#### 4. Verify Services
- Kafka UI: http://localhost:8585
- Kibana: http://localhost:5601
- Elasticsearch: http://localhost:9200

### Monitoring & Troubleshooting

#### Check Service Status
```bash
docker-compose ps
```

#### View Logs
```bash
# All services
docker-compose logs -f

# Specific service
docker-compose logs -f spark
```

#### Restart Services
```bash
docker-compose restart <service-name>
```

## Architecture Highlights

### Design Principles

**Microservices**: Each component is independently deployable

**Event-Driven**: Kafka enables asynchronous communication

**Scalable**: Horizontal scaling at multiple layers

**Observable**: Built-in monitoring and logging

**Resilient**: Fault-tolerant message queue

**Modern Stack**: Industry-standard tools and practices

### Key Advantages

1. **Decoupled Architecture**: Services can be updated independently
2. **Real-Time Analytics**: Immediate insights from streaming data
3. **ML Integration**: Seamless prediction pipeline
4. **Easy Deployment**: Single docker-compose command
5. **Production-Ready**: Security, monitoring, and persistence

## Future Enhancements

### Potential Improvements

#### Technical Upgrades
- Multi-node Elasticsearch cluster
- Distributed Spark cluster
- Redis caching layer
- Kubernetes deployment

#### Feature Additions
- Advanced ML models (deep learning)
- Real-time alerting system
- API gateway (Kong/Nginx)
- Grafana for metrics

## Conclusion

### GameStatsPx: A Modern Data Platform

#### What We Built
- **Distributed system** with 7+ microservices
- **Real-time pipeline** from collection to visualization
- **Machine learning** integration for predictions
- **Production-grade** architecture with monitoring

#### Technologies Mastered
- Docker & Docker Compose
- Apache Kafka (KRaft)
- Apache Spark ML
- Elasticsearch & Kibana
- Fluent Bit logging

## Thank You!
*Questions?*


### Meme
*Me at the end of this project.*

<img src="img/meme.png" width="500">