# <center> <img src="../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Procesamiento de Datos Masivos** </center>
---
### <center> **Primavera 2025** </center>
---
## <center> **Final Project (Website Activity) Documentation** </center>


**Team**:
- Luis Raul Acosta Mendoza 
- Samantha Abigail Quintero 
- Arturo Benajamin Vergara Romo
    
**Profesor**: Dr. Pablo Camarillo Ramirez

## Introduction and Problem Definition 

This project aims to analyze real-time data from an e-commerce web application. Using Apache Kafka, the user session data will be consumed in real time, then processed and stored in Parquet files by Apache Spark. The resulting dataset will be used to train a Machine Learning model that can predict user behavior within the application.

The data captured from the application includes real-time information about each user session, such as the pages visited, seen products, user interactions within the diferent elements in the page, such as scrolling, zooming, hovering and the clicks performed. The goal of this project is to leverage this information to predict whether a user is likely to make a purchase in future sessions.

## System Architecture
The architecture of this project consists of three real-time producers, each writing data to a different Kafka topic. Each producer generates a distinct type of data: the first focuses solely on the page and product the user is viewing, the second captures all click events within the session, and the third generates data related to other user interactions.

For data consumption, three Kafka consumers are used, each reading from its corresponding topic. The data is processed using a PySpark session with structured streaming. Initially, the JSON data is parsed and transformed into columns, and then another streaming pipeline writes the data into Parquet files, effectively creating a small data lake.

To enable machine learning, a separate Spark session reads the Parquet files. The data is grouped by session, computing aggregations such as the total number of clicks and other user interactions. These aggregated features are used to train a model. A DecisionTreeClassifier from PySpark’s MLlib was chosen, using the btn_buy variable as the label, which indicates whether the user clicked the “buy” button during that session.

Finally, the prediction results were visualized in a Power BI dashboard to make the insights more accessible and easier to interpret.

# <center> <img src="img/Arqui.png" alt="ITESO" width="480" height="380"> </center>






## 5V Justification

This system aligns with the five fundamental characteristics of big data: **Volume, Velocity, Variety, Veracity, and Value**.

---

### **Volume**  
The system is built to handle a high volume of user interaction data generated in real time by the e-commerce application. Each Kafka producer streams multiple JSON records per second, including page views, click events, and interaction data.

The following table summarizes estimated data growth:

| Time Period | Processed Data |
|-------------|----------------|
| 1 Second    | 2 KB           |
| 1 Minute    | 124 KB         |
| 1 Hour      | 7 MB           |
| 1 Day       | 179 MB         |

---

### **Velocity**  
Data is ingested and processed in real time using **Apache Kafka** and **Spark Structured Streaming**. Kafka ensures fast, durable message delivery, while Spark enables on-the-fly processing and storage in Parquet format. This architecture supports continuous, low-latency data streaming.

Using PySpark’s `QueryListener`, we observed an average processing rate of **11.8 rows/second**, demonstrating the system’s capability to handle real-time insights.

---

### **Variety**  
The system handles multiple data types, each with its own schema, allowing for diverse analytical use cases:

1. **Page Views**
2. **User Interactions**
3. **Click Events**

#### **Page Views Schema**
- `user_id` (string)  
- `session_id` (string)  
- `page_url` (string)  
- `referrer_url` (string)  
- `category` (string)  
- `price` (float)  
- `timestamp` (string)

#### **User Interactions Schema**
- `user_id` (string)  
- `session_id` (string)  
- `interaction_type` (string)  
- `page_url` (string)  
- `category` (string)  
- `price` (float)  
- `details` (string)  
- `timestamp` (string)

#### **Click Events Schema**
- `user_id` (string)  
- `session_id` (string)  
- `element_id` (string)  
- `page_url` (string)  
- `category` (string)  
- `price` (float)  
- `timestamp` (string)  
- `x_coord` (float)  
- `y_coord` (float)

---

### **Veracity**  
To ensure data reliability:

- **Kafka** ensures message durability and ordered delivery.
- **Spark** enforces schema validation during JSON parsing.
- Malformed or incomplete records are filtered out during transformation.
- Writing to **Parquet**, a strongly typed format, enforces structural integrity for downstream processing.

---

### **Value**  
The processed data is used to **train machine learning models** that predict user purchase behavior. Results are visualized using **Power BI**, enabling:

- Data-driven marketing strategies  
- Personalized user experience improvements  
- Business decisions based on predictive insights  

This demonstrates how raw, high-volume data can be transformed into **tangible business value**.

---


## Implementation Details

### Technologies Used
- Apache Kafka
Kafka was used as the real-time messaging system to decouple the data producers from consumers. It provides high throughput, scalability, and durability, making it ideal for real-time event streaming.

- PySpark (Apache Spark with Python API)
PySpark was used for both real-time and batch data processing. Spark Structured Streaming handled streaming data from Kafka, while batch jobs were used for feature engineering and model training.

- Parquet
Parquet, a columnar storage format, was chosen for its performance benefits in analytical workloads. It supports efficient compression and encoding, which reduces storage size and speeds up query performance.

- Power BI
Power BI was used for creating interactive dashboards to visualize the predictions and insights generated by the machine learning model.

### Design Choices

- Three separate Kafka producers and topics were used to isolate the types of data: page views, click events, and other user interactions. This separation improves modularity, scalability, and makes it easier to track issues.


- Storing data in Parquet format within a data lake enabled historical data analysis and reuse for training ML models without re-ingesting data.

- Data was aggregated at the session level to capture meaningful behavioral patterns. This allowed the machine learning model to learn from combined features such as total clicks, number of interactions, and product views per session.


- A DecisionTreeClassifier was chosen for its interpretability and fast training time, which is suitable for iterative development and evaluation.


## Results and Evaluation

After running the machine learning model, the following results were obtained:
                                                                                
- Accuracy: 0.9130434782608695
                                                                                
- Precision: 0.922572960095295
                                                                                
- Recall: 0.9130434782608696

- F1 Score: 0.9080025204788911

### Power BI

# <center> <img src="img/PowerBi_report.png" alt="report"> </center>

## Conclusion

This project successfully implemented a real-time data processing pipeline for an e-commerce web application using **Apache Kafka**, **PySpark**, and **Parquet**. User session data—including page views, click events, and interactions—was collected and processed to build a rich dataset for training a **DecisionTreeClassifier** that predicts the likelihood of a user making a purchase.

The pipeline combines **real-time streaming** with **batch processing**, enabling both instant feedback and long-term analytics. By aggregating data at the **session level**, the model benefited from richer behavioral context, resulting in improved predictive performance.

Insights were visualized using **Power BI**, making the results accessible and actionable for final users or stakeholders.

### Key Learnings:
- The integration of **streaming technologies** with machine learning enables real-time, scalable analytics.
- **Session-based feature engineering** enhances model accuracy by capturing user behavior over time.
- **Interpretable models** like decision trees are effective for early-stage predictive tasks, providing transparency and ease of communication with non-technical teams.

This project demonstrates the practical value of unifying data engineering and data science workflows to drive informed business decisions in real time.
