# <center> <img src="../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Procesamiento de Datos Masivos** </center>
---
### <center> **Primavera 2025** </center>
---
## <center> **Final Proyect** </center>


**Team**:
- Luis Raul Acosta Mendoza 
- Samantha Abigail Quintero 
- Arturo Benajamin Vergara Romo
    
**Profesor**: Dr. Pablo Camarillo Ramirez

## Introduction and Problem Definition 

This project aims to analyze real-time data from an e-commerce web application. Using Apache Kafka, user session data will be consumed in real time, then processed and stored in Parquet files by Apache Spark. The resulting dataset will be used to train a Machine Learning model that can predict user behavior within the application.

The data captured from the application includes real-time information about each user session, such as the pages visited, seen products, user interactions within the diferent elements in the page, such as scroll, zoom, hover and the clicks performed. The goal of this project is to leverage this information to predict whether a user is likely to make a purchase in future sessions.

## System Architecture
The architecture of this project consists of three real-time producers, each writing data to a different Kafka topic. Each producer generates a distinct type of data: the first focuses solely on the page and product the user is viewing, the second captures all click events within the session, and the third generates data related to other user interactions.

For data consumption, three Kafka consumers are used, each reading from its respective topic. The data is processed using a PySpark session with structured streaming. Initially, the JSON data is parsed and transformed into columns, and then another streaming pipeline writes the data into Parquet files, effectively creating a small data lake.

To enable machine learning, a separate Spark session reads the Parquet files. The data is grouped by session, computing aggregations such as the total number of clicks and other user interactions. These aggregated features are used to train a model. A DecisionTreeClassifier from PySpark’s MLlib was chosen, using the btn_buy variable as the label, which indicates whether the user clicked the “buy” button during that session.

Finally, the prediction results were visualized in a Power BI dashboard to make the insights more accessible and easier to interpret.






## 5v Justification

- Volume:
The system is designed to handle a high volume of user interaction data generated by the e-commerce application in real time. Each Kafka producer sends multiple JSON records per second, including page views, click events, and interaction data. The following table shows the computed data growth:

| Time Period | Proceced Data |
|-------------|---------------|
| 1 Second    | 2KB           |
| 1 Minute    | 124KB         |
| 1 Hour      | 7MB           |
| 1 Day       | 179MB         |

- Velocity:
The architecture processes data in real time using Apache Kafka and Spark Structured Streaming. Data is ingested with low latency from the Kafka topics and processed on the fly to be stored in Parquet files. The system supports continuous streaming and writing with near real-time feedback, making it suitable for scenarios that require instant insights. Using the QueryListener form pySpark we could obtain the rows per second of our aplication, this is 11.8

- Variety:
Three different types of data are collected: page views, user interactions, and click events, also its a diferent schema for each data, for page views the schema was: user_id (string), session_id (string), page_url (string), referrer_url (string), category (string), price (float), timestamp (string), for the user interactions: user_id (string), session_id (string), interaction_type (string), page_url (string), category (string), price (float), details (string), timestamp (string) and finally for the click events: user_id (string), session_id (string), element_id (string), page_url (string), category (string), price (float), timestamp (string), x_coord (float), y_coord (float)
 The use of multiple data sources and structures reflects the system's ability to handle diverse data types in a unified processing pipeline.

- Veracity:
To ensure data reliability, the system uses Kafka’s message durability and Spark’s schema validation during JSON parsing. Invalid or malformed data is filtered out during the transformation stage, and structured formats like Parquet further enforce data integrity for downstream analysis and modeling.

- Value:
The collected and processed data is used to train a machine learning model that predicts user purchase behavior. These predictions are visualized in Power BI, providing actionable insights that can drive marketing strategies, user experience improvements, and business decisions. This demonstrates the value extracted from the raw data through analytics.




## Implementation Details

### Technologies Used
- Apache Kafka
Kafka was used as the real-time messaging system to decouple the data producers from consumers. It provides high throughput, scalability, and durability, making it ideal for real-time event streaming.

- PySpark (Apache Spark with Python API)
PySpark was used for both real-time and batch data processing. Spark Structured Streaming handled streaming data from Kafka, while batch jobs were used for feature engineering and model training.

- Parquet
Parquet, a columnar storage format, was chosen for its performance benefits in analytical workloads. It supports efficient compression and encoding, which reduces storage size and speeds up query performance.

- Power BI
Power BI was used for creating interactive dashboards to visualize the predictions and insights generated by the machine learning model.

### Design Choices

- Three separate Kafka producers and topics were used to isolate the types of data: page views, click events, and other user interactions. This separation improves modularity, scalability, and makes it easier to track issues.


- Storing data in Parquet format within a data lake enabled historical data analysis and reuse for training ML models without re-ingesting data.

- Data was aggregated at the session level to capture meaningful behavioral patterns. This allowed the machine learning model to learn from combined features such as total clicks, number of interactions, and product views per session.


- A DecisionTreeClassifier was chosen for its interpretability and fast training time, which is suitable for iterative development and evaluation.


## Results and Evaluation

After running the machine learning model, the following results were obtained:
                                                                                
- Accuracy: 0.7901234567901234
                                                                                
- Precision: 0.855921855921856
                                                                                
- Recall: 0.7901234567901234

- F1 Score: 0.7960105968521116

### Power BI

## Conclusion

This project successfully implemented a real-time data processing pipeline for an e-commerce web application using Apache Kafka, PySpark, and Parquet. User session data—such as page views, click events, and interactions—was collected and processed to create a rich dataset for training a DecisionTreeClassifier aimed at predicting the likelihood of a user making a purchase. The pipeline combined real-time streaming and batch processing, enabling both immediate insights and long-term analysis. Aggregating data at the session level improved model performance by providing contextual behavioral features. The results were visualized using Power BI, offering accessible and actionable insights. Key learnings include the effectiveness of integrating streaming technologies with machine learning, the benefits of session-based feature engineering, and the practicality of using interpretable models like decision trees to support early-stage predictive analytics.