# <center> **ITESO** </center>
# <center> **Final Project Procesamiento de Datos Masivos** </center>
---
## <center> **Streamer Applications** </center>
## <center> **Real-Time Stock Price Analysis** </center>
---
## <center> **Par de Foraneos** </center>
---
#### <center> **Spring 2025** </center>
---
#### <center> **05/14/2025** </center>

---
**Profesor**: Dr. Pablo Camarillo Ramirez <br>
**Students**: Eddie, Konrad, Diego, Aaron

# Introduction and Problem Definition

### **PENDING**

# System Architecture

### Workflow
1. Make sure Docker is set up and connected. Provide Spark and Kafka Server ID's in all necessary Notebooks.
2. Run file called `producer_application_stocks.ipynb` to start the data streaming.
3. While the stream is running, start notebook called `streamer_application_stocks.ipynb` to capture and persist the read data.
4. Once the data lake is sufficient, stop the streaming and consumer notebooks.
5. Finally, run file `postprocess_application_stocks.ipynb` to perform a union based operations on each ouput based on the stock's company.

### Architecture

# <center> <img src="./images/BigData_Architecture.jpg" alt="Project Architecture"> </center>

# 5 V's Justification

## Volume
How the system handles large data volumes. Each
team needs to compute the size of each produced record to do
this analysis. 

Displayed send volumn is per topic, i.e. volume/stock.

- Record Size: Each stock data record contains timestamp, symbol, price, volume, and market indicators (~101 bytes)
- Ingestion Rate: Publishing 10x faster than real-world trading for stress testing (simulating high-frequency trading scenarios), so in a real application scale down volumn by 10 and scale up by total amount of stocks.
- Scalability: The per-topic design allows linear scaling - adding more stocks simply requires additional topics/partitions

| Time Period       | Data Processed       |
|----------------|----------------|
| 1 Second                      | 0.202 KB   |
| 1 Minute (60 Seconds)         | 12.168 KB   |
| 1 Hour (3,600 Seconds)         | 0.730 MB   |
| 1 Day (86,400 Seconds)            | 17.112 MB  |
| 1 Year (31.5 Million Seconds)  | 6.10 GB   |



## Velocity
The system’s ability to process streaming data in real-
time. The performance can be obtained by using the processedRowsPerSecond info obtained from the event progress
data (using QueryListeners)
- Processing Rate: Sustains 28.3 rows/second per stock topic
- Parallel Consumption: Spark structured streaming maintains 1:1 topic:task parallelism for optimal throughput

## Variety
The schema that were handling in the input of the stream consists on:
- stock_text **(string)** - contains timestamp, stock-id, price

The schema that were handling in the output of the stream consists on:
- timestamp **(timestamp)**
- company **(string)**
- open **(float)**
- high **(float)**
- low **(float)**
- close **(float)**

The schema that were handling in the postprocess consists on:  
- timestamp **(timestamp)**
- company **(string)**
- open **(float)**
- high **(float)**
- low **(float)**
- close **(float)**
- williams_r **(float)**
- ultimate_osc **(float)** 
- rsi **(float)**
- ema **(float)**
- close_lag1 **(float)**
- close_lag2 **(float)**
- close_lag3 **(float)**
- close_lag4 **(float)**
- close_lag5 **(float)**

## Value
The output dataset from the stream contains OHLC stock prices on a 5min interval. They provide a snapshot of a stock's price movement over a specific time period. Thus, they allow an analysis of behaviour, volatility and key price levels. They can be used for technical statistics as well as trend and pattern analysis making them essential for traders and analysts. 
In our use case they are used to calculate technical indicators and price lags. Later, all of them are used to train a ML model to predict trading behaviour.

# Implementation Details

## Technological Stack
- `PySpark`: Python framework that allows us to work with large volumes of datasets.
- `Kafka`: Distributed streaming platform for real-time data applications.
- `Pandas`: Python library to create dataframes, data analysis and manipulation.
- `Numpy`: Python library for scientific computing and data analysis. 
- `YFinance`: Yahoo Finance downloads real historical values from selected companies.
- `Technical Analysis (ta)`: Python library to calculate specific financial metrics, such as RSI, EMA. 

## Design Choices
- To get a realistic basis for our producer we download real stock prices from the internet. The last downloaded price serves as the initial price for the producer which then produces new prices using the Geometric Brownian Motion (GBM) model and semi-random risk-free interest rate and volatility. In this way we have a somewhat real producer scenario.
- Our producer creates stock prices in a 5s interval. However, they are then published every 0.5s. This is due to convenience in the showcase of our application to produce a sufficient amount of data in a running interval. 
- the OHLC price statistics on a 5 minute interval are calculated directly within the kafka stream since they are just inplace-resampling operations. The technical indicators however, we had to outsource them to post-processing. This is due to a calculation window of those indicators which are usually 14 timesteps. Consequently, we would need at least 14 5 minute OHLC stock prices per batch to get one row of indicator values while loosing the first 13 OHLC prices. So even if we would process very large batches we would loose the first part of our streaming data in every batch which is not useful since it would create an incomplete datalake. 
- We outsourced the class containing the actual producer and the function to calculate the technical indicators in a file called `stock_utils.py`.
- We also created a specific Jupyter Notebook that acts as the producer of data. 

## Optimizations
- The Kafka streamer is serialized in 'utf-8' to avoid additional overhead that would be caused by casting such as json etc.

# Analysis and Evaluation

### **PENDING**  
Analysis of the ML model performance
(precision, accuracy, recall, etc.) and functionality of the application.
In this section you need to paste your PowerBI Dashboard.

# Conclusion

### Challenges
- The biggest challenge was the calculation of the techincal indicators. At first we tried to include them directly into the streaming process. This however produced empty output. Since spark doesn't allow printing commands it was hard to figure out that due to the spinup window of 14 timesteps in the indicators, the batches stayed empty with NaN values. After this debugging process and the conclusion that we need a complete datalake we decided to outsource the indicators to the postprocessing in spark applying the calculations to the whole datalake. 
- Furthermore, during the coding and debugging process and the running producer, the streaming cache of kafka grew heavily in memory which in one moment caused the docker application to crash due to missing memory on the disk. Consequently, we had to reinstall the docker application and make some settings to avoid congestion in the future due to streaming data.
- We applied all the learned concepts throughout the semester into one final application, and we think that that's the most important part. 

## Download necessary Modules

In [1]:
import findspark
findspark.init()
import ta
from par_de_foraneos.stock_utils import resample_and_aggregate
from par_de_foraneos.stock_utils import SparkUtils as SpU
from par_de_foraneos.stock_utils import ProgressListener

## Spark Session creation


In [2]:
from pyspark.sql import SparkSession

SPARK_SERVER = {'Konrad': '2453c3db49e4',
                'Aaron' : 'a5ab6bdab4b3',
                'Diego': '368ad5a83fd7'}
KAFKA_SERVER = {'Konrad': '4c63f45c41b4:9093',
                'Aaron' : '69b1b3611d90:9093',
                'Diego' : 'a27c998f34f5:9093'}
current_user = 'Konrad'

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka") \
    .master("spark://{}:7077".format(SPARK_SERVER[current_user])) \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()

sc = spark.sparkContext

spark.conf.set("spark.sql.shuffle.partitions", "5")

:: loading settings :: url = jar:file:/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6fcfff10-735c-4b53-80d4-23e7a7508fce;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.13;3.5.4 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.13;3.5.4 in central
	found org.apache.kafka#kafka-clients;3.4.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.scala-lang.modules#scala-parallel-collections_2.13;1.0.4 in central
	found org.apache.commons#commons-pool2;2.11.1 in centr

## Kafka Stream creation

In [3]:
streamer_lines = []

for i in range(4):
    streamer_lines.append( spark \
                            .readStream \
                            .format("kafka") \
                            .option("kafka.bootstrap.servers", "{}".format(KAFKA_SERVER[current_user])) \
                            .option("subscribe", f"stock_topic{i}") \
                            .option("failOnDataLoss", "false")
                            .load()
    )

### Setup Schema for Output DF

In [4]:

result_schema = SpU.generate_schema([("timestamp", "timestamp" ),
                                     ('company', 'string'),
                                              ("open", "float" ),
                                              ("high", "float" ),
                                              ("low", "float"),
                                              ("close", "float" )
                                              ])

## Defining Output Transformations on DF

In [5]:
from pyspark.sql.functions import col, split
from pyspark.sql.types import DoubleType, TimestampType

streamer_df = []

for i in range(4):

        #split csv input and tranform into spark df
        df = streamer_lines[i].withColumn("value_str", col("value").cast("string"))
        df = df.withColumn("split", split(col("value_str"), ","))
        df = df.withColumn("timestamp", col("split").getItem(0).cast(TimestampType())) \
                .withColumn("company", col("split").getItem(1)) \
                .withColumn("close", col("split").getItem(2).cast(DoubleType())) \
                .select("timestamp", "company","close")

        #setup resampling window for UDF
        custom_resampler = resample_and_aggregate(new_window=5)

        # creating a watermark window of 10mins - suficient for our 5min ressampling window
        # and applying the custom resampling function
        resampled_df = df.withWatermark("timestamp", "10 minutes") \
                .groupBy("company").applyInPandas(custom_resampler, schema=result_schema)

        streamer_df.append(resampled_df)



### QUERY LISTENER

In [6]:
spark.streams.addListener(ProgressListener())

## Sink configuration

In [None]:
query = []

# write the stream to parquet files and store different writestreams in a list
for i in range(4):
    query.append(
        streamer_df[i] \
        .writeStream \
        .outputMode("append") \
        .trigger(processingTime='120 seconds') \
        .format("parquet") \
        .option("path", f"/home/jovyan/notebooks/data/final_project_ParDeForaneos/output{i}/")
        .option("checkpointLocation", f"/home/jovyan/notebooks/data/final_project_ParDeForaneos/checkpoints/stock_topic{i}") \
        .start()
    )

25/05/14 16:44:44 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


Query started: 7e8e3b5d-3d00-489e-bc81-a3a079681298
Query started: 2bf848d3-15f0-45da-966d-ac6b840170d5


25/05/14 16:44:44 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/05/14 16:44:44 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


Query started: cad52912-181d-4478-a5d0-8b398d29199c
Query started: db275114-d087-4020-85f1-95acb87d5bf4


25/05/14 16:44:44 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


25/05/14 16:44:46 WARN OffsetSeqMetadata: Updating the value of conf 'spark.sql.shuffle.partitions' in current session from '5' to '200'.
25/05/14 16:44:46 WARN OffsetSeqMetadata: Updating the value of conf 'spark.sql.shuffle.partitions' in current session from '5' to '200'.
25/05/14 16:44:46 WARN OffsetSeqMetadata: Updating the value of conf 'spark.sql.shuffle.partitions' in current session from '5' to '200'.
25/05/14 16:44:46 WARN OffsetSeqMetadata: Updating the value of conf 'spark.sql.shuffle.partitions' in current session from '5' to '200'.
25/05/14 16:44:47 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
25/05/14 16:44:47 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
25/05/14 16:44:47 WARN AdminClientConfig: These config

Query made progress: 127.8904217192793 rows processed per second




Query made progress: 104.27165826733862 rows processed per second




Query made progress: 95.43094878686351 rows processed per second


                                                                                

Query made progress: 88.81478204648313 rows processed per second


[Stage 9:>  (0 + 0) / 200][Stage 11:(110 + 2) / 200][Stage 15:> (0 + 0) / 200]  

Query made progress: 26.8857356235997 rows processed per second




Query made progress: 20.411055988660525 rows processed per second




Query made progress: 15.94507806444469 rows processed per second


                                                                                

Query made progress: 13.298854820834872 rows processed per second


[Stage 19:>(87 + 2) / 200][Stage 21:> (1 + 0) / 200][Stage 23:> (0 + 0) / 200]  

Query made progress: 34.2563516985441 rows processed per second




Query made progress: 27.238678924072182 rows processed per second




Query made progress: 22.628700735432773 rows processed per second


                                                                                

Query made progress: 19.510608893585886 rows processed per second


[Stage 27:>(88 + 2) / 200][Stage 29:> (0 + 0) / 200][Stage 31:> (0 + 0) / 200]  

Query made progress: 39.66804979253112 rows processed per second


[Stage 29:===>           (50 + 2) / 200][Stage 31:>               (0 + 0) / 200]

Query made progress: 30.59395801331285 rows processed per second




Query made progress: 22.41605702494841 rows processed per second


                                                                                

Query made progress: 19.275748044197115 rows processed per second


[Stage 35:(133 + 2) / 200][Stage 37:> (0 + 0) / 200][Stage 39:> (0 + 0) / 200]  

Query made progress: 42.09787756533942 rows processed per second


[Stage 37:=====>         (70 + 2) / 200][Stage 39:>               (0 + 0) / 200]

Query made progress: 32.68864069735767 rows processed per second




Query made progress: 23.469587326422843 rows processed per second


                                                                                

Query made progress: 20.09545340366742 rows processed per second


[Stage 43:(153 + 2) / 200][Stage 44:>   (0 + 0) / 1][Stage 47:> (0 + 0) / 200]  

Query made progress: 36.57790021426385 rows processed per second


[Stage 45:===>           (41 + 2) / 200][Stage 47:>               (4 + 0) / 200]

Query made progress: 30.21109847048414 rows processed per second




Query made progress: 23.95448647569618 rows processed per second


                                                                                

Query made progress: 20.764552562988705 rows processed per second


[Stage 51:>(70 + 2) / 200][Stage 53:> (0 + 0) / 200][Stage 54:>   (0 + 0) / 1]  

Query made progress: 39.900249376558605 rows processed per second




Query made progress: 26.081286676809388 rows processed per second




Query made progress: 21.73716148899556 rows processed per second


                                                                                

Query made progress: 18.37330873308733 rows processed per second


[Stage 59:> (0 + 0) / 200][Stage 61:>(89 + 2) / 200][Stage 63:> (0 + 0) / 200]  

Query made progress: 43.51784413692644 rows processed per second


[Stage 59:===>           (51 + 2) / 200][Stage 63:>               (0 + 0) / 200]

Query made progress: 33.47280334728033 rows processed per second




Query made progress: 23.981537226570342 rows processed per second


                                                                                

Query made progress: 20.282261472154143 rows processed per second


[Stage 67:>(10 + 1) / 200][Stage 69:>(37 + 1) / 200][Stage 71:> (0 + 0) / 200]0]

Query made progress: 42.16654904728299 rows processed per second




Query made progress: 26.99040090344438 rows processed per second




Query made progress: 23.476474616061818 rows processed per second


                                                                                

Query made progress: 20.302850858641403 rows processed per second


## STOP STREAMERS

In [8]:
for i in range(4):
    query[i].stop()

Query terminated: 7e8e3b5d-3d00-489e-bc81-a3a079681298
Query terminated: 2bf848d3-15f0-45da-966d-ac6b840170d5
Query terminated: cad52912-181d-4478-a5d0-8b398d29199c
Query terminated: db275114-d087-4020-85f1-95acb87d5bf4


### Close Spark Session

In [9]:
sc.stop()