# <center> **ITESO** </center>
# <center> **Final Project Procesamiento de Datos Masivos** </center>
---
## <center> **Streamer Applications** </center>
## <center> **Real-Time Stock Price Analysis** </center>
---
## <center> **Par de Foraneos** </center>
---
#### <center> **Spring 2025** </center>
---
#### <center> **05/14/2025** </center>

---
**Profesor**: Dr. Pablo Camarillo Ramirez <br>
**Students**: Eddie, Konrad, Diego, Aaron

# Introduction and Problem Definition

# Introduction and Problem Definition

The financial markets generate enormous volumes of data at high velocity, making them an ideal candidate for big data technologies. Our project addresses the challenge of processing, analyzing, and extracting actionable insights from real-time stock market data. We've developed a comprehensive system that captures stock price movements, processes them using distributed computing, and applies machine learning to predict trading signals.

The primary objectives of this project are:

1. **Data Ingestion**: Design a scalable system to ingest real-time stock price data from multiple sources using Kafka as the data streaming platform.

2. **Data Processing**: Leverage Apache Spark's distributed computing capabilities to process and transform streaming data into usable formats.

3. **Technical Analysis**: Calculate key technical indicators and financial metrics that provide insights into market trends and potential trading opportunities.

4. **Machine Learning Implementation**: Apply machine learning models to predict future price movements and generate trading signals (BUY, SELL, WAIT).

5. **Backtesting Framework**: Develop a backtesting system to evaluate and optimize trading strategies using historical data.

6. **Performance Analysis**: Measure the effectiveness of our trading strategies using key financial metrics like Calmar ratio, Sharpe ratio, and Sortino ratio.

Our application focuses on four major stocks (CAT, AAPL, NVDA, CVX) that represent different market sectors and exhibit varied patterns of volatility and price movement. This diversity allows us to test our system's robustness across different market conditions.

The real-time nature of stock market data presents challenges in terms of volume, velocity, variety, veracity, and value - the 5Vs of Big Data. Our architecture is designed to address these challenges through a pipeline that connects data producers (simulating real-time stock prices), Kafka streaming, Spark processing, machine learning modeling, and visualization.

# System Architecture

### Workflow
1. Make sure Docker is set up and connected. Provide Spark and Kafka Server ID's in all necessary Notebooks.
2. Run file called `producer_application_stocks.ipynb` to start the data streaming.
3. While the stream is running, start notebook called `streamer_application_stocks.ipynb` to capture and persist the read data.
4. Once the data lake is sufficient, stop the streaming and consumer notebooks.
5. Finally, run file `postprocess_application_stocks.ipynb` to perform a union based operations on each ouput based on the stock's company.

### Architecture

# <center> <img src="./images/BigData_Architecture.jpg" alt="Project Architecture"> </center>

# 5 V's Justification

## Volume
How the system handles large data volumes. Each
team needs to compute the size of each produced record to do
this analysis. 

Displayed send volumn is per topic, i.e. volume/stock.

- Record Size: Each stock data record contains timestamp, symbol, price, volume, and market indicators (~101 bytes)
- Ingestion Rate: Publishing 10x faster than real-world trading for stress testing (simulating high-frequency trading scenarios), so in a real application scale down volumn by 10 and scale up by total amount of stocks.
- Scalability: The per-topic design allows linear scaling - adding more stocks simply requires additional topics/partitions

| Time Period       | Data Processed       |
|----------------|----------------|
| 1 Second                      | 0.202 KB   |
| 1 Minute (60 Seconds)         | 12.168 KB   |
| 1 Hour (3,600 Seconds)         | 0.730 MB   |
| 1 Day (86,400 Seconds)            | 17.112 MB  |
| 1 Year (31.5 Million Seconds)  | 6.10 GB   |



## Velocity
The system’s ability to process streaming data in real-
time. The performance can be obtained by using the processedRowsPerSecond info obtained from the event progress
data (using QueryListeners)
- Processing Rate: Sustains 28.3 rows/second per stock topic
- Parallel Consumption: Spark structured streaming maintains 1:1 topic:task parallelism for optimal throughput

## Variety
The schema that were handling in the input of the stream consists on:
- stock_text **(string)** - contains timestamp, stock-id, price

The schema that were handling in the output of the stream consists on:
- timestamp **(timestamp)**
- company **(string)**
- open **(float)**
- high **(float)**
- low **(float)**
- close **(float)**

The schema that were handling in the postprocess consists on:  
- timestamp **(timestamp)**
- company **(string)**
- open **(float)**
- high **(float)**
- low **(float)**
- close **(float)**
- williams_r **(float)**
- ultimate_osc **(float)** 
- rsi **(float)**
- ema **(float)**
- close_lag1 **(float)**
- close_lag2 **(float)**
- close_lag3 **(float)**
- close_lag4 **(float)**
- close_lag5 **(float)**

## Value
The output dataset from the stream contains OHLC stock prices on a 5min interval. They provide a snapshot of a stock's price movement over a specific time period. Thus, they allow an analysis of behaviour, volatility and key price levels. They can be used for technical statistics as well as trend and pattern analysis making them essential for traders and analysts. 
In our use case they are used to calculate technical indicators and price lags. Later, all of them are used to train a ML model to predict trading behaviour.

# Implementation Details

## Technological Stack
- `PySpark`: Python framework that allows us to work with large volumes of datasets.
- `Kafka`: Distributed streaming platform for real-time data applications.
- `Pandas`: Python library to create dataframes, data analysis and manipulation.
- `Numpy`: Python library for scientific computing and data analysis. 
- `YFinance`: Yahoo Finance downloads real historical values from selected companies.
- `Technical Analysis (ta)`: Python library to calculate specific financial metrics, such as RSI, EMA. 

## Design Choices
- To get a realistic basis for our producer we download real stock prices from the internet. The last downloaded price serves as the initial price for the producer which then produces new prices using the Geometric Brownian Motion (GBM) model and semi-random risk-free interest rate and volatility. In this way we have a somewhat real producer scenario.
- Our producer creates stock prices in a 5s interval. However, they are then published every 0.5s. This is due to convenience in the showcase of our application to produce a sufficient amount of data in a running interval. 
- the OHLC price statistics on a 5 minute interval are calculated directly within the kafka stream since they are just inplace-resampling operations. The technical indicators however, we had to outsource them to post-processing. This is due to a calculation window of those indicators which are usually 14 timesteps. Consequently, we would need at least 14 5 minute OHLC stock prices per batch to get one row of indicator values while loosing the first 13 OHLC prices. So even if we would process very large batches we would loose the first part of our streaming data in every batch which is not useful since it would create an incomplete datalake. 
- We outsourced the class containing the actual producer and the function to calculate the technical indicators in a file called `stock_utils.py`.
- We also created a specific Jupyter Notebook that acts as the producer of data. 

## Optimizations
- The Kafka streamer is serialized in 'utf-8' to avoid additional overhead that would be caused by casting such as json etc.

# Analysis and Evaluation

Our analysis evaluates both the performance of the data processing pipeline and the effectiveness of the machine learning-based trading strategy.

## Data Processing Performance

The Spark structured streaming application demonstrated strong performance metrics when processing the real-time stock data:

- **Processing Rate**: The system sustained an average of 28.3 rows per second per stock topic, with peaks reaching over 127 rows per second during high-velocity periods.
- **Scalability**: The application successfully handled multiple parallel streams (one for each stock) without performance degradation.
- **Latency**: The average processing time remained consistently low, with the streaming application efficiently transforming raw stock price data into OHLC (Open, High, Low, Close) format.

## Machine Learning Model Performance

Our SVM-based machine learning models showed impressive predictive capabilities across all four stocks:

1. **CAT**: F1 Score of 0.928, with optimized technical indicators (RSI window: 9, Williams %R window: 10)
2. **AAPL**: Perfect F1 Score of 1.0, indicating exceptional predictive accuracy
3. **NVDA**: Perfect F1 Score of 1.0, with strong performance across all evaluation metrics
4. **CVX**: F1 Score of 0.777, still demonstrating good predictive power

The high F1 scores indicate that our models effectively learned the patterns in the stock price movements and could accurately predict the optimal trading signals.

## Trading Strategy Performance

The backtesting results demonstrate the effectiveness of our trading strategy:

1. **Individual Stock Performance**:
   - **NVDA**: Highest return at 10.27%, with an exceptional Calmar ratio of 11,580.89
   - **CVX**: 5.73% return with a Calmar ratio of 179.02
   - **AAPL**: 5.68% return with a Calmar ratio of 2,697.71
   - **CAT**: 3.83% return with a Calmar ratio of 1,346.93

2. **Portfolio Performance**:
   - **Overall Return**: 6.38% for the equally-weighted portfolio
   - **Risk-Adjusted Metrics**: Average Calmar ratio of 3,951.14, average Sharpe ratio of 8.14, and average Sortino ratio of 160.39

These metrics indicate that our strategy not only generates positive returns but does so with minimal drawdowns and controlled risk. The extremely high Calmar ratios suggest that the strategy effectively manages risk while capturing profit opportunities.

3. **Optimization Effectiveness**:
   - **Technical Indicators**: Optuna-based optimization identified unique parameters for each stock, significantly enhancing model performance.
   - **Trading Parameters**: Stop-loss and take-profit parameters were optimized to values generally between 1-2%, indicating that quick profit-taking and tight loss control were crucial for success.

Our portfolio visualization and analysis demonstrate that the machine learning approach outperforms traditional technical analysis-based trading strategies, with higher returns and better risk management metrics.

# Conclusion

In this notebook, we've applied machine learning techniques to our real-time stock price analysis system. We've demonstrated how to:

1. Load and prepare data from either our processed parquet files or historical sources
2. Generate trading signals based on price movements
3. Optimize technical indicators and model parameters
4. Train SVM models to predict trading signals
5. Backtest trading strategies with optimized parameters
6. Analyze and visualize portfolio performance

The results show that our approach can potentially generate profitable trading strategies, though real-world implementation would require additional considerations such as transaction costs, market impact, and risk management.

This completes our Big Data-based stock trading system, which integrates data streaming, processing, machine learning, and backtesting components.

### Challenges
- The biggest challenge was the calculation of the techincal indicators. At first we tried to include them directly into the streaming process. This however produced empty output. Since spark doesn't allow printing commands it was hard to figure out that due to the spinup window of 14 timesteps in the indicators, the batches stayed empty with NaN values. After this debugging process and the conclusion that we need a complete datalake we decided to outsource the indicators to the postprocessing in spark applying the calculations to the whole datalake. 
- Furthermore, during the coding and debugging process and the running producer, the streaming cache of kafka grew heavily in memory which in one moment caused the docker application to crash due to missing memory on the disk. Consequently, we had to reinstall the docker application and make some settings to avoid congestion in the future due to streaming data.
- We applied all the learned concepts throughout the semester into one final application, and we think that that's the most important part. 

## Download necessary Modules

In [None]:
import findspark
findspark.init()
import ta
from par_de_foraneos.stock_utils import resample_and_aggregate
from par_de_foraneos.stock_utils import SparkUtils as SpU
from par_de_foraneos.stock_utils import ProgressListener

## Spark Session creation


In [None]:
from pyspark.sql import SparkSession

SPARK_SERVER = {'Konrad': '2453c3db49e4',
                'Aaron' : 'a5ab6bdab4b3',
                'Diego': '368ad5a83fd7'}
KAFKA_SERVER = {'Konrad': '4c63f45c41b4:9093',
                'Aaron' : '69b1b3611d90:9093',
                'Diego' : 'a27c998f34f5:9093'}
current_user = 'Konrad'

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka") \
    .master("spark://{}:7077".format(SPARK_SERVER[current_user])) \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()

sc = spark.sparkContext

spark.conf.set("spark.sql.shuffle.partitions", "5")

## Kafka Stream creation

In [None]:
streamer_lines = []

for i in range(4):
    streamer_lines.append( spark \
                            .readStream \
                            .format("kafka") \
                            .option("kafka.bootstrap.servers", "{}".format(KAFKA_SERVER[current_user])) \
                            .option("subscribe", f"stock_topic{i}") \
                            .option("failOnDataLoss", "false")
                            .load()
    )

### Setup Schema for Output DF

In [None]:

result_schema = SpU.generate_schema([("timestamp", "timestamp" ),
                                     ('company', 'string'),
                                              ("open", "float" ),
                                              ("high", "float" ),
                                              ("low", "float"),
                                              ("close", "float" )
                                              ])

## Defining Output Transformations on DF

In [None]:
from pyspark.sql.functions import col, split
from pyspark.sql.types import DoubleType, TimestampType

streamer_df = []

for i in range(4):

        #split csv input and tranform into spark df
        df = streamer_lines[i].withColumn("value_str", col("value").cast("string"))
        df = df.withColumn("split", split(col("value_str"), ","))
        df = df.withColumn("timestamp", col("split").getItem(0).cast(TimestampType())) \
                .withColumn("company", col("split").getItem(1)) \
                .withColumn("close", col("split").getItem(2).cast(DoubleType())) \
                .select("timestamp", "company","close")

        #setup resampling window for UDF
        custom_resampler = resample_and_aggregate(new_window=5)

        # creating a watermark window of 10mins - suficient for our 5min ressampling window
        # and applying the custom resampling function
        resampled_df = df.withWatermark("timestamp", "10 minutes") \
                .groupBy("company").applyInPandas(custom_resampler, schema=result_schema)

        streamer_df.append(resampled_df)



### QUERY LISTENER

In [None]:
#to get processed lines per second
spark.streams.addListener(ProgressListener())

## Sink configuration

In [None]:
query = []

# write the stream to parquet files and store different writestreams in a list
for i in range(4):
    query.append(
        streamer_df[i] \
        .writeStream \
        .outputMode("append") \
        .trigger(processingTime='120 seconds') \
        .format("parquet") \
        .option("path", f"/home/jovyan/notebooks/data/final_project_ParDeForaneos/output{i}/")
        .option("checkpointLocation", f"/home/jovyan/notebooks/data/final_project_ParDeForaneos/checkpoints/stock_topic{i}") \
        .start()
    )

## STOP STREAMERS

In [None]:
for i in range(4):
    query[i].stop()

### Close Spark Session

In [None]:
sc.stop()