# <center> **ITESO** </center>
# <center> **Final Project Procesamiento de Datos Masivos** </center>
---
## <center> **Streamer Applications** </center>
## <center> **Real-Time Stock Price Analysis** </center>
---
## <center> **Par de Foraneos** </center>
---
#### <center> **Spring 2025** </center>
---
#### <center> **05/14/2025** </center>

---
**Profesor**: Dr. Pablo Camarillo Ramirez
**Students**: Eddie, Konrad , Diego, Aaron

# Introduction and Problem Definition

### **PENDING**

# System Architecture

### Workflow
1. Make sure Docker is set up and connected.
2. Run file called `producer_application_stocks.ipynb` to start the data streaming.
3. While the stream is running, start notebook called `streamer_application_stocks.ipynb` to capture and persist the read data.
4. Once the data lake is sufficient, stop the streaming and consumer notebooks.
5. Finally, run file `postprocess_application_stocks.ipynb` to perform a union based operations on each ouput based on the stock's company.

### Architecture

# <center> <img src="./images/BigData_Architecture.jpg" alt="Project Architecture"> </center>

# 5 V's Justification

## Volume
### **PENDING**
How the system handles large data volumes. Each
team needs to compute the size of each produced record to to
this analysis. 

## Velocity
### **PENDING**
The system’s ability to process streaming data in real-
time. The performance can be obtained by using the processedRowsPerSecond info obtained from the event progress
data (using QueryListeners)

## Variety
The schema that were handling consists on:  
- timestamp
- company 
- open 
- high 
- low 
- close
- williams_r 
- ultimate_osc 
- rsi 
- ema 
- close_lag1 
- close_lag2 
- close_lag3 
- close_lag4 
- close_lag5 

## Veracity
As said before, this data is historical, extraced directly from Yahoo Finance. This ensures that the data that were getting for our application is real. 

## Value
The data stored calculates the metrics listed above, using the Technical Analysis library. This allows for more specific insights for us to analyze and deliver. 

# Implementation Details

## Technological Stack
- `PySpark`: Python framework that allows us to work with large volumes of datasets.
- `Kafka`: Distributed streaming platform for real-time data applications.
- `Pandas`: Python library to create dataframes, data analysis and manipulation.
- `Numpy`: Python library for scientific computing and data analysis. 
- `YFinance`: Yahoo Finance downloads real historical values from selected companies.
- `Technical Analysis (ta)`: Python library to calculate specific financial metrics, such as RSI, EMA. 

## Design Choices
- We created a file called `stock_utils.py` that calculates the mentioned metrics with the data obtained. 
- We also created a specific Jupyter Notebook that acts as the producer of data. 

# Analysis and Evaluation

### **PENDING**  
Analysis of the ML model performance
(precision, accuracy, recall, etc.) and functionality of the application.
In this section you need to paste your PowerBI Dashboard.

# Conclusion
This project was a challenge, specially to read all the collected data correctly. The fact that we used real, historic data, adds value to the project but was also what kept us stuck for some time.  
We applied all the learned concepts throughout the semester into one final application, and we think that that's the most important part. 

## Download necessary Modules

In [40]:
import findspark
findspark.init()
import ta
from foraneos.stock_utils import resample_and_aggregate
from foraneos.stock_utils import SparkUtils as SpU

## Spark Session creation


In [41]:
from pyspark.sql import SparkSession

SPARK_SERVER = {'Konrad': '2453c3db49e4',
                'Aaron' : 'a5ab6bdab4b3',
                'Diego': '368ad5a83fd7'}
KAFKA_SERVER = {'Konrad': '4c63f45c41b4:9093',
                'Aaron' : '69b1b3611d90:9093',
                'Diego' : 'a27c998f34f5:9093'}
current_user = 'Diego'

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka") \
    .master("spark://{}:7077".format(SPARK_SERVER[current_user])) \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()
    
sc = spark.sparkContext

spark.conf.set("spark.sql.shuffle.partitions", "5")

## Kafka Stream creation

In [42]:
streamer_lines = []

for i in range(4):
    streamer_lines.append( spark \
                            .readStream \
                            .format("kafka") \
                            .option("kafka.bootstrap.servers", "{}".format(KAFKA_SERVER[current_user])) \
                            .option("subscribe", f"stock_topic{i}") \
                            .option("failOnDataLoss", "false")
                            .load()
    )

### Setup Schema for Output DF

In [43]:

result_schema = SpU.generate_schema([("timestamp", "timestamp" ), 
                                     ('company', 'string'),
                                              ("open", "float" ), 
                                              ("high", "float" ), 
                                              ("low", "float"),
                                              ("close", "float" )                                                                               
                                              ])

## Defining Sink Transformation on DF

In [44]:
from pyspark.sql.functions import col, split
from pyspark.sql.types import DoubleType, TimestampType

streamer_df = []

for i in range(4):
             
        #split csv input and tranform into spark df     
        df = streamer_lines[i].withColumn("value_str", col("value").cast("string"))         
        df = df.withColumn("split", split(col("value_str"), ","))
        df = df.withColumn("timestamp", col("split").getItem(0).cast(TimestampType())) \
                .withColumn("company", col("split").getItem(1)) \
                .withColumn("close", col("split").getItem(2).cast(DoubleType())) \
                .select("timestamp", "company","close")
          
        #setup resampling window for UDF
        custom_resampler = resample_and_aggregate(new_window=5)
    
        # creating a watermark window of 10mins - suficient for our 5min ressampling window
        # and applying the custom resampling function
        resampled_df = df.withWatermark("timestamp", "10 minutes") \
                .groupBy("company").applyInPandas(custom_resampler, schema=result_schema)

        streamer_df.append(resampled_df)



## Sink configuration

In [45]:
query = []

# write the stream to parquet files and store different writestreams in a list
for i in range(4):
    query.append(
        streamer_df[i] \
        .writeStream \
        .outputMode("append") \
        .trigger(processingTime='120 seconds') \
        .format("parquet") \
        .option("path", f"/home/jovyan/notebooks/data/final_project_ParDeForaneos/output{i}/")
        .option("checkpointLocation", f"/home/jovyan/notebooks/data/final_project_ParDeForaneos/checkpoints/stock_topic{i}") \
        .start()
    )

25/05/14 00:43:37 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/05/14 00:43:37 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/05/14 00:43:37 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/05/14 00:43:37 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


## STOP STREAMERS

In [46]:
for i in range(4):
    query[i].stop()

In [47]:
sc.stop()