---
Author: Mustapha Bouhsen <br>
[LinkedIn](https://www.linkedin.com/in/mustapha-bouhsen/)<br>
[Git](https://github.com/mus514)<br>
Date: February 14, 2024<br>
---

In [0]:
%run Repos/bouhsen.m@gmail.com/ML_Pipeline_Hub/library/garch_model

In [0]:
%run Repos/bouhsen.m@gmail.com/ML_Pipeline_Hub/library/daily_utilities

## Creating table containg the log return for each stock prices

The log return is given by :

$$
r_t = log(\frac{P_t}{P_{t-1}})
$$

Where $P_t$ is the stock price at time t

In [0]:
#-----------------------------------------
# Set the prod folder path
#-----------------------------------------
raw_folder_path = "/mnt/raw/"
prod_folder_path = "/mnt/prod/"

stocks = ["aapl", "amzn", "googl", "msft"]

In [0]:
#-----------------------------------------
# Loading the stocks_prices table and calculate the log return for each stock
#-----------------------------------------
df = spark.sql("SELECT * FROM stocks_prices")

# Calculate the log return for each stock
for stock in stocks:
    # Order by date
    df = df.orderBy("date")
    # Create a window specification
    windowSpec = Window.orderBy("date")
    # Calculate log return
    col_expr = F.log(df[stock]) - F.lag(F.log(df[stock])).over(windowSpec)
    # Round the return
    col_expr_rounded = F.round(col_expr, 6)
    # Assign the new column to the dataframe
    df = df.withColumn(stock, col_expr_rounded)

# Delete The Null row
df = df.na.drop()

# Check if the table exists
if spark.catalog.tableExists("stocks_returns"):
    # Drop the existing table
    spark.sql(f"DROP TABLE stocks_returns")
    print(f'Dropped table: stocks_returns')


# Create the table
df.write.format("parquet").saveAsTable("stocks_returns")

In [0]:
%sql
-- Disply the stocks_returns
SELECT *
FROM stocks_returns
ORDER BY date DESC
LIMIT 10

date,aapl,amzn,msft,googl
2024-02-15,-0.001576,-0.006926,-0.007181,-0.021961
2024-02-14,-0.004821,0.013781,0.009619,0.005497
2024-02-13,-0.011338,-0.021703,-0.021764,-0.016333
2024-02-12,-0.009043,-0.012169,-0.012659,-0.009915
2024-02-09,0.004086,0.026782,0.015432,0.020957
2024-02-08,-0.005771,-0.004055,0.000145,0.002539
2024-02-07,0.000581,0.008125,0.02089,0.009943
2024-02-06,0.008595,-0.006835,-0.000394,0.002919
2024-02-05,0.009799,-0.008769,-0.013638,0.009089
2024-02-02,-0.00542,0.075726,0.018259,0.008605


In [0]:
#-----------------------------------------
# Write the returns in the prod
#-----------------------------------------
# Temp folder to save temp parquet files
temp_folder = prod_folder_path + f"temp/"

# write data frame to csv
df.write.mode("overwrite").option("header", "True").csv(temp_folder)

# get all files path ending with .parquet
files_paths = get_files_paths_from_folders(temp_folder, ".csv")
            
# Copy parquet files to final destination
ingest_and_transform_to_parquet(files_paths, prod_folder_path, "returns")

# delete the temp folder
delete_contents_recursively(temp_folder)