# IBM Advanced Data Science Capstone: Forecasting Stock Prices

## Extract-Transform-Load (ETL) 

### 1. Data Cleansing
 - Ensure that both the data files - main stocks dataset and the referential sectors dataset are in the same directory for feasible access. 
 - Ensure that the any indexed or unnamed columns from the CSV file are duly dropped or not read into the spark dataframes. This add constraints in checking for data duplication and makes the dataset more noisy.
 - Ensure that the NULL values are correctly imputed for the stock prices and volumes. Based on the univariate analysis, there are a lot of outliers in the stocks dataset. So, we will impute NULL records with the median taking into account the underlying outlier sensitivity of the numerical features.
 - Ensure that all the features in the datasets are of the correct datatype.
 - Ensure to rename the column names of the features removing whitespaces, so that we can efficiently run SQL queries on the spark dataframes.

In [92]:
# find and init the spark instance to ensure it is pip installed
import findspark
findspark.init()

# set some HTML display setting 
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

# import all the pyspark dependencies 
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType, DateType
from pyspark.sql.functions import *
import pyspark.sql.functions as F

from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor

# declare a spark object that we will run our spark SQL dataframes on 
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

# init a spark session 
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()

# import basic data analysis libraries  
import numpy as np
import pandas as pd
import scipy.stats as stats

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

In [93]:
data_stocks = spark.read.csv('kaggle_stock_data.csv', header=True).drop('_c0')
data_stocks.show(5)

+----------+--------------------+----------+-----------+----------+-------+
|Instrument|                Date|Price High|Price Close|Price Open| Volume|
+----------+--------------------+----------+-----------+----------+-------+
|   CBKG.DE|2019-01-02T00:00:00Z|     5.804|      5.765|     5.782|7221471|
|   CBKG.DE|2019-01-03T00:00:00Z|      5.95|      5.802|     5.748|8064658|
|   CBKG.DE|2019-01-04T00:00:00Z|     6.168|      6.143|      5.89|8772521|
|   CBKG.DE|2019-01-07T00:00:00Z|     6.249|      6.182|     6.242|6781840|
|   CBKG.DE|2019-01-08T00:00:00Z|      6.39|       6.33|     6.172|8472530|
+----------+--------------------+----------+-----------+----------+-------+
only showing top 5 rows



In [94]:
data_sectors = spark.read.csv('kaggle_stock_sector_information.csv', header=True).drop('_c0')
data_sectors.show(5)

+----------+--------------------+-------------------------+-------------------------+------------------------+--------------------+--------------------+-------------------------+-------------------------+------------------------+------------------+------------------+
|Instrument| Company Common Name|TRBC Economic Sector Name|TRBC Business Sector Name|TRBC Industry Group Name|  TRBC Industry Name|  TRBC Activity Name|TRBC Economic Sector Code|TRBC Business Sector Code|TRBC Industry Group Code|TRBC Industry Code|TRBC Activity Code|
+----------+--------------------+-------------------------+-------------------------+------------------------+--------------------+--------------------+-------------------------+-------------------------+------------------------+------------------+------------------+
|   CBKG.DE|      Commerzbank AG|               Financials|     Banking & Investm...|        Banking Services|               Banks|         Banks (NEC)|                       55|                  

In [95]:
# Define the custom function to replace spaces with underscores in column names
def clean_feature_names(df):
    new_columns = [col(col_name).alias(col_name.replace(" ", "_")) for col_name in df.columns]
    return df.select(*new_columns)

data_stocks_cleaned = clean_feature_names(data_stocks)
data_stocks_cleaned.show(5)

+----------+--------------------+----------+-----------+----------+-------+
|Instrument|                Date|Price_High|Price_Close|Price_Open| Volume|
+----------+--------------------+----------+-----------+----------+-------+
|   CBKG.DE|2019-01-02T00:00:00Z|     5.804|      5.765|     5.782|7221471|
|   CBKG.DE|2019-01-03T00:00:00Z|      5.95|      5.802|     5.748|8064658|
|   CBKG.DE|2019-01-04T00:00:00Z|     6.168|      6.143|      5.89|8772521|
|   CBKG.DE|2019-01-07T00:00:00Z|     6.249|      6.182|     6.242|6781840|
|   CBKG.DE|2019-01-08T00:00:00Z|      6.39|       6.33|     6.172|8472530|
+----------+--------------------+----------+-----------+----------+-------+
only showing top 5 rows



In [96]:
data_sectors_cleaned = clean_feature_names(data_sectors)
data_sectors_cleaned.show(5)

+----------+--------------------+-------------------------+-------------------------+------------------------+--------------------+--------------------+-------------------------+-------------------------+------------------------+------------------+------------------+
|Instrument| Company_Common_Name|TRBC_Economic_Sector_Name|TRBC_Business_Sector_Name|TRBC_Industry_Group_Name|  TRBC_Industry_Name|  TRBC_Activity_Name|TRBC_Economic_Sector_Code|TRBC_Business_Sector_Code|TRBC_Industry_Group_Code|TRBC_Industry_Code|TRBC_Activity_Code|
+----------+--------------------+-------------------------+-------------------------+------------------------+--------------------+--------------------+-------------------------+-------------------------+------------------------+------------------+------------------+
|   CBKG.DE|      Commerzbank AG|               Financials|     Banking & Investm...|        Banking Services|               Banks|         Banks (NEC)|                       55|                  

In [97]:
data_stocks_cleaned = data_stocks_cleaned.select(
    col("Instrument").cast(StringType()).alias("Instrument"),
    col("Date").cast(DateType()).alias("Date"),
    col("Price_High").cast(DoubleType()).alias("Price_High"),
    col("Price_Close").cast(DoubleType()).alias("Price_Close"),
    col("Price_Open").cast(DoubleType()).alias("Price_Open"),
    col("Volume").cast(IntegerType()).alias("Volume")
    
)
data_stocks_cleaned.show()

+----------+----------+----------+-----------+----------+--------+
|Instrument|      Date|Price_High|Price_Close|Price_Open|  Volume|
+----------+----------+----------+-----------+----------+--------+
|   CBKG.DE|2019-01-02|     5.804|      5.765|     5.782| 7221471|
|   CBKG.DE|2019-01-03|      5.95|      5.802|     5.748| 8064658|
|   CBKG.DE|2019-01-04|     6.168|      6.143|      5.89| 8772521|
|   CBKG.DE|2019-01-07|     6.249|      6.182|     6.242| 6781840|
|   CBKG.DE|2019-01-08|      6.39|       6.33|     6.172| 8472530|
|   CBKG.DE|2019-01-09|     6.432|       6.22|     6.401| 7686557|
|   CBKG.DE|2019-01-10|     6.302|       6.28|     6.136| 5269389|
|   CBKG.DE|2019-01-11|     6.423|       6.35|     6.287| 8684431|
|   CBKG.DE|2019-01-14|     6.319|       6.27|     6.319| 4784613|
|   CBKG.DE|2019-01-15|      6.42|       6.19|      6.32| 7935736|
|   CBKG.DE|2019-01-16|     6.692|      6.649|     6.261|13097725|
|   CBKG.DE|2019-01-17|     6.563|      6.424|     6.537| 8926

In [98]:
data_stocks_cleaned.dtypes

[('Instrument', 'string'),
 ('Date', 'date'),
 ('Price_High', 'double'),
 ('Price_Close', 'double'),
 ('Price_Open', 'double'),
 ('Volume', 'int')]

In [99]:
data_stocks_cleaned.groupBy(data_stocks_cleaned.columns).count().filter("count > 1").count()

15257

In [100]:
data_stocks_cleaned = data_stocks_cleaned.dropDuplicates()
data_sectors_cleaned = data_sectors_cleaned.dropDuplicates()

In [101]:
data_stocks_cleaned.groupBy(data_stocks_cleaned.columns).count().filter("count > 1").count()

0

In [102]:
data_stocks_cleaned = data_stocks_cleaned.withColumn('Daily_Return', round(col('Price_Open') - col('Price_Close'),2))
data_stocks_cleaned.show(5)

+----------+----------+----------+-----------+----------+--------+------------+
|Instrument|      Date|Price_High|Price_Close|Price_Open|  Volume|Daily_Return|
+----------+----------+----------+-----------+----------+--------+------------+
|   CBKG.DE|2020-03-30|     3.601|      3.338|     3.585|18925755|        0.25|
|   CBKG.DE|2022-02-21|     9.513|      9.153|       9.4|10191050|        0.25|
|   CBKG.DE|2022-11-03|     8.162|       8.08|     8.104| 4897012|        0.02|
|  DTEGn.DE|2020-10-20|    13.935|      13.68|      13.9|11136709|        0.22|
|  DTEGn.DE|2021-05-21|    17.224|      17.17|    17.004|10933087|       -0.17|
+----------+----------+----------+-----------+----------+--------+------------+
only showing top 5 rows



In [103]:
data_stocks_cleaned = data_stocks_cleaned.withColumn("Month", month(col("Date")))
data_stocks_cleaned = data_stocks_cleaned.withColumn("Year", year(col("Date")))
data_stocks_cleaned = data_stocks_cleaned.withColumn("Quarter", quarter(col("Date")))
data_stocks_cleaned.show(5)

+----------+----------+----------+-----------+----------+--------+------------+-----+----+-------+
|Instrument|      Date|Price_High|Price_Close|Price_Open|  Volume|Daily_Return|Month|Year|Quarter|
+----------+----------+----------+-----------+----------+--------+------------+-----+----+-------+
|   CBKG.DE|2020-03-30|     3.601|      3.338|     3.585|18925755|        0.25|    3|2020|      1|
|   CBKG.DE|2022-02-21|     9.513|      9.153|       9.4|10191050|        0.25|    2|2022|      1|
|   CBKG.DE|2022-11-03|     8.162|       8.08|     8.104| 4897012|        0.02|   11|2022|      4|
|  DTEGn.DE|2020-10-20|    13.935|      13.68|      13.9|11136709|        0.22|   10|2020|      4|
|  DTEGn.DE|2021-05-21|    17.224|      17.17|    17.004|10933087|       -0.17|    5|2021|      2|
+----------+----------+----------+-----------+----------+--------+------------+-----+----+-------+
only showing top 5 rows



In [104]:
arr_temp = ['Instrument','Price_High','Price_Open','Price_Close','Daily_Return','Volume','Year','Month','Quarter']
data_stocks_cleaned.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in arr_temp]).show()

+----------+----------+----------+-----------+------------+------+----+-----+-------+
|Instrument|Price_High|Price_Open|Price_Close|Daily_Return|Volume|Year|Month|Quarter|
+----------+----------+----------+-----------+------------+------+----+-----+-------+
|         0|       415|       772|        380|         775| 57783|  72|   72|     72|
+----------+----------+----------+-----------+------------+------+----+-----+-------+



In [105]:
data_stocks_cleaned = data_stocks_cleaned.na.drop()
data_stocks_cleaned.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in arr_temp]).show()

+----------+----------+----------+-----------+------------+------+----+-----+-------+
|Instrument|Price_High|Price_Open|Price_Close|Daily_Return|Volume|Year|Month|Quarter|
+----------+----------+----------+-----------+------------+------+----+-----+-------+
|         0|         0|         0|          0|           0|     0|   0|    0|      0|
+----------+----------+----------+-----------+------------+------+----+-----+-------+



In [106]:
data_stocks_agg = data_stocks_cleaned.groupBy("Instrument","Year","Quarter","Month").agg(
    round(avg(col("Price_High")), 2).alias("Avg_Price_High"),
    round(avg(col("Price_Open")), 2).alias("Avg_Price_Open"),
    round(avg(col("Price_Close")), 2).alias("Avg_Price_Close"),
    round(avg(col("Daily_Return")), 2).alias("Avg_Daily_Return"),
    round(sum(col("Volume")), 2).alias("Total_Volume")
).sort(["Instrument","Year","Quarter","Month"],ascending=[True, False, True, True])
data_stocks_agg.show(10)

+----------+----+-------+-----+--------------+--------------+---------------+----------------+------------+
|Instrument|Year|Quarter|Month|Avg_Price_High|Avg_Price_Open|Avg_Price_Close|Avg_Daily_Return|Total_Volume|
+----------+----+-------+-----+--------------+--------------+---------------+----------------+------------+
|   123F.DE|2023|      1|    1|          5.81|          5.52|           5.66|           -0.14|       20871|
|   123F.DE|2023|      1|    2|          6.28|          6.25|           6.13|            0.12|       15052|
|   123F.DE|2023|      1|    3|          5.13|           5.0|           4.94|            0.06|      126029|
|   123F.DE|2023|      2|    4|          5.28|          5.19|           5.24|           -0.05|       34716|
|   123F.DE|2023|      2|    5|          7.38|          7.15|           7.27|           -0.11|       25381|
|   123F.DE|2023|      2|    6|          7.37|          7.29|           7.23|            0.05|       47670|
|   123F.DE|2023|      3|   

### 2. Feature Engineering
 - Ensure that both the data files - main stocks dataset and the referential sectors dataset are in the same directory for feasible access. 
 - Ensure that the any indexed or unnamed columns from the CSV file are duly dropped or not read into the spark dataframes. This add constraints in checking for data duplication and makes the dataset more noisy.
 - Ensure that the NULL values are correctly imputed for the stock prices and volumes. Based on the univariate analysis, there are a lot of outliers in the stocks dataset. So, we will impute NULL records with the median taking into account the underlying outlier sensitivity of the numerical features.
 - Ensure that all the features in the datasets are of the correct datatype.
 - Ensure to rename the column names of the features removing whitespaces, so that we can efficiently run SQL queries on the spark dataframes.

In [108]:
# Step 1: StringIndexer for encoding 'Instrument'
indexer = StringIndexer(inputCol="Instrument", outputCol="Instrument_Encoded")
data_stocks_agg_idx = indexer.fit(data_stocks_agg).transform(data_stocks_agg)
data_stocks_agg_idx.show(5) 

+----------+----+-------+-----+--------------+--------------+---------------+----------------+------------+------------------+
|Instrument|Year|Quarter|Month|Avg_Price_High|Avg_Price_Open|Avg_Price_Close|Avg_Daily_Return|Total_Volume|Instrument_Encoded|
+----------+----+-------+-----+--------------+--------------+---------------+----------------+------------+------------------+
|   123F.DE|2023|      1|    1|          5.81|          5.52|           5.66|           -0.14|       20871|             382.0|
|   123F.DE|2023|      1|    2|          6.28|          6.25|           6.13|            0.12|       15052|             382.0|
|   123F.DE|2023|      1|    3|          5.13|           5.0|           4.94|            0.06|      126029|             382.0|
|   123F.DE|2023|      2|    4|          5.28|          5.19|           5.24|           -0.05|       34716|             382.0|
|   123F.DE|2023|      2|    5|          7.38|          7.15|           7.27|           -0.11|       25381|    

In [110]:
# Step 2: Assemble features into a vector
feature_cols = ["Avg_Price_High", "Avg_Price_Open", "Avg_Price_Close", "Avg_Daily_Return", "Total_Volume"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
data_stocks_agg_assembled = assembler.transform(data_stocks_agg_idx)
data_stocks_agg_assembled.show()

+----------+----+-------+-----+--------------+--------------+---------------+----------------+------------+------------------+--------------------+
|Instrument|Year|Quarter|Month|Avg_Price_High|Avg_Price_Open|Avg_Price_Close|Avg_Daily_Return|Total_Volume|Instrument_Encoded|            features|
+----------+----+-------+-----+--------------+--------------+---------------+----------------+------------+------------------+--------------------+
|   123F.DE|2023|      1|    1|          5.81|          5.52|           5.66|           -0.14|       20871|             382.0|[5.81,5.52,5.66,-...|
|   123F.DE|2023|      1|    2|          6.28|          6.25|           6.13|            0.12|       15052|             382.0|[6.28,6.25,6.13,0...|
|   123F.DE|2023|      1|    3|          5.13|           5.0|           4.94|            0.06|      126029|             382.0|[5.13,5.0,4.94,0....|
|   123F.DE|2023|      2|    4|          5.28|          5.19|           5.24|           -0.05|       34716|     

In [111]:
# Step 3: Split the data into training and testing sets
(train_data, test_data) = data_stocks_agg_assembled.randomSplit([0.7, 0.3], seed=42)

In [112]:
# Step 4: Train a Random Forest model
rf1 = RandomForestRegressor(featuresCol="features", labelCol="Avg_Price_Close", numTrees=100)
model_1 = rf1.fit(train_data)

In [117]:
# Step 5: Extract and display feature importances
feature_importances = model_1.featureImportances
print("Feature Importances For Closing Price: ")
for i, col in enumerate(feature_cols):
    print(f"{col}: {feature_importances[i]}")

Feature Importances For Closing Price: 
Avg_Price_High: 0.2307402176220589
Avg_Price_Open: 0.33246446801466234
Avg_Price_Close: 0.362523214435785
Avg_Daily_Return: 0.04032599536148937
Total_Volume: 0.03394610456600444


#### OBSERVATIONS 

In [118]:
rf2 = RandomForestRegressor(featuresCol="features", labelCol="Avg_Price_Open", numTrees=100)
model_2 = rf2.fit(train_data)

#feature_importances = model_2.featureImportances
print("Feature Importances For Opening Price: \n")
for i, col in enumerate(feature_cols):
    print(f"{col}: {model_2.featureImportances[i]}")

Feature Importances For Opening Price: 
Avg_Price_High: 0.256838631234274
Avg_Price_Open: 0.34067750058294666
Avg_Price_Close: 0.32826684171434134
Avg_Daily_Return: 0.04301247867510776
Total_Volume: 0.031204547793330147


In [121]:
rf3 = RandomForestRegressor(featuresCol="features", labelCol="Avg_Price_High", numTrees=100)
model_3 = rf3.fit(train_data)

print("Feature Importances For Highest Price: \n")
for i, col in enumerate(feature_cols):
    print(f"{col}: {model_3.featureImportances[i]}")

Feature Importances For Highest Price: 

Avg_Price_High: 0.26224224673748286
Avg_Price_Open: 0.33570976827568877
Avg_Price_Close: 0.3285761220687858
Avg_Daily_Return: 0.04276698485837756
Total_Volume: 0.0307048780596651


In [122]:
rf4 = RandomForestRegressor(featuresCol="features", labelCol="Avg_Daily_Return", numTrees=100)
model_4 = rf4.fit(train_data)

print("Feature Importances For Daily Return: \n")
for i, col in enumerate(feature_cols):
    print(f"{col}: {model_4.featureImportances[i]}")

Feature Importances For Daily Return: 

Avg_Price_High: 0.05661544607438059
Avg_Price_Open: 0.06782776249494417
Avg_Price_Close: 0.06525228063587078
Avg_Daily_Return: 0.7098925044464978
Total_Volume: 0.1004120063483067


In [123]:
rf5 = RandomForestRegressor(featuresCol="features", labelCol="Total_Volume", numTrees=100)
model_5 = rf5.fit(train_data)

print("Feature Importances For Volume: \n")
for i, col in enumerate(feature_cols):
    print(f"{col}: {model_5.featureImportances[i]}")

Feature Importances For Volume: 

Avg_Price_High: 0.06127463731916735
Avg_Price_Open: 0.047322153491228175
Avg_Price_Close: 0.04814586525695915
Avg_Daily_Return: 0.017007722818292752
Total_Volume: 0.8262496211143525
