# Stock-Market-Index-Price-Prediction

### Final Project for Introduction to Deep Learning Course at University of South Florida
### Team Members: Jun Kim, Gerardo Wibmer Gonzalez, Paul-Ann Francis, Tahsun Rahman Khan

## Overview
This project aims to predict the closing prices of stock market indices using deep learning models built with Python and TensorFlow. We have developed a two-model approach to achieve this goal. The first model (Model 1) predicts the closing price using four input features (Open, High, Low, and Volume) for a specific date. The second model (Model 2) predicts the future values of each feature based on their respective historical data. By combining the predictions from these two models, we can estimate the closing price for any given day.

Although our project focuses on predicting the S&P 500 index prices, the code can be easily modified to predict the prices of any stock.

## Data
The data used in this project consists of daily stock market index prices, including Open, High, Low, Close, and Volume. We have used the historical price data of the S&P 500 index, obtained from Yahoo Finance, to train, validate, and test our models.

## Approach
1. Preprocessing: The raw data is preprocessed to create a windowed dataset, which includes the necessary features and target variables.
2. Model Training: We train separate instances of Model 2 for each feature (Open, High, Low, and Volume) using their historical data. This results in four different models that predict future values for each feature. Then, we train Model 1 using the combined features from Model 2's predictions.
3. Hyperparameter Tuning: We used grid search to find the best hyperparameters for our models. The optimal hyperparameters and their corresponding performance metrics are stored in CSV files under the hyperparameters/ directory.
4. Prediction: We input the predicted values of Open, High, Low, and Volume from Model 2 into Model 1 to predict the closing price for any given day.

## Repository Structure
- models/: Contains the saved deep learning models (Model 1 and Model 2 instances) for each feature.
- hyperparameters/: Contains CSV files with the optimal hyperparameters and their corresponding performance metrics, obtained through grid search.
- notebooks/: Contains Jupyter notebooks for data preprocessing, model training, and evaluation.
- README.md: Provides an overview of the project, including the approach, data, and repository structure.

## Dependencies
- Python 3.8 or higher
- datetime
- matplotlib
- NumPy
- pandas
- scikit-learn
- TensorFlow
- Keras
- yfinance

## License
This project is licensed under the Apache License 2.0. See LICENSE file for details.

In [None]:
import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, LeakyReLU, LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
import yfinance as yf

# Preprocessing

In [None]:
TRAIN_START_DATE = '1960-01-01'
TRAIN_END_DATE = '2015-12-31'
PREDICT_START_DATE = '2016-01-01'
PREDICT_END_DATE = '2019-12-31'
WINDOW_SIZE = 7

In [None]:
# Download S&P 500 data from Yahoo Finance
df = yf.download('^GSPC', start=TRAIN_START_DATE, end=PREDICT_END_DATE)

df = df.reset_index()

def str_to_datetime(s):
  split = s.split('-')
  year, month, day = int(split[0]), int(split[1]), int(split[2])
  return datetime.datetime(year=year, month=month, day=day)

df['Date'] = df['Date'].dt.strftime('%Y-%m-%d')
df['Date'] = df['Date'].apply(str_to_datetime)

df.index = df.pop('Date')

# Drop columns that are not needed
df = df.drop(columns=['Adj Close'])

In [None]:
# Split data into train and test sets
X_train = df[:PREDICT_START_DATE]
X_test = df[PREDICT_START_DATE:PREDICT_END_DATE]

y_train = X_train.pop('Close')
y_test = X_test.pop('Close')

In [None]:
def df_to_windowed_df(feature, df, window_size, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE):
    """Converts a dataframe into a windowed dataframe and date list"""
    df = df[start_date:end_date]
    date_list = df.index.to_list()
    feature_values = df[feature].to_numpy()
    windowed_df = []
    for i in range(len(feature_values) - window_size):
        windowed_df.append(feature_values[i:i+window_size])
    return np.array(windowed_df), date_list[window_size:]

# Create windowed dataframes and date_list
open_windowed_df, date_list = df_to_windowed_df('Open', df, window_size=WINDOW_SIZE, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE)
high_windowed_df, date_list = df_to_windowed_df('High', df, window_size=WINDOW_SIZE, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE)
low_windowed_df, date_list = df_to_windowed_df('Low', df, window_size=WINDOW_SIZE, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE)
volume_windowed_df, date_list = df_to_windowed_df('Volume', df, window_size=WINDOW_SIZE, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE)

In [None]:
to_combine_open_windowed_df, _ = df_to_windowed_df('Open', df, window_size=1, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE)
to_combine_high_windowed_df, _ = df_to_windowed_df('High', df, window_size=1, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE)
to_combine_low_windowed_df, _ = df_to_windowed_df('Low', df, window_size=1, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE)
to_combine_volume_windowed_df, _ = df_to_windowed_df('Volume', df, window_size=1, start_date=PREDICT_START_DATE, end_date=PREDICT_END_DATE)

# Stack the windowed dataframes along the third axis
stacked_windowed_df = np.stack([open_windowed_df, high_windowed_df, low_windowed_df, volume_windowed_df], axis=-1)

# Reshape the stacked_windowed_df into the desired shape (975, 30, 4)
combined_windowed_df = stacked_windowed_df.reshape((-1, WINDOW_SIZE, 4))

In [None]:
def windowed_df_to_date_X_y(windowed_df, date_list):
    """Converts a windowed dataframe and date list into a date, X, and y dataframe"""
    date_df = []
    X_df = []
    y_df = []
    for i in range(len(windowed_df) - 1):  # Modify the range to exclude the last window
        date_df.append(date_list[i + 1])   # Shift date by 1
        X_df.append(windowed_df[i])
        if windowed_df.ndim == 3:
            y_df.append(windowed_df[i + 1][-1][-1])  # Shift the target y value by 1 (for 3D input)
        else:
            y_df.append(windowed_df[i + 1][-1])  # Shift the target y value by 1 (for 2D input)
    return np.array(date_df), np.array(X_df), np.array(y_df)

# Use the modified function with the date_list parameter
date_df, X_open, y_open = windowed_df_to_date_X_y(open_windowed_df, date_list)
_, X_high, y_high = windowed_df_to_date_X_y(high_windowed_df, date_list)
_, X_low, y_low = windowed_df_to_date_X_y(low_windowed_df, date_list)
_, X_volume, y_volume = windowed_df_to_date_X_y(volume_windowed_df, date_list)
_, X_train_combined, _ = windowed_df_to_date_X_y(combined_windowed_df, date_list)
y_train_combined = y_train[len(y_train) - len(X_train_combined):].to_numpy()

# Model 2

In [None]:
class CreateModel:
    def __init__(self, feature, dates, X_train, y_train, window_size, params=None):
        self.feature = feature
        self.dates = dates
        self.X = X_train    
        self.y = y_train
        self.window_size = window_size
        self.params = params

        if self.params is None:
            self.best_model = self.train_model(True)
        else:
            self.best_model = self.train_model(False)

    def create_model(self, lstm_units=64, dense_units=32, learning_rate=0.001, lstm_activation='tanh', dense_activation='relu'):
        model = Sequential([
            Input((self.window_size, 1)),
            LSTM(lstm_units, activation=lstm_activation),
        ])

        if dense_activation == 'leaky_relu':
            model.add(Dense(dense_units))
            model.add(LeakyReLU(alpha=0.3))
            model.add(Dense(dense_units))
            model.add(LeakyReLU(alpha=0.3))
        else:
            model.add(Dense(dense_units, activation=dense_activation))
            model.add(Dense(dense_units, activation=dense_activation))
            model.add(Dense(1))

        model.compile(loss='mse',
                    optimizer=Adam(learning_rate=learning_rate),
                    metrics=['mean_absolute_error'])

        return model

    def train_model(self, grid_search):
        # Check if GPU is available
        if tf.config.list_physical_devices('GPU'):
            print("Using GPU")
            # Set the device to GPU:0
            with tf.device('/GPU:0'):
                best_model = self.train_model_helper(grid_search)
        else:
            print("Using CPU")
            # If GPU is not available, perform grid search on CPU
            best_model = self.train_model_helper(grid_search)

        return best_model

    def train_model_helper(self, grid_search):
        # Wrap the create_model function with KerasRegressor
        model = KerasRegressor(build_fn=self.create_model, verbose=1)

        # Define the grid search parameters
        param_grid = {
            'lstm_units': [8, 16],
            'dense_units': [4, 8],
            'learning_rate': [0.0001, 0.0005],
            'epochs': [30, 50],
            'batch_size': [8, 16],
            'lstm_activation': ['tanh', 'relu'],
            'dense_activation': ['relu', 'elu']
        }

        # Create the GridSearchCV object
        grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=6, cv=3, verbose=1)

        # Split the data into training and validation
        q_80 = int(len(self.dates) * .8)

        self.dates_train, X_train, y_train = self.dates[:q_80], self.X[:q_80], self.y[:q_80]
        self.dates_val, X_val, y_val = self.dates[q_80:], self.X[q_80:], self.y[q_80:]

        # Flatten the training data for GridSearchCV
        X_train_flat = np.reshape(X_train, (X_train.shape[0], -1))
        y_train_flat = y_train.flatten()
        
        if grid_search:
            # Perform grid search
            grid_result = grid.fit(X_train_flat, y_train_flat)

            # Print the best hyperparameters
            print("Best score: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

            # Create a model with the best hyperparameters
            best_params = grid_result.best_params_

            # Save the best hyperparameters
            best_params_df = pd.DataFrame(best_params, index=[0])
            best_params_df.to_csv(f'../hyperparameters/{self.feature}_best_params.csv', index=False)

            best_epochs = best_params.pop('epochs')
            best_batch_size = best_params.pop('batch_size')
            best_model = self.create_model(**best_params)
        else:
            print("Using best hyperparameters from previous run")
            best_epochs = self.params.pop('epochs')
            best_batch_size = self.params.pop('batch_size')
            best_model = self.create_model(**self.params)

        # Train the best model with the training data using default batch_size and epochs
        best_model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=best_epochs, batch_size=best_batch_size)

        return best_model

    def predict(self):
        train_predictions = self.best_model.predict(self.X).flatten()
        return train_predictions

    def plot(self):
        plt.figure(figsize=(20, 10))
        plt.plot(self.dates, self.y, label='Actual')
        plt.plot(self.dates, self.predict(), label='Predicted')
        plt.legend()

## Only run when you're want to grid search

In [None]:
# print("Creating Open Model")
# open_model = CreateModel('Open', df, START_DATE, END_DATE, n=3)
# print("Creating High Model")
# high_model = CreateModel('High', df, START_DATE, END_DATE, n=3)
# print("Creating Low Model")
# low_model = CreateModel('Low', df, START_DATE, END_DATE, n=3)
# print("Creating Volume Model")
# volume_model = CreateModel('Volume', df, START_DATE, END_DATE, n=3)
# print("Done")

## Only Run if you want to train and save the models

In [None]:
open_params = pd.read_csv('../hyperparameters/Open_best_params.csv').to_dict(orient='records')[0]
open_model = CreateModel('Open', date_df, X_open, y_open, WINDOW_SIZE, params=open_params)

In [None]:
open_model.plot()

In [None]:
open_model.best_model.save('../models/open_model.h5')

In [None]:
high_params = pd.read_csv('../hyperparameters/High_best_params.csv').to_dict(orient='records')[0]
high_model = CreateModel('High', date_df, X_high, y_high, WINDOW_SIZE, params=high_params)

In [None]:
high_model.plot()

In [None]:
high_model.best_model.save('../models/high_model.h5')

In [None]:
low_params = pd.read_csv('../hyperparameters/Low_best_params.csv').to_dict(orient='records')[0]
low_model = CreateModel('Low', date_df, X_low, y_low, WINDOW_SIZE, params=low_params)

In [None]:
low_model.plot()

In [None]:
low_model.best_model.save('../models/low_model.h5')

In [None]:
volume_params = pd.read_csv('../hyperparameters/Volume_best_params.csv').to_dict(orient='records')[0]
volume_model = CreateModel('Volume', date_df, X_volume, y_volume, WINDOW_SIZE, params=volume_params)

In [None]:
volume_model.plot()

In [None]:
volume_model.best_model.save('../models/volume_model.h5')

## Restore models
You have to run this even when you're training

In [None]:
open_model = tf.keras.models.load_model('../models/open_model.h5')
high_model = tf.keras.models.load_model('../models/high_model.h5')
low_model = tf.keras.models.load_model('../models/low_model.h5')
volume_model = tf.keras.models.load_model('../models/volume_model.h5')

## Plot Restored Model

In [None]:
# Plot the preidction and actual values
plt.figure(figsize=(20, 10))
plt.plot(date_df, y_open, label='Actual Open')
plt.plot(date_df, open_model.predict(X_open), label='Predicted Open')
plt.legend()

In [None]:
# Plot the preidction and actual values
plt.figure(figsize=(20, 10))
plt.plot(date_df, y_high, label='Actual High')
plt.plot(date_df, high_model.predict(X_high), label='Predicted High')
plt.legend()

In [None]:
# Plot the preidction and actual values
plt.figure(figsize=(20, 10))
plt.plot(date_df, y_low, label='Actual Low')
plt.plot(date_df, low_model.predict(X_low), label='Predicted Low')
plt.legend()

In [None]:
# Plot the preidction and actual values
plt.figure(figsize=(20, 10))
plt.plot(date_df, y_volume, label='Actual Volume')
plt.plot(date_df, volume_model.predict(X_volume), label='Predicted Volume')
plt.legend()