<a href="https://colab.research.google.com/github/jeet1912/ms/blob/main/ds677/assignments/week2/DS677_Week2HW2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Week 2 Homework 2: S&P 500 Index Prediction**

**Objectives:** Predict the next value of the S&P 500 index based on historical data points.

The S&P 500, also known as the Standard & Poor's 500, is a stock market index representing the performance of 500 major companies listed on U.S. exchanges. We aim to train the linear regression model on the training dataset using gradient descent. We will use the test set as a validation set to determine the optimal number of features, `Sequence Length`, and the corresponding model parameters.





----

`Sequence Length` refers to the number of historical data points used to predict the next value in a time series. A further explanation：Given the previous N data points, you are going to predict the next value. N can be 1, ..., n. Here, N is the sequence length



**Your task of this homework will be:**
1.   Independently implement gradient descent method
2.   Choose the optimal value for `Sequence Length` and familiarize yourself with model tuning

# Download Data
Download data from google drive

You should have
- `SPY_dataset.csv`: the dataset, which includes the date and corresponding SPY close prices.

after running the following block.

In [None]:
!pip install --upgrade gdown

# Main link
!gdown --id '1UH1H8dmYuOcfPRPVDYLI1AIIcwPoqVqW' --output SPY_dataset.csv

Downloading...
From: https://drive.google.com/uc?id=1UH1H8dmYuOcfPRPVDYLI1AIIcwPoqVqW
To: /content/SPY_dataset.csv
100% 25.4k/25.4k [00:00<00:00, 72.1MB/s]


In [None]:
!ls

sample_data  SPY_dataset.csv


# Some Utilities

Plotly is a graphing library for Python; we will use it later to display the final result.

In [None]:
!pip install plotly --upgrade



Import the necessary libraries.

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import plotly.graph_objects as go
import tensorflow as tf

np.random.seed(777)

In [None]:
def setup_tpu():
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        print('Running on TPU ', tpu.master())
    except ValueError:
        tpu = None

    if tpu:
        tf.config.experimental_connect_to_cluster(tpu)
        tf.tpu.experimental.initialize_tpu_system(tpu)
        strategy = tf.distribute.TPUStrategy(tpu)
    else:
        print("TPU not found, using CPU/GPU")
        strategy = tf.distribute.get_strategy()

    return strategy

strategy = setup_tpu()



Running on TPU  


# Dataset

We load the dataset and divide it into a training set and a test set.


*   The training set covers the period from 2017 to 2020
*   The testing set covers the period from January 2021 to August 2021



In [None]:
# Load dataframe
df = pd.read_csv('SPY_dataset.csv',
                 parse_dates=['Date'])

mask = (df['Date'] >= '2017-01-01') & (df['Date'] <= '2020-12-31')
df_train = df.loc[mask]
df_test = df.loc[~mask]

print("The columns of training dataset:", df_train.columns, "The size of training dataset:", df_train.shape)
print("The columns of training dataset:", df_test.columns, "The size of testing dataset:", df_test.shape)


# Convert the price data into a list for later analysis
data_train = df_train['Close'].to_list()
data_test = df_test['Close'].to_list()

The columns of training dataset: Index(['Date', 'Close'], dtype='object') The size of training dataset: (1007, 2)
The columns of training dataset: Index(['Date', 'Close'], dtype='object') The size of testing dataset: (166, 2)


# Model

Linear Regression by Sklearn library

In this section, We will use existing models from the [sklearn](https://https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) library to make predictions.

-----

The parameter:

```
max_sequence_length=150
```
 is used to explore the impact of different historical lengths, testing up to 150 days to assess various cyclical influences in stock market analysis. You are free to make changes. Remember, our goal is to find the best `sequence_length`. Once we have evaluated all possible candidates, we will be able to choose the best value for `sequence_length`.

 You can freely adjust the `max_sequence_length`, but keep in mind that longer lengths will increase the time it takes for the code to run.



In [None]:
def preprocess_data(price_data, sequence_length=6):
    """ Prepare data for model training and evaluation by creating sequences of given length.

    Args:
        price_data (list): List of stock prices.
        sequence_length (int): Number of historical points used to predict the next point.

    Returns:
        tuple: Features and target arrays prepared for regression.
    """
    num_data_points = len(price_data)
    samples = []
    for i in range(num_data_points - sequence_length + 1):
        sample = price_data[i: i + sequence_length]
        samples.append(sample)
    samples = np.array(samples)
    features = samples[:, :-1]
    target = samples[:, -1]
    return features, target

def train_and_evaluate(data_train, data_test, max_sequence_length=150):
    """ Train and evaluate linear regression model using training and testing datasets.

    Args:
        data_train (list): Training dataset prices.
        data_test (list): Testing dataset prices.
        max_sequence_length (int): Maximum length of data sequences used for training.

    Returns:
        dict: Contains various metrics and model parameters for each sequence length.
    """

    results = {
        'sequence_length': [],
        'train_loss': [],
        'test_loss': [],
        'model_intercepts': [],
        'model_coefficients': []
    }

    # Evaluate model performance across different sequence lengths
    for length in range(1, max_sequence_length + 1):
        features_train, target_train = preprocess_data(data_train, length + 1)
        model = LinearRegression().fit(features_train, target_train)
        predictions_train = model.predict(features_train)
        train_loss = mean_squared_error(target_train, predictions_train)

        features_test, target_test = preprocess_data(data_test, length + 1)
        predictions_test = model.predict(features_test)
        test_loss = mean_squared_error(target_test, predictions_test)

        # Collect results
        results['sequence_length'].append(length)
        results['train_loss'].append(train_loss)
        results['test_loss'].append(test_loss)
        results['model_intercepts'].append(model.intercept_)
        results['model_coefficients'].append(model.coef_)

    return results

# Example of function usage
results = train_and_evaluate(data_train, data_test, max_sequence_length=150)



In this section, we will visually assess our prediction results through plotting.

In [None]:
# Initialize a Plotly figure
fig = go.Figure()

# Add training data trace
fig.add_trace(
    go.Scatter(
        x=results['sequence_length'],
        y=results['train_loss'],
        mode='lines+markers',
        line=dict(color='red'),
        name='Training Loss'
    )
)

# Add testing data trace
fig.add_trace(
    go.Scatter(
        x=results['sequence_length'],
        y=results['test_loss'],
        mode='lines+markers',
        line=dict(color='blue'),
        name='Testing Loss'
    )
)

# Update plot layout for better readability and aesthetics
fig.update_layout(
    title="Performance of Linear Regression Model",
    xaxis_title="Number of Historical Points (N)",
    yaxis_title="Mean Squared Error (MSE)",
    legend_title="Dataset Type",
    font=dict(
        family="Arial, monospace",
        size=18,
        color="RebeccaPurple"
    )
)

# Display the figure
fig.show()

# Gradient Descent Implementation

In this section, we will build a Linear Regression model from scratch.

Complete the function for gradient descent according to the hints provided in the comments.

Please write your own implementation wherever **'#Your Code'** appears.

In [None]:
def normalize_data(data): #to avoid overflow
    return (data - np.mean(data)) / np.std(data)

data_train_normalized = normalize_data(data_train)
data_test_normalized = normalize_data(data_test)

def predict(X, B):
    """
    Predict outcomes using a linear model defined by coefficients B.

    Args:
        X (numpy.ndarray): Feature matrix for prediction.
        B (numpy.ndarray): Coefficients of the model.

    Returns:
        numpy.ndarray: Predicted values.
    """
    return np.array(X.dot(B), dtype=np.float64)


def MSE(y_pred, y):
    """
    Calculate the mean squared error between predicted and actual values.

    Args:
        y_pred (numpy.ndarray): Predicted values.
        y (numpy.ndarray): Actual target values.

    Returns:
        float: Computed mean squared error.
    """
    m = len(y_pred)
    return np.sum((y_pred - y) ** 2) / m

def gradient_descent(X, y, learning_rate=1e-7):
    """
    Perform gradient descent to find the regression coefficients that minimize the MSE.

    Args:
        X (numpy.ndarray): Feature matrix.
        y (numpy.ndarray): Target vector.
        learning_rate (float): Step size for each iteration.

    Returns:
        tuple: Optimal coefficients and the loss at convergence.
    """

    # Initialize coefficients as zeros with the same number of elements as there are features in X
    B = np.zeros(X.shape[1])

    # Set the initial previous mean squared error to infinity
    mse_prev = float('inf')

    # Start an infinite loop that will break once convergence criteria are met
    while True:
        # Calculate predicted values using the current coefficient values
        # YOUR CODE
        y_pred = predict(X,B)

        # Compute the difference between predicted and actual values
        # YOUR CODE
        error = y_pred - y


        # Calculate the gradient of the cost function
        # YOUR CODE
        grad = 2 / (X.shape[0]) * np.sum((np.transpose(X))*(error))
        grad = np.array(grad, dtype=np.float64)

        # Update the coefficients by taking a step proportional to the gradient
        # YOUR CODE
        B_next = B - learning_rate*grad
        B_next = np.array(B_next, dtype=np.float64)

        # Compute the mean squared error with the updated coefficients
        # YOUR CODE
        y_p = predict(X,B_next)
        mse = MSE(y_p, y)

        # Check if the change in MSE is below the threshold (learning rate), indicating convergence
        # YOUR CODE
        mse_diff = abs(mse_prev - mse)
        if mse_diff < learning_rate:
          break
        # Update the previous MSE for the next iteration
        # YOUR CODE
        mse_prev = mse
        B = B_next




    # Return the optimized coefficients and the final mean squared error
    return B, mse



def train_and_evaluate(data_train, data_test, max_sequence_length=150, learning_rate=1e-7):
    results = {
        'sequence_length': [],
        'train_loss': [],
        'test_loss': [],
        'model_intercepts': [],
        'model_coefficients': []
    }

    for length in range(1, max_sequence_length + 1):
        print(f'Processing Sequence Length={length}', end=" ")

        # Training
        x_train, y_train = preprocess_data(data_train, length + 1)
        x0_train = np.ones((x_train.shape[0], 1))
        X_train = np.concatenate((x0_train, x_train), axis=1)

        # Convert numpy arrays to TensorFlow tensors for TPU
        X_train_tf = tf.convert_to_tensor(X_train, dtype=tf.float32)
        y_train_tf = tf.convert_to_tensor(y_train, dtype=tf.float32)

        with strategy.scope():  # Use TPU for computation
            # Although gradient_descent uses numpy, we wrap it in TensorFlow's strategy
            B, train_loss = gradient_descent(X_train_tf.numpy(), y_train_tf.numpy(), learning_rate)

        # Convert back to numpy for storage in results
        results['sequence_length'].append(length)
        results['train_loss'].append(train_loss)
        results['model_intercepts'].append(B[0])
        results['model_coefficients'].append(B[1:])

        # Testing
        x_test, y_test = preprocess_data(data_test, length + 1)
        x0_test = np.ones((x_test.shape[0], 1))
        X_test = np.concatenate((x0_test, x_test), axis=1)

        # Convert to TensorFlow tensors for prediction
        X_test_tf = tf.convert_to_tensor(X_test, dtype=tf.float32)
        y_test_tf = tf.convert_to_tensor(y_test, dtype=tf.float32)

        y_pred_test = predict(X_test_tf.numpy(), B)
        test_loss = MSE(y_pred_test, y_test_tf.numpy())

        results['test_loss'].append(test_loss)

        print(f" Done. Train Loss: {train_loss:.6f}, Test Loss: {test_loss:.6f}")

    return results


# Lengths from 1 to 150
learning_rate = 1e-5
max_sequence_length = 150
results = train_and_evaluate(data_train_normalized, data_test_normalized, max_sequence_length, learning_rate)


Processing Sequence Length=1  Done. Train Loss: 0.628802, Test Loss: 0.616278
Processing Sequence Length=2  Done. Train Loss: 0.256551, Test Loss: 0.245080
Processing Sequence Length=3  Done. Train Loss: 0.134670, Test Loss: 0.126162
Processing Sequence Length=4  Done. Train Loss: 0.086626, Test Loss: 0.080786
Processing Sequence Length=5  Done. Train Loss: 0.064342, Test Loss: 0.060889
Processing Sequence Length=6  Done. Train Loss: 0.053400, Test Loss: 0.051467
Processing Sequence Length=7  Done. Train Loss: 0.047556, Test Loss: 0.047199
Processing Sequence Length=8  Done. Train Loss: 0.045170, Test Loss: 0.045757
Processing Sequence Length=9  Done. Train Loss: 0.044354, Test Loss: 0.045730
Processing Sequence Length=10  Done. Train Loss: 0.045011, Test Loss: 0.046677
Processing Sequence Length=11  Done. Train Loss: 0.046508, Test Loss: 0.048803
Processing Sequence Length=12  Done. Train Loss: 0.048583, Test Loss: 0.051330
Processing Sequence Length=13  Done. Train Loss: 0.051166, Te


Similarly, after completing the calculations, we will plot the results.

In [None]:
# Initialize a Plotly figure
fig = go.Figure()

# Add training data trace
fig.add_trace(
    go.Scatter(
        x=results['sequence_length'],
        y=results['train_loss'],
        mode='lines+markers',
        line=dict(color='red'),
        name='Training Loss'
    )
)

# Add testing data trace
fig.add_trace(
    go.Scatter(
        x=results['sequence_length'],
        y=results['test_loss'],
        mode='lines+markers',
        line=dict(color='blue'),
        name='Testing Loss'
    )
)

# Update plot layout for better readability and aesthetics
fig.update_layout(
    title="Performance of Linear Regression Model",
    xaxis_title="Number of Historical Points (N)",
    yaxis_title="Mean Squared Error (MSE)",
    legend_title="Dataset Type",
    font=dict(
        family="Arial, monospace",
        size=18,
        color="RebeccaPurple"
    )
)

# Display the figure
fig.show()

Finally, we will select the best `Sequence Length` and print out the parameters of the model.

In [None]:
# Display the equation of the best model
losses_test = results['test_loss']
minpos = losses_test.index(min(losses_test))
print(f'The best sequence length is {minpos + 1} with minimal test loss.')

best_model = results['model_coefficients'][minpos]
intercept = results['model_intercepts'][minpos]


print('The best model is')
print('y_pred = ', end='')

# Print the intercept
print(f'{round(intercept, 3)}', end=" + ")

# Print the coefficients with their respective terms
for i, coef in enumerate(best_model):
    if i != len(best_model) - 1:
        print(f'({round(coef, 3)}*x{i+1})', end=" + ")
        if (i+1) % 5 == 0:
            print('\n\t', end='')
    else:
        print(f'({round(coef, 3)}*x{i+1})')

The best sequence length is 9 with minimal test loss.
The best model is
y_pred = 0.104 + (0.104*x1) + (0.104*x2) + (0.104*x3) + (0.104*x4) + (0.104*x5) + 
	(0.104*x6) + (0.104*x7) + (0.104*x8) + (0.104*x9)
