<a href="https://colab.research.google.com/github/nnm2602/StockPredictor/blob/main/StockPredictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing everything we need to start the project

In [17]:
import yfinance as yf
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score
import os

Loading the s&p500 data that we downloaded or download it from yahoo finance API if it doesn't exist

In [18]:
sp500 = pd.read_csv("sp500.csv", index_col=0) if os.path.exists("sp500.csv") else yf.Ticker("^GSPC").history(period="max")

We'll process the data in the following way:

+ Converts the index to datetime format.
+ Drops unnecessary columns ("Dividends" and "Stock Splits").
+ Creates a "Tomorrow" column representing the next day's closing price.
+ Creates a "Target" column, where 1 indicates a price increase and 0 indicates a decrease.
+ Filters data from January 1, 1990, onwards.

In [19]:
sp500.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1927-12-30 00:00:00-05:00,17.66,17.66,17.66,17.66,0,0.0,0.0
1928-01-03 00:00:00-05:00,17.76,17.76,17.76,17.76,0,0.0,0.0
1928-01-04 00:00:00-05:00,17.719999,17.719999,17.719999,17.719999,0,0.0,0.0
1928-01-05 00:00:00-05:00,17.549999,17.549999,17.549999,17.549999,0,0.0,0.0
1928-01-06 00:00:00-05:00,17.66,17.66,17.66,17.66,0,0.0,0.0


In [20]:
sp500.index = pd.to_datetime(sp500.index)
# sp500.drop(["Dividends", "Stock Splits"], axis=1, inplace=True)
sp500["Tomorrow"] = sp500["Close"].shift(-1)
sp500["Target"] = (sp500["Tomorrow"] > sp500["Close"]).astype(int)
sp500 = sp500.loc["1990-01-01":].copy()

We will use the four predictors in the following order: "Close", "Volume", "Open", "High", "Low".
These will contain all of the essential information we need to predict the next day's price.

In [21]:
predictors = ["Close", "Volume", "Open", "High", "Low"]

Since our stock data will be noisy and complex, we'll be using a Random Forest Classifier to train our model.  
We'll be using 100 estimators and a minimum number of samples split of 100.

In [22]:
model = RandomForestClassifier(n_estimators=100, min_samples_split=100, random_state=1)
train, test = sp500.iloc[:-100], sp500.iloc[-100:]
model.fit(train[predictors], train["Target"])


Uses the trained model to make predictions on the test data and calculates the precision score.

This will be used as a baseline for us to compare our model to.

In [23]:
preds = model.predict(test[predictors])
precision = precision_score(test["Target"], preds)
precision

0.5531914893617021

Great! Now that we have our initial goal in place, we will improve our model by adding more predictors.

First we will write our backtest function which will evaluate our model's performance on different times intervals.

+ The function iterates through the dataset in specified intervals, creating training and testing sets.
+ For each iteration, the predict function is used to obtain predictions using the given model and predictors.
+ Predictions from each iteration are collected in a list.

In [24]:
def backtest(data, model, predictors, start=2500, step=250):
    all_predictions = []

    for i in range(start, data.shape[0], step):
        train = data.iloc[0:i].copy()
        test = data.iloc[i:(i+step)].copy()
        predictions = predict(train, test, predictors, model)
        all_predictions.append(predictions)

    return pd.concat(all_predictions)

We will create a 2 new columns in the sp500 data frame: ratio_column and trend_colmn.

+ raito_column: will compare the closing price compared to its mean value throughout the 5 horizons (time intervals).
+ trend_column: will compare the trend of the price throughout the 5 horizons. Whether the price is increasing or decreasing on each closing day via the column "target". "0" for down and "1" for up.

With these 2 columns extra column there will be more useful information for the model to predict the stock price.

The 5 horizons we chose to use are: 2, 5, 60, 250, 1000. (days)

In [25]:
horizons = [2, 5, 60, 250, 1000]
new_predictors = []


for horizon in horizons:
    rolling_averages = sp500["Close"].rolling(horizon).mean()

    ratio_column = f"Close_Ratio_{horizon}"
    sp500[ratio_column] = sp500["Close"] / rolling_averages

    trend_column = f"Trend_{horizon}"
    sp500[trend_column] = sp500.shift(1).rolling(horizon).sum()["Target"]

    new_predictors += [ratio_column, trend_column]

sp500.dropna(subset=sp500.columns[sp500.columns != "Tomorrow"], inplace=True)
sp500.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits,Tomorrow,Target,Close_Ratio_2,Trend_2,Close_Ratio_5,Trend_5,Close_Ratio_60,Trend_60,Close_Ratio_250,Trend_250,Close_Ratio_1000,Trend_1000
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1993-12-14 00:00:00-05:00,465.730011,466.119995,462.459991,463.059998,275050000,0.0,0.0,461.839996,0,0.997157,1.0,0.996617,1.0,1.000283,32.0,1.028047,127.0,1.176082,512.0
1993-12-15 00:00:00-05:00,463.059998,463.690002,461.839996,461.839996,331770000,0.0,0.0,463.339996,1,0.998681,0.0,0.995899,1.0,0.997329,32.0,1.025151,126.0,1.172676,512.0
1993-12-16 00:00:00-05:00,461.859985,463.980011,461.859985,463.339996,284620000,0.0,0.0,466.380005,1,1.001621,1.0,0.999495,2.0,1.000311,32.0,1.028274,127.0,1.176163,513.0
1993-12-17 00:00:00-05:00,463.339996,466.380005,463.339996,466.380005,363750000,0.0,0.0,465.850006,0,1.00327,2.0,1.004991,3.0,1.006561,32.0,1.034781,128.0,1.183537,514.0
1993-12-20 00:00:00-05:00,466.380005,466.899994,465.529999,465.850006,255900000,0.0,0.0,465.299988,0,0.999431,1.0,1.003784,2.0,1.00512,32.0,1.033359,128.0,1.181856,513.0


We update the our random forest model for our backtesting. This time it will be using 200 estimators and a minimum number of samples split of 50.

In [26]:
model = RandomForestClassifier(n_estimators=200, min_samples_split=50, random_state=1)

Now that everything is almost in place, the last thing to do is to create a predict function so that we can use it for our backtest function. The function does the following:
+ train the model on the new predictors to guess the "Target" value.
+ the model is then used to make predictions on the test data. Anything above 0.6 is considered 1 and anything below is considered 0.
+ it will then create a panda series giving us the side by side comparison of the "Target" values and the predictions.

In [27]:
def predict(train, test, predictors, model):
    model.fit(train[predictors], train["Target"])
    preds = (model.predict_proba(test[predictors])[:, 1] >= 0.6).astype(int)
    preds = pd.Series(preds, index=test.index, name="Predictions")
    combined = pd.concat([test["Target"], preds], axis=1)
    return combined

Commence the backtest!!

In [28]:
predictions = backtest(sp500, model, new_predictors)
predictions

Unnamed: 0_level_0,Target,Predictions
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2003-11-14 00:00:00-05:00,0,0
2003-11-17 00:00:00-05:00,0,1
2003-11-18 00:00:00-05:00,1,1
2003-11-19 00:00:00-05:00,0,0
2003-11-20 00:00:00-05:00,1,1
...,...,...
2024-01-08 00:00:00-05:00,0,0
2024-01-09 00:00:00-05:00,1,0
2024-01-10 00:00:00-05:00,0,0
2024-01-11 00:00:00-05:00,1,0


Let's see our precision.

In [29]:
precision_backtest = precision_score(predictions["Target"], predictions["Predictions"])
precision_backtest

0.5728038507821901

Let's see the target distribution in backtest predictions.

In [30]:
predictions["Target"].value_counts() / predictions.shape[0]

1    0.544147
0    0.455853
Name: Target, dtype: float64

bouncing around 50/50. If you're looking for more reasoning to believe the stock market is random, here it is. :|

Overall, our model did ok for the most part. It manages to predict the trend correctly around 57% of the time with our new predictors, beating baseline of 55%.  