# Summary

This project aims to predict future closing prices of GOOG stock by utilizing a Polynomial Linear Regression Model based on historical closing price data of GOOG and AAPL stocks as support. Through this project, the focus is to gain insights into Polynomial Regression Modeling, forecasting data, and constructing prediction intervals on a time series dataset. By analyzing the past trends of the stock market, this project seeks to forecast the potential outcomes of future GOOG closing prices with a higher degree of accuracy.


First, I import the necessary packages.

In [14]:
# !pip install ta
# !pip install plotly

import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from scipy.stats import t
import yfinance as yf
import datetime as dt
import ta
import plotly.express as px

Next, I set the date, stocks, and model parameters.

In [15]:
# Setting dates
start_date = dt.date(2016, 1, 1)
end_date = dt.date(2022, 4, 9) # Starting Prediction Date
days_ahead = 21 # Predicted Days
start_split = 14 # Training Data Start Date

# Selecting Stock and Parameters used for model
stock_name = "GOOG"
alt_stocks = ['AAPL']
dat_columns = ['Day', 'Month', 'Year', 'DOW', 'Bus_Day', 'Future_Day', 'Close', 'Price_Diff', 'SMA2','SMA3', 'CCI', 'Volume', 'RSI', 'Upper_Band', 'Lower_Band', 'MACD', 'Signal']
alt_columns = ['Date', 'Volume', 'CCI', 'Close', 'Price_Diff']
model_runs = 5
ma1 = 20
ma2 = 50
ma3 = 200
macd1 = 12
macd2 = 26
sig1 = 9

Here are a few functions needed to calculate certain parameters.

In [16]:
def calculate_rsi(stock_data, period=14):
    diff = stock_data["Close"].diff()
    pos_diff = diff.where(diff > 0, 0)
    neg_diff = diff.where(diff < 0, 0)
    avg_gain = pos_diff.ewm(span=period, min_periods=period).mean()
    avg_loss = neg_diff.ewm(span=period, min_periods=period).mean().abs()
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

def calculate_bands(stock_data):
    middle_band = stock_data.Close.rolling(window=20).mean()
    std = stock_data.Close.rolling(window=20).std()
    upper_band = middle_band + (std * 2)
    lower_band = middle_band - (std * 2)
    return upper_band, lower_band

Calculating the necessary parameters used for the model.

In [17]:
stock = yf.Ticker(stock_name)
df = stock.history(start = start_date, end = end_date).reset_index()
df = df[['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]
df.columns = ['Datetime', 'Open', 'High', 'Low', 'Close', 'Volume']
df['Date'] = list(map(lambda x: x.date(), df.Datetime))
df['Day'] = list(map(lambda x: x.day, df.Datetime))
df['Month'] = list(map(lambda x: x.month, df.Datetime))
df['Year'] = list(map(lambda x: x.year, df.Datetime))
df['Price_Diff'] = (df.Close - df.Close.shift(1)) / df.Close.shift(1)
df['SMA1'] = df['Close'].rolling(ma1).mean()
df['SMA2'] = df['Close'].rolling(ma2).mean()
df['SMA3'] = df['Close'].rolling(ma3).mean()
df['CCI'] = ta.trend.cci(df['High'], df['Low'], df['Close'], window=20, constant=0.015)
df['ATR'] = ta.volatility.average_true_range(df['High'], df['Low'], df['Close'], window=20)
df['Bus_Day'] = list(map(lambda x: x.days, (df.Date - df.Date.shift(1))))
df['DOW'] = list(map(lambda x: x.weekday(), df.Date))
future_day = list(map(lambda x: x.days, (df.Date.shift(-1) - df.Date )))
future_day[-1] = (end_date - df.Date[len(df)-1]).days
df['Future_Day'] = future_day
df['RSI'] = calculate_rsi(df)

upper_band, lower_band = calculate_bands(df)
df['Upper_Band'] = upper_band
df['Lower_Band'] = lower_band

ema_1 = df.Close.ewm(span=macd1).mean()
ema_2 = df.Close.ewm(span=macd2).mean()
df['MACD'] = ema_1 - ema_2
df['Signal'] = df.MACD.ewm(span=sig1).mean()

Calculating the necessary parameters for alternative stocks (AAPL) and combining it to main dataframe.

In [18]:
for s_name in alt_stocks:
    cur_stock_name = s_name
    cur_stock = yf.Ticker(cur_stock_name)
    cur_df = cur_stock.history(start = start_date, end = end_date).reset_index()
    cur_df.Date = list(map(lambda x: x.date(), cur_df.Date))
    cur_df['CCI'] = ta.trend.cci(cur_df['High'], cur_df['Low'], cur_df['Close'], window=20, constant=0.015)
    cur_df['Price_Diff'] = cur_df.Close - cur_df.Close.shift(1)
    cur_df['SMA2'] = cur_df['Close'].rolling(ma2).mean()
    cur_df['SMA3'] = cur_df['Close'].rolling(ma3).mean()
    cur_df['RSI'] = calculate_rsi(cur_df)
    cur_upper, cur_lower = calculate_bands(cur_df)
    cur_df['Upper_Band'] = cur_upper
    cur_df['Lower_Band'] = cur_lower
    cur_ema_1 = cur_df.Close.ewm(span=macd1).mean()
    cur_ema_2 = cur_df.Close.ewm(span=macd2).mean()
    cur_df['MACD'] = ema_1 - ema_2
    cur_df['Signal'] = cur_df.MACD.ewm(span=sig1).mean()

    cur_df = cur_df[alt_columns]
    cur_rename_col = list(map(lambda x: x + "_"+cur_stock_name, cur_df.columns[1:]))
    cur_df.columns = ['Date'] + cur_rename_col
    df = pd.merge(df, cur_df, on='Date')
    dat_columns += cur_rename_col

df.dropna(inplace=True)
df = df.reset_index(drop = True)
df.head(5)

Unnamed: 0,Datetime,Open,High,Low,Close,Volume,Date,Day,Month,Year,...,Future_Day,RSI,Upper_Band,Lower_Band,MACD,Signal,Volume_AAPL,CCI_AAPL,Close_AAPL,Price_Diff_AAPL
0,2016-10-17,38.990002,39.2925,38.875,38.998001,21860000,2016-10-17,17,10,2016,...,1,51.829101,39.442489,38.46776,0.124545,0.135737,94499600,126.735276,27.320112,-0.018589
1,2016-10-18,39.392502,40.080502,39.278252,39.763,41138000,2016-10-18,18,10,2016,...,1,70.35295,39.588172,38.441327,0.178658,0.144321,98214000,118.497584,27.30151,-0.018602
2,2016-10-19,39.911999,40.23,39.901501,40.075001,35336000,2016-10-19,19,10,2016,...,1,74.895856,39.812739,38.343161,0.243907,0.164238,80138400,54.742987,27.220169,-0.081341
3,2016-10-20,40.165001,40.198502,39.801498,39.848499,35150000,2016-10-20,20,10,2016,...,1,66.376087,39.905842,38.298857,0.27418,0.186227,96503200,71.898394,27.206226,-0.013943
4,2016-10-21,39.75,39.974998,39.700001,39.968498,25324000,2016-10-21,21,10,2016,...,3,68.562218,40.020684,38.246365,0.304347,0.209851,92770800,54.31743,27.099314,-0.106913


Selecting last date for training data.

In [19]:
init_ind = len(df) - start_split
init_date = df.Date[init_ind]
print("End Of Training Data: " + str(init_date))

End Of Training Data: 2022-03-22


Setting up X and y values for Model

In [20]:
X = df[dat_columns][:init_ind-1-days_ahead].values
y = df["Close"][1:init_ind].values
new_y = []
for i in range(len(y)-days_ahead):
    new_y.append(y[i:i+days_ahead])
y = np.array(new_y)

Training data with randomized dataset multiple times to reduce order bias.

In [21]:
for i in range(model_runs):
    rand_ind = np.random.choice(np.arange(0, len(X)), len(X), replace = False)
    X_train = X[rand_ind]
    y_train = y[rand_ind]

    poly_features = PolynomialFeatures(degree=2)
    X_train_poly = poly_features.fit_transform(X_train)

    # MODEL
    model = LinearRegression()
    model.fit(X_train_poly, y_train)

Making prediction based on latest data.

In [22]:
latest_inputs = poly_features.transform(df.iloc[len(df)-1][dat_columns].values.reshape(1, -1))
latest_pred = model.predict(latest_inputs)[0]
print(latest_pred)

[130.51145419 125.40657807 120.74462711 116.05826337 112.35083993
 112.15409228 109.46482368 111.3328011  111.55833218 112.39130332
 109.22089086 104.88025062 103.435766   103.6983789  104.11148841
 101.17434013 101.07988714  99.68243202 100.47286067 100.82545402
 101.51764834]


Preparing data for analysis and visualization.

In [23]:
all_dates = df.Date.values
all_close = df.Close.values
latest_date = all_dates[-1]

for i in range(days_ahead):
    all_dates = np.append(all_dates, (latest_date + dt.timedelta(days = i+1)))
    all_close = np.append(all_close, latest_pred[i])

pred_indicator = np.append(np.repeat("Actual", len(df)), np.repeat("Predicted", days_ahead))
final_graph_df = pd.DataFrame({'Date':all_dates, 'Close_Price':all_close, 'Label':pred_indicator})

Comparing actual vs predicted data and checking the absolute difference.

In [24]:
print("Actual - Expected:")
print("Latest Price: " + str(df.Close.values[-1]))

df_max = stock.history(start = end_date).reset_index()

df_max['Date'] = list(map(lambda x: x.date(), df_max.Date))
df_max = df_max[['Date','Close']]
last_actual_vals = df_max[:days_ahead].Close.values

all_final_diff = []
all_dates = all_dates[:len(all_dates)-days_ahead]
for i in range(days_ahead):
    all_final_diff.append(abs(last_actual_vals[i] - latest_pred[i]))
    all_dates = np.append(all_dates, df_max.Date.values[i])

act_exp_df = pd.DataFrame({'Date':df_max[:days_ahead].Date, 'Actual':last_actual_vals, 'Expected':latest_pred, 'Diff':all_final_diff })
print(act_exp_df)
print("Total Difference: " + str(sum(all_final_diff)))

final_graph_df = pd.DataFrame({'Date':all_dates, 'Close_Price':all_close, 'Label':pred_indicator})
final_graph_df = pd.concat([final_graph_df, pd.DataFrame({'Date':act_exp_df.Date, 'Close_Price':act_exp_df.Actual, 'Label':np.repeat("Actual", days_ahead)})])

final_graph_df['Year'] = list(map(lambda x: x.year, final_graph_df.Date))
final_graph_df['DOY_Label'] = list(map(lambda x: int(x.strftime('%j')), final_graph_df.Date))

final_pred_df = final_graph_df[final_graph_df.Label == "Predicted"].reset_index(drop = True)
final_graph_df = final_graph_df[final_graph_df.Label == "Actual"].reset_index(drop = True)

Actual - Expected:
Latest Price: 134.010498046875
          Date      Actual    Expected       Diff
0   2022-04-11  129.796494  130.511454   0.714961
1   2022-04-12  128.374496  125.406578   2.967918
2   2022-04-13  130.285995  120.744627   9.541368
3   2022-04-14  127.252998  116.058263  11.194735
4   2022-04-18  127.960999  112.350840  15.610159
5   2022-04-19  130.531006  112.154092  18.376914
6   2022-04-20  128.245499  109.464824  18.780675
7   2022-04-21  124.937500  111.332801  13.604699
8   2022-04-22  119.613998  111.558332   8.055666
9   2022-04-25  123.250000  112.391303  10.858697
10  2022-04-26  119.505997  109.220891  10.285106
11  2022-04-27  115.020500  104.880251  10.140250
12  2022-04-28  119.411499  103.435766  15.975733
13  2022-04-29  114.966499  103.698379  11.268120
14  2022-05-02  117.156998  104.111488  13.045509
15  2022-05-03  118.129501  101.174340  16.955161
16  2022-05-04  122.574997  101.079887  21.495110
17  2022-05-05  116.746498   99.682432  17.064066


The initial predicted values of the regression model show a relatively high degree of proximity to the actual data values. However, as the model proceeds to predict more data points, its accuracy decreases.

Creating prediction interval of 68% and 90% confidence for visualization.

In [25]:
# PI = ŷ ± z*σ(ε_t)

residual_std = np.sqrt(mean_squared_error(last_actual_vals, latest_pred))

z_score1 = 1.00  # For 68% confidence interval
z_score2 = 1.645  # For 90% confidence interval

all_lower1 = []
all_upper1 = []
all_lower2 = []
all_upper2 = []
for i in range(days_ahead):
    all_lower1.append(latest_pred[i] - z_score1 * residual_std)
    all_upper1.append(latest_pred[i] + z_score1 * residual_std)
    
    all_lower2.append(latest_pred[i] - z_score2 * residual_std)
    all_upper2.append(latest_pred[i] + z_score2 * residual_std)

ci_df = pd.DataFrame({'Date': final_pred_df.Date, 'DOY_Label':final_pred_df.DOY_Label, 'Lower1':all_lower1, 'Upper1':all_upper1, 'Lower2':all_lower2, 'Upper2':all_upper2})
ci_df.head()

Unnamed: 0,Date,DOY_Label,Lower1,Upper1,Lower2,Upper2
0,2022-04-11,101,116.916943,144.105966,108.148483,152.874426
1,2022-04-12,102,111.812066,139.00109,103.043607,147.76955
2,2022-04-13,103,107.150116,134.339139,98.381656,143.107599
3,2022-04-14,104,102.463752,129.652775,93.695292,138.421235
4,2022-04-18,108,98.756328,125.945352,89.987868,134.713811


Graphing GOOG closing prices with forecasted predictions and prediction interval.

In [26]:
fig = px.line(final_graph_df, x='DOY_Label', y=['Close_Price'], color = 'Year', hover_data = {"Date": "|%B %d, %Y", "DOY_Label":False}).update_traces(connectgaps = True)
fig.add_scatter(name = "Predicted", x = final_pred_df.DOY_Label, y = final_pred_df.Close_Price)

fig.update_traces(mode='markers+lines')
fig.update_xaxes(
    title_text = "Month",
    tickvals = [1, 32, 60, 91, 121, 152, 182, 213, 244, 274, 305, 335],
    ticktext = ['Jan', 'Feb', 'March', 'April', 'May', 'June', 'July', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
)


pred_color1 = 'rgba(68, 68, 68, 0.2)'
pred_color2 = 'rgba(68, 68, 68, 0.1)'

fig.add_scatter(name = 'Upper Bound: 68%', x = ci_df.DOY_Label, y=ci_df.Upper1, mode = 'lines', showlegend = True, marker = dict(color=pred_color1, line=dict(width=0)))
fig.add_scatter(name = 'Lower Bound: 68%', x = ci_df.DOY_Label, y=ci_df.Lower1, mode = 'lines', showlegend = True, fillcolor=pred_color1, fill='tonexty', marker = dict(color=pred_color1, line=dict(width=0)))

fig.add_scatter(name = 'Upper Bound: 90%', x = ci_df.DOY_Label, y=ci_df.Upper2, mode = 'lines', showlegend = True, marker = dict(color=pred_color2, line=dict(width=0)))
fig.add_scatter(name = 'Lower Bound: 90%', x = ci_df.DOY_Label, y=ci_df.Lower2, mode = 'lines', showlegend = True, fillcolor=pred_color2, fill='tonexty', marker = dict(color=pred_color2, line=dict(width=0)))


fig.update_layout(
    yaxis_title='Close Prices ($)',
    title=stock_name + ' Stock Prices Until ' + str(latest_date),
    hovermode="x",
    plot_bgcolor='rgba(0,0,0,0)',
    paper_bgcolor='rgba(0,0,0,0)'
)

fig.update_xaxes(linecolor = 'black')
fig.update_yaxes(rangemode="tozero", linecolor = 'black')
fig.show()


# Conclusion

This project explores the use of a polynomial linear regression model on time series data to forecast and predict future data points. Although the model did not accurately predict exact future data points, it was still able to provide a rough idea of the overall shape of the future data points. The project provided valuable learning experiences in polynomial linear regression, forecasting and predicting future data points using the model, and creating prediction intervals and visualizations of the data. Overall, the project contributed to a deeper understanding of the capabilities and limitations of polynomial linear regression modeling on time series data.