# Week 1: Using historical data

This code primarily focuses on using historical stock data to base its predictions. As such, I am not expecting too much luck from it as there are many other factors that have to do with stock price.

I will be using Yahoo Finance to pull my data for various stock prices. For this project, I will mainly focus on tech stocks (i.e. Apple, Facebook, Microsoft, Tesla, Google).

I will be using the multivariate regression model, which can be used to predict tomorrow's prices for multiple tickers simultaneously, using today's prices of all tickers as features.

In [17]:
import pandas as pd
import yfinance as yf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Importing data from Yahoo Finance

Yahoo Finance allows us to download data as a pandas dataframe.

In [11]:
start_date = '2010-01-01'
end_date = '2024-05-01'
ticker_list = ['AAPL', 'META', 'MSFT', 'TSLA', 'GOOG']
data = yf.download(ticker_list, start=start_date, end=end_date, ignore_tz=True)[['Close']]

[*********************100%%**********************]  5 of 5 completed


In [12]:
data

Price,Close,Close,Close,Close,Close
Ticker,AAPL,GOOG,META,MSFT,TSLA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2010-01-04,7.643214,15.610239,,30.950001,
2010-01-05,7.656429,15.541497,,30.959999,
2010-01-06,7.534643,15.149715,,30.770000,
2010-01-07,7.520714,14.797037,,30.450001,
2010-01-08,7.570714,14.994298,,30.660000,
...,...,...,...,...,...
2024-04-24,169.020004,161.100006,493.500000,409.059998,162.130005
2024-04-25,169.889999,157.949997,441.380005,399.040009,170.179993
2024-04-26,169.300003,173.690002,443.290009,406.320007,168.289993
2024-04-29,173.500000,167.899994,432.619995,402.250000,194.050003


Now, we can use the `dropna()` function to remove any NaN values.

In [13]:
data.dropna(inplace=True)

In [14]:
data

Price,Close,Close,Close,Close,Close
Ticker,AAPL,GOOG,META,MSFT,TSLA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2012-05-18,18.942142,14.953949,38.230000,29.270000,1.837333
2012-05-21,20.045713,15.295419,34.029999,29.750000,1.918000
2012-05-22,19.891787,14.963912,31.000000,29.760000,2.053333
2012-05-23,20.377144,15.179603,32.000000,29.110001,2.068000
2012-05-24,20.190001,15.035145,33.029999,29.070000,2.018667
...,...,...,...,...,...
2024-04-24,169.020004,161.100006,493.500000,409.059998,162.130005
2024-04-25,169.889999,157.949997,441.380005,399.040009,170.179993
2024-04-26,169.300003,173.690002,443.290009,406.320007,168.289993
2024-04-29,173.500000,167.899994,432.619995,402.250000,194.050003


## Finding tomorrow's prices

Now, we create new columns to find tomorrow's closing prices.

In [19]:
tomorrow_prices = data.shift(-1)
tomorrow_prices.columns = [f"Tomorrow_{col}" for col in tomorrow_prices.columns]
data_with_tomorrow = pd.concat([data, tomorrow_prices], axis=1)
data_with_tomorrow

Unnamed: 0_level_0,"(Close, AAPL)","(Close, GOOG)","(Close, META)","(Close, MSFT)","(Close, TSLA)","Tomorrow_('Close', 'AAPL')","Tomorrow_('Close', 'GOOG')","Tomorrow_('Close', 'META')","Tomorrow_('Close', 'MSFT')","Tomorrow_('Close', 'TSLA')"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2012-05-18,18.942142,14.953949,38.230000,29.270000,1.837333,20.045713,15.295419,34.029999,29.750000,1.918000
2012-05-21,20.045713,15.295419,34.029999,29.750000,1.918000,19.891787,14.963912,31.000000,29.760000,2.053333
2012-05-22,19.891787,14.963912,31.000000,29.760000,2.053333,20.377144,15.179603,32.000000,29.110001,2.068000
2012-05-23,20.377144,15.179603,32.000000,29.110001,2.068000,20.190001,15.035145,33.029999,29.070000,2.018667
2012-05-24,20.190001,15.035145,33.029999,29.070000,2.018667,20.081785,14.733027,31.910000,29.059999,1.987333
...,...,...,...,...,...,...,...,...,...,...
2024-04-24,169.020004,161.100006,493.500000,409.059998,162.130005,169.889999,157.949997,441.380005,399.040009,170.179993
2024-04-25,169.889999,157.949997,441.380005,399.040009,170.179993,169.300003,173.690002,443.290009,406.320007,168.289993
2024-04-26,169.300003,173.690002,443.290009,406.320007,168.289993,173.500000,167.899994,432.619995,402.250000,194.050003
2024-04-29,173.500000,167.899994,432.619995,402.250000,194.050003,170.330002,164.639999,430.170013,389.329987,183.279999


In [20]:
data_with_tomorrow.dropna(inplace=True)

In [21]:
data_with_tomorrow

Unnamed: 0_level_0,"(Close, AAPL)","(Close, GOOG)","(Close, META)","(Close, MSFT)","(Close, TSLA)","Tomorrow_('Close', 'AAPL')","Tomorrow_('Close', 'GOOG')","Tomorrow_('Close', 'META')","Tomorrow_('Close', 'MSFT')","Tomorrow_('Close', 'TSLA')"
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2012-05-18,18.942142,14.953949,38.230000,29.270000,1.837333,20.045713,15.295419,34.029999,29.750000,1.918000
2012-05-21,20.045713,15.295419,34.029999,29.750000,1.918000,19.891787,14.963912,31.000000,29.760000,2.053333
2012-05-22,19.891787,14.963912,31.000000,29.760000,2.053333,20.377144,15.179603,32.000000,29.110001,2.068000
2012-05-23,20.377144,15.179603,32.000000,29.110001,2.068000,20.190001,15.035145,33.029999,29.070000,2.018667
2012-05-24,20.190001,15.035145,33.029999,29.070000,2.018667,20.081785,14.733027,31.910000,29.059999,1.987333
...,...,...,...,...,...,...,...,...,...,...
2024-04-23,166.899994,159.919998,496.100006,407.570007,144.679993,169.020004,161.100006,493.500000,409.059998,162.130005
2024-04-24,169.020004,161.100006,493.500000,409.059998,162.130005,169.889999,157.949997,441.380005,399.040009,170.179993
2024-04-25,169.889999,157.949997,441.380005,399.040009,170.179993,169.300003,173.690002,443.290009,406.320007,168.289993
2024-04-26,169.300003,173.690002,443.290009,406.320007,168.289993,173.500000,167.899994,432.619995,402.250000,194.050003


## Setting up testing and training data

Now, we will set up our testing and training data. Because there are multiple tickers that we want to predict, we must use the multivariate linear regression model. As such, we need to extract tomorrow's prices per ticker from each column, and reconcatenate them into a single dataframe that we can use in our `y`, or our target.

In [22]:
features = [('Close', 'AAPL'), ('Close', 'GOOG'), ('Close', 'META'), ('Close', 'MSFT'), ('Close', 'TSLA')]
X = data_with_tomorrow[features]  

In [31]:
# List to store individual DataFrames for tomorrow's prices for each ticker
tomorrow_dfs = []

# Loop through each ticker and extract tomorrow's prices
for ticker in ['AAPL', 'GOOG', 'META', 'MSFT', 'TSLA']:
    # Extract tomorrow's prices for the current ticker
    tomorrow_column_name = f"Tomorrow_('Close', '{ticker}')"
    tomorrow_df = data_with_tomorrow[tomorrow_column_name]
    # Rename the column to the ticker name
    tomorrow_df = tomorrow_df.rename(ticker)
    # Append the DataFrame to the list
    tomorrow_dfs.append(tomorrow_df)

# Concatenate all DataFrames into a single DataFrame
y = pd.concat(tomorrow_dfs, axis=1)

# Now, y contains tomorrow's prices for all tickers

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

## Starting the learning model

Now, we are going to use Scikit's LinearRegression model to train and test our data. We will train our model, make predictions with the test set, evaluate the model, and predict tomorrow's prices based on this. Note that our array is in the order `['AAPL', 'GOOG', 'META', 'MSFT', 'TSLA']`.

In [41]:
# from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Initialize the multivariate linear regression model
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=100)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Example prediction for a new data point
new_data_point = X_test.iloc[0].values.reshape(1, -1)  # Example: Using the first row of the test set as a new data point
predicted_tomorrow_prices = model.predict(new_data_point)
print("Predicted tomorrow's prices:", predicted_tomorrow_prices)

Mean Squared Error: 21.258734107730756
Predicted tomorrow's prices: [[19.26742859 29.94808971 67.63670124 38.0364994  14.59541989]]


## Buy or sell?

Now, we have a prediction of tomorrow's prices. We will then create a trading strategy based on these prices:
- We will buy the stock if the predicted price is 2% higher than the current price.
- We will sell if the predicted price is 2% lower than the current price.
- We will hold if neither condition is met.

Again, it is important to remember that our array is in the order `['AAPL', 'GOOG', 'META', 'MSFT', 'TSLA']`.

In [36]:
# Example trading strategy based on predicted prices
def trading_strategy(predicted_prices, current_prices):
    decision = []
    for pred_price, curr_price in zip(predicted_prices, current_prices):
        if pred_price > curr_price * 1.02:  # Buy if predicted price is 2% higher than current price
            decision.append("Buy")
        elif pred_price < curr_price * 0.98:  # Sell if predicted price is 2% lower than current price
            decision.append("Sell")
        else:
            decision.append("Hold")  # Hold if neither buy nor sell condition is met
    return decision

# Example prediction for a new data point
new_data_point = X_test.iloc[0].values.reshape(1, -1)  # Example: Using the first row of the test set as a new data point
predicted_tomorrow_prices = model.predict(new_data_point).flatten()  # Flatten the array

# Assume you have current prices for the same tickers
current_prices = [173.03, 168.46, 441.68, 397.84, 180.01]  # Example: Replace with actual current prices

# Use the trading strategy to make decisions
decisions = trading_strategy(predicted_tomorrow_prices, current_prices)
print("Trading Decisions:", decisions)


Trading Decisions: ['Sell', 'Sell', 'Sell', 'Sell', 'Sell']
