<a href="https://colab.research.google.com/github/kenanmorani/yahoo-finance-stock-prediction/blob/main/Stock_Market_Prediction_with_Yahoo_Finance_Data_Using_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing and Importing the required Library for yfinance data

In [1]:
# Install yfinance if not already installed
!pip install yfinance --quiet

In [2]:
# Import necessary libraries
import yfinance as yf
import pandas as pd

# Visualizing the imported data

In [3]:
# Fetch historical data for SPY (S&P 500 ETF)
def fetch_stock_data(ticker="SPY", period="3y", interval="1d"):
    print(f"Fetching data for {ticker}...")
    stock_data = yf.download(ticker, period=period, interval=interval)
    print(f"Data fetched: {len(stock_data)} rows.")
    return stock_data

# Save the data to a CSV file for later use
def save_to_csv(data, filename="stock_data.csv"):
    data.to_csv(filename)
    print(f"Data saved to {filename}")

# Fetch and save data
spy_data = fetch_stock_data()
save_to_csv(spy_data)

# Display a preview
spy_data.head()


Fetching data for SPY...


[*********************100%***********************]  1 of 1 completed


Data fetched: 753 rows.
Data saved to stock_data.csv


Price,Close,High,Low,Open,Volume
Ticker,SPY,SPY,SPY,SPY,SPY
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2022-01-25,415.799713,420.824107,408.794265,414.450301,167997300
2022-01-26,414.756622,424.95854,410.430839,421.781201,186391100
2022-01-27,412.708557,422.613798,410.995499,419.426909,149878300
2022-01-28,422.958344,423.006183,409.435539,414.086679,164457400
2022-01-31,430.576263,430.930359,420.910279,422.278822,152251400


# Checking and preprocessing of the data

<small>
In this section, we prepared the raw stock market data for modeling.  
The following steps were performed:    <br>
1. Handled Missing Values: Any rows with missing values were dropped to ensure data consistency.    <br>
2. Selected Relevant Features: Chose the most important features (Open, High, Low, Close, Volume) that directly influence stock price movements. Calculated the Daily Return as an additional feature to capture daily price change trends.    <br>
3. Normalized the Data: Applied Min-Max Scaling to ensure all features are scaled between 0 and 1, facilitating faster convergence during model training.    <br>

The resulting dataset is clean, consistent, and ready for feature engineering and model development.<br>
</small>


In [4]:
# Import necessary libraries
from sklearn.preprocessing import MinMaxScaler

In [5]:
# Step 1: Handle Missing Values
print("Checking for missing values...")
print(spy_data.isnull().sum())
spy_data = spy_data.dropna()  # Remove rows with missing values

Checking for missing values...
Price   Ticker
Close   SPY       0
High    SPY       0
Low     SPY       0
Open    SPY       0
Volume  SPY       0
dtype: int64


In [6]:
# Step 2: Feature Selection
# Select required columns
selected_features = ['Open', 'High', 'Low', 'Close', 'Volume']
processed_data = spy_data[selected_features]

# Optional: Add daily returns as a feature
processed_data['Daily Return'] = processed_data['Close'].pct_change()
processed_data = processed_data.dropna()  # Remove NaN values caused by pct_change()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  processed_data['Daily Return'] = processed_data['Close'].pct_change()


In [7]:
# Step 3: Normalize the Data
scaler = MinMaxScaler()
normalized_data = pd.DataFrame(scaler.fit_transform(processed_data),
                               columns=processed_data.columns,
                               index=processed_data.index)

# Save the preprocessed data to a CSV file for later use
normalized_data.to_csv("preprocessed_stock_data.csv")
print("Preprocessed data saved to preprocessed_stock_data.csv")

# Display a preview of the normalized data
normalized_data.head()


Preprocessed data saved to preprocessed_stock_data.csv


Price,Open,High,Low,Close,Volume,Daily Return
Ticker,SPY,SPY,SPY,SPY,SPY,Unnamed: 6_level_1
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2022-01-26,0.308048,0.291934,0.272146,0.262956,0.852693,0.416246
2022-01-27,0.299385,0.283,0.274239,0.255215,0.657475,0.391567
2022-01-28,0.279732,0.284495,0.268457,0.293957,0.735423,0.694029
2022-01-31,0.30988,0.31469,0.310988,0.322752,0.670163,0.624702
2022-02-01,0.343126,0.326906,0.336281,0.333749,0.5146,0.510373


# Adding features to the dataset

In this section, we will enhance our dataset by incorporating several technical indicators commonly used in stock market prediction. These indicators will serve as features for the predictive model, improving its ability to learn from historical market data.

1. **Relative Strength Index (RSI):**  
   RSI is a momentum oscillator that measures the speed and change of price movements. It is calculated using the average gains and losses over a defined period (commonly 14 days). The formula for RSI is:

   $$
   RSI = 100 - \frac{100}{1 + RS}
   $$

   where \( RS \) (Relative Strength) is the ratio of the average gain to the average loss:

   $$
   RS = \frac{\text{Average Gain}}{\text{Average Loss}}
   $$

2. **Moving Average Convergence Divergence (MACD):**  
   MACD is a trend-following momentum indicator that calculates the difference between two exponential moving averages (EMAs) of the closing price. The formula for MACD is:

   $$
   MACD = EMA_{\text{short}} - EMA_{\text{long}}
   $$

   where:
   - \( EMA_{\text{short}} \) is the Exponential Moving Average of the closing price over a short period (typically 12 days).
   - \( EMA_{\text{long}} \) is the Exponential Moving Average of the closing price over a longer period (typically 26 days).

   Additionally, a **Signal Line** is calculated as the 9-day EMA of the MACD:

   $$
   \text{Signal Line} = EMA(MACD)
   $$

3. **Bollinger Bands:**  
   Bollinger Bands consist of a simple moving average (SMA) and two bands that are placed two standard deviations above and below the SMA. The formulas are:

   $$
   \text{Middle Band} = SMA_{n}(P)
   $$

   where \(P\) is the closing price, and \(n\) is the window size (typically 20 days). The Upper and Lower Bands are calculated as:

   $$
   \text{Upper Band} = SMA + (2 \times \text{Standard Deviation})
   $$

   $$
   \text{Lower Band} = SMA - (2 \times \text{Standard Deviation})
   $$

In the following code, we will calculate and add these indicators to our dataset, preparing it for the model development phase. Later, we will discuss the rationale behind these choices and explore their impact on the model's performance.

In [8]:
# Import necessary libraries for feature engineering
import numpy as np

In [9]:
# Step 1: Relative Strength Index (RSI)
# Step 1: Relative Strength Index (RSI)
def calculate_rsi(data, window=14):
    delta = data['Close'].diff(1)  # Calculate daily price change
    gain = delta.where(delta > 0, 0)  # Keep only positive gains
    loss = -delta.where(delta < 0, 0)  # Keep only negative losses (as positive values)

    # Calculate rolling averages of gains and losses
    avg_gain = gain.rolling(window=window, min_periods=1).mean()
    avg_loss = loss.rolling(window=window, min_periods=1).mean()

    # Compute Relative Strength (RS) and Relative Strength Index (RSI)
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))

    return rsi  # Return as a Pandas Series

# Apply the function to the dataset
normalized_data['RSI'] = calculate_rsi(normalized_data)

# Step 2: Moving Average Convergence Divergence (MACD)
def calculate_macd(data, short_window=12, long_window=26, signal_window=9):
    short_ema = data['Close'].ewm(span=short_window, adjust=False).mean()
    long_ema = data['Close'].ewm(span=long_window, adjust=False).mean()
    macd = short_ema - long_ema
    signal_line = macd.ewm(span=signal_window, adjust=False).mean()
    return macd, signal_line

normalized_data['MACD'], normalized_data['Signal_Line'] = calculate_macd(processed_data)

# Step 3: Bollinger Bands
def calculate_bollinger_bands(data, window=20):
    sma = data['Close'].rolling(window=window).mean()  # Simple Moving Average
    std = data['Close'].rolling(window=window).std()   # Standard Deviation
    upper_band = sma + (2 * std)
    lower_band = sma - (2 * std)
    return upper_band, lower_band

normalized_data['Upper_Band'], normalized_data['Lower_Band'] = calculate_bollinger_bands(processed_data)

In [10]:
# Step 4: Drop rows with NaN values (caused by rolling calculations)
normalized_data = normalized_data.dropna()

In [11]:
# Save the feature-engineered data to a CSV file
normalized_data.to_csv("feature_engineered_data.csv")
print("Feature-engineered data saved to feature_engineered_data.csv")

# Display a preview of the dataset with new features
normalized_data.head()


Feature-engineered data saved to feature_engineered_data.csv


Price,Open,High,Low,Close,Volume,Daily Return,RSI,MACD,Signal_Line,Upper_Band,Lower_Band
Ticker,SPY,SPY,SPY,SPY,SPY,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
2022-02-23,0.279662,0.252623,0.245506,0.221609,0.564978,0.261527,26.50025,-1.525829,1.349818,442.397323,405.732874
2022-02-24,0.203448,0.236212,0.207514,0.244579,1.0,0.594612,37.098107,-2.139338,0.651987,442.791799,404.852219
2022-02-25,0.26892,0.269324,0.268598,0.278764,0.507377,0.665876,43.004963,-1.874173,0.146755,442.530876,405.736165
2022-02-28,0.277443,0.270637,0.278673,0.274712,0.634681,0.415739,43.181486,-1.730569,-0.22871,442.4845,405.273398
2022-03-01,0.288044,0.266882,0.265938,0.250657,0.592822,0.287012,37.049009,-2.106016,-0.604171,442.033918,403.816634


# Predictive Modelling


In this step, we will choose an appropriate predictive model to forecast the next day's closing price of the stock based on the features we've engineered so far. Given that we have historical time-series data, a recurrent neural network (RNN), specifically an LSTM (Long Short-Term Memory) model, would be a suitable choice for capturing the sequential dependencies in the data.

However, a simpler approach like Linear Regression or Random Forest could also be considered, especially if you're focusing on quick experimentation. But for this task, we'll proceed with an LSTM model for forecasting.

In [12]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.model_selection import train_test_split

In [17]:


## B) Model Selection
## We'll use an LSTM model, which is particularly good for time-series data. LSTMs are capable of remembering long-term dependencies and are often used for tasks like stock price prediction.

## C) Model Structure
## For simplicity, we can create a basic LSTM model with:

## An LSTM layer (to capture sequential relationships),
## A Dense layer for output prediction (next day's closing price),
## Dropout for regularization to prevent overfitting. ##

A) Prepare Data for the Model <br>
Before training the model, we need to:  <br>

Split the data into training and testing sets.  <br>
Scale the features to improve the performance of neural networks (using MinMaxScaler from sklearn).  <br><br>

Scaling: We use MinMaxScaler to scale the features to a range of 0 to 1. LSTM models perform better with scaled data. <br>
Prepare Data for LSTM: We take the past 60 days of data as features and use the next day's closing price as the target (y). <br>

In [20]:
# 1. Scale the features using MinMaxScaler
#scaler = MinMaxScaler(feature_range=(0, 1))
#scaled_data = scaler.fit_transform(normalized_data)

# 1. Prepare data for LSTM (X - features, y - target)
# We'll predict the next day's closing price
# Step 1: Create the features (X) and target (y)
X = []
y = []

for i in range(window_length, len(scaled_data)):
    X.append(scaled_data[i - window_length:i, :-1])  # previous 60 days (exclude 'Close' column)
    y.append(scaled_data[i, 3])  # 'Close' price (index 3)

X = np.array(X)
y = np.array(y)

# Split data into train and test sets
# Step 2: Scale the features and target separately
# Scale the features (X)
feature_scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled = feature_scaler.fit_transform(X.reshape(-1, X.shape[2])).reshape(X.shape)  # Reshape for feature scaling

# Scale the target (y)
target_scaler = MinMaxScaler(feature_range=(0, 1))
y_scaled = target_scaler.fit_transform(y.reshape(-1, 1))

# Step 3: Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.2, shuffle=False)

B) Model Selection  <br>
I'll use an LSTM model, which is particularly good for time-series data. LSTMs are capable of remembering long-term dependencies and are often used for tasks like stock price prediction. <br>

Model Structure <br>
For simplicity, we can create a basic LSTM model with: <br>

An LSTM layer (to capture sequential relationships), <br>
A Dense layer for output prediction (next day's closing price), <br>
Dropout for regularization to prevent overfitting. <br> <br>

Model Architecture: <br>
The first LSTM layer has 50 units, and return_sequences=True allows us to stack another LSTM layer. <br>
The second LSTM layer also has 50 units but does not return sequences (return_sequences=False), which is suitable for the final output. <br>
Dropout layers are added to prevent overfitting. <br>
The output layer is a Dense layer with a single unit to predict the next closing price.

In [21]:
# 4. Build the LSTM model
# Step 4: Build the LSTM model
model = Sequential()

# LSTM layers
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1], X_train.shape[2])))
model.add(Dropout(0.2)) # Dropout for regularization
model.add(LSTM(units=50, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(units=1))

  super().__init__(**kwargs)


c) Compiling and training fo the model <br> <br>
Compilation and Training: We use the Adam optimizer and MSE loss for regression. The model is trained for 10 epochs with a batch size of 32.

In [22]:
# 5. Compile the model with a suitable optimizer (Adam)
model.compile(optimizer='adam', loss='mean_squared_error')

# 6. Train the model
# Step 5: Train the model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 139ms/step - loss: 0.0311 - val_loss: 0.0022
Epoch 2/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 87ms/step - loss: 0.0059 - val_loss: 0.0038
Epoch 3/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 97ms/step - loss: 0.0038 - val_loss: 0.0029
Epoch 4/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 93ms/step - loss: 0.0030 - val_loss: 0.0019
Epoch 5/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 59ms/step - loss: 0.0029 - val_loss: 0.0041
Epoch 6/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 59ms/step - loss: 0.0032 - val_loss: 0.0029
Epoch 7/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 59ms/step - loss: 0.0024 - val_loss: 0.0015
Epoch 8/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 60ms/step - loss: 0.0027 - val_loss: 0.0023
Epoch 9/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7fbc29341350>

Making Predcitions <br><br>
Prediction and Evaluation: After training, we use the model to predict the stock prices and evaluate it using RMSE (Root Mean Squared Error).

In [26]:
# 7. Predict the results
predictions = model.predict(X_test)

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 25ms/step


In [27]:
# Step 7: Inverse transform the predicted values
predictions_reshaped = predictions.reshape(-1, 1)
predicted_prices = target_scaler.inverse_transform(predictions_reshaped)  # Reverse the scaling for the target

In [28]:
from sklearn.metrics import mean_squared_error
import math
# Step 8: Evaluate the model
rmse = math.sqrt(mean_squared_error(target_scaler.inverse_transform(y_test.reshape(-1, 1)), predicted_prices))
print(f"Root Mean Squared Error (RMSE): {rmse}")


Root Mean Squared Error (RMSE): 0.03383713506197041


# Comparing our model to other predictive models

Random Forest Model of similar complexity

In [29]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Step 4.1: Train Random Forest (RF)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train.reshape(X_train.shape[0], -1), y_train)  # Flatten X_train for RF


  return fit_method(estimator, *args, **kwargs)


In [30]:
# Make predictions
rf_predictions = rf_model.predict(X_test.reshape(X_test.shape[0], -1))  # Flatten X_test for RF

# Inverse transform the predictions
rf_predicted_prices = target_scaler.inverse_transform(rf_predictions.reshape(-1, 1))

# Evaluate the model
rf_rmse = math.sqrt(mean_squared_error(target_scaler.inverse_transform(y_test.reshape(-1, 1)), rf_predicted_prices))
print(f"Random Forest RMSE: {rf_rmse}")

Random Forest RMSE: 0.14452318712802725


Support Vector Machine with simialr complexity

In [31]:
# Step 4.2: Train Support Vector Machine (SVM)
svm_model = SVR(kernel='rbf', C=100, epsilon=0.1)
svm_model.fit(X_train.reshape(X_train.shape[0], -1), y_train)  # Flatten X_train for SVM

  y = column_or_1d(y, warn=True)


In [32]:
# Make predictions
svm_predictions = svm_model.predict(X_test.reshape(X_test.shape[0], -1))  # Flatten X_test for SVM

# Inverse transform the predictions
svm_predicted_prices = target_scaler.inverse_transform(svm_predictions.reshape(-1, 1))

# Evaluate the model
svm_rmse = math.sqrt(mean_squared_error(target_scaler.inverse_transform(y_test.reshape(-1, 1)), svm_predicted_prices))
print(f"SVM RMSE: {svm_rmse}")

SVM RMSE: 0.32994284019818876


Artificial Neural Netweorks with similar complexity

In [33]:
# Step 4.3: Train Artificial Neural Network (ANN)
ann_model = MLPRegressor(hidden_layer_sizes=(50, 50), max_iter=1000, random_state=42)
ann_model.fit(X_train.reshape(X_train.shape[0], -1), y_train)  # Flatten X_train for ANN


  y = column_or_1d(y, warn=True)


In [34]:
# Make predictions
ann_predictions = ann_model.predict(X_test.reshape(X_test.shape[0], -1))  # Flatten X_test for ANN

# Inverse transform the predictions
ann_predicted_prices = target_scaler.inverse_transform(ann_predictions.reshape(-1, 1))

# Evaluate the model
ann_rmse = math.sqrt(mean_squared_error(target_scaler.inverse_transform(y_test.reshape(-1, 1)), ann_predicted_prices))
print(f"ANN RMSE: {ann_rmse}")

ANN RMSE: 0.045545319473081006


Conclusion:  
LSTM: With an RMSE of 0.0338, the LSTM model performs the best among all models, demonstrating its ability to capture the sequential nature of time-series data. LSTM's architecture, which is designed to learn temporal patterns, appears well-suited for predicting stock prices in this context.

ANN: The ANN model with an RMSE of 0.0455 is a close second, offering a robust predictive performance. The simplicity of the ANN's architecture allows it to fit the data well but it lacks the temporal capabilities of LSTM, which may be why it’s slightly less accurate than the LSTM.

Random Forest (RF): The RF model (RMSE = 0.1445) provides reasonable results but struggles more than the LSTM and ANN. The random forest's lack of consideration for sequential patterns likely contributes to its higher error.

SVM: The SVM model (RMSE = 0.3299) has the highest RMSE among the four models, indicating that it has the least ability to generalize for this time-series prediction task. SVM, while powerful for many tasks, doesn't naturally handle time dependencies as well as LSTM or ANN.

## Checking the required libraries

In [36]:
# Fetch the versions of installed libraries
import pkg_resources


# Libraries you want to include
required_libraries = ['pandas', 'numpy', 'matplotlib', 'seaborn', 'scikit-learn', 'tensorflow', 'yfinance']
# Fetch the Python version

# Fetch the installed version of each library
installed_versions = {pkg.key: pkg.version for pkg in pkg_resources.working_set if pkg.key in required_libraries}

# Write the specific libraries and versions to the requirements.txt file
with open("requirements.txt", "w") as f:
    for library, version in installed_versions.items():
        f.write(f"{library}=={version}\n")

print("Filtered requirements.txt file with selected libraries has been created.")



Filtered requirements.txt file with selected libraries has been created.


In [38]:
import sys
# Display Python version
sys.version

'3.11.11 (main, Dec  4 2024, 08:55:07) [GCC 11.4.0]'