## Part 3. Machine Learning Price predicition 

>[!warning] It can be time-consuming. Take into account that it will take at least 2-3 minutes per stocks 

## Part 3.1. Data preparation

### Part 3.1.1. Import libraries

Import libs

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

from read_write_csv import read_csv, save_csv

import numpy as np
import pandas as pd


### Part 3.1.2. Import the data for the analysis

Copy the dataframe with the historical data

In [2]:
df = read_csv('save216.csv')

### Part 3.1.3. Convert values so they have the applicable formatting to feed the model

We have to convert all the values in the dataframe to the formatting that the model would get - i.e., we have to normalize the data.

In [3]:
# here we keep a separate scaler for each ticker to use it later on
scalers = {} 
scaled_data = {}

# then we iterate trough each ticker in the dataframe and normalize the data for each column, 
# then adding it into scaled_data dict and into scalers dict
for ticker in df.columns:
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled_data[ticker] = scaler.fit_transform(df[[ticker]])
    scalers[ticker] = scaler

### Part 3.1.4. Tickers pre-processing

In [4]:
def create_dataset(dataset, lookback):
    """
    Create sequences of `lookback` days as input (X)
    and the next day's price as output (y).
    """
    X, y = [], []
    for i in range(lookback, len(dataset)):
        X.append(dataset[i-lookback:i, 0])  # Last `lookback` prices
        y.append(dataset[i, 0])            # Target is the next price
    return np.array(X), np.array(y)

def add_technical_features(series):
    """
    Add technical features like RSI to a series of price data.
    Parameters:
        series (pd.Series): A pandas Series of price data for a ticker.
    Returns:
        pd.DataFrame: A DataFrame with original prices and added features.
    """
    if len(series) < 14:  # Ensure enough data for rolling calculations
        raise ValueError("Insufficient data for technical feature calculation.")

    # Calculate daily price changes
    delta = series.diff()

    # Separate gains and losses
    gain = delta.where(delta > 0, 0).rolling(window=14, min_periods=1).mean()
    loss = -delta.where(delta < 0, 0).rolling(window=14, min_periods=1).mean()

    # Calculate the Relative Strength Index (RSI)
    rs = gain / (loss + 1e-10)  # Add small constant to avoid division by zero
    rsi = 100 - (100 / (1 + rs))

    # Prepare the DataFrame with features
    features_df = series.to_frame(name="price")  # Convert Series to DataFrame
    features_df["RSI"] = rsi

    return features_df

# Iterate over columns to process each ticker
ticker_models = {}
counter = 0

for ticker in df.columns:
    try:
        counter += 1
        print(f"[{counter}] Processing ticker: {ticker}")

        # Extract and process price data for the ticker
        price_series = df[ticker]
        ticker_features = add_technical_features(price_series)

        # Dynamically adjust lookback period based on dataset size
        num_rows = len(ticker_features)
        lookback = max(10, min(60, num_rows // 100))  # Example scaling logic
        print(f"Ticker: {ticker}, Lookback period: {lookback}")

        # Create input (X) and output (y) sequences
        ticker_scaled_data = scaled_data[ticker]  # Ensure scaled data is used
        X, y = create_dataset(ticker_scaled_data, lookback)

        # Reshape X for LSTM: [samples, time_steps, features]
        X = X.reshape(X.shape[0], X.shape[1], 1)

        # Split into training and testing sets
        from sklearn.model_selection import train_test_split
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42
        )

        # Save training and testing data in dictionaries if needed later
        ticker_models[ticker] = {
            "X_train": X_train,
            "X_test": X_test,
            "y_train": y_train,
            "y_test": y_test,
            "scaler": scalers[ticker],  # Save the scaler for this ticker
        }

    except ValueError as e:
        print(f"Skipping ticker {ticker} due to error: {e}")
        continue


[1] Processing ticker: KO
Ticker: KO, Lookback period: 25
[2] Processing ticker: MCD
Ticker: MCD, Lookback period: 25
[3] Processing ticker: MOH
Ticker: MOH, Lookback period: 25
[4] Processing ticker: NEE
Ticker: NEE, Lookback period: 25
[5] Processing ticker: PH
Ticker: PH, Lookback period: 25


## <mark>Part 3.2. Feeding models</mark>

### <mark>Part 3.2.1. Training the models for each ticker</mark>

In [5]:
# Train models for each ticker
for ticker, model_data in ticker_models.items():
    print(f"Training LSTM model for ticker: {ticker}")

    # Get training data
    X_train, y_train = model_data["X_train"], model_data["y_train"]

    # Initialize the model
    model = Sequential()

    # Bidirectional LSTM layer
    model.add(Bidirectional(LSTM(units=50, return_sequences=True), input_shape=(X_train.shape[1], X_train.shape[2])))
    model.add(LSTM(units=50, return_sequences=False))
    model.add(Dense(units=25))
    model.add(Dense(units=1))

    # Compile the model
    model.compile(optimizer='adam', loss='mean_squared_error')

    # Implement early stopping to prevent overfitting
    early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

    # Learning rate reduction after plateau
    lr_scheduler = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_lr=1e-6)

    # Train the model
    model.fit(X_train, y_train, batch_size=32, epochs=50, validation_split=0.2, verbose=1, callbacks=[early_stopping, lr_scheduler])

    # Save the trained model
    ticker_models[ticker]["model"] = model


Training LSTM model for ticker: KO


  super().__init__(**kwargs)


Epoch 1/50
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 32ms/step - loss: 0.0237 - val_loss: 7.2178e-04 - learning_rate: 0.0010
Epoch 2/50
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 19ms/step - loss: 6.5182e-04 - val_loss: 5.1141e-04 - learning_rate: 0.0010
Epoch 3/50
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - loss: 4.7591e-04 - val_loss: 4.2368e-04 - learning_rate: 0.0010
Epoch 4/50
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - loss: 4.9147e-04 - val_loss: 4.1243e-04 - learning_rate: 0.0010
Epoch 5/50
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - loss: 4.4745e-04 - val_loss: 3.6338e-04 - learning_rate: 0.0010
Epoch 6/50
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - loss: 4.6282e-04 - val_loss: 3.5389e-04 - learning_rate: 0.0010
Epoch 7/50
[1m50/50[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - loss: 4.098

### Part 3.2.2. Evaluation of the models based on RMSE

Low RMSE values show that the model makes more accurate predictions and fits the data well. Higher levels, on the other hand, imply more significant mistakes and fewer accurate forecasts. Here we calculate rmse automatically based on the data.

In [6]:
# Initialize dictionaries to store RMSE for training and test datasets
train_rmse_results = {}
test_rmse_results = {}

# Evaluate models for each ticker
for ticker, model_data in ticker_models.items():
    print(f"Evaluating performance for ticker: {ticker}")

    # Get training and test data
    X_train, y_train = model_data["X_train"], model_data["y_train"]
    X_test, y_test = model_data["X_test"], model_data["y_test"]
    model = model_data["model"]
    scaler = model_data["scaler"]

    # Make predictions on the training set
    train_predictions = model.predict(X_train)
    train_predictions = scaler.inverse_transform(train_predictions.reshape(-1, 1))
    y_train = scaler.inverse_transform(y_train.reshape(-1, 1))
    train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
    train_rmse_results[ticker] = train_rmse

    # Make predictions on the test set
    test_predictions = model.predict(X_test)
    test_predictions = scaler.inverse_transform(test_predictions.reshape(-1, 1))
    y_test = scaler.inverse_transform(y_test.reshape(-1, 1))
    test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
    test_rmse_results[ticker] = test_rmse

    print(f"{ticker} Train RMSE: {train_rmse}, Test RMSE: {test_rmse}")

# Calculate dynamic thresholds based on test RMSE
mean_test_rmse = np.mean(list(test_rmse_results.values()))
std_test_rmse = np.std(list(test_rmse_results.values()))
dynamic_threshold = mean_test_rmse + std_test_rmse
print(f"Dynamic Threshold for Poor Models (Test RMSE): {dynamic_threshold}")

# Evaluate model performance based on test RMSE
for ticker, test_rmse in test_rmse_results.items():
    train_rmse = train_rmse_results[ticker]
    print(f"Evaluating {ticker}...")

    # Compare test RMSE with dynamic threshold
    if test_rmse > dynamic_threshold:
        print(f"{ticker}: Poor test performance (Test RMSE: {test_rmse} > Threshold: {dynamic_threshold}).")
    else:
        print(f"{ticker}: Test performance is satisfactory (Test RMSE: {test_rmse} ≤ Threshold: {dynamic_threshold}).")

    # Compare training and test RMSE to detect overfitting
    if train_rmse < test_rmse * 0.8:
        print(f"{ticker}: Potential overfitting detected (Train RMSE: {train_rmse} << Test RMSE: {test_rmse}).")
    elif test_rmse < train_rmse * 0.8:
        print(f"{ticker}: Potential underfitting detected (Test RMSE: {test_rmse} << Train RMSE: {train_rmse}).")
    else:
        print(f"{ticker}: Balanced model performance between train and test sets.")

# Determine the best-performing model based on test RMSE
best_ticker = min(test_rmse_results, key=test_rmse_results.get)
print(f"Best-performing ticker: {best_ticker} with Test RMSE: {test_rmse_results[best_ticker]}")

# Calculate naive baseline RMSE for each ticker
naive_rmse = {
    ticker: np.sqrt(mean_squared_error(df[ticker][1:], np.roll(df[ticker], 1)[1:]))
    for ticker in df.columns
}

# Compare naive baseline with model RMSE
for ticker, naive in naive_rmse.items():
    print(f"Ticker {ticker} Naive Baseline RMSE: {naive}")
    if test_rmse_results[ticker] < naive:
        print(f"{ticker}: Model outperforms naive baseline.")
    else:
        print(f"{ticker}: Model underperforms naive baseline.")

Evaluating performance for ticker: KO
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step
KO Train RMSE: 0.747361909050187, Test RMSE: 0.7295664837262821
Evaluating performance for ticker: MCD
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step
MCD Train RMSE: 4.213083909450583, Test RMSE: 4.049792127940189
Evaluating performance for ticker: MOH
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
MOH Train RMSE: 7.636658834129813, Test RMSE: 7.910283573963914
Evaluating performance for ticker: NEE
[1m63/63[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step
NEE Train RMSE: 1.3511589477073769, Test RMSE: 1.1472556443595467

## <mark>Part 3.3. Forecasting and saving results</mark>

### <mark>Part 3.3.1. Forecasting </mark>

Here we can set the desirable number of days to predict. As there is 252 trading days in a year, we may multiply this value by the desired one to get the desirable volume.

In [7]:

# Forecast future prices for each ticker
future_prices_dict = {}
future_days = 252*3  # Number of days to predict

for ticker, model_data in ticker_models.items():
    print(f"Forecasting future prices for ticker: {ticker}")

    # Extract model and scaler for the ticker
    model = model_data["model"]
    scaler = model_data["scaler"]

    # Use the last `lookback` days of known data as the starting point
    input_data = scaled_data[ticker][-lookback:].reshape(1, lookback, X.shape[2])
    future_prices = []

    for _ in range(future_days):
        future_price = model.predict(input_data)
        future_prices.append(future_price[0, 0])
        
        future_price_reshaped = future_price.reshape(1, 1, 1)
        input_data = np.append(input_data[:, 1:, :], future_price_reshaped, axis=1)

    # Reverse normalization to get the actual price scale
    future_prices_original_scale = scaler.inverse_transform(np.array(future_prices).reshape(-1, 1))
    future_prices_dict[ticker] = future_prices_original_scale.flatten()


Forecasting future prices for ticker: KO
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 52ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 55ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 63ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step
[1m1/1[0m [32m━━━━━━

### Part 3.3.2. Saving predicted results to dataframe

In [8]:
# Create a pandas DataFrame from the future prices dictionary
df_forecast = pd.DataFrame(future_prices_dict)

# Generate a future date range starting from the last date in the original data
last_date = pd.to_datetime(df.index[-1])  # Assuming your data has a datetime index
future_dates = [last_date + pd.Timedelta(days=i + 1) for i in range(future_days)]

# Assign the future dates as the DataFrame index
df_forecast['Date'] = future_dates
df_forecast = df_forecast.set_index('Date')

# Add RMSE information to each column name
updated_columns = {
    ticker: f"{ticker}, RMSE: {round(test_rmse_results[ticker], 2)}"
    for ticker in df_forecast.columns
}
df_forecast = df_forecast.rename(columns=updated_columns)

# Ensure the columns and structure match the original data
df_forecast.index.name = "Date"


### Part 3.3.3. Saving results to csv

In [9]:
save_csv(df_forecast, 'forecasted_data.csv')

'c:\\Users\\nikit\\Desktop\\Personal\\pythonLanguage\\portfolio_optimization_ml\\src\\data\\forecasted_data.csv'

## Part 3.4. Analysis

Here we could consider taking all the steps of analysis as in the second part, but in fact there is no real reason for that. Machine learning models, such as LSTMs, tend to make conservative predictions that minimize the mean squared error (MSE) on the training and validation datasets. So as the result we have almost no volatilites in the forecasted data, that is why we will have returns that are almost 0, that is why there is no point in getting all the related data, as it will not represent anything really relevant. Instead, we would like to get the forecasted data and use it to make additional assumptions about our portfolio in the last part.

### 3.4.1. Average Prices 

As we have lots of tickers to analyze, we couldn't plot them on the linear graph all together to get the global picture. For that, we would like to make a column that is going to contain all the values brought together in the average form.

In [10]:
avg_price_forecast = pd.DataFrame()
avg_price_forecast.index = df_forecast.index
avg_price_forecast['Average Price'] = df_forecast.mean(axis=1)
avg_price_forecast.tail()

Unnamed: 0_level_0,Average Price
Date,Unnamed: 1_level_1
2026-12-21,306.730377
2026-12-22,306.730286
2026-12-23,306.730133
2026-12-24,306.730042
2026-12-25,306.729919


Also, we would like to export this file to plot the graph later on.

In [11]:
save_csv(avg_price_forecast, 'avg_price_forecast.csv')

'c:\\Users\\nikit\\Desktop\\Personal\\pythonLanguage\\portfolio_optimization_ml\\src\\data\\avg_price_forecast.csv'