# Random Forest

Random Forest is a versatile and widely used machine learning algorithm for classification and regression tasks. It is particularly effective for datasets with complex relationships, non-linear patterns, and a mix of categorical and numerical features. Unlike statistical methods that assume linear relationships, Random Forest can handle intricate interactions between variables, making it suitable for diverse applications like predicting customer behavior, diagnosing diseases, or identifying fraudulent activities.

The method integrates several key components: Decision Trees, Bagging (Bootstrap Aggregating), and Ensemble Learning. Below is a brief overview of each component and their roles in the prediction process:

1. ___Decision Trees___
Random Forest is built upon decision trees, which are flowchart-like structures that split data into subsets based on feature values.
Each tree learns a series of decision rules to make predictions, but on its own, a single tree can be prone to overfitting.
<br><br>

2. ___Bagging (Bootstrap Aggregating)___
Random Forest employs bagging, where multiple subsets of the training data are created by sampling with replacement (bootstrapping).
Each decision tree is trained on a different subset, reducing variance and improving model robustness.
<br><br>

3. ___Ensemble Learning___
The power of Random Forest lies in combining the predictions of multiple decision trees.

    - For classification: Each tree votes for a class, and the majority vote is selected.
    - For regression: The average of all tree predictions is taken.

### 01. Import libraries

In [2]:
#pip install torch==2.2.1 torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118

In [5]:
# Set to run in 'GPU'
import torch

print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

# Check for available GPUs
if torch.cuda.is_available():
    device = torch.device('cuda')  # Use the first available GPU
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")

    # Move all tensors to the GPU by default
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
else:
    print("No GPU available, using CPU.")
    device = torch.device('cpu')

1
NVIDIA GeForce RTX 3050 Laptop GPU
Using GPU: NVIDIA GeForce RTX 3050 Laptop GPU


In [6]:
# Data manipulation and analysis
import numpy as np
import pandas as pd
import itertools
from math import sqrt

# Graph
import plotly.graph_objs as go
from matplotlib import pyplot as plt
import seaborn as sns

# Random Forest
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Data source
import yfinance as yf

# Create features (datetime)
from fastai.tabular.core import add_datepart

In [7]:
# Disable Warnings
import warnings
warnings.filterwarnings('ignore')

### 02. Import data

__yfinance__ is a Python library that simplifies access and manipulation of financial data, allowing users to easily interact with historical and current stock market data. Here are the main reasons to use it:

In [8]:
# Download historical data for a stock
ticker = 'EURUSD=X'
df = yf.download(tickers=ticker, period='10y', interval='1d')

# Move the index to a column
df = df.reset_index()

# Drop the MultiIndex level (keep the first level only)
if isinstance(df.columns, pd.MultiIndex):
    df.columns = df.columns.droplevel(1)

# Replace blank space in the name of the columns
df.columns = df.columns.str.lower().str.replace(' ', '_')

df.head()

[*********************100%***********************]  1 of 1 completed


Price,date,close,high,low,open,volume
0,2015-01-05,1.194643,1.19759,1.188909,1.1955,0
1,2015-01-06,1.193902,1.197,1.188693,1.19383,0
2,2015-01-07,1.187536,1.19,1.180401,1.187479,0
3,2015-01-08,1.1836,1.184806,1.175601,1.183894,0
4,2015-01-09,1.179607,1.18383,1.176831,1.179426,0


In [9]:
# Declare figure
fig = go.Figure()

#Candlestick
fig.add_trace(go.Candlestick(x=df.index,
                open=df['open'],
                high=df['high'],
                low=df['low'],
                close=df['close'], name = 'market data'))

# Add titles
fig.update_layout(
    title='Live share price evolution',
    yaxis_title='Stock Price (USD per Shares)')

# X-Axes
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=15, label="15m", step="minute", stepmode="backward"),
            dict(count=45, label="45m", step="minute", stepmode="backward"),
            dict(count=1, label="HTD", step="hour", stepmode="todate"),
            dict(count=3, label="3h", step="hour", stepmode="backward"),
            dict(step="all")
        ])
    )
)

# Show
fig.show()

### 03. Feature engineering

Feature engineering involves transforming, creating, and selecting the most relevant features from the original data to enhance the accuracy and efficiency of predictive models.

In this case, extensive feature engineering is less critical since the model focuses exclusively on the 'close' column.

In [10]:
# Create a column for the previous day's close
df['prev_close'] = df.shift(1)['close']

# Calculate the daily change in the closing price
df['close_change'] = df.apply(
    lambda row: 0 if np.isnan(row.prev_close) else (row.close - row.prev_close), 
    axis=1
)

# Remove the temporary prev_close column
df.drop('prev_close', axis=1, inplace=True)

# Calculate the return
df['return'] = (df['close']-df['open'])/df['open']

df.head()

Price,date,close,high,low,open,volume,close_change,return
0,2015-01-05,1.194643,1.19759,1.188909,1.1955,0,0.0,-0.000717
1,2015-01-06,1.193902,1.197,1.188693,1.19383,0,-0.000742,6e-05
2,2015-01-07,1.187536,1.19,1.180401,1.187479,0,-0.006366,4.7e-05
3,2015-01-08,1.1836,1.184806,1.175601,1.183894,0,-0.003936,-0.000249
4,2015-01-09,1.179607,1.18383,1.176831,1.179426,0,-0.003993,0.000153


In [11]:
# Ensure the 'date' column is the index
df = df.set_index('date')

# Make the DataFrame have daily frequency (including weekends)
df = df.asfreq('D')

# Forward fill missing data (weekends will be filled with the previous available data)
df = df.ffill(downcast='infer')

# Sort by date to ensure it's in chronological order
df = df.sort_index()

# Reset the index if you want to move 'date' back to a column
df = df.reset_index()

df.head()

Price,date,close,high,low,open,volume,close_change,return
0,2015-01-05,1.194643,1.19759,1.188909,1.1955,0,0.0,-0.000717
1,2015-01-06,1.193902,1.197,1.188693,1.19383,0,-0.000742,6e-05
2,2015-01-07,1.187536,1.19,1.180401,1.187479,0,-0.006366,4.7e-05
3,2015-01-08,1.1836,1.184806,1.175601,1.183894,0,-0.003936,-0.000249
4,2015-01-09,1.179607,1.18383,1.176831,1.179426,0,-0.003993,0.000153


In [12]:
# Save data in other variable, then make it a index
df['date_temp'] = df['date']

# Add date parts
df = add_datepart(df, 'date')
df.drop('Elapsed', axis=1, inplace=True)
df['mon_fri'] = ((df['Dayofweek'] == 0) | (df['Dayofweek'] == 4)).astype(int)

# Replace blank space in the name of the columns
df.columns = df.columns.str.lower().str.replace(' ','_')

# Convert all boolean columns to 1 and 0
boolean_columns = ['is_month_end', 'is_month_start', 'is_quarter_end', 'is_quarter_start', 'is_year_end', 'is_year_start'] 
df[boolean_columns] = df[boolean_columns].astype(int)

# # Set 'date' column as the index
df.set_index('date_temp', inplace=True)

df.head()

Price,close,high,low,open,volume,close_change,return,year,month,week,day,dayofweek,dayofyear,is_month_end,is_month_start,is_quarter_end,is_quarter_start,is_year_end,is_year_start,mon_fri
date_temp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2015-01-05,1.194643,1.19759,1.188909,1.1955,0,0.0,-0.000717,2015,1,2,5,0,5,0,0,0,0,0,0,1
2015-01-06,1.193902,1.197,1.188693,1.19383,0,-0.000742,6e-05,2015,1,2,6,1,6,0,0,0,0,0,0,0
2015-01-07,1.187536,1.19,1.180401,1.187479,0,-0.006366,4.7e-05,2015,1,2,7,2,7,0,0,0,0,0,0,0
2015-01-08,1.1836,1.184806,1.175601,1.183894,0,-0.003936,-0.000249,2015,1,2,8,3,8,0,0,0,0,0,0,0
2015-01-09,1.179607,1.18383,1.176831,1.179426,0,-0.003993,0.000153,2015,1,2,9,4,9,0,0,0,0,0,0,1


### 04. Pre-processing

Unlike other machine learning models, in the case of time series, __it is crucial to ensure that partitions are performed respecting the temporal order of the data__, avoiding random mixing of the data.

In [20]:
# Create lagged columns to prepare supervised data
def create_lagged_features(df, lag=10):
    lagged_data = pd.DataFrame()
    for column in df.columns:
        for i in range(1, lag + 1):
            lagged_data[f'{column}_lag_{i}'] = df[column].shift(i)
    return lagged_data

In [29]:
# Create a copy of the dataframe
data = df.copy()

# Select the columns to be used for lags (you can add more variables)
columns_to_lag = [col for col in data.columns if col != 'close']

# Create lagged features
lag = 180  # Generate lags from 1 to 180
lagged_features = create_lagged_features(data[columns_to_lag], lag=lag)
data_with_lags = pd.concat([data[['close']], lagged_features], axis=1).dropna()

In [30]:
# Separate features (X) and target (y)
X = data_with_lags.iloc[:, 1:]
y = data_with_lags['close']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

### 05. Modeling

In this step, we apply the Random Forest algorithm to predict the target variable. A Random Forest Regressor is an ensemble learning method that builds multiple decision trees and combines their predictions to produce a final output. The model is trained using the input features (X_train) and the corresponding target values (y_train). The process includes the following steps:

- __Model Initialization:__ A RandomForestRegressor is initialized with 100 decision trees (n_estimators=100) and a fixed random seed (random_state=42) to ensure reproducibility.
- __Training:__ The model is trained using the training dataset (X_train, y_train), where the algorithm learns to predict the target based on the features provided.

In [31]:
# Train the model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [32]:
# Make predictions on the test set
y_pred_test = model.predict(X_test)

# Evaluate the model performance using multiple metrics

# 1. RMSE (Root Mean Squared Error)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
print(f"RMSE on the test set: {rmse}")

# 2. MAE (Mean Absolute Error)
mae = mean_absolute_error(y_test, y_pred_test)
print(f"MAE on the test set: {mae}")

# 3. R² (Coefficient of Determination)
r2 = r2_score(y_test, y_pred_test)
print(f"R² on the test set: {r2}")

RMSE on the test set: 0.0029244830966281294
MAE on the test set: 0.0021975593138084154
R² on the test set: 0.9724765322882558


In [40]:
# Define prediction horizons
prediction_horizons = [3, 7, 10]

# Last known value from the test set
last_known_data = X_test.iloc[-1].values.reshape(1, -1)

# Dictionary to store predictions for each horizon
future_predictions_dict = {horizon: [] for horizon in prediction_horizons}

# Predictions for multiple timeframes
for horizon in prediction_horizons:
    temp_last_known_data = last_known_data.copy()
    for i in range(horizon):
        try:
            next_pred = model.predict(temp_last_known_data)[0]
        except Exception as e:
            print(f"Error during prediction: {e}")
            break
        
        # Store the prediction
        future_predictions_dict[horizon].append(next_pred)
        
        # Shift features and add the new prediction
        temp_last_known_data = np.roll(temp_last_known_data, -1, axis=1)
        temp_last_known_data[0, -1] = next_pred

# Display the results
for horizon, predictions in future_predictions_dict.items():
    print(f"Predictions for {horizon} days into the future: {predictions}")

Predicciones para 3 días en el futuro: [1.0256990134716033, 1.0416873443126677, 1.04152228474617]
Predicciones para 7 días en el futuro: [1.0256990134716033, 1.0416873443126677, 1.04152228474617, 1.044283241033554, 1.044833381175995, 1.0433479356765747, 1.0431839942932128]
Predicciones para 10 días en el futuro: [1.0256990134716033, 1.0416873443126677, 1.04152228474617, 1.044283241033554, 1.044833381175995, 1.0433479356765747, 1.0431839942932128, 1.0423766934871674, 1.0400027108192444, 1.0425498747825623]


In [51]:
# Create a range of dates for the upcoming days
future_dates = pd.date_range(start=data.index[-1], periods=max(prediction_horizons) + 1, freq='D')[1:]

# Historical data traces
trace1 = go.Scatter(
    x=data.index,
    y=data['close'],
    mode='lines',
    name='Historical Data'
)

# Predictions on the test set
trace2 = go.Scatter(
    x=y_test.index,
    y=y_pred_test,
    mode='lines',
    line=dict(dash='dash'),
    name='Predictions on Test'
)

# Future predictions for different horizons
future_traces = []

for idx, horizon in enumerate(prediction_horizons):
    future_traces.append(
        go.Scatter(
            x=future_dates[:horizon],
            y=future_predictions_dict[horizon],
            mode='lines',
            name=f'Predictions for {horizon} Days',
        )
    )

# Predictions on the training set
y_pred_train = model.predict(X_train)

trace3 = go.Scatter(
    x=y_train.index,
    y=y_pred_train,
    mode='lines',
    line=dict(dash='dot'),
    name='Predictions on Train'
)

# Combine all traces
data_x = [trace1, trace2, trace3] + future_traces

# Layout
layout = go.Layout(
    title='Random Forest: Time Series Prediction',
    xaxis=dict(title='Date'),
    yaxis=dict(title='Closing Price'),
    hovermode='x unified'
)

# Create the figure
fig = go.Figure(data=data_x, layout=layout)

# X-axis configuration
fig.update_xaxes(
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=15, label="15d", step="day", stepmode="backward"),
            dict(count=45, label="45d", step="day", stepmode="backward"),
            dict(count=1, label="1m", step="month", stepmode="backward"),
            dict(count=3, label="3m", step="month", stepmode="backward"),
            dict(step="all")
        ])
    )
)

# Show the figure
fig.show()

### 07. Model evaluation and interpretation

__Interpretation__

- RMSE (Root Mean Squared Error):
The RMSE value of 0.0029 indicates the average deviation of the predicted values from the actual values in the test set. Lower RMSE values signify better model performance, as the errors are small and close to zero.

- MAE (Mean Absolute Error):
The MAE value of 0.0022 suggests the average absolute difference between the predicted and actual values. Similar to RMSE, lower MAE values indicate better accuracy. Unlike RMSE, MAE does not amplify large errors, making it less sensitive to outliers.

- R² (Coefficient of Determination):
The R² value of 0.9725 shows that 97.25% of the variance in the test data is explained by the model. A value close to 1 indicates that the model is highly accurate in capturing the relationship between the features and the target variable.


__Conclusion__

The model demonstrates excellent performance based on the metrics provided:
- Low error values (RMSE and MAE) indicate that the predictions are highly accurate, with minimal deviation from the true values.
- The high R² value signifies that the model has captured most of the underlying patterns in the data.

Overall, the model is well-suited for predicting the target variable in this dataset and can be confidently used for forecasting within similar contexts.

### End