# Machine Learning Model Implementation and Evaluation
## 1. Introduction

In this notebook, we implemented and evaluated two machine learning models—Linear Regression and Random Forest Regressor—to predict the next day's closing price of a selected stock. The primary goal was to compare the performance of these models using various evaluation metrics and cross-validation techniques.

## 2. Data Loading and Preprocessing

In [97]:
import pandas as pd
import os

file_path = os.listdir("../data/processed/market data/")
data = pd.read_csv(f"../data/processed/market data/{file_path[0]}")
data.head()


Unnamed: 0.1,Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,MA_30,MA_100,Daily Returns,Log Returns,Cumulative Log Returns,Cumulative Returns,Cumulative Percentage Return,Rolling Mean,Rolling Std,Volatility 30,Volatility 100
0,0,2000-01-03,0.936384,1.004464,0.907924,0.999442,0.844981,535796800,,,,,,,,,,,
1,1,2000-01-04,0.966518,0.987723,0.90346,0.915179,0.773741,512377600,,,-0.08431,-0.088077,-0.088077,-0.08431,-8.431001,,,,
2,2,2000-01-05,0.926339,0.987165,0.919643,0.928571,0.785063,778321600,,,0.014633,0.014527,-0.07355,-0.070911,-7.091056,,,,
3,3,2000-01-06,0.947545,0.955357,0.848214,0.848214,0.717125,767972800,,,-0.086538,-0.090514,-0.164064,-0.151312,-15.131245,,,,
4,4,2000-01-07,0.861607,0.901786,0.852679,0.888393,0.751094,460734400,,,0.047369,0.046281,-0.117783,-0.111111,-11.1111,,,,


In [98]:
# Drop unnecessary columns
data = data.drop(columns=['Unnamed: 0'])


## 3. Feature Selection and Target Variable Preparation

In [99]:
# Prepare features and target variable
data.dropna(inplace=True)
X = data[['Open', 'High', 'Low', 'Volume', 'MA_30', 'MA_100', 'Volatility 30', 'Volatility 100', 'Rolling Std']]
y = data['Close'].shift(-1).dropna()
X = X.iloc[:-1]  # Align X with y


We selected specific features from the dataset that are likely to influence the stock’s closing price. The target variable y is the closing price shifted by one day, which allows the models to learn and predict the next day's price based on the current day’s features. Aligning X and y ensures that the lengths of our feature set and target variable match.

## 4. Model Training

In [102]:
# Train Linear Regression and Random Forest models
model = LinearRegression()
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

y_pred = model.predict(X_test)
y_pred_rf = rf_model.predict(X_test)



## 5. Model Evaluation

In [106]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, explained_variance_score
from sklearn.metrics import r2_score
# Evaluate models
mse = mean_squared_error(y_test, y_pred)
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2 = r2_score(y_test, y_pred)
r2_rf = r2_score(y_test, y_pred_rf)
mae = mean_absolute_error(y_test, y_pred)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
evs = explained_variance_score(y_test, y_pred)
evs_rf = explained_variance_score(y_test, y_pred_rf)

print("Linear Regression Results:")
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
print(f"Mean Absolute Error: {mae}")
print(f"Explained Variance Score: {evs}\n")

print("Random Forest Results:")
print(f"Mean Squared Error: {mse_rf}")
print(f"R-squared: {r2_rf}")
print(f"Mean Absolute Error: {mae_rf}")
print(f"Explained Variance Score: {evs_rf}")


Linear Regression Results:
Mean Squared Error: 1.6803818038862606
R-squared: 0.999461407610897
Mean Absolute Error: 0.5595357140026218
Explained Variance Score: 0.9994618876988046

Random Forest Results:
Mean Squared Error: 1.8963888094472183
R-squared: 0.9993921735065291
Mean Absolute Error: 0.6074940994073352
Explained Variance Score: 0.9993924256488199


We evaluated the models using Mean Squared Error (MSE), R-squared, Mean Absolute Error (MAE), and Explained Variance Score. These metrics provide insight into how well the models perform, with a particular focus on error magnitude and the proportion of variance explained by the model.

## 6. Cross-Validation

In [107]:
from sklearn.model_selection import cross_val_score
cv_scores = -cross_val_score(model, X, y, cv=10, scoring='neg_mean_squared_error')
cv_scores_rf = -cross_val_score(rf_model, X, y, cv=10, scoring='neg_mean_squared_error')

print("Cross-Validated MSE Scores Linear Regression:", cv_scores)
print("Cross-Validated MSE Scores Random Forest:", cv_scores_rf)

Cross-Validated MSE Scores Linear Regression: [3.92362516e-02 7.94197211e-04 5.38546126e-03 2.20652034e-02
 6.32716973e-02 1.15627701e-01 1.90378676e-01 6.20210583e-01
 5.61834984e+00 9.37999850e+00]
Cross-Validated MSE Scores Random Forest: [2.24063258e-02 2.05822152e-02 9.02179240e-02 2.10732683e-01
 2.49950029e+00 1.05475811e+00 1.26737952e+00 2.65679644e+01
 1.45778039e+02 1.95644960e+02]


To ensure the robustness of our models, we performed 10-fold cross-validation. This method helps in assessing how the model will generalize to an independent dataset, reducing the risk of overfitting.

## Conclusion
### Linear Regression:

Performance: The Linear Regression model performed exceptionally well on the test data, achieving high R-squared and low MSE and MAE values.
Cross-Validation: Cross-validation results showed variability in model performance across different folds, indicating potential overfitting or sensitivity to data splits.

### Random Forest:

Performance: The Random Forest model also performed well, though slightly less accurately than the Linear Regression model in this case.
Cross-Validation: The cross-validation scores exhibited higher variance compared to Linear Regression, which might be due to the Random Forest's reliance on specific feature combinations.

### Final Thoughts:

Both models demonstrated strong predictive performance, but the Linear Regression model slightly outperformed the Random Forest in this specific case. However, it's essential to consider the use case and data characteristics when choosing between these models. In scenarios with more complex, non-linear relationships, Random Forest may outperform Linear Regression.