# Ethereum Price Prediction with ARIMA, LSTM and Random Forest

This notebook implements your self project:

- **Objective:** Build a scalable predictive modeling framework to forecast Ethereum closing prices.
- **Data:** Daily ETH-USD prices (2500+ records).
- **Models:** ARIMA, LSTM, Decision Tree and Random Forest.
- **Metrics:** RMSE and MAPE to compare short-term and long-term forecasting performance.
- **Outcome:** A reusable 30-day forecasting pipeline and visualizations for ETH trading insights.


## 1. Setup

Run the cell below if you are on a fresh environment (e.g. Google Colab). Comment it out on your local machine if you already have the libraries installed.

In [None]:
# !pip install yfinance statsmodels scikit-learn tensorflow matplotlib pandas numpy

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import MinMaxScaler

import yfinance as yf
from statsmodels.tsa.arima.model import ARIMA

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.callbacks import EarlyStopping

plt.rcParams['figure.figsize'] = (10, 4)

### Helper functions

We define evaluation metrics, a function to create supervised sequences from time series data, and some plotting utilities.

In [None]:
def mape(y_true, y_pred):
    """Mean Absolute Percentage Error (in %).
    Adds a small epsilon to avoid division by zero.
    """
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    eps = 1e-8
    return np.mean(np.abs((y_true - y_pred) / (y_true + eps))) * 100


def evaluate_regression(y_true, y_pred, name="model"):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    mape_val = mape(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    print(f"{name} -> RMSE: {rmse:.4f}, MAE: {mae:.4f}, MAPE: {mape_val:.2f}%, R2: {r2:.4f}")
    return {"model": name, "rmse": rmse, "mae": mae, "mape": mape_val, "r2": r2}


def create_sequences(series, window_size=60):
    """Create (X, y) pairs using a rolling window over a 1D array or 2D array.
    series is expected to be a 2D array of shape (n_samples, n_features).
    """
    X, y = [], []
    for i in range(len(series) - window_size):
        X.append(series[i : i + window_size])
        y.append(series[i + window_size])
    return np.array(X), np.array(y)


def plot_predictions(dates, y_true, y_pred, title):
    plt.figure()
    plt.plot(dates, y_true, label="True")
    plt.plot(dates, y_pred, label="Predicted")
    plt.title(title)
    plt.xlabel("Date")
    plt.ylabel("ETH Close Price (USD)")
    plt.legend()
    plt.tight_layout()
    plt.show()

## 2. Data Collection & Preprocessing

- We use **yfinance** to download daily ETH-USD closing prices.
- Time range is chosen to have 2500+ observations.
- We keep the `Close` column and handle missing values if any.

In [None]:
symbol = "ETH-USD"
start_date = "2015-01-01"  # early enough to get 2500+ daily records

eth_df = yf.download(symbol, start=start_date)
eth_df = eth_df.sort_index()

# Keep only the closing price
eth_df = eth_df[["Close"]].dropna()

print("Number of daily records:", len(eth_df))
eth_df.head()

In [None]:
eth_df.tail()

In [None]:
plt.figure()
plt.plot(eth_df.index, eth_df["Close"].values)
plt.title("ETH-USD Closing Price")
plt.xlabel("Date")
plt.ylabel("Price (USD)")
plt.tight_layout()
plt.show()

### Basic Feature Engineering

For tree-based models we can include simple technical features:
- 1-day return
- 7-day and 30-day moving averages
- 7-day rolling volatility

The LSTM and ARIMA models will use the raw closing price series (scaled / transformed appropriately).

In [None]:
features_df = eth_df.copy()
features_df["return_1d"] = features_df["Close"].pct_change()
features_df["ma_7"] = features_df["Close"].rolling(window=7).mean()
features_df["ma_30"] = features_df["Close"].rolling(window=30).mean()
features_df["vol_7"] = features_df["return_1d"].rolling(window=7).std()

features_df = features_df.dropna()
features_df.head()

## 3. Train / Test Split

We use the first 80% of the time series for training and the remaining 20% for testing.

- **ARIMA** operates on the univariate closing price.
- **LSTM / Decision Tree / Random Forest** use a sliding window over the scaled closing price.
- The same test set is used for all models to make metrics comparable.

In [None]:
# For ARIMA (univariate series)
train_size_arima = int(len(eth_df) * 0.8)
train_arima = eth_df["Close"].iloc[:train_size_arima]
test_arima = eth_df["Close"].iloc[train_size_arima:]

print("ARIMA train size:", len(train_arima), "test size:", len(test_arima))

In [None]:
# For LSTM / Tree models, use scaled close price and create sequences
values = eth_df["Close"].values.reshape(-1, 1)
scaler = MinMaxScaler()
values_scaled = scaler.fit_transform(values)

window_size = 60
X_seq, y_seq = create_sequences(values_scaled, window_size=window_size)

train_size_seq = int(len(X_seq) * 0.8)

X_train_seq = X_seq[:train_size_seq]
y_train_seq = y_seq[:train_size_seq]
X_test_seq = X_seq[train_size_seq:]
y_test_seq = y_seq[train_size_seq:]

dates_all = eth_df.index[window_size:]
train_dates_seq = dates_all[:train_size_seq]
test_dates_seq = dates_all[train_size_seq:]

print("Sequence train shape:", X_train_seq.shape, "test shape:", X_test_seq.shape)

## 4. ARIMA Model (Classical Time Series)

We fit an ARIMA model on the training closing prices and forecast over the test horizon. The order (p,d,q) is chosen manually here; in a more advanced setting you can use auto-ARIMA to tune this.

In [None]:
p, d, q = 5, 1, 0  # basic order; can be tuned further

arima_model = ARIMA(train_arima, order=(p, d, q))
arima_result = arima_model.fit()
print(arima_result.summary())

In [None]:
arima_forecast = arima_result.forecast(steps=len(test_arima))
arima_forecast.index = test_arima.index

metrics_arima = evaluate_regression(test_arima.values, arima_forecast.values, name="ARIMA")
plot_predictions(test_arima.index, test_arima.values, arima_forecast.values, "ARIMA - Test Forecast")

## 5. Decision Tree & Random Forest (Tree-based Models)

Here we treat the problem as supervised regression with lag features:

- Inputs: last 60 days of (scaled) closing prices.
- Target: next day closing price.

We train a **Decision Tree Regressor** as a simple baseline and a **Random Forest Regressor** as an ensemble improvement.

In [None]:
# Flatten time dimension into feature vectors
X_train_flat = X_train_seq.reshape(X_train_seq.shape[0], -1)
X_test_flat = X_test_seq.reshape(X_test_seq.shape[0], -1)

# Decision Tree
dt = DecisionTreeRegressor(max_depth=10, random_state=42)
dt.fit(X_train_flat, y_train_seq.ravel())

dt_pred_scaled = dt.predict(X_test_flat).reshape(-1, 1)
dt_pred = scaler.inverse_transform(dt_pred_scaled).ravel()
y_test_dt = scaler.inverse_transform(y_test_seq.reshape(-1, 1)).ravel()

metrics_dt = evaluate_regression(y_test_dt, dt_pred, name="Decision Tree")
plot_predictions(test_dates_seq, y_test_dt, dt_pred, "Decision Tree - Test Predictions")

In [None]:
# Random Forest
rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train_flat, y_train_seq.ravel())

rf_pred_scaled = rf.predict(X_test_flat).reshape(-1, 1)
rf_pred = scaler.inverse_transform(rf_pred_scaled).ravel()
y_test_rf = scaler.inverse_transform(y_test_seq.reshape(-1, 1)).ravel()

metrics_rf = evaluate_regression(y_test_rf, rf_pred, name="Random Forest")
plot_predictions(test_dates_seq, y_test_rf, rf_pred, "Random Forest - Test Predictions")

## 6. LSTM Model (Deep Learning for Time Series)

An LSTM network can capture temporal dependencies and nonlinear patterns in the ETH price series. We use:

- One LSTM layer with 50 units
- Dense output layer
- Early stopping on validation loss to avoid overfitting

In [None]:
X_train_lstm = X_train_seq
X_test_lstm = X_test_seq

model_lstm = Sequential([
    LSTM(50, input_shape=(window_size, 1)),
    Dense(1)
])

model_lstm.compile(optimizer="adam", loss="mse")

early_stop = EarlyStopping(monitor="val_loss", patience=5, restore_best_weights=True)

history = model_lstm.fit(
    X_train_lstm,
    y_train_seq,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stop],
    verbose=1
)

In [None]:
lstm_pred_scaled = model_lstm.predict(X_test_lstm)
y_test_lstm_scaled = y_test_seq.reshape(-1, 1)

lstm_pred = scaler.inverse_transform(lstm_pred_scaled).ravel()
y_test_lstm = scaler.inverse_transform(y_test_lstm_scaled).ravel()

metrics_lstm = evaluate_regression(y_test_lstm, lstm_pred, name="LSTM")
plot_predictions(test_dates_seq, y_test_lstm, lstm_pred, "LSTM - Test Predictions")

### LSTM vs Decision Tree: Relative Improvement

To match your resume bullet, we can compute the relative improvement of LSTM over the Decision Tree in terms of MAPE. The exact percentage will depend on the run and data window, but typically LSTM should give a lower error on this task.

In [None]:
improvement_mape = (metrics_dt["mape"] - metrics_lstm["mape"]) / metrics_dt["mape"] * 100
print(f"LSTM MAPE improvement over Decision Tree: {improvement_mape:.2f}%")

## 7. 30-Day Forecasting Pipeline (Long-Term Horizon)

We now build a simple **30-day ahead** forecasting pipeline using the trained LSTM model:

- Use the last 60 days of scaled ETH prices as the initial window.
- Iteratively forecast the next day, append the prediction, and slide the window.
- Invert the scaling to get price forecasts in USD.


In [None]:
def multi_step_forecast_lstm(model, last_window_scaled, scaler, n_steps=30):
    """Iteratively forecast n_steps into the future using an LSTM model.
    last_window_scaled: array of shape (window_size, 1) in scaled space.
    Returns: (n_steps,) array in original price space.
    """
    window = last_window_scaled.copy()
    preds_scaled = []
    for _ in range(n_steps):
        x_input = window.reshape(1, window.shape[0], window.shape[1])
        yhat_scaled = model.predict(x_input, verbose=0)
        preds_scaled.append(yhat_scaled[0, 0])
        window = np.vstack([window[1:], yhat_scaled])
    preds_scaled = np.array(preds_scaled).reshape(-1, 1)
    preds = scaler.inverse_transform(preds_scaled).ravel()
    return preds

# Use last window_size days from the full series for forecasting
last_window_scaled = values_scaled[-window_size:]
n_future_days = 30
future_preds = multi_step_forecast_lstm(model_lstm, last_window_scaled, scaler, n_steps=n_future_days)

last_date = eth_df.index[-1]
future_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=n_future_days, freq="D")

plt.figure()
plt.plot(eth_df.index[-120:], eth_df["Close"].values[-120:], label="History (last 120 days)")
plt.plot(future_dates, future_preds, label="LSTM 30-day Forecast")
plt.title("ETH-USD 30-Day Ahead Forecast (LSTM)")
plt.xlabel("Date")
plt.ylabel("Price (USD)")
plt.legend()
plt.tight_layout()
plt.show()

## 8. Model Comparison Summary

Finally, we gather all the metrics into a single table to compare ARIMA, Decision Tree, Random Forest and LSTM on the same test period.

In [None]:
summary_df = pd.DataFrame([
    metrics_arima,
    metrics_dt,
    metrics_rf,
    metrics_lstm
])
summary_df.set_index("model", inplace=True)
summary_df

### How this notebook maps to your resume bullets

- **Collected and preprocessed Ethereum historical data with 2,500+ daily records**  
  → Data download, cleaning and feature engineering sections.
- **Trained & optimized ARIMA, LSTM and Random Forest predictive models**  
  → ARIMA, LSTM, Decision Tree and Random Forest sections with train/test splits and early stopping.
- **Evaluated models using RMSE and MAPE**  
  → `evaluate_regression` function and `summary_df` table.
- **Captured temporal dependencies with LSTM, reducing prediction error over Tree models**  
  → LSTM vs Decision Tree comparison and improvement calculation.
- **Developed a reliable 30-day forecasting pipeline**  
  → `multi_step_forecast_lstm` function and 30-day forecast plot.

You can now upload this notebook to your GitHub repository named something like `ethereum-price-prediction` and reference it directly from your resume.