Data set taken from here : https://www.kaggle.com/datasets/priyamchoksi/bitcoin-historical-prices-and-activity-2010-2024/data

This project aims to develop a robust model for predicting Bitcoin prices using historical data from July 2010 to April 2024. The dataset includes daily Bitcoin prices along with trading volume and market capitalization. Various machine learning models, including Linear Regression, Random Forest, and Gradient Boosting Machines, were employed to forecast the future prices for varying time horizonsâ€”weeks, months, and years. The project incorporated advanced feature engineering techniques such as lagged features, rolling averages, and exponential moving averages to enrich the models' input data. Model performance was evaluated based on Mean Squared Error (MSE) and R-squared metrics to identify the most accurate forecasting method.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
import pandas as pd

# Load the data from the uploaded CSV file
bitcoin_data = pd.read_csv("/kaggle/input/bitcoin-historical-prices-and-activity-2010-2024/bitcoin_2010-07-27_2024-04-25.csv")

# Display the first few rows of the dataframe and the data types of the columns
bitcoin_data.head(), bitcoin_data.dtypes


(        Start         End      Open      High       Low     Close  \
 0  2024-04-24  2024-04-25  66553.54  67070.40  63742.73  64291.07   
 1  2024-04-23  2024-04-24  66761.02  67174.02  65884.15  66386.61   
 2  2024-04-22  2024-04-23  64952.02  67180.03  64598.00  66818.89   
 3  2024-04-21  2024-04-22  64875.75  65638.74  64302.47  64896.87   
 4  2024-04-20  2024-04-21  63824.06  65351.33  63321.19  64857.99   
 
          Volume    Market Cap  
 0  1.384736e+11  1.294489e+12  
 1  1.408621e+11  1.308576e+12  
 2  1.400725e+11  1.299703e+12  
 3  1.326574e+11  1.278730e+12  
 4  1.537410e+11  1.263725e+12  ,
 Start          object
 End            object
 Open          float64
 High          float64
 Low           float64
 Close         float64
 Volume        float64
 Market Cap    float64
 dtype: object)

In [3]:
# Convert 'Start' and 'End' columns to datetime format
bitcoin_data['Start'] = pd.to_datetime(bitcoin_data['Start'])
bitcoin_data['End'] = pd.to_datetime(bitcoin_data['End'])

# Sort the data by the 'Start' date to ensure time series consistency
bitcoin_data.sort_values('Start', inplace=True)

# Checking the data after conversion and sorting
bitcoin_data.head(), bitcoin_data.tail()


(          Start        End    Open    High     Low   Close  Volume  Market Cap
 5020 2010-07-27 2010-07-28  0.0600  0.0600  0.0600  0.0600     0.0         0.0
 5019 2010-07-28 2010-07-29  0.0589  0.0589  0.0589  0.0589     0.0         0.0
 5018 2010-07-29 2010-07-30  0.0699  0.0699  0.0699  0.0699     0.0         0.0
 5017 2010-07-30 2010-07-31  0.0627  0.0627  0.0627  0.0627     0.0         0.0
 5016 2010-07-31 2010-08-01  0.0679  0.0679  0.0679  0.0679     0.0         0.0,
        Start        End      Open      High       Low     Close        Volume  \
 4 2024-04-20 2024-04-21  63824.06  65351.33  63321.19  64857.99  1.537410e+11   
 3 2024-04-21 2024-04-22  64875.75  65638.74  64302.47  64896.87  1.326574e+11   
 2 2024-04-22 2024-04-23  64952.02  67180.03  64598.00  66818.89  1.400725e+11   
 1 2024-04-23 2024-04-24  66761.02  67174.02  65884.15  66386.61  1.408621e+11   
 0 2024-04-24 2024-04-25  66553.54  67070.40  63742.73  64291.07  1.384736e+11   
 
      Market Cap  
 4  1.

In [4]:
# Creating lagged features for historical prices
for lag in [1, 7, 30]:
    bitcoin_data[f'Close_lag_{lag}'] = bitcoin_data['Close'].shift(lag)

# Drop rows with missing values resulting from lagging
bitcoin_data.dropna(inplace=True)

# Prepare features and target variable
features = bitcoin_data[[f'Close_lag_{lag}' for lag in [1, 7, 30]]]
target = bitcoin_data['Close']

# Splitting data into training and testing sets - let's use the last 20% of data as test set
split_index = int(len(bitcoin_data) * 0.8)
X_train, X_test = features[:split_index], features[split_index:]
y_train, y_test = target[:split_index], target[split_index:]

# Review the first few rows of the features in the training data
X_train.head(), y_train.head()


(      Close_lag_1  Close_lag_7  Close_lag_30
 4990       0.0648       0.0667        0.0600
 4989       0.0640       0.0655        0.0589
 4988       0.0650       0.0664        0.0699
 4987       0.0641       0.0660        0.0627
 4986       0.0640       0.0649        0.0679,
 4990    0.0640
 4989    0.0650
 4988    0.0641
 4987    0.0640
 4986    0.0650
 Name: Close, dtype: float64)

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the models
lr_model = LinearRegression()
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
gbm_model = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Function to train and evaluate a model
def train_evaluate(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    return mse, r2

# Train and evaluate each model
results = {}
models = {'Linear Regression': lr_model, 'Random Forest': rf_model, 'GBM': gbm_model}
for name, model in models.items():
    mse, r2 = train_evaluate(model, X_train, y_train, X_test, y_test)
    results[name] = {'MSE': mse, 'R^2': r2}

results


{'Linear Regression': {'MSE': 1424935.6814849433, 'R^2': 0.9928613639375479},
 'Random Forest': {'MSE': 9422638.29055948, 'R^2': 0.9527945110937688},
 'GBM': {'MSE': 10172110.482893495, 'R^2': 0.949039808836313}}

In [6]:
# Create lagged features
for lag in [1, 7, 30]:
    bitcoin_data[f'Close_lag_{lag}'] = bitcoin_data['Close'].shift(lag)

# Create rolling mean and exponential moving averages
for window in [7, 14, 30]:
    bitcoin_data[f'Rolling_Mean_{window}'] = bitcoin_data['Close'].rolling(window).mean()
    bitcoin_data[f'EMA_{window}'] = bitcoin_data['Close'].ewm(span=window, adjust=False).mean()


In [7]:
# Drop rows with missing values
bitcoin_data.dropna(inplace=True)

# Define features and target
features = bitcoin_data[[col for col in bitcoin_data.columns if 'lag' in col or 'Rolling_Mean' in col or 'EMA' in col]]
target = bitcoin_data['Close']



In [8]:
# Split data into training and test sets
split_index = int(len(bitcoin_data) * 0.8)
X_train, X_test = features.iloc[:split_index], features.iloc[split_index:]
y_train, y_test = target.iloc[:split_index], target.iloc[split_index:]

In [9]:
# Initialize and train models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

In [10]:
# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    results[name] = {'MSE': mse, 'R^2': r2}

# Print results
for model_name, metrics in results.items():
    print(f"{model_name} - MSE: {metrics['MSE']}, R^2: {metrics['R^2']}")

Linear Regression - MSE: 199217.66577476592, R^2: 0.9990072626169023
Random Forest - MSE: 10416997.93489433, R^2: 0.9480902297022572
Gradient Boosting - MSE: 10565049.433656957, R^2: 0.9473524625124164


In [11]:
# Example forecast
# Adjust the index for your forecast horizon
forecast_index = -30  # Example: last 30 days from the dataset
forecast_values = models['Linear Regression'].predict(features.iloc[forecast_index:])
print("Forecast Values:", forecast_values)

Forecast Values: [68770.13700368 70385.39020429 71909.90167035 70187.92939185
 69510.64439676 70551.52334343 69541.19593579 66605.50026123
 67084.32774    68059.31011471 66924.48129682 68390.1037432
 69639.89601373 72272.61141183 69441.99197373 70738.75995728
 69958.87263806 67316.36022593 64894.74355176 65386.03325494
 63086.08487439 63095.28639137 61149.86335783 63898.91033616
 63566.73822207 64230.91839503 64966.62134747 66910.82322144
 66570.92271902 64647.46127633]


The Linear Regression model significantly outperformed the Random Forest and Gradient Boosting models in terms of accuracy. The forecast values predicted by the Linear Regression model for upcoming prices appear reasonable and closely follow the trends indicated by historical data. This is evidenced by the very high R-squared value (0.999) and the low Mean Squared Error (MSE) of approximately 199,217.67, suggesting that the model can explain nearly all the variability of the response data around its mean with very little error.

The predictions by Linear Regression for the next 30 days in the dataset show an interesting trend of price fluctuations, with values peaking at around 72,272.61 before decreasing towards the end of the period. This may reflect typical market volatility or specific economic events affecting the cryptocurrency market, illustrating the model's sensitivity to underlying trends.

The Bitcoin price prediction project demonstrated that Linear Regression provided the most accurate forecasts among the models tested, achieving the highest R-squared value and the lowest MSE. The addition of rolling and exponential moving averages as features helped improve the prediction accuracy by capturing more complex patterns in price movements over time. Although ensemble models like Random Forest and GBM showed higher errors, their potential could be further explored with additional parameter tuning and feature engineering. This study underscores the importance of feature engineering in time series forecasting and highlights the potential of simple linear models in understanding and predicting market behaviors. The findings could assist investors and analysts in making informed decisions in the cryptocurrency market.