# Model Selection: 

General performance summary table of models: 

|           |Train Set| Train Set|Train Set|Test Set|Test Set|Test Set|
|-----------|-------|----------|---------|-------|----------|---------|
| **Model**| **R2** | **RMSE** | **MAPE%**   | **R2**    | **RMSE**    | **MAPE%**  |
| Linear Regression | 0.9965 | 799.3932 | 4.3926  | 0.9898 | 1339.5982 | 2.9076 |
| SVM          | 0.9968 | 985.0937 | 3.3692% | 0.9611 | 831.7924 | 2.6539% |
| KNN   | --- | --- | --- | 0.9113 | 1256.7745 | 4.0181% |
| Random Forest    | --- | 33.5004 | 0.0802% | 0.9897 | 430.0638| 1.3385% |
| Adaboost          | 0.9924 | 1508.1469 | 77.2415% | 0.7346 | 2173.5470| 75.3318% | 
| XGBoost           | 0.9970 | 925.0439 | 2.9174% | 0.9501 | 1015.8001| 3.5157% |
| LightGBM           | --- | 549.2033 | 1.2529% | 0.9640 |805.4951| 2.4725% |
| ARIMA         | 0.9970 | 733.2738 | 2.8100% | -1.5391 | 21103.8951 | 79.7169% |
| GAN           | 0.9955 | 706.8428 | 3.29% | 0.9651 | 731.9221| 2.30% |
| GRU           | 0.9971 | 666.0494 | 2.98% | 0.9738 | 651.7368| 2.04% |
| LSTM          | 0.9969 | 666.1824 | 3.00% | 0.9736 | 669.6527| 2.07% |


SORT ACCORDING **R2_TEST**: 

|           |Train Set| Train Set|Train Set|Test Set|Test Set|Test Set|
|-----------|-------|----------|---------|-------|----------|---------|
| **Model**| **R2** | **RMSE** | **MAPE%**   | **R2**    | **RMSE**    | **MAPE%**  |
| Linear Regression | 0.9965 | 799.3932 | 4.3926  | 0.9898 | 1339.5982 | 2.9076 |
| Random Forest    | --- | 33.5004 | 0.0802% | 0.9897 | 430.0638| 1.3385% |
| GRU           | 0.9971 | 666.0494 | 2.98% | 0.9738 | 651.7368| 2.04% |
| LSTM          | 0.9969 | 666.1824 | 3.00% | 0.9736 | 669.6527| 2.07% |
| GAN           | 0.9955 | 706.8428 | 3.29% | 0.9651 | 731.9221| 2.30% |
| LightGBM           | --- | 549.2033 | 1.2529% | 0.9640 |805.4951| 2.4725% |
| SVM          | 0.9968 | 985.0937 | 3.3692% | 0.9611 | 831.7924 | 2.6539% |
| XGBoost           | 0.9970 | 925.0439 | 2.9174% | 0.9501 | 1015.8001| 3.5157% |
| KNN   | --- | --- | --- | 0.9113 | 1256.7745 | 4.0181% |
| Adaboost          | 0.9924 | 1508.1469 | 77.2415% | 0.7346 | 2173.5470| 75.3318% | 
| ARIMA         | 0.9970 | 733.2738 | 2.8100% | -1.5391 | 21103.8951 | 79.7169% |



Analyze the performance results of the models *BITCOIN PRICE PREDICTION*

**Overall**

From the analysis, it can be observed that several models exhibit good performance in predicting Bitcoin prices. The top-performing models include Linear Regression, Random Forest, GRU, LSTM, GAN, and LightGBM, which demonstrate high R2 values on both the training and test sets. Additionally, these models exhibit low RMSE values and MAPE% on the test set, indicating accurate predictions with minimal error.

On the other hand, KNN, Adaboost, and ARIMA models display lower effectiveness compared to other models. These models exhibit lower R2 values on the test set and higher error metrics, particularly Adaboost and ARIMA, which demonstrate significantly higher MAPE% values on the test set.

**Analyzing the different types of models used**
*Experiments show that: *

1. The statistical model ARIMA shows the lowest performance. This can be attributed to the fact that this method utilizes less information compared to other approaches, leading to lower performance.

2. Machine learning models, such as Linear Regression, demonstrate good performance. It can be hypothesized that this is due to good data normalization and the presence of linear relationships between various economic and market factors. Linear regression gives better performance than KNN and SVM

3. Ensemble learning models, including bagging methods such as Random Forest and boosting methods such as AdaBoost, CatBoost, XGBoost, and LightGBM, demonstrated good performance but fell short of expectations, with the exception of the Random Forest method. This is because Random Forests typically employ a large number of decision trees, each with a relatively high depth, which enables them to learn complex rules and make efficient use of the relationship between features. On the other hand, LightGBM, XGBoost, AdaBoost, and CatBoost also perform well but require adjusting the number and depth of trees for optimal results.

4. The Deep Learning model demonstrates good performance at a high level, and there is still room for improvement. However, due to time and resource limitations, we were unable to adjust the parameters or create more complex neural networks to maximize the model's performance. 

**In summary**

In summary, the analysis reveals the superiority of linear methods, such as Linear Regression and Random Forest, compared to other models in terms of training time and overall performance. Deep learning models exhibit good performance and have the potential for further improvement, but they require longer training times and resource-intensive parameter tuning.

# The test runs all models with default parameters

## 1 - Import Library, Load and Split Data

In [1]:
# import Library
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.linear_model import  LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import  SVR

from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from catboost import CatBoostRegressor
import xgboost as xgb                                             
import lightgbm as lgb

from sklearn.metrics import r2_score, mean_absolute_percentage_error, mean_squared_error


In [2]:
# Load Data
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/lavibula/ML20222.PredictionBitcoin/main/data/data.csv")
# data = pd.read_csv(r"C:\Users\Administrator\OneDrive - Hanoi University of Science and Technology\ITE10 - Data Science and AI - HUST\20222\ML\Source_Codes\ML20222.PredictionBitcoin\data\data.csv")
# print(data.info())
data = data.sort_values('Date', ascending=True).reset_index(drop=True) # drop=True bỏ 'index' col cũ. 

# Use BTC_close_tomorrow as y (Target col) of X_today, instead of BTC_close_today
data["BTC_close_tomorrow"] = data["BTC_close"].shift(-1)
data = data.iloc[:-1] # data = data.drop(data.index[-1])
print(data.info())
data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2660 entries, 0 to 2659
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    2660 non-null   object 
 1   BTC_close               2660 non-null   float64
 2   BTC_open                2660 non-null   float64
 3   BTC_high                2660 non-null   float64
 4   BTC_low                 2660 non-null   float64
 5   difficulty              2660 non-null   int64  
 6   addresses_active_count  2660 non-null   int64  
 7   sum_lock_weight         2660 non-null   int64  
 8   mean_lock_size_ytes     2660 non-null   float64
 9   total_fees_usd          2660 non-null   float64
 10  mean_hash_rate          2660 non-null   float64
 11  xfer_cnt                2660 non-null   int64  
 12  mean_tx_size_usd        2660 non-null   float64
 13  ETH                     2660 non-null   float64
 14  LTC                     2660 non-null   

Unnamed: 0,Date,BTC_close,BTC_open,BTC_high,BTC_low,difficulty,addresses_active_count,sum_lock_weight,mean_lock_size_ytes,total_fees_usd,...,LTC,XRP,DOGE,COPPER,GOLD,SILVER,SPX,JP225,DJI,BTC_close_tomorrow
0,2016-03-10,415.8,412.8,417.5,410.3,158427203767,445273,426654988,6.881532e+05,1.731272e+04,...,0.00,0.00000,0.000000,0.0000,0.00,0.000,0.00,0.0,0.0,419.1
1,2016-03-11,419.1,415.8,422.4,415.1,158427203767,434658,398582424,6.227850e+05,1.710193e+04,...,0.00,0.00000,0.000000,0.0000,0.00,0.000,0.00,0.0,0.0,410.4
2,2016-03-12,410.4,419.1,420.7,407.0,158427203767,374730,331208848,5.750154e+05,1.398444e+04,...,0.00,0.00000,0.000000,0.0000,0.00,0.000,0.00,0.0,0.0,412.4
3,2016-03-13,412.4,410.4,415.9,409.6,158427203767,421585,334817852,6.293569e+05,1.460678e+04,...,0.00,0.00000,0.000000,0.0000,0.00,0.000,0.00,0.0,0.0,414.3
4,2016-03-14,414.3,412.4,416.1,411.2,158427203767,451902,437739524,7.199663e+05,1.689298e+04,...,0.00,0.00000,0.000000,0.0000,0.00,0.000,0.00,0.0,0.0,415.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2655,2023-06-17,26515.0,26341.3,26767.3,26183.5,52350439455487,863600,559141195,1.724510e+06,1.041701e+06,...,76.87,0.47940,0.062193,0.0000,0.00,0.000,0.00,0.0,0.0,26339.7
2656,2023-06-18,26339.7,26515.0,26679.3,26290.6,52350439455487,883864,603064705,1.985675e+06,7.946708e+05,...,77.20,0.48699,0.062107,3.8738,1969.45,24.198,0.00,0.0,0.0,26845.9
2657,2023-06-19,26845.9,26339.7,27029.7,26295.1,52350439455487,920552,567091224,1.758290e+06,8.514319e+05,...,77.51,0.49341,0.062429,3.8643,1964.05,24.062,0.00,0.0,0.0,28307.7
2658,2023-06-20,28307.7,26845.9,28393.0,26665.5,52350439455487,951926,543133484,1.677592e+06,1.052364e+06,...,80.31,0.49270,0.063108,3.8835,1947.70,23.234,4388.71,33155.0,34356.0,29996.9


In [3]:
# Split data for pre-train
from sklearn.model_selection import train_test_split

X = data.drop(['BTC_close_tomorrow', 'Date'], axis=1)
y = data['BTC_close_tomorrow']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.15, random_state=42, shuffle=False)

## 2 - Use default parameters for all model.

In [4]:

models= [
    LinearRegression(),
    SVR(),
    KNeighborsRegressor(), 
    RandomForestRegressor(),
    AdaBoostRegressor(),
    CatBoostRegressor(), 
    xgb.XGBRegressor(),
    lgb.LGBMRegressor()
]

entries = []                                           
for model in models:

    rmse_l = []
    mape_l = []
    r2_score_l = []
    
    model_name = model.__class__.__name__
    model.fit(X_train, y_train)
    y_pred_test = model.predict(X_test)

    r2_score_value = round(r2_score(y_test, y_pred_test), 4)
    rmse = round(mean_squared_error(y_test, y_pred_test, squared=False), 4)
    mape = mean_absolute_percentage_error(y_test, y_pred_test)
    mape = round(mape*100, 4)

    r2_score_l.append(r2_score_value)
    rmse_l.append(rmse)
    mape_l.append(mape)

    entries.append([model_name, np.mean(rmse_l), np.mean(mape_l), np.mean(r2_score_l)])

model_df = pd.DataFrame(entries, columns=['model_name', 'RMSE', 'MAPE%', 'r2_score'])
model_df.sort_values(by=['r2_score'], ascending=False)


Learning rate set to 0.046576
0:	learn: 16584.6138023	total: 230ms	remaining: 3m 49s
1:	learn: 15855.6774068	total: 250ms	remaining: 2m 4s
2:	learn: 15184.1602526	total: 268ms	remaining: 1m 28s
3:	learn: 14542.6441893	total: 286ms	remaining: 1m 11s
4:	learn: 13904.2828935	total: 303ms	remaining: 1m
5:	learn: 13299.1064746	total: 320ms	remaining: 53s
6:	learn: 12740.5523013	total: 339ms	remaining: 48.1s
7:	learn: 12185.2616844	total: 361ms	remaining: 44.8s
8:	learn: 11679.5678600	total: 383ms	remaining: 42.2s
9:	learn: 11190.3651229	total: 406ms	remaining: 40.2s
10:	learn: 10720.8286867	total: 426ms	remaining: 38.3s
11:	learn: 10286.7839545	total: 456ms	remaining: 37.6s
12:	learn: 9845.6744234	total: 488ms	remaining: 37.1s
13:	learn: 9424.6377699	total: 544ms	remaining: 38.3s
14:	learn: 9030.5269569	total: 589ms	remaining: 38.6s
15:	learn: 8647.5526851	total: 626ms	remaining: 38.5s
16:	learn: 8296.5098534	total: 649ms	remaining: 37.5s
17:	learn: 7947.9237663	total: 676ms	remaining: 36.9

Unnamed: 0,model_name,RMSE,MAPE%,r2_score
0,LinearRegression,945.401,3.0251,0.9498
3,RandomForestRegressor,1330.2098,4.8448,0.9006
6,XGBRegressor,1547.402,5.5764,0.8655
7,LGBMRegressor,1584.6834,5.4142,0.8589
4,AdaBoostRegressor,2079.5999,6.9202,0.7571
5,CatBoostRegressor,2871.5251,11.8601,0.5368
2,KNeighborsRegressor,10909.4877,45.3154,-5.6856
1,SVR,14898.5495,61.427,-11.4686


In [5]:
rf_params = {'bootstrap': True,
          'max_depth': None,
          'max_features': None,
          'min_samples_leaf': 1,
          'min_samples_split': 2,
          'n_estimators': 300}

lgb_params = {'subsample': 1.0,
 'random_state': 42,
 'objective': 'regression',
 'num_leaves': 35,
 'min_child_samples': 5,
 'metric': 'rmse',
 'max_depth': 8,
 'max_bin': 500,
 'learning_rate': 0.48,
 'lambda_l2': 0.2,
 'lambda_l1': 0.1,
 'boosting_type': 'dart'}

In [6]:
df2 = 

total= pd.concat([model_df, df2],ignore_index=True)
total.sort_values(by=['score'], ascending=False)

SyntaxError: invalid syntax (1738838458.py, line 1)