## **Modules**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pandas import DataFrame

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score

## **Data**

**Import the normalized data**

In [2]:
file_path = "https://raw.githubusercontent.com/lavibula/ML20222.PredictionBitcoin/main/data/saved_data.csv"
df = pd.read_csv(file_path)
df

Unnamed: 0,Date,BTC_close,BTC_open,BTC_high,BTC_low,BTC_volume,Active_Addr_Cnt,Difficulty,Mean_Block_Size(in_bytes),Sum_Block_Weight,...,ETH,LTC,DOGE,XRP,GOLD,SILVER,COPPER,S&P500,DJI,JP225
0,2023-04-16,30310.3,30299.2,30545.3,30134.6,34.48,840992.0,4.788780e+13,1.866594e+06,495223185.0,...,2119.29,100.03,0.090465,0.52089,2015.6,25.438,9023.50,4137.64,33885.31,28493.47
1,2023-04-15,30299.6,30472.6,30586.5,30208.8,31.71,1045660.0,4.788780e+13,1.839875e+06,631025193.0,...,2090.59,96.66,0.088890,0.51930,2002.2,25.460,9023.50,4137.64,33885.31,28493.47
2,2023-04-14,30472.5,30387.4,30964.9,30026.0,98.38,1016042.0,4.788780e+13,1.759535e+06,559166432.0,...,2099.98,96.34,0.088707,0.52269,2002.2,25.460,9023.50,4137.64,33885.31,28493.47
3,2023-04-13,30387.4,29892.4,30524.1,29864.5,65.87,1009669.0,4.788780e+13,1.812113e+06,567094231.0,...,2012.11,94.19,0.087344,0.51244,2041.3,25.925,9058.50,4146.22,34030.34,28156.97
4,2023-04-12,29886.4,30209.8,30473.0,29679.5,78.69,1056542.0,4.788780e+13,1.933496e+06,635037442.0,...,1916.58,92.02,0.083398,0.50473,2010.9,25.458,8916.50,4091.95,33647.22,28082.70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4649,2010-07-24,0.1,0.1,0.1,0.1,0.50,959.0,1.820000e+02,1.519540e+03,1148772.0,...,0.00,0.00,0.000000,0.00000,1187.8,18.101,7018.25,1102.66,10424.62,9430.96
4650,2010-07-23,0.1,0.1,0.1,0.1,2.40,655.0,1.820000e+02,5.309330e+02,412004.0,...,0.00,0.00,0.000000,0.00000,1187.8,18.101,7018.25,1102.66,10424.62,9430.96
4651,2010-07-22,0.1,0.1,0.1,0.1,2.16,594.0,1.820000e+02,5.724432e+02,403000.0,...,0.00,0.00,0.000000,0.00000,1195.6,18.120,7002.75,1093.67,10322.30,9220.88
4652,2010-07-21,0.1,0.1,0.1,0.1,0.58,784.0,1.820000e+02,6.038213e+02,499964.0,...,0.00,0.00,0.000000,0.00000,1191.8,17.803,0.00,1069.59,10120.53,9278.83


**Define the periods of time that the features have in the data**

For example, the features contain SMA2 which is the smoothed moving average during 2 days, so a period of time that must be in the periods list is 2.

In [3]:
periods = [2,4,8,12,24,48,96,192]

**Label value counts**

In [4]:
df["BTC_close"].value_counts()

BTC_close
0.1       98
0.3       45
0.2       37
0.9       33
4.9       28
          ..
8497.3     1
8476.3     1
8661.2     1
8783.1     1
8264.4     1
Name: count, Length: 3702, dtype: int64

### **Data Preprocessing**

**Drop useless column**

**Drop nan if exist**

In [6]:
# Print all the amount of nan values
df.isna().sum().sum()

0

Or forward fill before drop, this forward filling can avoid data leakeage!

In [7]:
forward_fill = False

if forward_fill:
    df.ffill(inplace=True)
else:
    df.dropna(inplace=True)

**Create train set, validation set and test set**

In [9]:
X_df = df.drop(columns=["BTC_close"])
y_df = df["BTC_close"]

In [10]:
X_df = X_df.drop(columns=["Date"])

In [11]:
X_train_df, X_test_df, y_train_df, y_test_df = train_test_split(X_df, y_df, test_size=0.3, shuffle=False)
X_val_df, X_test_df, y_val_df, y_test_df = train_test_split(X_test_df, y_test_df, test_size=0.5, shuffle=False)

## **Model and training**

### Random Forest

**Training**

In [12]:
model_rf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, 
                                  min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', bootstrap=True, 
                                  oob_score=False, n_jobs=-1, random_state=1234, verbose=1)

In [13]:
model_rf.fit(X_train_df, y_train_df)

ValueError: Unknown label type: 'continuous'

**Evaluation**

In [12]:
print("AUC score in the train set:", roc_auc_score(y_train_df, model_rf.predict_proba(X_train_df)[:, 1]))

[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.1s


AUC score in the train set: 1.0


[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.2s finished


In [13]:
print("AUC score in the validation set:", roc_auc_score(y_val_df, model_rf.predict_proba(X_val_df)[:, 1]))

AUC score in the validation set: 0.5383297974769913


[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished


In [14]:
print("AUC score in the test set:", roc_auc_score(y_test_df, model_rf.predict_proba(X_test_df)[:, 1]))

AUC score in the test set: 0.5220341793638587


[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished


In [15]:
print("Accuracy score in the train set:", accuracy_score(y_train_df, (model_rf.predict_proba(X_train_df)[:, 1] >= 0.5) * 1))

[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.1s


Accuracy score in the train set: 1.0


[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.2s finished


In [16]:
print("Accuracy score in the validation set:", accuracy_score(y_val_df, (model_rf.predict_proba(X_val_df)[:, 1] >= 0.5) * 1))

Accuracy score in the validation set: 0.5280041616091555


[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished


In [17]:
print("Accuracy score in the test set:", accuracy_score(y_test_df, (model_rf.predict_proba(X_test_df)[:, 1] >= 0.5) * 1))

Accuracy score in the test set: 0.5090152565880721


[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished


**Threshold tuning**

In [18]:
pred_val = model_rf.predict_proba(X_val_df)[:, 1]
tmp_pred_val = pred_val.copy()

l = []
for i in range(4000, 6000):
    pred_val = (tmp_pred_val >= i/10000) * 1
    l.append(accuracy_score(y_val_df, pred_val))

threshold = max(l)
print(max(l))

[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.1s finished


0.5280041616091555


In [19]:
pred_test = model_rf.predict_proba(X_test_df)[:,1]
pred_test = (pred_test >= threshold) * 1
print("Final testing accuracy:")
accuracy_score(y_test_df, pred_test)

Final testing accuracy:


[Parallel(n_jobs=16)]: Using backend ThreadingBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.0s
[Parallel(n_jobs=16)]: Done 100 out of 100 | elapsed:    0.0s finished


0.5176837725381415

**Features importance**

In [20]:
importance_df = pd.DataFrame()
importance_df['feature'] = model_rf.feature_names_in_
importance_df['importance'] = model_rf.feature_importances_
importance_df.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance
32,BOP,0.002056
71,close_EMA2,0.001961
144,close_ALMA4,0.00176
68,close_WMA2,0.001751
66,close_SMA2,0.001679
42,ForceIndex2,0.001456
33,UO,0.001452
72,close_DEMA2,0.001449
1438,log_volume_resid_divide_prev,0.001427
67,close_SMMA2,0.001422


**Note:**

The result is different at different training times, in the reality, we train multiple times the model and choose the one that gives us the best accuracy on the validation set