<a id="100"></a>
**HOME**

**Main Idea:**

Binary classification in trading predicts whether the market will **move up** or **move down** within a specific timeframe, using only OHLC price data. By leveraging machine learning, traders can simplify decision-making, and improve trading efficiency, enhancing the chances of consistent profits in volatile markets.


**References:**

* [Evaluating Machine Learning Classification for Financial Trading: An Empirical Approach](https://jfin-swufe.springeropen.com/articles/10.1186/s40854-020-00217-x)
* [Trading via Selective Classification](https://arxiv.org/pdf/2110.14914v1)
* [Forecasting and trading cryptocurrencies with machine learning under changing market conditions](https://jfin-swufe.springeropen.com/articles/10.1186/s40854-020-00217-x)
* [Trading via Selective Classification](https://arxiv.org/pdf/2110.14914v1)

**Content:**

* [**Import Dataset**](#1)
* [**Data Preparation**](#2)
* [**Modeling and Evaluation**](#3)
* [**Modeling All Data**](#4)
* [**Today's Prediction**](#5)

> **Prev Green Candle: Close2Close**

____

<a id="1"></a>

**Import Dataset**

In [1]:
symbol='BTCUSDT'

In [2]:
from binance.client import Client
import pandas as pd
import time

# Initialize the Binance client
api_key = "sytvkKKUmXPabC877r7MFv7rhibYAMoczrMdTse0OSB6dRyImx1G8yEInE889y00"
api_secret = "KYgkq441X5spXpdDoLELwlcoJ3k7uh9LeXGgf7aQvABSMZl42Py3OUIwFCqVgc6L"
client = Client(api_key, api_secret)

def fetch_ohlcv_batch(client, symbol, interval, start_time, limit=1000):
    """
    Fetch a batch of OHLCV data from Binance.
    """
    try:
        candles = client.get_klines(
            symbol=symbol,
            interval=interval,
            startTime=start_time,
            limit=limit
        )
        # Transform data into desired format
        ohlcv = [
            [int(c[0]), float(c[1]), float(c[2]), float(c[3]), float(c[4]), float(c[5])]
            for c in candles
        ]
        return ohlcv
    except Exception as e:
        print(f"Error fetching data: {e}")
        return None

def fetch_historical_ohlcv(client, symbol, interval, start_time, limit=1000):
    """
    Fetch historical OHLCV data in batches from Binance.
    """
    all_data = []
    while True:
        data = fetch_ohlcv_batch(client, symbol, interval, start_time, limit)
        if data:
            # Append data to all_data
            all_data.extend(data)
            # Update `start_time` to the timestamp of the last fetched data point + 1 millisecond
            start_time = data[-1][0] + 1
            print(f"Fetched {len(data)} data points. Total so far: {len(all_data)}")
        else:
            print("No more data to fetch or an error occurred.")
            break

        # If the batch size is less than the limit, it means we reached the end of available data
        if len(data) < limit:
            print("Reached the end of available data.")
            break

        # To avoid rate limit issues, wait for a short while
        time.sleep(1)

    # Convert data to DataFrame
    df = pd.DataFrame(all_data, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
    return df

# Usage example
if __name__ == "__main__":
    # Define parameters
    # symbol = 'BTCUSDT'        # Symbol to fetch (without '/')
    interval = Client.KLINE_INTERVAL_4HOUR # Timeframe ('1m', '5m', '1h', '1d', etc.)
    start_time = int(pd.Timestamp("2007-01-01").timestamp() * 1000)  # Start date in milliseconds
    limit = 1000              # Max data points per batch

    # Fetch historical data
    df = fetch_historical_ohlcv(client, symbol, interval, start_time, limit)
    print(f"Total fetched data points: {len(df)}")
    print(df.head())

Fetched 1000 data points. Total so far: 1000
Fetched 1000 data points. Total so far: 2000
Fetched 1000 data points. Total so far: 3000
Fetched 1000 data points. Total so far: 4000
Fetched 1000 data points. Total so far: 5000
Fetched 1000 data points. Total so far: 6000
Fetched 1000 data points. Total so far: 7000
Fetched 1000 data points. Total so far: 8000
Fetched 1000 data points. Total so far: 9000
Fetched 1000 data points. Total so far: 10000
Fetched 1000 data points. Total so far: 11000
Fetched 1000 data points. Total so far: 12000
Fetched 1000 data points. Total so far: 13000
Fetched 1000 data points. Total so far: 14000
Fetched 1000 data points. Total so far: 15000
Fetched 1000 data points. Total so far: 16000
Fetched 270 data points. Total so far: 16270
Reached the end of available data.
Total fetched data points: 16270
            timestamp     open     high      low    close      volume
0 2017-08-17 04:00:00  4261.48  4349.99  4261.32  4349.99   82.088865
1 2017-08-17 08:00:0

In [3]:
df.head()

Unnamed: 0,timestamp,open,high,low,close,volume
0,2017-08-17 04:00:00,4261.48,4349.99,4261.32,4349.99,82.088865
1,2017-08-17 08:00:00,4333.32,4485.39,4333.32,4427.3,63.619882
2,2017-08-17 12:00:00,4436.06,4485.39,4333.42,4352.34,174.562001
3,2017-08-17 16:00:00,4352.33,4354.84,4200.74,4325.23,225.109716
4,2017-08-17 20:00:00,4307.56,4369.69,4258.56,4285.08,249.769913


In [4]:
df.tail()

Unnamed: 0,timestamp,open,high,low,close,volume
16265,2025-01-20 16:00:00,104957.99,107050.0,100333.0,103691.63,25962.9026
16266,2025-01-20 20:00:00,103691.64,104331.0,101701.01,102260.01,5700.34025
16267,2025-01-21 00:00:00,102260.0,103260.1,100119.04,102936.01,10133.69299
16268,2025-01-21 04:00:00,102936.01,102974.14,101111.68,101989.05,5400.15858
16269,2025-01-21 08:00:00,101989.06,102894.29,101740.82,102475.94,1966.36797


<a id="id"></a>
[**Back to HOME**](#100)

<a id="2"></a>

**Data Preparation**

In [5]:
# Select all rows except the last one
df = df.iloc[:-1]

In [6]:
df.tail()

Unnamed: 0,timestamp,open,high,low,close,volume
16264,2025-01-20 12:00:00,108239.19,108700.01,104480.0,104957.99,13711.24313
16265,2025-01-20 16:00:00,104957.99,107050.0,100333.0,103691.63,25962.9026
16266,2025-01-20 20:00:00,103691.64,104331.0,101701.01,102260.01,5700.34025
16267,2025-01-21 00:00:00,102260.0,103260.1,100119.04,102936.01,10133.69299
16268,2025-01-21 04:00:00,102936.01,102974.14,101111.68,101989.05,5400.15858


In [7]:
df.columns

Index(['timestamp', 'open', 'high', 'low', 'close', 'volume'], dtype='object')

In [8]:
df_close2close=df.copy()

In [9]:
df_close2close['prev_close'] = df['close'].shift(1)

In [10]:
df_close2close

Unnamed: 0,timestamp,open,high,low,close,volume,prev_close
0,2017-08-17 04:00:00,4261.48,4349.99,4261.32,4349.99,82.088865,
1,2017-08-17 08:00:00,4333.32,4485.39,4333.32,4427.30,63.619882,4349.99
2,2017-08-17 12:00:00,4436.06,4485.39,4333.42,4352.34,174.562001,4427.30
3,2017-08-17 16:00:00,4352.33,4354.84,4200.74,4325.23,225.109716,4352.34
4,2017-08-17 20:00:00,4307.56,4369.69,4258.56,4285.08,249.769913,4325.23
...,...,...,...,...,...,...,...
16264,2025-01-20 12:00:00,108239.19,108700.01,104480.00,104957.99,13711.243130,108239.19
16265,2025-01-20 16:00:00,104957.99,107050.00,100333.00,103691.63,25962.902600,104957.99
16266,2025-01-20 20:00:00,103691.64,104331.00,101701.01,102260.01,5700.340250,103691.63
16267,2025-01-21 00:00:00,102260.00,103260.10,100119.04,102936.01,10133.692990,102260.01


In [11]:
# Drop rows with any NaN values
df_close2close.dropna(inplace=True)

In [12]:
df_close2close

Unnamed: 0,timestamp,open,high,low,close,volume,prev_close
1,2017-08-17 08:00:00,4333.32,4485.39,4333.32,4427.30,63.619882,4349.99
2,2017-08-17 12:00:00,4436.06,4485.39,4333.42,4352.34,174.562001,4427.30
3,2017-08-17 16:00:00,4352.33,4354.84,4200.74,4325.23,225.109716,4352.34
4,2017-08-17 20:00:00,4307.56,4369.69,4258.56,4285.08,249.769913,4325.23
5,2017-08-18 00:00:00,4285.08,4340.62,4134.61,4292.39,276.193043,4285.08
...,...,...,...,...,...,...,...
16264,2025-01-20 12:00:00,108239.19,108700.01,104480.00,104957.99,13711.243130,108239.19
16265,2025-01-20 16:00:00,104957.99,107050.00,100333.00,103691.63,25962.902600,104957.99
16266,2025-01-20 20:00:00,103691.64,104331.00,101701.01,102260.01,5700.340250,103691.63
16267,2025-01-21 00:00:00,102260.00,103260.10,100119.04,102936.01,10133.692990,102260.01


In [13]:
# Create the 'up_down' column: 1 if today's close is higher than yesterday's, else 0
df_close2close['down_close2close'] = (df_close2close['close'] < df_close2close['prev_close']).astype(int)

In [14]:
df_close2close.columns

Index(['timestamp', 'open', 'high', 'low', 'close', 'volume', 'prev_close',
       'down_close2close'],
      dtype='object')

In [15]:
df_close2close.tail()

Unnamed: 0,timestamp,open,high,low,close,volume,prev_close,down_close2close
16264,2025-01-20 12:00:00,108239.19,108700.01,104480.0,104957.99,13711.24313,108239.19,1
16265,2025-01-20 16:00:00,104957.99,107050.0,100333.0,103691.63,25962.9026,104957.99,1
16266,2025-01-20 20:00:00,103691.64,104331.0,101701.01,102260.01,5700.34025,103691.63,1
16267,2025-01-21 00:00:00,102260.0,103260.1,100119.04,102936.01,10133.69299,102260.01,0
16268,2025-01-21 04:00:00,102936.01,102974.14,101111.68,101989.05,5400.15858,102936.01,1


In [16]:
# Delete columns 
df_close2close_select = df_close2close.drop(['timestamp'], axis=1)

In [17]:
df_close2close_select.tail()

Unnamed: 0,open,high,low,close,volume,prev_close,down_close2close
16264,108239.19,108700.01,104480.0,104957.99,13711.24313,108239.19,1
16265,104957.99,107050.0,100333.0,103691.63,25962.9026,104957.99,1
16266,103691.64,104331.0,101701.01,102260.01,5700.34025,103691.63,1
16267,102260.0,103260.1,100119.04,102936.01,10133.69299,102260.01,0
16268,102936.01,102974.14,101111.68,101989.05,5400.15858,102936.01,1


In [18]:
# Count the occurrences of 1 and 0
value_counts = df_close2close_select['down_close2close'].value_counts(normalize=True) * 100

# Display the percentages
print(f"Percentage of 1: {value_counts.get(1, 0):.2f}%")
print(f"Percentage of 0: {value_counts.get(0, 0):.2f}%")

Percentage of 1: 48.38%
Percentage of 0: 51.62%


In [19]:
# Separate features and target
X = df_close2close_select.drop('down_close2close', axis=1)  # Replace 'target' with your actual target column name
y = df_close2close_select['down_close2close']

In [20]:
# Split the data into training, validation, and test sets
from sklearn.model_selection import train_test_split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [21]:
# # Handle class imbalance using SMOTE
# from imblearn.over_sampling import SMOTE
# smote = SMOTE(random_state=42)
# X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
X_train_res=X_train
y_train_res=y_train

<a id="id"></a>
[**Back to HOME**](#100)

<a id="3"></a>

**Modeling and Evaluation**

In [22]:
# Parameter untuk GridSearchCV
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.1, 0.2],
    "subsample": [0.8, 1.0]
}

In [23]:
# Import the XGBoost classifier
from xgboost import XGBClassifier
# model_xgb = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
model_xgb = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')

In [24]:
# GridSearchCV for best parameters
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
grid_cv = GridSearchCV(estimator=model_xgb, param_grid=param_grid, scoring="accuracy", cv=5, verbose=1, n_jobs=-1)

In [25]:
# Train the model
grid_cv.fit(X_train_res, y_train_res)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

In [26]:
# Best parameters
print("Best Parameters:", grid_cv.best_params_)
print("Best Cross-Validation Accuracy:", grid_cv.best_score_)

Best Parameters: {'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 200, 'subsample': 1.0}
Best Cross-Validation Accuracy: 0.7594626264168578


In [27]:
# Evaluate the model on the validation set
best_model = grid_cv.best_estimator_
y_val_pred = best_model.predict(X_val)
y_val_pred_proba = best_model.predict_proba(X_val)[:, 1] 

In [28]:
# Evaluate the model on the test set
best_model = grid_cv.best_estimator_
y_test_pred = best_model.predict(X_test)
y_test_pred_proba = best_model.predict_proba(X_test)[:, 1] 

In [29]:
# Metrics Evaluation on the validation set
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, make_scorer
)

accuracy_val = accuracy_score(y_val, y_val_pred)
precision_val = precision_score(y_val, y_val_pred)
recall_val = recall_score(y_val, y_val_pred)
f1_val = f1_score(y_val, y_val_pred)
f2_val = (1 + 2**2) * (precision_val * recall_val) / ((2**2 * precision_val) + recall_val)
roc_auc_val = roc_auc_score(y_val, y_val_pred_proba)

In [30]:
# Metrics Evaluation on the test set
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, make_scorer
)

accuracy_test = accuracy_score(y_test, y_test_pred)
precision_test = precision_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred)
f1_test = f1_score(y_test, y_test_pred)
f2_test = (1 + 2**2) * (precision_test * recall_test) / ((2**2 * precision_test) + recall_test)
roc_auc_test = roc_auc_score(y_test, y_test_pred_proba)

In [31]:
print("\nValidation Evaluation Metrics:")
print(f"Accuracy: {accuracy_val:.4f}")
print(f"Precision: {precision_val:.4f}")
print(f"Recall: {recall_val:.4f}")
print(f"F1 Score: {f1_val:.4f}")
print(f"F2 Score: {f2_val:.4f}")
print(f"ROC AUC: {roc_auc_val:.4f}")


print("\nTest Evaluation Metrics:")
print(f"Accuracy: {accuracy_test:.4f}")
print(f"Precision: {precision_test:.4f}")
print(f"Recall: {recall_test:.4f}")
print(f"F1 Score: {f1_test:.4f}")
print(f"F2 Score: {f2_test:.4f}")
print(f"ROC AUC: {roc_auc_test:.4f}")


Validation Evaluation Metrics:
Accuracy: 0.7885
Precision: 0.7818
Recall: 0.7627
F1 Score: 0.7721
F2 Score: 0.7664
ROC AUC: 0.8840

Test Evaluation Metrics:
Accuracy: 0.7481
Precision: 0.7447
Recall: 0.7340
F1 Score: 0.7393
F2 Score: 0.7361
ROC AUC: 0.8518


<a id="id"></a>
[**Back to HOME**](#100)

<a id="4"></a>

**Modeling All Data**

In [32]:
symbol = 'BTCUSDT'

In [33]:
from binance.client import Client
import pandas as pd
import time

# Initialize the Binance client
api_key = "sytvkKKUmXPabC877r7MFv7rhibYAMoczrMdTse0OSB6dRyImx1G8yEInE889y00"
api_secret = "KYgkq441X5spXpdDoLELwlcoJ3k7uh9LeXGgf7aQvABSMZl42Py3OUIwFCqVgc6L"
client = Client(api_key, api_secret)

def fetch_ohlcv_batch(client, symbol, interval, start_time, limit=1000):
    """
    Fetch a batch of OHLCV data from Binance.
    """
    try:
        candles = client.get_klines(
            symbol=symbol,
            interval=interval,
            startTime=start_time,
            limit=limit
        )
        # Transform data into desired format
        ohlcv = [
            [int(c[0]), float(c[1]), float(c[2]), float(c[3]), float(c[4]), float(c[5])]
            for c in candles
        ]
        return ohlcv
    except Exception as e:
        print(f"Error fetching data: {e}")
        return None

def fetch_historical_ohlcv(client, symbol, interval, start_time, limit=1000):
    """
    Fetch historical OHLCV data in batches from Binance.
    """
    all_data = []
    while True:
        data = fetch_ohlcv_batch(client, symbol, interval, start_time, limit)
        if data:
            # Append data to all_data
            all_data.extend(data)
            # Update `start_time` to the timestamp of the last fetched data point + 1 millisecond
            start_time = data[-1][0] + 1
            print(f"Fetched {len(data)} data points. Total so far: {len(all_data)}")
        else:
            print("No more data to fetch or an error occurred.")
            break

        # If the batch size is less than the limit, it means we reached the end of available data
        if len(data) < limit:
            print("Reached the end of available data.")
            break

        # To avoid rate limit issues, wait for a short while
        time.sleep(1)

    # Convert data to DataFrame
    df = pd.DataFrame(all_data, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
    return df

# Usage example
if __name__ == "__main__":
    # Define parameters
    # symbol = 'BTCUSDT'        # Symbol to fetch (without '/')
    interval = Client.KLINE_INTERVAL_4HOUR  # Timeframe ('1m', '5m', '1h', '1d', etc.)
    start_time = int(pd.Timestamp("2010-07-17").timestamp() * 1000)  # Start date in milliseconds
    limit = 1000              # Max data points per batch

    # Fetch historical data
    df_all = fetch_historical_ohlcv(client, symbol, interval, start_time, limit)
    print(f"Total fetched data points: {len(df_all)}")
    print(df_all.head())

Fetched 1000 data points. Total so far: 1000
Fetched 1000 data points. Total so far: 2000
Fetched 1000 data points. Total so far: 3000
Fetched 1000 data points. Total so far: 4000
Fetched 1000 data points. Total so far: 5000
Fetched 1000 data points. Total so far: 6000
Fetched 1000 data points. Total so far: 7000
Fetched 1000 data points. Total so far: 8000
Fetched 1000 data points. Total so far: 9000
Fetched 1000 data points. Total so far: 10000
Fetched 1000 data points. Total so far: 11000
Fetched 1000 data points. Total so far: 12000
Fetched 1000 data points. Total so far: 13000
Fetched 1000 data points. Total so far: 14000
Fetched 1000 data points. Total so far: 15000
Fetched 1000 data points. Total so far: 16000
Fetched 270 data points. Total so far: 16270
Reached the end of available data.
Total fetched data points: 16270
            timestamp     open     high      low    close      volume
0 2017-08-17 04:00:00  4261.48  4349.99  4261.32  4349.99   82.088865
1 2017-08-17 08:00:0

In [34]:
# Select all rows except the last one
df_all = df_all.iloc[:-1]

In [35]:
df_all

Unnamed: 0,timestamp,open,high,low,close,volume
0,2017-08-17 04:00:00,4261.48,4349.99,4261.32,4349.99,82.088865
1,2017-08-17 08:00:00,4333.32,4485.39,4333.32,4427.30,63.619882
2,2017-08-17 12:00:00,4436.06,4485.39,4333.42,4352.34,174.562001
3,2017-08-17 16:00:00,4352.33,4354.84,4200.74,4325.23,225.109716
4,2017-08-17 20:00:00,4307.56,4369.69,4258.56,4285.08,249.769913
...,...,...,...,...,...,...
16264,2025-01-20 12:00:00,108239.19,108700.01,104480.00,104957.99,13711.243130
16265,2025-01-20 16:00:00,104957.99,107050.00,100333.00,103691.63,25962.902600
16266,2025-01-20 20:00:00,103691.64,104331.00,101701.01,102260.01,5700.340250
16267,2025-01-21 00:00:00,102260.00,103260.10,100119.04,102936.01,10133.692990


In [36]:
# Shift 1 
df_all['prev_close'] = df_all['close'].shift(1)

In [37]:
df_all

Unnamed: 0,timestamp,open,high,low,close,volume,prev_close
0,2017-08-17 04:00:00,4261.48,4349.99,4261.32,4349.99,82.088865,
1,2017-08-17 08:00:00,4333.32,4485.39,4333.32,4427.30,63.619882,4349.99
2,2017-08-17 12:00:00,4436.06,4485.39,4333.42,4352.34,174.562001,4427.30
3,2017-08-17 16:00:00,4352.33,4354.84,4200.74,4325.23,225.109716,4352.34
4,2017-08-17 20:00:00,4307.56,4369.69,4258.56,4285.08,249.769913,4325.23
...,...,...,...,...,...,...,...
16264,2025-01-20 12:00:00,108239.19,108700.01,104480.00,104957.99,13711.243130,108239.19
16265,2025-01-20 16:00:00,104957.99,107050.00,100333.00,103691.63,25962.902600,104957.99
16266,2025-01-20 20:00:00,103691.64,104331.00,101701.01,102260.01,5700.340250,103691.63
16267,2025-01-21 00:00:00,102260.00,103260.10,100119.04,102936.01,10133.692990,102260.01


In [38]:
# Drop rows with any NaN values
df_all.dropna(inplace=True)

In [39]:
df_all

Unnamed: 0,timestamp,open,high,low,close,volume,prev_close
1,2017-08-17 08:00:00,4333.32,4485.39,4333.32,4427.30,63.619882,4349.99
2,2017-08-17 12:00:00,4436.06,4485.39,4333.42,4352.34,174.562001,4427.30
3,2017-08-17 16:00:00,4352.33,4354.84,4200.74,4325.23,225.109716,4352.34
4,2017-08-17 20:00:00,4307.56,4369.69,4258.56,4285.08,249.769913,4325.23
5,2017-08-18 00:00:00,4285.08,4340.62,4134.61,4292.39,276.193043,4285.08
...,...,...,...,...,...,...,...
16264,2025-01-20 12:00:00,108239.19,108700.01,104480.00,104957.99,13711.243130,108239.19
16265,2025-01-20 16:00:00,104957.99,107050.00,100333.00,103691.63,25962.902600,104957.99
16266,2025-01-20 20:00:00,103691.64,104331.00,101701.01,102260.01,5700.340250,103691.63
16267,2025-01-21 00:00:00,102260.00,103260.10,100119.04,102936.01,10133.692990,102260.01


In [40]:
# Create the 'up_down' column: 1 if today's close is higher than yesterday's, else 0
df_all['down_close2close'] = (df_all['close'] < df_all['prev_close']).astype(int) 

In [41]:
# Delete columns 
df_all_select = df_all.drop(['timestamp'], axis=1)

In [42]:
# Separate features and target
X_all = df_all_select.drop('down_close2close', axis=1)  # Replace 'target' with your actual target column name
y_all = df_all_select['down_close2close']

In [43]:
# # Handle class imbalance using SMOTE
# from imblearn.over_sampling import SMOTE
# smote = SMOTE(random_state=42)
# X_train_res_all, y_train_res_all = smote.fit_resample(X_all, y_all)

X_train_res_all=X_all
y_train_res_all= y_all

In [44]:
# Parameter untuk GridSearchCV
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.1, 0.2],
    "subsample": [0.8, 1.0]
}

In [45]:
# Import the XGBoost classifier
from xgboost import XGBClassifier
model_xgb_all = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')

In [46]:
# GridSearchCV for best parameters
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
grid_cv_all = GridSearchCV(estimator=model_xgb_all, param_grid=param_grid, scoring="accuracy", cv=5, verbose=1, n_jobs=-1)

In [47]:
# Train the model
grid_cv_all.fit(X_train_res_all, y_train_res_all)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

In [48]:
best_model_all = grid_cv_all.best_estimator_

<a id="id"></a>
[**Back to HOME**](#100)

<a id="5"></a>

**Today's Prediction**

In [49]:
symbol='BTCUSDT'

In [50]:
from binance.client import Client
import pandas as pd
import time

# Initialize the Binance client
api_key = "sytvkKKUmXPabC877r7MFv7rhibYAMoczrMdTse0OSB6dRyImx1G8yEInE889y00"
api_secret = "KYgkq441X5spXpdDoLELwlcoJ3k7uh9LeXGgf7aQvABSMZl42Py3OUIwFCqVgc6L"
client = Client(api_key, api_secret)

def fetch_ohlcv_batch(client, symbol, interval, start_time, limit=1000):
    """
    Fetch a batch of OHLCV data from Binance.
    """
    try:
        candles = client.get_klines(
            symbol=symbol,
            interval=interval,
            startTime=start_time,
            limit=limit
        )
        # Transform data into desired format
        ohlcv = [
            [int(c[0]), float(c[1]), float(c[2]), float(c[3]), float(c[4]), float(c[5])]
            for c in candles
        ]
        return ohlcv
    except Exception as e:
        print(f"Error fetching data: {e}")
        return None

def fetch_historical_ohlcv(client, symbol, interval, start_time, limit=1000):
    """
    Fetch historical OHLCV data in batches from Binance.
    """
    all_data = []
    while True:
        data = fetch_ohlcv_batch(client, symbol, interval, start_time, limit)
        if data:
            # Append data to all_data
            all_data.extend(data)
            # Update `start_time` to the timestamp of the last fetched data point + 1 millisecond
            start_time = data[-1][0] + 1
            print(f"Fetched {len(data)} data points. Total so far: {len(all_data)}")
        else:
            print("No more data to fetch or an error occurred.")
            break

        # If the batch size is less than the limit, it means we reached the end of available data
        if len(data) < limit:
            print("Reached the end of available data.")
            break

        # To avoid rate limit issues, wait for a short while
        time.sleep(1)

    # Convert data to DataFrame
    df = pd.DataFrame(all_data, columns=['timestamp', 'open', 'high', 'low', 'close', 'volume'])
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='ms')
    return df

# Usage example
if __name__ == "__main__":
    # Define parameters
    # symbol = 'BTCUSDT'        # Symbol to fetch (without '/')
    interval = Client.KLINE_INTERVAL_4HOUR  # Timeframe ('1m', '5m', '1h', '1d', etc.)
    start_time = int(pd.Timestamp("2010-07-17").timestamp() * 1000)  # Start date in milliseconds
    limit = 1000              # Max data points per batch

    # Fetch historical data
    df_today = fetch_historical_ohlcv(client, symbol, interval, start_time, limit)
    print(f"Total fetched data points: {len(df_today)}")
    print(df_today.head())

Fetched 1000 data points. Total so far: 1000
Fetched 1000 data points. Total so far: 2000
Fetched 1000 data points. Total so far: 3000
Fetched 1000 data points. Total so far: 4000
Fetched 1000 data points. Total so far: 5000
Fetched 1000 data points. Total so far: 6000
Fetched 1000 data points. Total so far: 7000
Fetched 1000 data points. Total so far: 8000
Fetched 1000 data points. Total so far: 9000
Fetched 1000 data points. Total so far: 10000
Fetched 1000 data points. Total so far: 11000
Fetched 1000 data points. Total so far: 12000
Fetched 1000 data points. Total so far: 13000
Fetched 1000 data points. Total so far: 14000
Fetched 1000 data points. Total so far: 15000
Fetched 1000 data points. Total so far: 16000
Fetched 270 data points. Total so far: 16270
Reached the end of available data.
Total fetched data points: 16270
            timestamp     open     high      low    close      volume
0 2017-08-17 04:00:00  4261.48  4349.99  4261.32  4349.99   82.088865
1 2017-08-17 08:00:0

In [51]:
# Select all rows except the last one
df_today = df_today.iloc[:-1]

In [52]:
df_today['prev_close'] = df_today['close'].shift(1)

In [53]:
df_today

Unnamed: 0,timestamp,open,high,low,close,volume,prev_close
0,2017-08-17 04:00:00,4261.48,4349.99,4261.32,4349.99,82.088865,
1,2017-08-17 08:00:00,4333.32,4485.39,4333.32,4427.30,63.619882,4349.99
2,2017-08-17 12:00:00,4436.06,4485.39,4333.42,4352.34,174.562001,4427.30
3,2017-08-17 16:00:00,4352.33,4354.84,4200.74,4325.23,225.109716,4352.34
4,2017-08-17 20:00:00,4307.56,4369.69,4258.56,4285.08,249.769913,4325.23
...,...,...,...,...,...,...,...
16264,2025-01-20 12:00:00,108239.19,108700.01,104480.00,104957.99,13711.243130,108239.19
16265,2025-01-20 16:00:00,104957.99,107050.00,100333.00,103691.63,25962.902600,104957.99
16266,2025-01-20 20:00:00,103691.64,104331.00,101701.01,102260.01,5700.340250,103691.63
16267,2025-01-21 00:00:00,102260.00,103260.10,100119.04,102936.01,10133.692990,102260.01


In [54]:
df_today_test= df_today.tail(1)

In [55]:
df_today_test

Unnamed: 0,timestamp,open,high,low,close,volume,prev_close
16268,2025-01-21 04:00:00,102936.01,102974.14,101111.68,101989.05,5400.15858,102936.01


In [56]:
# Delete column
df_today_test_ready = df_today_test.drop(columns=['timestamp'])

In [57]:
df_today_test_ready

Unnamed: 0,open,high,low,close,volume,prev_close
16268,102936.01,102974.14,101111.68,101989.05,5400.15858,102936.01


In [58]:
# Evaluate the model data train only
y_today_pred = best_model.predict(df_today_test_ready)
y_today_pred_proba = best_model.predict_proba(df_today_test_ready)[:, 1] 

In [59]:
y_today_pred

array([1])

In [60]:
y_today_pred_proba

array([0.8319061], dtype=float32)

In [61]:
# Evaluate the model on the ALL DATA
y_today_pred_all = best_model_all.predict(df_today_test_ready)
y_today_pred_proba_all = best_model_all.predict_proba(df_today_test_ready)[:, 1]

In [62]:
y_today_pred_all

array([1])

In [63]:
y_today_pred_proba_all

array([0.7391174], dtype=float32)

<a id="id"></a>
[**Back to HOME**](#100)