In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Calculate Target Variable 

In [2]:
btc_price_data_1_year = pd.read_csv("data/bitcoin_historical_data_1_year.csv")
btc_price_data_1_year

Unnamed: 0,timestamp,open,high,low,close,volume,date,time
0,2023-11-01 00:00:00,34618.86,34676.51,34656.38,34667.88,48.953000,2023-11-01,00:00:00
1,2023-11-01 00:01:00,34642.54,34687.53,34673.30,34642.82,16.178075,2023-11-01,00:01:00
2,2023-11-01 00:02:00,34637.97,34656.82,34642.53,34656.56,8.753120,2023-11-01,00:02:00
3,2023-11-01 00:03:00,34617.22,34656.56,34656.56,34629.34,11.308610,2023-11-01,00:03:00
4,2023-11-01 00:04:00,34597.99,34630.42,34629.41,34622.27,8.583808,2023-11-01,00:04:00
...,...,...,...,...,...,...,...,...
528627,2024-10-31 23:56:00,70238.76,70248.97,70248.97,70238.76,1.189134,2024-10-31,23:56:00
528628,2024-10-31 23:57:00,70218.00,70250.00,70238.77,70233.24,4.767082,2024-10-31,23:57:00
528629,2024-10-31 23:58:00,70193.97,70242.25,70232.55,70207.78,9.589688,2024-10-31,23:58:00
528630,2024-10-31 23:59:00,70175.16,70207.79,70207.79,70197.83,7.112237,2024-10-31,23:59:00


## 1. Calculate Target Variable - Function

In [3]:
def create_target_variable(data, threshold = 0.01):
    """
    Computes and sets the 'target' variable from the input 'data' and 'threshold'.
    Creates a 'target' column with the computed values in the 'data' DataFrame.

    Parameters
    ----------
    data: a DataFrame with the time series data. There must be a column named 'close'! 
          This column will be used by the user to calculate the 'target' variable.
    
    threshold: threshold for the price change to classify as 'buy' or 'sell'. For instance, if you want a 1% increase to be a 'buy' signal,
               the threshold will be 0.01. Adjust this threshold as per your strategy.
    """
    # Create a copy of the DataFrame
    data_copy = data.copy(deep=True)
    
    # Compute the percentage change between the current close price and the close price in the next period.
    # This will help define whether there’s a significant increase or decrease.
    data_copy['future_return'] = ((data_copy['close'].shift(-1) - data_copy['close']) / data_copy['close']) * 100

    # Define the target as 1 (buy) if the future return is above the threshold, and 0 (sell) if it is below or equal to the threshold.
    data_copy['target'] = (data_copy['future_return'] > threshold).astype(float)

    # The last row in your dataset will have a NaN value for 'future_return' due to the shift operation. Drop this row to clean up the dataset.
    data_copy = data_copy.dropna()

    # Check the balance of 1s and 0s in our target variable to understand how many “buy” and “sell” signals we have.
    print(data_copy['target'].value_counts())

    return data_copy

In [4]:
# Define the threshold for the price change to classify as 'buy' or 'sell'. For instance, if we want a 1% increase to be a 'buy' signal,
# the threshold will be 0.01.
threshold = 0.01

# Compute the 'target' variable
btc_price_data_1_year_target = create_target_variable(btc_price_data_1_year, threshold)

target
0.0    317152
1.0    211479
Name: count, dtype: int64


In [5]:
btc_price_data_1_year.head()

Unnamed: 0,timestamp,open,high,low,close,volume,date,time
0,2023-11-01 00:00:00,34618.86,34676.51,34656.38,34667.88,48.953,2023-11-01,00:00:00
1,2023-11-01 00:01:00,34642.54,34687.53,34673.3,34642.82,16.178075,2023-11-01,00:01:00
2,2023-11-01 00:02:00,34637.97,34656.82,34642.53,34656.56,8.75312,2023-11-01,00:02:00
3,2023-11-01 00:03:00,34617.22,34656.56,34656.56,34629.34,11.30861,2023-11-01,00:03:00
4,2023-11-01 00:04:00,34597.99,34630.42,34629.41,34622.27,8.583808,2023-11-01,00:04:00


In [6]:
btc_price_data_1_year_target.head()

Unnamed: 0,timestamp,open,high,low,close,volume,date,time,future_return,target
0,2023-11-01 00:00:00,34618.86,34676.51,34656.38,34667.88,48.953,2023-11-01,00:00:00,-0.072286,0.0
1,2023-11-01 00:01:00,34642.54,34687.53,34673.3,34642.82,16.178075,2023-11-01,00:01:00,0.039662,1.0
2,2023-11-01 00:02:00,34637.97,34656.82,34642.53,34656.56,8.75312,2023-11-01,00:02:00,-0.078542,0.0
3,2023-11-01 00:03:00,34617.22,34656.56,34656.56,34629.34,11.30861,2023-11-01,00:03:00,-0.020416,0.0
4,2023-11-01 00:04:00,34597.99,34630.42,34629.41,34622.27,8.583808,2023-11-01,00:04:00,-0.045404,0.0


## 2. Calculate Target Variable - Explanation

To create a binary target variable for price prediction:

1. **Choose the Prediction Horizon**:  
   Decide the period over which you want to predict price changes. For example, if you want to predict whether the price will increase within the next day or hour, specify that period. Let’s say we’re using the next row in your dataset as the future period for simplicity.

2. **Calculate Future Returns**:  
   Compute the percentage change between the current close price and the close price in the next period. This will help define whether there’s a significant increase or decrease.

   ```python
   df['future_return'] = (df['close'].shift(-1) - df['close']) / df['close']
   ```

3. **Define the Threshold for Significant Price Change**:  
   Set a threshold for the price change to classify as “buy” or “sell.” For instance, if you want a 1% increase to be a “buy” signal, the threshold will be 0.01. Adjust this threshold as per your strategy.

   ```python
   threshold = 0.01
   ```

4. **Create the Target Variable**:  
   Define the target as 1 (buy) if the future return is above the threshold, and 0 (sell) if it is below or equal to the threshold.

   ```python
   df['target'] = (df['future_return'] > threshold).astype(int)
   ```

5. **Remove Any NaN Values**:  
   The last row in your dataset will have a NaN value for `future_return` due to the shift operation. Drop this row to clean up the dataset.

   ```python
   df = df.dropna()
   ```

6. **Verify the Target Distribution**:  
   Finally, it’s helpful to check the balance of 1s and 0s in your target variable to understand how many “buy” and “sell” signals you have.

   ```python
   print(df['target'].value_counts())
   ```

After this, you’ll have a `target` column in your dataset, which you can use as the target variable for your XGBoost model. Let me know if you need further clarification on any step!

### 1. Calculate Future Returns:  
  **Choose the Prediction Horizon**: Decide the period over which you want to predict price changes. For example, if you want to predict whether the price will increase within the next day or hour, specify that period. Let’s say we’re using the next row in your dataset as the future period for simplicity.
   
   Compute the **percentage change** between the current close price and the close price in the next period. This will help define whether there’s a significant increase or decrease.

In [7]:
btc_price_data_1_year['future_return'] = ((btc_price_data_1_year['close'].shift(-1) - btc_price_data_1_year['close']) / btc_price_data_1_year['close']) * 100

In [8]:
btc_price_data_1_year

Unnamed: 0,timestamp,open,high,low,close,volume,date,time,future_return
0,2023-11-01 00:00:00,34618.86,34676.51,34656.38,34667.88,48.953000,2023-11-01,00:00:00,-0.072286
1,2023-11-01 00:01:00,34642.54,34687.53,34673.30,34642.82,16.178075,2023-11-01,00:01:00,0.039662
2,2023-11-01 00:02:00,34637.97,34656.82,34642.53,34656.56,8.753120,2023-11-01,00:02:00,-0.078542
3,2023-11-01 00:03:00,34617.22,34656.56,34656.56,34629.34,11.308610,2023-11-01,00:03:00,-0.020416
4,2023-11-01 00:04:00,34597.99,34630.42,34629.41,34622.27,8.583808,2023-11-01,00:04:00,-0.045404
...,...,...,...,...,...,...,...,...,...
528627,2024-10-31 23:56:00,70238.76,70248.97,70248.97,70238.76,1.189134,2024-10-31,23:56:00,-0.007859
528628,2024-10-31 23:57:00,70218.00,70250.00,70238.77,70233.24,4.767082,2024-10-31,23:57:00,-0.036251
528629,2024-10-31 23:58:00,70193.97,70242.25,70232.55,70207.78,9.589688,2024-10-31,23:58:00,-0.014172
528630,2024-10-31 23:59:00,70175.16,70207.79,70207.79,70197.83,7.112237,2024-10-31,23:59:00,0.017223


### 2. Define the Threshold for Significant Price Change:  
   Set a threshold for the price change to classify as “buy” or “sell.” For instance, if you want a 1% increase to be a “buy” signal, the threshold will be 0.01. Adjust this threshold as per your strategy.


In [9]:
# For 1% increase to be a “buy” signal, the threshold will be 0.01
threshold = 0.01

### 3. Create the Target Variable:  
   Define the target as 1 (buy) if the future return is above the threshold, and 0 (sell) if it is below or equal to the threshold.


In [10]:
btc_price_data_1_year['target'] = (btc_price_data_1_year['future_return'] > threshold).astype(int)
btc_price_data_1_year

Unnamed: 0,timestamp,open,high,low,close,volume,date,time,future_return,target
0,2023-11-01 00:00:00,34618.86,34676.51,34656.38,34667.88,48.953000,2023-11-01,00:00:00,-0.072286,0
1,2023-11-01 00:01:00,34642.54,34687.53,34673.30,34642.82,16.178075,2023-11-01,00:01:00,0.039662,1
2,2023-11-01 00:02:00,34637.97,34656.82,34642.53,34656.56,8.753120,2023-11-01,00:02:00,-0.078542,0
3,2023-11-01 00:03:00,34617.22,34656.56,34656.56,34629.34,11.308610,2023-11-01,00:03:00,-0.020416,0
4,2023-11-01 00:04:00,34597.99,34630.42,34629.41,34622.27,8.583808,2023-11-01,00:04:00,-0.045404,0
...,...,...,...,...,...,...,...,...,...,...
528627,2024-10-31 23:56:00,70238.76,70248.97,70248.97,70238.76,1.189134,2024-10-31,23:56:00,-0.007859,0
528628,2024-10-31 23:57:00,70218.00,70250.00,70238.77,70233.24,4.767082,2024-10-31,23:57:00,-0.036251,0
528629,2024-10-31 23:58:00,70193.97,70242.25,70232.55,70207.78,9.589688,2024-10-31,23:58:00,-0.014172,0
528630,2024-10-31 23:59:00,70175.16,70207.79,70207.79,70197.83,7.112237,2024-10-31,23:59:00,0.017223,1


In [11]:
btc_price_data_1_year[btc_price_data_1_year.target == 1].head(10)

Unnamed: 0,timestamp,open,high,low,close,volume,date,time,future_return,target
1,2023-11-01 00:01:00,34642.54,34687.53,34673.3,34642.82,16.178075,2023-11-01,00:01:00,0.039662,1
5,2023-11-01 00:05:00,34601.12,34629.21,34620.51,34606.55,4.76513,2023-11-01,00:05:00,0.046725,1
6,2023-11-01 00:06:00,34601.14,34630.82,34606.55,34622.72,3.357925,2023-11-01,00:06:00,0.124398,1
11,2023-11-01 00:11:00,34590.21,34614.89,34605.0,34602.26,3.475336,2023-11-01,00:11:00,0.060227,1
15,2023-11-01 00:15:00,34598.81,34627.47,34608.5,34609.07,5.081769,2023-11-01,00:15:00,0.055679,1
18,2023-11-01 00:18:00,34601.29,34626.58,34620.51,34618.08,8.066458,2023-11-01,00:18:00,0.039228,1
19,2023-11-01 00:19:00,34612.24,34634.46,34618.78,34631.66,2.974894,2023-11-01,00:19:00,0.038549,1
23,2023-11-01 00:23:00,34543.03,34568.26,34567.04,34550.22,2.9565,2023-11-01,00:23:00,0.045933,1
26,2023-11-01 00:26:00,34528.91,34557.07,34553.5,34533.88,5.26977,2023-11-01,00:26:00,0.042654,1
27,2023-11-01 00:27:00,34517.59,34549.46,34536.93,34548.61,6.354277,2023-11-01,00:27:00,0.015109,1


In [12]:
print("Count buy signals:", len(btc_price_data_1_year[btc_price_data_1_year.target == 1]))
print("Count sell signals:", len(btc_price_data_1_year[btc_price_data_1_year.target == 0]))

Count buy signals: 211479
Count sell signals: 317153


### 4. Remove Any NaN Values:  
The last row in your dataset will have a NaN value for `future_return` due to the shift operation. Drop this row to clean up the dataset.

In [13]:
btc_price_data_1_year = btc_price_data_1_year.dropna()
btc_price_data_1_year

Unnamed: 0,timestamp,open,high,low,close,volume,date,time,future_return,target
0,2023-11-01 00:00:00,34618.86,34676.51,34656.38,34667.88,48.953000,2023-11-01,00:00:00,-0.072286,0
1,2023-11-01 00:01:00,34642.54,34687.53,34673.30,34642.82,16.178075,2023-11-01,00:01:00,0.039662,1
2,2023-11-01 00:02:00,34637.97,34656.82,34642.53,34656.56,8.753120,2023-11-01,00:02:00,-0.078542,0
3,2023-11-01 00:03:00,34617.22,34656.56,34656.56,34629.34,11.308610,2023-11-01,00:03:00,-0.020416,0
4,2023-11-01 00:04:00,34597.99,34630.42,34629.41,34622.27,8.583808,2023-11-01,00:04:00,-0.045404,0
...,...,...,...,...,...,...,...,...,...,...
528626,2024-10-31 23:55:00,70248.97,70264.97,70248.98,70248.97,1.604753,2024-10-31,23:55:00,-0.014534,0
528627,2024-10-31 23:56:00,70238.76,70248.97,70248.97,70238.76,1.189134,2024-10-31,23:56:00,-0.007859,0
528628,2024-10-31 23:57:00,70218.00,70250.00,70238.77,70233.24,4.767082,2024-10-31,23:57:00,-0.036251,0
528629,2024-10-31 23:58:00,70193.97,70242.25,70232.55,70207.78,9.589688,2024-10-31,23:58:00,-0.014172,0


### 5. Verify the Target Distribution:  
Check the balance of 1s and 0s in our target variable to understand how many “buy” and “sell” signals we have.

In [14]:
btc_price_data_1_year['target'].value_counts()

target
0    317152
1    211479
Name: count, dtype: int64

In [15]:
btc_price_data_1_year.head()

Unnamed: 0,timestamp,open,high,low,close,volume,date,time,future_return,target
0,2023-11-01 00:00:00,34618.86,34676.51,34656.38,34667.88,48.953,2023-11-01,00:00:00,-0.072286,0
1,2023-11-01 00:01:00,34642.54,34687.53,34673.3,34642.82,16.178075,2023-11-01,00:01:00,0.039662,1
2,2023-11-01 00:02:00,34637.97,34656.82,34642.53,34656.56,8.75312,2023-11-01,00:02:00,-0.078542,0
3,2023-11-01 00:03:00,34617.22,34656.56,34656.56,34629.34,11.30861,2023-11-01,00:03:00,-0.020416,0
4,2023-11-01 00:04:00,34597.99,34630.42,34629.41,34622.27,8.583808,2023-11-01,00:04:00,-0.045404,0


In [16]:
btc_price_data_1_year_target.head()

Unnamed: 0,timestamp,open,high,low,close,volume,date,time,future_return,target
0,2023-11-01 00:00:00,34618.86,34676.51,34656.38,34667.88,48.953,2023-11-01,00:00:00,-0.072286,0.0
1,2023-11-01 00:01:00,34642.54,34687.53,34673.3,34642.82,16.178075,2023-11-01,00:01:00,0.039662,1.0
2,2023-11-01 00:02:00,34637.97,34656.82,34642.53,34656.56,8.75312,2023-11-01,00:02:00,-0.078542,0.0
3,2023-11-01 00:03:00,34617.22,34656.56,34656.56,34629.34,11.30861,2023-11-01,00:03:00,-0.020416,0.0
4,2023-11-01 00:04:00,34597.99,34630.42,34629.41,34622.27,8.583808,2023-11-01,00:04:00,-0.045404,0.0
