# BTC Data Exploration

## Overview
This notebook explores the collected BTC data from different timeframes (4h, 1d, 1w) to understand:
- Data quality and completeness
- Price patterns and trends
- Volume analysis
- Data distribution across timeframes

## Data Sources
- **4h timeframe**: High-frequency data for detailed analysis
- **1d timeframe**: Daily patterns and trends
- **1w timeframe**: Weekly macro trends

## Objectives
1. Load and examine collected data
2. Perform basic statistical analysis
3. Visualize price and volume patterns
4. Identify data quality issues
5. Prepare data for feature engineering


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


## 1. Load Data


In [3]:
if data_dir.exists():
    files = list(data_dir.glob('*.parquet'))
    print(f"Parquet files found: {files}")
    for file in files:
        print(f"  - {file.name}")
else:
    print("Data directory does not exist!")

NameError: name 'data_dir' is not defined

In [6]:
# Load data from parquet files
data_dir = Path('../data_collection/data')

# Load different timeframes
data_4h = pd.read_parquet(data_dir / 'btc_4h_20251022.parquet')  # Use actual filename
data_1d = pd.read_parquet(data_dir / 'btc_1d_20251022.parquet')
data_1w = pd.read_parquet(data_dir / 'btc_1w_20251022.parquet')
data_1m = pd.read_parquet(data_dir / 'btc_1M_20251022.parquet')

print(f"4h data shape: {data_4h.shape}")
print(f"1d data shape: {data_1d.shape}")
print(f"1w data shape: {data_1w.shape}")
print(f"1m data shape: {data_1m.shape}")

4h data shape: (12367, 5)
1d data shape: (2115, 5)
1w data shape: (350, 5)
1m data shape: (91, 5)


In [29]:
print(data_4h.head(5))
print(data_4h.tail(5))
print(data_4h.shape)
print(data_4h["high"].max())
print(data_4h["low"].min())

                        open     high      low    close        volume
timestamp                                                            
2020-03-01 00:00:00  8523.61  8675.00  8511.11  8620.36   6058.887982
2020-03-01 04:00:00  8620.17  8634.73  8520.58  8541.16   5437.032897
2020-03-01 08:00:00  8541.15  8650.00  8518.00  8648.37   5567.314081
2020-03-01 12:00:00  8647.42  8750.00  8535.24  8557.80   9813.477149
2020-03-01 16:00:00  8557.80  8582.73  8411.00  8542.20  11521.279904
                          open       high        low      close      volume
timestamp                                                                  
2025-10-20 08:00:00  111169.91  111679.25  110608.27  111016.75  3452.86704
2025-10-20 12:00:00  111016.75  111705.56  110588.23  111144.55  3006.20504
2025-10-20 16:00:00  111144.55  111303.05  109855.83  110803.22  3716.65669
2025-10-20 20:00:00  110803.22  111272.00  110418.25  110532.09  1366.18103
2025-10-21 00:00:00  110532.09  110532.09  109303.86  

In [30]:
print(data_1d.head(5))
print(data_1d.tail(5))
print(data_1d.shape)
print(data_1d["high"].max())
print(data_1d["low"].min())

               open     high      low    close        volume
timestamp                                                   
2020-03-01  8523.61  8750.00  8411.00  8531.88  43892.201779
2020-03-02  8530.30  8965.75  8498.00  8915.24  60401.317730
2020-03-03  8911.18  8919.65  8651.00  8760.07  55154.997282
2020-03-04  8760.07  8848.29  8660.00  8750.87  38696.482578
2020-03-05  8750.99  9159.42  8746.54  9054.68  58201.866355
                 open       high        low      close       volume
timestamp                                                          
2025-10-17  108194.27  109240.00  103528.23  106431.68  37920.66838
2025-10-18  106431.68  107499.00  106322.20  107185.01  11123.18766
2025-10-19  107185.00  109450.07  106103.36  108642.78  15480.66423
2025-10-20  108642.77  111705.56  107402.52  110532.09  19193.44160
2025-10-21  110532.09  110532.09  109303.86  109573.21   1729.33018
(2061, 5)
126199.63
3782.13


In [31]:
print(data_1w.head(5))
print(data_1w.tail(5))
print(data_1w.shape)
print(data_1w["high"].max())
print(data_1w["low"].min())

               open     high      low    close        volume
timestamp                                                   
2020-03-02  8530.30  9188.00  8000.00  8033.31  3.791971e+05
2020-03-09  8034.76  8179.31  3782.13  5361.30  1.224228e+06
2020-03-16  5360.33  6900.00  4442.12  5816.19  1.180843e+06
2020-03-23  5816.05  6957.96  5688.00  5881.42  7.703808e+05
2020-03-30  5880.50  7198.00  5857.76  6772.78  6.647843e+05
                 open       high        low      close         volume
timestamp                                                            
2025-09-22  115232.29  115379.25  108620.07  112163.95   93970.577040
2025-09-29  112163.96  125708.42  111560.65  123482.31  124480.098903
2025-10-06  123482.32  126199.63  102000.00  114958.80  211576.359223
2025-10-13  114958.81  115963.81  103528.23  108642.78  171795.750970
2025-10-20  108642.77  111705.56  107402.52  109573.21   20922.776900
(295, 5)
126199.63
3782.13


In [7]:
print(data_1m.head(5))
print(data_1m.tail(5))
print(data_1m.shape)
print(data_1m["high"].max())
print(data_1m["low"].min())

               open      high      low    close        volume
timestamp                                                    
2018-04-01  6922.00   9759.82  6430.00  9246.01  1.110964e+06
2018-05-01  9246.01  10020.00  7032.95  7485.01  9.144764e+05
2018-06-01  7485.01   7786.69  5750.00  6390.07  9.422498e+05
2018-07-01  6391.08   8491.77  6070.00  7730.93  1.102510e+06
2018-08-01  7735.67   7750.00  5880.00  7011.21  1.408160e+06
                 open       high        low      close         volume
timestamp                                                            
2025-06-01  104591.88  110530.17   98200.00  107146.50  427546.463360
2025-07-01  107146.51  123218.00  105100.19  115764.08  484315.651017
2025-08-01  115764.07  124474.00  107350.10  108246.35  471366.942936
2025-09-01  108246.36  117900.00  107255.00  114048.93  374551.994070
2025-10-01  114048.94  126199.63  102000.00  108258.66  536487.098566
(91, 5)
126199.63
3156.26


In [1]:
import pandas as pd
import numpy as np

def test_target_logic_simple():
    """Simple test to verify target logic using daily data"""
    
    print("🧪 Testing Target Logic with Simple Version")
    print("=" * 50)
    
    # Load daily data
    data = pd.read_parquet('../data_collection/data/btc_1d_20251022.parquet')
    print(f"📊 Data loaded: {len(data)} records")
    print(f"📅 Period: {data.index[0]} to {data.index[-1]}")
    
    # Create target variable with simple logic
    y_labels = []
    
    print("\n🔄 Calculating labels...")
    
    for i in range(len(data)):
        if i % 500 == 0:
            print(f"   Processing {i}/{len(data)} records...")
        
        current_close = data.iloc[i]['close']
        
        # Look ahead 30 days (30 records for daily data)
        if i + 30 < len(data):
            future_data = data.iloc[i+1:i+31]
            
            # Check for +5% threshold first
            future_highs = future_data['high']
            max_future_high = future_highs.max()
            price_increase = (max_future_high - current_close) / current_close
            
            # Check for -15% threshold
            future_lows = future_data['low']
            min_future_low = future_lows.min()
            price_drop = (min_future_low - current_close) / current_close
            
            # Determine which threshold was reached first
            if price_increase >= 0.05 and price_drop <= -0.15:
                # Both thresholds reached - need to check which came first
                # Find first occurrence of each
                for j in range(len(future_data)):
                    future_high = future_data.iloc[j]['high']
                    future_low = future_data.iloc[j]['low']
                    
                    if (future_high - current_close) / current_close >= 0.05:
                        # +5% reached first
                        y_labels.append(0)  # REST
                        break
                    elif (future_low - current_close) / current_close <= -0.15:
                        # -15% reached first
                        y_labels.append(1)  # SELL
                        break
            elif price_increase >= 0.05:
                # Only +5% reached
                y_labels.append(0)  # REST
            elif price_drop <= -0.15:
                # Only -15% reached
                y_labels.append(1)  # SELL
            else:
                # Neither threshold reached
                y_labels.append(0)  # REST
        else:
            # Not enough future data
            y_labels.append(0)  # REST
    
    # Convert to numpy array
    y_labels = np.array(y_labels)
    
    # Calculate statistics
    total_labels = len(y_labels)
    sell_count = np.sum(y_labels)
    rest_count = total_labels - sell_count
    
    print(f"\n📊 Results:")
    print(f"   Total labels: {total_labels}")
    print(f"   SELL labels: {sell_count} ({sell_count/total_labels*100:.1f}%)")
    print(f"   REST labels: {rest_count} ({rest_count/total_labels*100:.1f}%)")
    print(f"   Ratio: {sell_count/rest_count:.1f}:1 (SELL:REST)")
    
    return y_labels

# Run the test
test_labels = test_target_logic_simple()

🧪 Testing Target Logic with Simple Version
📊 Data loaded: 2146 records
📅 Period: 2019-12-08 00:00:00 to 2025-10-22 00:00:00

🔄 Calculating labels...
   Processing 0/2146 records...
   Processing 500/2146 records...
   Processing 1000/2146 records...
   Processing 1500/2146 records...
   Processing 2000/2146 records...

📊 Results:
   Total labels: 2146
   SELL labels: 351 (16.4%)
   REST labels: 1795 (83.6%)
   Ratio: 0.2:1 (SELL:REST)
