# BTC Data Exploration

## Overview
This notebook explores the collected BTC data from different timeframes (4h, 1d, 1w) to understand:
- Data quality and completeness
- Price patterns and trends
- Volume analysis
- Data distribution across timeframes

## Data Sources
- **4h timeframe**: High-frequency data for detailed analysis
- **1d timeframe**: Daily patterns and trends
- **1w timeframe**: Weekly macro trends

## Objectives
1. Load and examine collected data
2. Perform basic statistical analysis
3. Visualize price and volume patterns
4. Identify data quality issues
5. Prepare data for feature engineering


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


## 1. Load Data


In [3]:
data_dir = Path('../data_collection/data')
print(f"Data directory: {data_dir.absolute()}")
print(f"Directory exists: {data_dir.exists()}")

Data directory: /Users/al02260279/Documents/private/btc_prediction/notebooks/../data_collection/data
Directory exists: True


In [4]:
if data_dir.exists():
    files = list(data_dir.glob('*.parquet'))
    print(f"Parquet files found: {files}")
    for file in files:
        print(f"  - {file.name}")
else:
    print("Data directory does not exist!")

Parquet files found: [PosixPath('../data_collection/data/btc_1w_20251020.parquet'), PosixPath('../data_collection/data/btc_4h_20251020.parquet'), PosixPath('../data_collection/data/btc_1d_20251020.parquet')]
  - btc_1w_20251020.parquet
  - btc_4h_20251020.parquet
  - btc_1d_20251020.parquet


In [6]:
# Load data from parquet files
data_dir = Path('../data_collection/data')

# Load different timeframes
data_4h = pd.read_parquet(data_dir / 'btc_4h_20251020.parquet')  # Use actual filename
data_1d = pd.read_parquet(data_dir / 'btc_1d_20251020.parquet')
data_1w = pd.read_parquet(data_dir / 'btc_1w_20251020.parquet')

print(f"4h data shape: {data_4h.shape}")
print(f"1d data shape: {data_1d.shape}")
print(f"1w data shape: {data_1w.shape}")


4h data shape: (11925, 5)
1d data shape: (1988, 5)
1w data shape: (284, 5)


In [13]:
data_4h=data_4h[~data_4h.index.isin(["2025-10-20 00:00:00","2025-10-20 04:00:00","2025-10-20 08:00:00"])]
data_4h.tail(5)


Unnamed: 0_level_0,open,high,low,close,volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-10-19 04:00:00,107275.79,107290.0,106558.61,106786.0,1247.19589
2025-10-19 08:00:00,106786.0,108260.5,106103.36,107783.47,5059.55867
2025-10-19 12:00:00,107783.47,108621.68,107355.02,108486.7,3636.47308
2025-10-19 16:00:00,108486.71,109450.07,108240.0,108908.43,2430.56297
2025-10-19 20:00:00,108908.43,109370.99,108471.1,108642.78,2048.76152


In [12]:
data_1d=data_1d[~data_1d.index.isin(["2025-10-20"])]
data_1d.tail(5)

Unnamed: 0_level_0,open,high,low,close,volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-10-15,113028.13,113612.35,110164.0,110763.28,22986.48811
2025-10-16,110763.28,111982.45,107427.0,108194.28,29857.17252
2025-10-17,108194.27,109240.0,103528.23,106431.68,37920.66838
2025-10-18,106431.68,107499.0,106322.2,107185.01,11123.18766
2025-10-19,107185.0,109450.07,106103.36,108642.78,15480.66423


In [11]:
data_1w=data_1w[~data_1w.index.isin(["2025-10-20"])]
data_1w.tail(5)

Unnamed: 0_level_0,open,high,low,close,volume
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-09-15,115268.01,117900.0,114384.0,115232.29,70729.4403
2025-09-22,115232.29,115379.25,108620.07,112163.95,93970.57704
2025-09-29,112163.96,125708.42,111560.65,123482.31,124480.098903
2025-10-06,123482.32,126199.63,102000.0,114958.8,211576.359223
2025-10-13,114958.81,115963.81,103528.23,108642.78,171795.75097
