# BTC Data Exploration

## Overview
This notebook explores the collected BTC data from different timeframes (4h, 1d, 1w) to understand:
- Data quality and completeness
- Price patterns and trends
- Volume analysis
- Data distribution across timeframes

## Data Sources
- **4h timeframe**: High-frequency data for detailed analysis
- **1d timeframe**: Daily patterns and trends
- **1w timeframe**: Weekly macro trends

## Objectives
1. Load and examine collected data
2. Perform basic statistical analysis
3. Visualize price and volume patterns
4. Identify data quality issues
5. Prepare data for feature engineering


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


## 1. Load Data


In [3]:
data_dir = Path('../data_collection/data')
print(f"Data directory: {data_dir.absolute()}")
print(f"Directory exists: {data_dir.exists()}")

Data directory: /Users/al02260279/Documents/private/btc_prediction/notebooks/../data_collection/data
Directory exists: True


In [4]:
if data_dir.exists():
    files = list(data_dir.glob('*.parquet'))
    print(f"Parquet files found: {files}")
    for file in files:
        print(f"  - {file.name}")
else:
    print("Data directory does not exist!")

Parquet files found: [PosixPath('../data_collection/data/btc_1w_20251020.parquet'), PosixPath('../data_collection/data/btc_4h_20251020.parquet'), PosixPath('../data_collection/data/btc_1d_20251020.parquet')]
  - btc_1w_20251020.parquet
  - btc_4h_20251020.parquet
  - btc_1d_20251020.parquet


In [14]:
# Load data from parquet files
data_dir = Path('../data_collection/data')

# Load different timeframes
data_4h = pd.read_parquet(data_dir / 'btc_4h_20251020.parquet')  # Use actual filename
data_1d = pd.read_parquet(data_dir / 'btc_1d_20251020.parquet')
data_1w = pd.read_parquet(data_dir / 'btc_1w_20251020.parquet')

print(f"4h data shape: {data_4h.shape}")
print(f"1d data shape: {data_1d.shape}")
print(f"1w data shape: {data_1w.shape}")


4h data shape: (11925, 5)
1d data shape: (1988, 5)
1w data shape: (284, 5)


In [None]:
data_4h=data_4h[~data_4h.index.isin(["2025-10-20 00:00:00","2025-10-20 04:00:00","2025-10-20 08:00:00"])]

                          open       high        low      close      volume
timestamp                                                                  
2025-10-19 04:00:00  107275.79  107290.00  106558.61  106786.00  1247.19589
2025-10-19 08:00:00  106786.00  108260.50  106103.36  107783.47  5059.55867
2025-10-19 12:00:00  107783.47  108621.68  107355.02  108486.70  3636.47308
2025-10-19 16:00:00  108486.71  109450.07  108240.00  108908.43  2430.56297
2025-10-19 20:00:00  108908.43  109370.99  108471.10  108642.78  2048.76152
(11922, 5)


In [23]:
print(data_4h.head(5))
print(data_4h.tail(5))
print(data_4h.shape)
print(data_4h["high"].max())
print(data_4h["low"].min())

                        open     high      low    close        volume
timestamp                                                            
2020-05-12 00:00:00  8562.04  8742.43  8528.78  8716.07  11224.925222
2020-05-12 04:00:00  8716.75  8785.00  8614.98  8656.05  10948.791761
2020-05-12 08:00:00  8655.76  8828.72  8632.93  8800.92  14846.694767
2020-05-12 12:00:00  8800.91  8944.72  8659.00  8867.72  22551.312510
2020-05-12 16:00:00  8867.72  8978.26  8775.00  8792.19  17005.945766
                          open       high        low      close      volume
timestamp                                                                  
2025-10-19 04:00:00  107275.79  107290.00  106558.61  106786.00  1247.19589
2025-10-19 08:00:00  106786.00  108260.50  106103.36  107783.47  5059.55867
2025-10-19 12:00:00  107783.47  108621.68  107355.02  108486.70  3636.47308
2025-10-19 16:00:00  108486.71  109450.07  108240.00  108908.43  2430.56297
2025-10-19 20:00:00  108908.43  109370.99  108471.10  

In [19]:
data_1d=data_1d[~data_1d.index.isin(["2025-10-20"])]

In [24]:
print(data_1d.head(5))
print(data_1d.tail(5))
print(data_1d.shape)
print(data_1d["high"].max())
print(data_1d["low"].min())

               open     high      low    close         volume
timestamp                                                    
2020-05-12  8562.04  8978.26  8528.78  8810.79   86522.780066
2020-05-13  8810.99  9398.00  8792.99  9309.37   92466.274018
2020-05-14  9309.35  9939.00  9256.76  9791.98  129565.377470
2020-05-15  9791.97  9845.62  9150.00  9316.42  115890.761516
2020-05-16  9315.96  9588.00  9220.00  9381.27   59587.627862
                 open       high        low      close       volume
timestamp                                                          
2025-10-15  113028.13  113612.35  110164.00  110763.28  22986.48811
2025-10-16  110763.28  111982.45  107427.00  108194.28  29857.17252
2025-10-17  108194.27  109240.00  103528.23  106431.68  37920.66838
2025-10-18  106431.68  107499.00  106322.20  107185.01  11123.18766
2025-10-19  107185.00  109450.07  106103.36  108642.78  15480.66423
(1987, 5)
126199.63
8528.78


In [21]:
data_1w=data_1w[~data_1w.index.isin(["2025-10-20"])]

In [25]:
print(data_1w.head(5))
print(data_1w.tail(5))
print(data_1w.shape)
print(data_1w["high"].max())
print(data_1w["low"].min())

               open      high      low    close         volume
timestamp                                                     
2020-05-18  9681.11   9950.00  8700.00  8720.34  517248.177536
2020-05-25  8718.14   9740.00  8642.72  9448.27  425528.246167
2020-06-01  9448.27  10380.00  9266.00  9746.99  427822.495347
2020-06-08  9746.99   9992.72  9113.00  9342.10  336172.771517
2020-06-15  9342.10   9589.00  8910.45  9294.69  323565.711842
                 open       high        low      close         volume
timestamp                                                            
2025-09-15  115268.01  117900.00  114384.00  115232.29   70729.440300
2025-09-22  115232.29  115379.25  108620.07  112163.95   93970.577040
2025-09-29  112163.96  125708.42  111560.65  123482.31  124480.098903
2025-10-06  123482.32  126199.63  102000.00  114958.80  211576.359223
2025-10-13  114958.81  115963.81  103528.23  108642.78  171795.750970
(283, 5)
126199.63
8642.72
