# Data Ingestion with RustyBT

This notebook demonstrates how to fetch and prepare data from multiple sources for backtesting.

**Data Sources Covered:**
- yfinance (Yahoo Finance) - Free stocks, ETFs, forex
- CCXT - Cryptocurrency exchanges (100+ exchanges)
- CSV files - Custom data
- Alpaca - Real-time and historical market data

**What you'll learn:**
- Fetching data from different providers
- Data validation and quality checks
- Creating custom data bundles
- Caching for performance

**Estimated runtime:** 5-10 minutes (depending on data downloads)

In [None]:
# Setup
from rustybt.analytics import setup_notebook, create_progress_iterator
setup_notebook()

import pandas as pd
import polars as pl
from datetime import datetime, timedelta

from rustybt.data.adapters import YFinanceAdapter, CCXTAdapter, CSVAdapter

## 1. Yahoo Finance Data (Stocks & ETFs)

Yahoo Finance provides free historical data for stocks, ETFs, indices, and forex.

In [None]:
# Initialize yfinance adapter
yf_adapter = YFinanceAdapter()

# Fetch data for multiple stocks
symbols = ['AAPL', 'GOOGL', 'MSFT', 'TSLA']
start_date = pd.Timestamp('2023-01-01')
end_date = pd.Timestamp('2023-12-31')

print(f"Fetching data for {len(symbols)} symbols...")

# Fetch with progress bar
all_data = []
for symbol in create_progress_iterator(symbols, desc="Downloading"):
    data = yf_adapter.fetch(
        symbols=[symbol],
        start_date=start_date,
        end_date=end_date,
        resolution='1d'
    )
    all_data.append(data)
    print(f"  {symbol}: {len(data)} bars")

# Combine all data
combined = pl.concat(all_data)
print(f"\nTotal: {len(combined)} bars across {len(symbols)} symbols")
print(f"\nData schema:")
print(combined.schema)

In [None]:
# Validate data quality
try:
    yf_adapter.validate(combined)
    print("✅ Data validation passed!")
    print("   - All OHLCV relationships valid")
    print("   - No NULL values")
    print("   - Timestamps properly sorted")
except Exception as e:
    print(f"❌ Validation failed: {e}")

## 2. Cryptocurrency Data (CCXT)

CCXT provides unified access to 100+ cryptocurrency exchanges.

In [None]:
# Initialize CCXT adapter for Binance
binance = CCXTAdapter(exchange_id='binance')

# Fetch BTC and ETH data
crypto_symbols = ['BTC/USDT', 'ETH/USDT']

print(f"Fetching crypto data from Binance...")

crypto_data = []
for symbol in crypto_symbols:
    data = binance.fetch(
        symbols=[symbol],
        start_date=pd.Timestamp('2024-01-01'),
        end_date=pd.Timestamp('2024-01-31'),
        resolution='1h'  # Hourly data
    )
    crypto_data.append(data)
    print(f"  {symbol}: {len(data)} bars")

crypto_combined = pl.concat(crypto_data)
print(f"\nTotal crypto bars: {len(crypto_combined)}")

## 3. CSV Data Import

Import custom data from CSV files.

In [None]:
# Example CSV structure:
csv_example = pd.DataFrame({
    'timestamp': pd.date_range('2024-01-01', periods=100, freq='D'),
    'symbol': 'CUSTOM',
    'open': 100 + pd.np.random.randn(100).cumsum(),
    'high': 105 + pd.np.random.randn(100).cumsum(),
    'low': 95 + pd.np.random.randn(100).cumsum(),
    'close': 100 + pd.np.random.randn(100).cumsum(),
    'volume': pd.np.random.randint(1000000, 10000000, 100)
})

# Save example CSV
csv_example.to_csv('example_data.csv', index=False)

# Load using CSV adapter
csv_adapter = CSVAdapter()
csv_data = csv_adapter.load('example_data.csv')

print(f"Loaded {len(csv_data)} bars from CSV")
print(f"\nFirst 5 rows:")
print(csv_data.head())

## 4. Data Quality Checks

Always validate data before using in backtests.

In [None]:
def check_data_quality(df, name="Data"):
    """Comprehensive data quality check."""
    print(f"\n{name} Quality Report:")
    print("=" * 50)
    
    # Check for nulls
    null_counts = df.null_count()
    if null_counts.sum().sum() > 0:
        print(f"⚠️  NULL values found:")
        print(null_counts)
    else:
        print("✅ No NULL values")
    
    # Check OHLC relationships
    invalid = df.filter(
        (pl.col('high') < pl.col('low')) |
        (pl.col('high') < pl.col('open')) |
        (pl.col('high') < pl.col('close')) |
        (pl.col('low') > pl.col('open')) |
        (pl.col('low') > pl.col('close'))
    )
    
    if len(invalid) > 0:
        print(f"❌ Invalid OHLC relationships: {len(invalid)} bars")
    else:
        print("✅ OHLC relationships valid")
    
    # Check for duplicates
    duplicates = df.filter(pl.col('timestamp').is_duplicated())
    if len(duplicates) > 0:
        print(f"⚠️  Duplicate timestamps: {len(duplicates)}")
    else:
        print("✅ No duplicate timestamps")
    
    # Date range
    print(f"\n📅 Date Range:")
    print(f"   Start: {df['timestamp'].min()}")
    print(f"   End: {df['timestamp'].max()}")
    print(f"   Total bars: {len(df)}")

# Check quality
check_data_quality(combined, "Stock Data")
check_data_quality(crypto_combined, "Crypto Data")

## 5. Save Data for Backtesting

Save data in efficient formats for fast backtesting.

In [None]:
# Save to Parquet (recommended - fast and efficient)
combined.write_parquet('stocks_2023.parquet')
crypto_combined.write_parquet('crypto_2024_01.parquet')

print("✅ Data saved to Parquet files")
print("\nFiles created:")
print("  - stocks_2023.parquet")
print("  - crypto_2024_01.parquet")

# Can also save to CSV for compatibility
# combined.write_csv('stocks_2023.csv')

## 6. Data Caching

RustyBT supports caching to avoid re-downloading data.

In [None]:
from rustybt.data.catalog import DataCatalog

# Initialize catalog with caching
catalog = DataCatalog(cache_dir='./data_cache')

# Register data source
catalog.register(
    name='stocks_2023',
    adapter=yf_adapter,
    symbols=['AAPL', 'GOOGL', 'MSFT'],
    start_date=pd.Timestamp('2023-01-01'),
    end_date=pd.Timestamp('2023-12-31')
)

# First call downloads data
data1 = catalog.load('stocks_2023')
print(f"First load: {len(data1)} bars (downloaded)")

# Second call uses cache
data2 = catalog.load('stocks_2023')
print(f"Second load: {len(data2)} bars (from cache - instant!)")

## Next Steps

Now that you have data:

1. **03_strategy_development.ipynb** - Build trading strategies with this data
2. **10_full_workflow.ipynb** - See complete workflow from data to results

## Key Takeaways

- ✅ Multiple data sources supported (stocks, crypto, custom)
- ✅ Built-in data validation catches errors early
- ✅ Efficient Parquet storage for fast backtests
- ✅ Caching prevents redundant downloads
- ✅ Progress bars for long downloads