# 📊 01 - Data Collection

This notebook is the first step in our crypto regime classification and trading strategy project. Here, we focus on gathering the data needed to perform PCA, clustering, and downstream trading analysis.

---

### 📥 1. Load and Inspect Raw Price Data

We begin by loading daily closing price data for various cryptocurrencies. These prices will serve as the foundation for return and feature calculations.

- Source: `crypto_prices.csv`
- Content: Historical daily closing prices
- Format: Coins in columns, dates in rows

```python
# Load price data
price_df = pd.read_csv("crypto_prices.csv", parse_dates=['timestamp'])
price_df.set_index('timestamp', inplace=True)
price_df.head()

In [35]:
import requests
import pandas as pd
import os
from dotenv import load_dotenv

# Load API key from .env
load_dotenv()
api_key = os.getenv("COINGECKO_DEMO_KEY")

def fetch_daily_prices_demo(coin_id, api_key, vs_currency="usd", days="365"):
    url = f"https://api.coingecko.com/api/v3/coins/{coin_id}/market_chart"
    
    headers = {
        "x-cg-demo-api-key": api_key  # DEMO key header
    }
    
    params = {
        "vs_currency": vs_currency,
        "days": days,
        "interval": "daily"
    }

    response = requests.get(url, headers=headers, params=params)
    if response.status_code != 200:
        raise Exception(f"Error {response.status_code}: {response.text}")
    
    data = response.json()
    prices = data.get("prices", [])
    
    df = pd.DataFrame(prices, columns=["timestamp", coin_id])
    df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms")
    df.set_index("timestamp", inplace=True)
    return df

In [None]:
# Combine and save if successful
if all_data:
    price_df = pd.concat(all_data, axis=1)
    price_df.to_csv("../data/crypto_prices.csv")
    print("📁 Saved: ../data/crypto_prices.csv")
else:
    print("❌ No data collected.")

## 🎯 Cryptocurrency Selection Strategy

**Why these 15 coins?**
- **Top Market Cap**: Bitcoin, Ethereum, BNB, XRP, ADA, SOL (market leaders)
- **Stablecoins**: USDT, USDC (regime stability indicators)  
- **DeFi/Layer 1s**: DOT, AVAX, MATIC (different sector exposure)
- **Meme/Retail**: DOGE, SHIB (sentiment indicators)
- **Established Alts**: LTC, TRX (proven track records)

This selection provides diverse exposure across market cap, use cases, and investor sentiment - critical for comprehensive regime analysis.

**Data Specifications:**
- **Source**: CoinGecko API (reliable, comprehensive)
- **Frequency**: Daily closing prices
- **Period**: 365 days (sufficient for regime detection)
- **Rate Limiting**: 6-second delays to respect API limits

In [39]:
# Load and display the collected data
price_df = pd.read_csv("crypto_prices.csv", index_col="timestamp", parse_dates=True)
print(price_df.head())

                 bitcoin     ethereum    tether    ripple  binancecoin  \
timestamp                                                                
2024-06-05  70600.011167  3814.932030  1.000157  0.525878   686.510668   
2024-06-06  71184.599431  3871.082091  1.000756  0.526266   699.924112   
2024-06-07  70759.588193  3812.701857  0.999556  0.521673   710.043483   
2024-06-08  69325.362388  3679.376652  0.999594  0.498861   683.338328   
2024-06-09  69315.104123  3683.025380  0.999938  0.493173   682.782750   

            usd-coin      solana   cardano  dogecoin  avalanche-2  polkadot  \
timestamp                                                                     
2024-06-05  1.000233  171.728129  0.461416  0.161543    36.091757  7.191048   
2024-06-06  1.000322  173.769571  0.461678  0.163508    36.535981  7.254035   
2024-06-07  0.999933  170.372720  0.458149  0.160168    35.926170  7.141668   
2024-06-08  0.999985  162.453205  0.449497  0.148307    33.509910  6.658967   
2024-06

In [None]:
# Data Quality Validation
print("📊 Data Quality Report")
print("=" * 50)
print(f"Date Range: {price_df.index.min().date()} to {price_df.index.max().date()}")
print(f"Total Days: {len(price_df)} observations")
print(f"Cryptocurrencies: {len(price_df.columns)} assets")
print()

# Check for missing data
missing_data = price_df.isnull().sum()
print("Missing Data per Asset:")
for coin, missing in missing_data.items():
    if missing > 0:
        print(f"  ❌ {coin}: {missing} missing values ({missing/len(price_df)*100:.1f}%)")
    else:
        print(f"  ✅ {coin}: Complete data")

print()
print("Price Range Validation:")
for coin in price_df.columns:
    min_price = price_df[coin].min()
    max_price = price_df[coin].max()
    print(f"  {coin}: ${min_price:.6f} - ${max_price:.2f}")

print("\n✅ Data collection complete and validated!")

## 🔄 Analysis Pipeline Next Steps

With clean price data collected, the analysis proceeds as follows:

**1. Feature Engineering** → `feature_engineering.ipynb`
- Calculate technical indicators (RSI, MACD, Bollinger Bands)
- Compute momentum and volatility measures
- Generate risk-adjusted return metrics

**2. PCA Analysis** → `pca_analysis.ipynb`  
- Reduce 105+ features to 5 principal components
- Identify key drivers of market behavior
- Explain variance and interpret components

**3. Clustering** → `clustering_analysis.ipynb`
- Apply K-means to PCA components
- Identify distinct market regimes
- Label each day with regime classification

**4. Regime Analysis** → Ongoing research
- Analyze regime characteristics and transitions
- Develop regime-aware trading strategies
- Backtest performance across different market conditions

---
**Output**: `crypto_prices.csv` → Ready for feature engineering pipeline