# YFinanceDataHandler User Guide

This notebook provides a comprehensive guide to using the `YFinanceDataHandler` class from the `algoshort` package.

## Table of Contents
1. [Setup and Installation](#1-setup-and-installation)
2. [Basic Usage](#2-basic-usage)
3. [Downloading Data](#3-downloading-data)
4. [Accessing and Manipulating Data](#4-accessing-and-manipulating-data)
5. [Massive Download Example](#5-massive-download-example)
6. [Saving and Loading Data](#6-saving-and-loading-data)
7. [Continuing with Algoshort Workflow](#7-continuing-with-algoshort-workflow)
8. [Cache Management](#8-cache-management)
9. [Best Practices](#9-best-practices)

## 1. Setup and Installation

First, ensure you have the required packages installed.

In [None]:
# Install dependencies if needed (uncomment to run)
# !pip install yfinance pandas numpy pyarrow

In [None]:
# Import required libraries
import sys
from pathlib import Path

# Add parent directory to path if running from notebooks folder
sys.path.insert(0, str(Path.cwd().parent))

import pandas as pd
import numpy as np
import logging
from datetime import datetime

# Import YFinanceDataHandler
from algoshort.yfinance_handler import YFinanceDataHandler

print("Imports successful!")

## 2. Basic Usage

### 2.1 Creating a Handler Instance

The handler can be initialized with various options:

In [None]:
# Basic initialization (no caching)
handler_basic = YFinanceDataHandler()
print(f"Basic handler: {handler_basic}")

In [None]:
# Advanced initialization with caching and custom settings
handler = YFinanceDataHandler(
    cache_dir="../data/cache",      # Directory for caching downloaded data
    enable_logging=True,             # Enable logging output
    chunk_size=50,                   # Symbols per download batch
    log_level=logging.INFO           # Logging verbosity
)
print(f"Advanced handler: {handler}")

### 2.2 Understanding Period and Interval Options

The handler supports both yfinance native formats and user-friendly aliases:

In [None]:
# Available period options
print("Period Options:")
print("-" * 50)
for alias, yf_period in handler.period_map.items():
    print(f"  '{alias}' -> '{yf_period}'")

In [None]:
# Available interval options
print("\nInterval Options:")
print("-" * 50)
for alias, yf_interval in handler.interval_map.items():
    print(f"  '{alias}' -> '{yf_interval}'")

## 3. Downloading Data

### 3.1 Single Symbol Download

In [None]:
# Download data for a single symbol
data = handler.download_data(
    symbols='AAPL',
    period='1y',        # 1 year of data
    interval='1d',      # Daily intervals
    use_cache=True      # Use cached data if available
)

print(f"Downloaded data for: {list(data.keys())}")
print(f"\nAAPL data shape: {data['AAPL'].shape}")
data['AAPL'].head()

### 3.2 Multiple Symbols Download

In [None]:
# Download data for multiple symbols
tech_stocks = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META']

data = handler.download_data(
    symbols=tech_stocks,
    period='2y',
    interval='daily',   # Using user-friendly alias
    use_cache=True
)

print(f"Downloaded {len(data)} symbols")
for symbol, df in data.items():
    print(f"  {symbol}: {len(df)} rows, columns: {list(df.columns)}")

### 3.3 Download with Date Range

In [None]:
# Download with specific date range
data = handler.download_data(
    symbols='SPY',
    start='2022-01-01',
    end='2023-12-31',
    interval='1d'
)

spy_data = data['SPY']
print(f"SPY data from {spy_data.index.min()} to {spy_data.index.max()}")
print(f"Total trading days: {len(spy_data)}")

## 4. Accessing and Manipulating Data

### 4.1 Get Data for a Single Symbol

In [None]:
# Get all data for a symbol
aapl_data = handler.get_data('AAPL')
print(f"AAPL data shape: {aapl_data.shape}")
aapl_data.tail()

In [None]:
# Get specific columns only
aapl_prices = handler.get_data('AAPL', columns=['open', 'close', 'volume'])
print(f"Columns: {list(aapl_prices.columns)}")
aapl_prices.head()

### 4.2 Get OHLC Data (Analysis-Ready Format)

In [None]:
# Get OHLC data formatted for analysis
ohlc = handler.get_ohlc_data('AAPL')
print(f"OHLC columns: {list(ohlc.columns)}")
print(f"Index name: {ohlc.index.name}")
ohlc.head()

### 4.3 Combined Data (Long Format)

In [None]:
# Get combined data for multiple symbols (long/row-bound format)
combined = handler.get_combined_data(
    symbols=['AAPL', 'MSFT', 'GOOGL'],
    columns=['close', 'volume']
)

print(f"Combined data shape: {combined.shape}")
print(f"Columns: {list(combined.columns)}")
print(f"\nUnique symbols: {combined['symbol'].unique()}")
combined.head(10)

### 4.4 Multiple Symbols Data (Wide Format)

In [None]:
# Get data in wide format (each symbol as a column)
wide_data = handler.get_multiple_symbols_data(
    symbols=['AAPL', 'MSFT', 'GOOGL'],
    column='close'
)

print(f"Wide format shape: {wide_data.shape}")
print(f"Columns: {list(wide_data.columns)}")
wide_data.head()

### 4.5 Get Company Information

In [None]:
# Get company info
info = handler.get_info('AAPL')

# Display key information
key_fields = ['longName', 'sector', 'industry', 'marketCap', 'trailingPE', 'dividendYield']
for field in key_fields:
    if field in info:
        print(f"{field}: {info[field]}")

## 5. Massive Download Example

This section demonstrates how to download data for a large number of symbols efficiently.

In [None]:
# Define a large list of symbols (S&P 500 subset + other markets)
# In production, you might load this from a file

SP500_SAMPLE = [
    # Technology
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META', 'NVDA', 'TSLA', 'AMD', 'INTC', 'CRM',
    'ADBE', 'ORCL', 'CSCO', 'AVGO', 'TXN', 'QCOM', 'IBM', 'NOW', 'AMAT', 'MU',
    
    # Finance
    'JPM', 'BAC', 'WFC', 'GS', 'MS', 'C', 'AXP', 'BLK', 'SCHW', 'USB',
    
    # Healthcare
    'UNH', 'JNJ', 'PFE', 'ABBV', 'MRK', 'LLY', 'TMO', 'ABT', 'DHR', 'BMY',
    
    # Consumer
    'WMT', 'PG', 'KO', 'PEP', 'COST', 'HD', 'MCD', 'NKE', 'SBUX', 'TGT',
    
    # Industrial
    'CAT', 'BA', 'HON', 'UPS', 'GE', 'MMM', 'LMT', 'RTX', 'DE', 'UNP',
    
    # ETFs and Indices
    'SPY', 'QQQ', 'IWM', 'DIA', 'VTI', 'VOO', 'XLF', 'XLK', 'XLE', 'XLV'
]

print(f"Total symbols to download: {len(SP500_SAMPLE)}")

In [None]:
# Create a handler optimized for massive downloads
massive_handler = YFinanceDataHandler(
    cache_dir="../data/massive_cache",  # Cache directory
    enable_logging=True,
    chunk_size=25,                       # Smaller chunks for stability
    log_level=logging.INFO
)

print(f"Handler ready: {massive_handler}")

In [None]:
%%time
# Massive download with progress tracking
print(f"Starting download of {len(SP500_SAMPLE)} symbols...")
print("="*60)

data = massive_handler.download_data(
    symbols=SP500_SAMPLE,
    period='5y',           # 5 years of historical data
    interval='1d',         # Daily data
    use_cache=True,        # Use cache to avoid re-downloading
    threads=True           # Enable multi-threading
)

print("="*60)
print(f"\nDownload complete!")
print(f"Successfully downloaded: {len(data)} symbols")
print(f"Failed symbols: {len(SP500_SAMPLE) - len(data)}")

In [None]:
# View summary of downloaded data
summary = massive_handler.list_available_data()

print(f"\nData Summary ({len(summary)} symbols):")
print("-" * 70)

# Show sample
for i, (symbol, info) in enumerate(list(summary.items())[:10]):
    print(f"{symbol:6s} | Rows: {info['rows']:5d} | Range: {info['date_range']} | Missing: {info['missing_values']}")

if len(summary) > 10:
    print(f"... and {len(summary) - 10} more symbols")

## 6. Saving and Loading Data

### 6.1 Save to Different Formats

In [None]:
# Create output directory
from pathlib import Path
output_dir = Path("../data/saved")
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Output directory: {output_dir.absolute()}")

In [None]:
# Strategy 1: Save as separate files (one file per symbol)
massive_handler.save_data(
    filepath=str(output_dir / "stocks.parquet"),
    symbols=['AAPL', 'MSFT', 'GOOGL'],
    format='parquet',
    multi_symbol_strategy='separate_files'
)

print("Saved as separate parquet files")

In [None]:
# Strategy 2: Save as single combined file
massive_handler.save_data(
    filepath=str(output_dir / "tech_combined.csv"),
    symbols=['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META'],
    format='csv',
    multi_symbol_strategy='single_file',
    combine_column='close'
)

print("Saved as combined CSV")

In [None]:
# Strategy 3: Save as Excel with multiple sheets
massive_handler.save_data(
    filepath=str(output_dir / "portfolio.xlsx"),
    symbols=['AAPL', 'MSFT', 'JPM', 'UNH', 'WMT'],
    format='excel',
    multi_symbol_strategy='excel_sheets'
)

print("Saved as Excel with sheets")

In [None]:
# Save ALL downloaded data as parquet (best for large datasets)
massive_handler.save_data(
    filepath=str(output_dir / "all_stocks.parquet"),
    format='parquet',
    multi_symbol_strategy='separate_files'
)

print(f"Saved all {len(massive_handler)} symbols")

In [None]:
# List saved files
print("\nSaved files:")
for f in sorted(output_dir.glob("*")):
    size_kb = f.stat().st_size / 1024
    print(f"  {f.name:40s} ({size_kb:.1f} KB)")

### 6.2 Loading Data Back

To continue working with saved data, you can load it directly with pandas.

In [None]:
# Load parquet files
loaded_aapl = pd.read_parquet(output_dir / "all_stocks_AAPL.parquet")
print(f"Loaded AAPL: {loaded_aapl.shape}")
loaded_aapl.head()

In [None]:
# Load multiple symbols and combine
symbols_to_load = ['AAPL', 'MSFT', 'GOOGL']
loaded_data = {}

for symbol in symbols_to_load:
    file_path = output_dir / f"all_stocks_{symbol}.parquet"
    if file_path.exists():
        loaded_data[symbol] = pd.read_parquet(file_path)
        print(f"Loaded {symbol}: {len(loaded_data[symbol])} rows")

print(f"\nTotal symbols loaded: {len(loaded_data)}")

In [None]:
# Create a new handler with the loaded data
# This is useful when you want to continue the algoshort workflow

# First, create a handler
resumed_handler = YFinanceDataHandler(
    cache_dir="../data/massive_cache",
    enable_logging=True
)

# Manually inject the loaded data
for symbol, df in loaded_data.items():
    resumed_handler.data[symbol] = df
    if symbol not in resumed_handler.symbols:
        resumed_handler.symbols.append(symbol)

print(f"Resumed handler: {resumed_handler}")
print(f"Available symbols: {resumed_handler.symbols}")

### 6.3 Using Cache for Session Continuity

The best way to continue work across sessions is to use the built-in cache.

In [None]:
# List what's in the cache
cached_symbols = massive_handler.list_cached_symbols()
print(f"Cached symbols: {len(cached_symbols)}")
print(f"Sample: {cached_symbols[:10]}")

In [None]:
# View detailed cache info
cache_info = massive_handler.list_cached_data()
print(f"\nCache files ({len(cache_info)}):")
print("-" * 80)

for i, (filename, info) in enumerate(list(cache_info.items())[:5]):
    print(f"{info['symbol']:6s} | Period: {info['period']:6s} | Interval: {info['interval']:4s} | Size: {info['size_kb']:7.1f} KB | Modified: {info['last_modified'].strftime('%Y-%m-%d %H:%M')}")

if len(cache_info) > 5:
    print(f"... and {len(cache_info) - 5} more files")

In [None]:
# In a NEW session, create handler and data loads from cache automatically
new_session_handler = YFinanceDataHandler(
    cache_dir="../data/massive_cache",
    enable_logging=True
)

# Download will use cache instead of re-downloading
data = new_session_handler.download_data(
    symbols=['AAPL', 'MSFT', 'GOOGL'],
    period='5y',
    interval='1d',
    use_cache=True  # This will load from cache if available
)

print(f"Data loaded (from cache): {list(data.keys())}")

## 7. Continuing with Algoshort Workflow

Now that data is loaded, you can use it with other algoshort modules.

In [None]:
# Import other algoshort modules
from algoshort.ohlcprocessor import OHLCProcessor
from algoshort.signals import regime_sma, regime_breakout
from algoshort.stop_loss import StopLossCalculator
from algoshort.returns import ReturnsCalculator

print("Algoshort modules imported successfully!")

### 7.1 Calculate Relative Prices

In [None]:
# Get OHLC data for stock and benchmark
stock_ohlc = new_session_handler.get_ohlc_data('AAPL')

# Download benchmark if not available
new_session_handler.download_data('SPY', period='5y', interval='1d', use_cache=True)
benchmark_ohlc = new_session_handler.get_ohlc_data('SPY')

print(f"Stock data: {stock_ohlc.shape}")
print(f"Benchmark data: {benchmark_ohlc.shape}")

In [None]:
# Calculate relative prices
processor = OHLCProcessor()

# Reset index to get date column
stock_df = stock_ohlc.reset_index()
benchmark_df = benchmark_ohlc.reset_index()

relative_data = processor.calculate_relative_prices(
    stock_data=stock_df,
    benchmark_data=benchmark_df,
    rebase=True
)

print(f"Relative data columns: {list(relative_data.columns)}")
relative_data[['date', 'close', 'rclose']].tail()

### 7.2 Generate Trading Signals

In [None]:
# Generate SMA regime signal
signal_data = regime_sma(
    df=relative_data,
    close_col='rclose',
    fast_period=50,
    slow_period=200
)

print(f"Signal columns added: {[c for c in signal_data.columns if 'sma' in c.lower()]}")
signal_data[['date', 'rclose', 'sma_50_200']].tail(10)

In [None]:
# Generate breakout signal
signal_data = regime_breakout(
    df=signal_data,
    high_col='rhigh',
    low_col='rlow',
    window=20
)

print(f"Breakout signal added")
signal_data[['date', 'rclose', 'sma_50_200', 'bo_20']].tail(10)

### 7.3 Calculate Stop Losses

In [None]:
# Prepare data for stop loss calculator
# Rename relative columns to match expected format
stop_data = signal_data.rename(columns={
    'ropen': 'open',
    'rhigh': 'high',
    'rlow': 'low',
    'rclose': 'close'
})

# Create stop loss calculator
stop_calc = StopLossCalculator(stop_data)

# Calculate ATR-based stop loss for the SMA signal
result = stop_calc.atr_stop_loss(
    signal='sma_50_200',
    window=14,
    multiplier=2.0
)

print(f"Stop loss column: sma_50_200_stop_loss")
result[['close', 'sma_50_200', 'sma_50_200_stop_loss']].tail(10)

### 7.4 Calculate Returns

In [None]:
# Create returns calculator
# Need both absolute and relative columns
returns_data = result.copy()

# Add relative columns back (prefixed with 'r')
returns_data['ropen'] = signal_data['ropen']
returns_data['rhigh'] = signal_data['rhigh']
returns_data['rlow'] = signal_data['rlow']
returns_data['rclose'] = signal_data['rclose']

returns_calc = ReturnsCalculator(returns_data)

# Calculate returns for the SMA signal
final_result = returns_calc.get_returns(
    df=returns_data,
    signal='sma_50_200',
    relative=False  # Using absolute prices
)

print(f"Returns columns added:")
returns_cols = [c for c in final_result.columns if 'sma_50_200' in c and c != 'sma_50_200']
print(returns_cols)

In [None]:
# View final results
display_cols = ['close', 'sma_50_200', 'sma_50_200_stop_loss', 
                'sma_50_200_returns', 'sma_50_200_cumul']
final_result[display_cols].tail(10)

In [None]:
# Plot cumulative returns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(12, 8), sharex=True)

# Price and signal
axes[0].plot(final_result['date'], final_result['close'], label='Price', alpha=0.7)
axes[0].fill_between(
    final_result['date'],
    final_result['close'].min(),
    final_result['close'].max(),
    where=final_result['sma_50_200'] == 1,
    alpha=0.3,
    color='green',
    label='Long Signal'
)
axes[0].set_ylabel('Price')
axes[0].legend()
axes[0].set_title('AAPL vs SPY - SMA 50/200 Strategy')

# Cumulative returns
axes[1].plot(final_result['date'], final_result['sma_50_200_cumul'] * 100, 
             label='Strategy Returns', color='blue')
axes[1].axhline(y=0, color='black', linestyle='--', alpha=0.3)
axes[1].set_ylabel('Cumulative Returns (%)')
axes[1].set_xlabel('Date')
axes[1].legend()

plt.tight_layout()
plt.show()

## 8. Cache Management

In [None]:
# View cache status
print(f"Cache directory: {massive_handler.cache_dir}")
print(f"Cached symbols: {len(massive_handler.list_cached_symbols())}")

In [None]:
# Clear cache for specific symbols
removed = massive_handler.clear_cache(symbols=['AAPL', 'MSFT'])
print(f"Removed {removed} cache files")

In [None]:
# Clear entire cache (uncomment to run)
# removed = massive_handler.clear_cache()
# print(f"Removed {removed} cache files")

## 9. Best Practices

### 9.1 Recommended Workflow for Large Projects

In [None]:
# Best practice workflow

def create_analysis_pipeline(symbols, cache_dir="../data/cache"):
    """
    Create a complete analysis pipeline with proper error handling.
    
    Args:
        symbols: List of stock symbols
        cache_dir: Cache directory path
        
    Returns:
        Dictionary with handler and processed data
    """
    
    # Step 1: Initialize handler with caching
    handler = YFinanceDataHandler(
        cache_dir=cache_dir,
        enable_logging=True,
        chunk_size=25  # Conservative chunk size
    )
    
    # Step 2: Download data (will use cache if available)
    try:
        data = handler.download_data(
            symbols=symbols,
            period='5y',
            interval='1d',
            use_cache=True
        )
        print(f"Successfully loaded {len(data)} symbols")
    except Exception as e:
        print(f"Error downloading data: {e}")
        return None
    
    # Step 3: Validate data quality
    summary = handler.list_available_data()
    for symbol, info in summary.items():
        if info['missing_values'] > 0:
            print(f"Warning: {symbol} has {info['missing_values']} missing values")
    
    return {
        'handler': handler,
        'data': data,
        'summary': summary
    }

# Example usage
result = create_analysis_pipeline(['AAPL', 'MSFT', 'GOOGL'])
if result:
    print(f"\nPipeline ready with {len(result['data'])} symbols")

### 9.2 Tips for Production Use

1. **Always use caching** - Reduces API calls and speeds up development
2. **Use appropriate chunk sizes** - Smaller chunks (25-50) for stability, larger (100+) for speed
3. **Handle errors gracefully** - Some symbols may fail, continue with successful ones
4. **Save intermediate results** - Use parquet format for best performance
5. **Monitor cache age** - Data older than 24h is automatically refreshed
6. **Use logging** - Enable logging to track download progress and issues

---

## Summary

This guide covered:

1. **Setup** - Creating handlers with various configurations
2. **Downloading** - Single, multiple, and massive symbol downloads
3. **Data Access** - Multiple formats (long, wide, OHLC)
4. **Persistence** - Saving and loading data across sessions
5. **Workflow Integration** - Using data with other algoshort modules
6. **Cache Management** - Efficient data reuse

For more information, refer to the module docstrings and the `results_analysis_yfinance_handler.md` document.