# Research Pipeline with Custom Fundamental Data

This notebook demonstrates how to load custom fundamental data from CSV files and use it in Zipline Pipeline for quantitative research and stock screening.

## What You'll Learn

1. **Setup**: Creating a custom data database for fundamentals
2. **Loading Data**: Importing CSV data with symbol-to-sid mapping
3. **Pipeline Basics**: Creating DataSets from your custom data
4. **Factor Analysis**: Building factors from fundamental metrics
5. **Screening**: Filtering stocks based on fundamental criteria
6. **Ranking**: Scoring and ranking stocks for investment decisions
7. **Integration**: Combining fundamentals with price data
8. **Visualization**: Analyzing results with charts

## Use Cases

- **Value Investing**: Screen for low P/E, high ROE stocks
- **Quality Analysis**: Identify companies with strong fundamentals
- **Sector Rotation**: Compare metrics across sectors
- **Factor Research**: Test custom factors based on fundamentals
- **Portfolio Construction**: Build portfolios using fundamental signals

## Data Requirements

You'll need two CSV files:

1. **Fundamentals CSV** (`sample_fundamentals.csv`): Your fundamental data
   - Required: `Ticker`, `Date` columns
   - Data columns: Any metrics you want (Revenue, EPS, ROE, etc.)

2. **Securities CSV** (`sample_securities.csv`): Symbol-to-Sid mapping
   - Required: `Symbol`, `Sid` columns
   - Optional: `Name`, `Exchange`, `Sector` for reference

---

## Part 1: Setup and Database Creation

First, we'll import the necessary modules and create a database for our fundamental data.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Import Zipline custom data module
from zipline.data.custom import (
    create_custom_db,
    load_csv_to_db,
    describe_custom_db,
    list_custom_dbs,
    get_prices,
    make_custom_dataset_class,
    CustomSQLiteLoader,
)

# Import Zipline Pipeline
from zipline.pipeline import Pipeline, CustomFactor
from zipline.pipeline.data import EquityPricing
from zipline.pipeline.factors import SimpleMovingAverage, Returns
from zipline.pipeline.filters import StaticAssets
from zipline.pipeline.engine import SimplePipelineEngine
from zipline.data.bundles import load as load_bundle
from zipline.utils.calendar_utils import get_calendar

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("✓ Imports successful!")

### Define the Fundamental Data Schema

Specify the columns in your fundamental data and their types:
- `int`: Integer values (e.g., shares outstanding)
- `float`: Decimal values (e.g., ratios, percentages)
- `text`: String values (e.g., sector names)
- `date`: Date values
- `datetime`: Timestamp values

In [None]:
# Define the schema for our fundamental data
# This should match the columns in your CSV file (excluding Ticker and Date)

fundamental_columns = {
    # Income Statement
    'Revenue': 'int',
    'NetIncome': 'int',
    
    # Balance Sheet
    'TotalAssets': 'int',
    'TotalEquity': 'int',
    'SharesOutstanding': 'int',
    
    # Per-Share Metrics
    'EPS': 'float',
    'BookValuePerShare': 'float',
    
    # Financial Ratios
    'ROE': 'float',              # Return on Equity
    'DebtToEquity': 'float',     # Debt/Equity ratio
    'CurrentRatio': 'float',     # Current Assets/Current Liabilities
    'PERatio': 'float',          # Price-to-Earnings ratio
    
    # Metadata
    'Sector': 'text',
}

print("Schema defined with {} columns:".format(len(fundamental_columns)))
for col, dtype in fundamental_columns.items():
    print(f"  - {col}: {dtype}")

### Create the Database

Create a SQLite database to store our fundamental data. This is a one-time operation.

In [None]:
# Database configuration
DB_CODE = 'fundamentals'      # Database identifier
BAR_SIZE = '1 quarter'        # Data frequency (quarterly fundamentals)

# Create the database
try:
    db_path = create_custom_db(
        db_code=DB_CODE,
        bar_size=BAR_SIZE,
        columns=fundamental_columns,
    )
    print(f"✓ Database created: {db_path}")
except FileExistsError:
    print(f"ℹ Database '{DB_CODE}' already exists, will use existing database")
    # Get the database path
    from zipline.data.custom.config import get_custom_data_dir, get_db_filename
    db_path = get_custom_data_dir() / get_db_filename(DB_CODE)
    print(f"  Location: {db_path}")

---

## Part 2: Load Data from CSV

Now we'll load our fundamental data from CSV files into the database.

### Preview the Data Files

Let's first look at what our CSV files contain.

In [None]:
# File paths (adjust these to your actual file locations)
FUNDAMENTALS_CSV = 'sample_fundamentals.csv'
SECURITIES_CSV = 'sample_securities.csv'

# Preview fundamentals data
fundamentals_df = pd.read_csv(FUNDAMENTALS_CSV)
print("Fundamentals Data Preview:")
print(f"  Shape: {fundamentals_df.shape} (rows, columns)")
print(f"  Date range: {fundamentals_df['Date'].min()} to {fundamentals_df['Date'].max()}")
print(f"  Unique tickers: {fundamentals_df['Ticker'].nunique()}")
print(f"  Tickers: {', '.join(sorted(fundamentals_df['Ticker'].unique()))}")
print("\nFirst few rows:")
display(fundamentals_df.head(10))

In [None]:
# Preview securities mapping
securities_df = pd.read_csv(SECURITIES_CSV)
print("Securities Mapping Preview:")
print(f"  {len(securities_df)} securities mapped")
print("\nAll securities:")
display(securities_df)

### Load CSV into Database

This loads the CSV data into the database, mapping ticker symbols to Zipline's internal asset IDs (Sids).

In [None]:
# Load the data
result = load_csv_to_db(
    csv_path=FUNDAMENTALS_CSV,
    db_code=DB_CODE,
    sid_map=securities_df,       # DataFrame with Symbol and Sid columns
    id_col='Ticker',              # Column name for ticker in fundamentals CSV
    date_col='Date',              # Column name for dates
    on_duplicate='replace',       # Replace existing records on conflict
    fail_on_unmapped=False,       # Skip tickers not in securities_df
)

# Report results
print("\n" + "="*60)
print("DATA LOADING RESULTS")
print("="*60)
print(f"✓ Rows inserted/updated: {result['rows_inserted']:,}")
print(f"  Rows skipped: {result['rows_skipped']:,}")

if result['unmapped_ids']:
    print(f"\n⚠ Warning: {len(result['unmapped_ids'])} ticker(s) not mapped:")
    for ticker in result['unmapped_ids']:
        print(f"  - {ticker}")
    print("\n  Add these to your securities.csv file if needed.")

if result['errors']:
    print(f"\n⚠ Errors encountered: {len(result['errors'])}")
    for error in result['errors'][:5]:
        print(f"  - {error}")

print("\n✓ Data loading complete!")
print("="*60)

### Verify the Database

Let's check what's in the database now.

In [None]:
# Get database information
db_info = describe_custom_db(DB_CODE)

print("\nDatabase Information:")
print("="*60)
print(f"Database: {db_info['db_code']}")
print(f"Location: {db_info['db_path']}")
print(f"Frequency: {db_info['bar_size']}")
print(f"\nData Statistics:")
print(f"  Total rows: {db_info['row_count']:,}")
print(f"  Unique assets (Sids): {db_info['num_sids']}")
if db_info['date_range']:
    print(f"  Date range: {db_info['date_range'][0]} to {db_info['date_range'][1]}")

print(f"\nColumns ({len(db_info['columns'])}):")
for col, dtype in db_info['columns'].items():
    print(f"  - {col}: {dtype}")

if db_info['sids']:
    print(f"\nAsset IDs (Sids):")
    print(f"  {', '.join(map(str, sorted([int(s) for s in db_info['sids']])))}")

print("="*60)

### Query the Data Directly

Before using Pipeline, let's query the data directly to see what we have.

In [None]:
# Query all data
all_data = get_prices(
    db_code=DB_CODE,
    fields=['Revenue', 'NetIncome', 'EPS', 'ROE', 'PERatio', 'Sector']
)

print(f"Retrieved {len(all_data):,} records")
print("\nSample data:")
display(all_data.head(20))

In [None]:
# Summary statistics
print("Summary Statistics:")
print("="*60)
display(all_data[['Revenue', 'NetIncome', 'EPS', 'ROE', 'PERatio']].describe())

---

## Part 3: Create Pipeline DataSet

Now we'll create a Zipline Pipeline DataSet from our custom data. This allows us to use the data in Pipeline computations.

In [None]:
# Create a DataSet class from our database
Fundamentals = make_custom_dataset_class(
    db_code=DB_CODE,
    columns=fundamental_columns,
    base_name='Fundamentals',  # This will create 'FundamentalsDataSet'
)

print("✓ DataSet class created: FundamentalsDataSet")
print("\nAvailable columns (as Pipeline factors):")
for col in fundamental_columns.keys():
    print(f"  - Fundamentals.{col}")
    
print("\nYou can now use these in Pipeline like:")
print("  Fundamentals.Revenue.latest")
print("  Fundamentals.ROE.latest")
print("  Fundamentals.PERatio.latest")

---

## Part 4: Simple Pipeline Examples

Let's create some simple pipelines to screen and rank stocks.

### Example 1: Basic Screening

Screen for stocks with:
- High ROE (> 10%)
- Low P/E ratio (< 30)
- Low debt (Debt/Equity < 1.0)

In [None]:
# Create a simple screening pipeline
def make_screening_pipeline():
    """
    Create a pipeline that screens for quality stocks.
    
    Criteria:
    - ROE > 10% (profitable and efficient)
    - P/E < 30 (reasonably valued)
    - Debt/Equity < 1.0 (not over-leveraged)
    """
    # Get the latest fundamental values
    roe = Fundamentals.ROE.latest
    pe_ratio = Fundamentals.PERatio.latest
    debt_to_equity = Fundamentals.DebtToEquity.latest
    eps = Fundamentals.EPS.latest
    revenue = Fundamentals.Revenue.latest
    sector = Fundamentals.Sector.latest
    
    # Define screening filters
    high_roe = (roe > 10.0)
    reasonable_pe = (pe_ratio < 30.0)
    low_debt = (debt_to_equity < 1.0)
    
    # Combine filters
    quality_screen = high_roe & reasonable_pe & low_debt
    
    # Create pipeline
    return Pipeline(
        columns={
            'ROE': roe,
            'PE_Ratio': pe_ratio,
            'Debt_to_Equity': debt_to_equity,
            'EPS': eps,
            'Revenue': revenue,
            'Sector': sector,
        },
        screen=quality_screen,  # Only return stocks passing the screen
    )

screening_pipeline = make_screening_pipeline()
print("✓ Screening pipeline created")
print("  Filters: ROE > 10%, P/E < 30, Debt/Equity < 1.0")

### Example 2: Ranking Pipeline

Rank stocks by a composite quality score.

In [None]:
# Create a custom factor for quality score
class QualityScore(CustomFactor):
    """
    Composite quality score based on:
    - ROE (higher is better)
    - P/E ratio (lower is better)
    - Debt/Equity (lower is better)
    
    Returns a normalized score where higher = better quality.
    """
    inputs = [
        Fundamentals.ROE,
        Fundamentals.PERatio,
        Fundamentals.DebtToEquity,
    ]
    window_length = 1  # Only need latest value
    
    def compute(self, today, assets, out, roe, pe, debt):
        # Get latest values (window_length=1, so just index 0)
        roe_latest = roe[0]
        pe_latest = pe[0]
        debt_latest = debt[0]
        
        # Normalize each metric to 0-1 scale using percentile rank
        # ROE: higher is better
        roe_score = (roe_latest - np.nanmin(roe_latest)) / (np.nanmax(roe_latest) - np.nanmin(roe_latest))
        
        # P/E: lower is better, so invert
        pe_score = 1 - ((pe_latest - np.nanmin(pe_latest)) / (np.nanmax(pe_latest) - np.nanmin(pe_latest)))
        
        # Debt: lower is better, so invert
        debt_score = 1 - ((debt_latest - np.nanmin(debt_latest)) / (np.nanmax(debt_latest) - np.nanmin(debt_latest)))
        
        # Composite score (equal weights)
        out[:] = (roe_score + pe_score + debt_score) / 3.0


def make_ranking_pipeline():
    """
    Create a pipeline that ranks stocks by quality score.
    """
    # Calculate quality score
    quality = QualityScore()
    
    # Get fundamentals
    roe = Fundamentals.ROE.latest
    pe_ratio = Fundamentals.PERatio.latest
    debt_to_equity = Fundamentals.DebtToEquity.latest
    eps = Fundamentals.EPS.latest
    sector = Fundamentals.Sector.latest
    
    # Rank by quality score
    quality_rank = quality.rank(ascending=False)  # 1 = best
    
    return Pipeline(
        columns={
            'Quality_Score': quality,
            'Quality_Rank': quality_rank,
            'ROE': roe,
            'PE_Ratio': pe_ratio,
            'Debt_to_Equity': debt_to_equity,
            'EPS': eps,
            'Sector': sector,
        },
    )

ranking_pipeline = make_ranking_pipeline()
print("✓ Ranking pipeline created")
print("  Ranks stocks by composite quality score (ROE, P/E, Debt)")

### Example 3: Sector Analysis Pipeline

Compare metrics across sectors.

In [None]:
def make_sector_analysis_pipeline():
    """
    Create a pipeline for sector-based analysis.
    """
    # Get all fundamental metrics
    revenue = Fundamentals.Revenue.latest
    net_income = Fundamentals.NetIncome.latest
    roe = Fundamentals.ROE.latest
    pe_ratio = Fundamentals.PERatio.latest
    debt_to_equity = Fundamentals.DebtToEquity.latest
    current_ratio = Fundamentals.CurrentRatio.latest
    sector = Fundamentals.Sector.latest
    
    # Calculate profit margin
    profit_margin = (net_income / revenue) * 100.0
    
    return Pipeline(
        columns={
            'Sector': sector,
            'Revenue': revenue,
            'Net_Income': net_income,
            'Profit_Margin_%': profit_margin,
            'ROE': roe,
            'PE_Ratio': pe_ratio,
            'Debt_to_Equity': debt_to_equity,
            'Current_Ratio': current_ratio,
        },
    )

sector_pipeline = make_sector_analysis_pipeline()
print("✓ Sector analysis pipeline created")

---

## Part 5: Run Pipelines (Without Bundle)

For testing, we can run pipelines using just our custom data without a full Zipline bundle.

In [None]:
# Create a simple test runner for our pipeline
# This doesn't require a full bundle - just uses our custom data

# Get test date (use a date from our data)
test_date = pd.Timestamp('2023-12-31')

# Get the assets we have data for
test_sids = [int(s) for s in db_info['sids']]

print(f"Test Configuration:")
print(f"  Date: {test_date.date()}")
print(f"  Assets: {len(test_sids)} stocks")
print(f"  Sids: {test_sids}")

### Run Screening Pipeline

Find stocks that pass our quality screens.

In [None]:
# For this example, we'll manually query and filter
# In a real backtest, this would run automatically via SimplePipelineEngine

# Query the data for our test date
from zipline.data.custom import get_latest_values

screening_data = get_latest_values(
    db_code=DB_CODE,
    as_of_date=test_date.strftime('%Y-%m-%d'),
    sids=test_sids,
)

# Apply our screening criteria
screening_data['Passes_Screen'] = (
    (screening_data['ROE'] > 10.0) &
    (screening_data['PERatio'] < 30.0) &
    (screening_data['DebtToEquity'] < 1.0)
)

# Get stocks that pass
screened_stocks = screening_data[screening_data['Passes_Screen']].copy()

print(f"\nScreening Results (as of {test_date.date()}):")
print("="*80)
print(f"Total stocks analyzed: {len(screening_data)}")
print(f"Stocks passing screen: {len(screened_stocks)}")
print(f"Pass rate: {len(screened_stocks)/len(screening_data)*100:.1f}%")

if len(screened_stocks) > 0:
    print(f"\nQuality Stocks (ROE > 10%, P/E < 30, Debt/Equity < 1.0):")
    display(screened_stocks[['Sid', 'ROE', 'PERatio', 'DebtToEquity', 'EPS', 'Sector']].sort_values('ROE', ascending=False))
else:
    print("\nNo stocks passed the screen criteria.")
    print("Try adjusting the thresholds or check your data.")

### Calculate Quality Rankings

Rank all stocks by our composite quality score.

In [None]:
# Calculate quality score manually
ranking_data = get_latest_values(
    db_code=DB_CODE,
    as_of_date=test_date.strftime('%Y-%m-%d'),
    sids=test_sids,
).copy()

# Normalize ROE (higher is better)
roe_min, roe_max = ranking_data['ROE'].min(), ranking_data['ROE'].max()
ranking_data['ROE_Score'] = (ranking_data['ROE'] - roe_min) / (roe_max - roe_min)

# Normalize P/E (lower is better, so invert)
pe_min, pe_max = ranking_data['PERatio'].min(), ranking_data['PERatio'].max()
ranking_data['PE_Score'] = 1 - ((ranking_data['PERatio'] - pe_min) / (pe_max - pe_min))

# Normalize Debt (lower is better, so invert)
debt_min, debt_max = ranking_data['DebtToEquity'].min(), ranking_data['DebtToEquity'].max()
ranking_data['Debt_Score'] = 1 - ((ranking_data['DebtToEquity'] - debt_min) / (debt_max - debt_min))

# Composite quality score
ranking_data['Quality_Score'] = (
    ranking_data['ROE_Score'] + 
    ranking_data['PE_Score'] + 
    ranking_data['Debt_Score']
) / 3.0

# Rank by quality
ranking_data = ranking_data.sort_values('Quality_Score', ascending=False)
ranking_data['Quality_Rank'] = range(1, len(ranking_data) + 1)

print(f"\nQuality Rankings (as of {test_date.date()}):")
print("="*80)
display(ranking_data[['Quality_Rank', 'Sid', 'Quality_Score', 'ROE', 'PERatio', 'DebtToEquity', 'Sector']])

### Sector Analysis

Compare metrics across sectors.

In [None]:
# Get sector data
sector_data = get_latest_values(
    db_code=DB_CODE,
    as_of_date=test_date.strftime('%Y-%m-%d'),
    sids=test_sids,
).copy()

# Calculate profit margin
sector_data['Profit_Margin_%'] = (sector_data['NetIncome'] / sector_data['Revenue']) * 100

# Group by sector
sector_summary = sector_data.groupby('Sector').agg({
    'Revenue': 'sum',
    'NetIncome': 'sum',
    'ROE': 'mean',
    'PERatio': 'mean',
    'DebtToEquity': 'mean',
    'Profit_Margin_%': 'mean',
    'Sid': 'count',
}).rename(columns={'Sid': 'Num_Stocks'})

sector_summary = sector_summary.round(2)

print(f"\nSector Analysis (as of {test_date.date()}):")
print("="*80)
display(sector_summary)

---

## Part 6: Visualizations

Create charts to visualize the fundamental data.

In [None]:
# Scatter plot: ROE vs P/E Ratio
fig, ax = plt.subplots(figsize=(12, 8))

# Map Sids to symbols for labels
sid_to_symbol = dict(zip(securities_df['Sid'], securities_df['Ticker']))
ranking_data['Symbol'] = ranking_data['Sid'].map(sid_to_symbol)

# Color by sector
sectors = ranking_data['Sector'].unique()
colors = plt.cm.Set3(np.linspace(0, 1, len(sectors)))
sector_colors = dict(zip(sectors, colors))

for sector in sectors:
    sector_data_plot = ranking_data[ranking_data['Sector'] == sector]
    ax.scatter(
        sector_data_plot['ROE'],
        sector_data_plot['PERatio'],
        s=200,
        c=[sector_colors[sector]],
        label=sector,
        alpha=0.7,
        edgecolors='black',
        linewidth=1.5,
    )
    
    # Add stock labels
    for idx, row in sector_data_plot.iterrows():
        ax.annotate(
            row['Symbol'],
            (row['ROE'], row['PERatio']),
            xytext=(5, 5),
            textcoords='offset points',
            fontsize=9,
            fontweight='bold',
        )

ax.set_xlabel('Return on Equity (ROE %)', fontsize=12, fontweight='bold')
ax.set_ylabel('P/E Ratio', fontsize=12, fontweight='bold')
ax.set_title(f'ROE vs P/E Ratio by Sector ({test_date.date()})', fontsize=14, fontweight='bold')
ax.legend(title='Sector', loc='best', framealpha=0.9)
ax.grid(True, alpha=0.3)

# Add quadrant lines for reference
ax.axhline(y=30, color='red', linestyle='--', alpha=0.5, label='P/E = 30 (threshold)')
ax.axvline(x=10, color='green', linestyle='--', alpha=0.5, label='ROE = 10% (threshold)')

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("  - Top-left quadrant (high ROE, low P/E): Best value + quality")
print("  - Top-right quadrant (high ROE, high P/E): Quality but expensive")
print("  - Bottom-left quadrant (low ROE, low P/E): Cheap but poor quality")
print("  - Bottom-right quadrant (low ROE, high P/E): Expensive and poor quality")

In [None]:
# Bar chart: Quality scores by stock
fig, ax = plt.subplots(figsize=(14, 6))

# Sort by quality score
plot_data = ranking_data.sort_values('Quality_Score', ascending=True)

# Create bars colored by sector
bars = ax.barh(
    plot_data['Symbol'],
    plot_data['Quality_Score'],
    color=[sector_colors[s] for s in plot_data['Sector']],
    edgecolor='black',
    linewidth=1.5,
)

ax.set_xlabel('Quality Score', fontsize=12, fontweight='bold')
ax.set_ylabel('Stock', fontsize=12, fontweight='bold')
ax.set_title(f'Composite Quality Score by Stock ({test_date.date()})', fontsize=14, fontweight='bold')
ax.set_xlim(0, 1.0)
ax.grid(True, axis='x', alpha=0.3)

# Add value labels
for i, (idx, row) in enumerate(plot_data.iterrows()):
    ax.text(
        row['Quality_Score'] + 0.02,
        i,
        f"{row['Quality_Score']:.3f}",
        va='center',
        fontsize=9,
    )

plt.tight_layout()
plt.show()

print("\nQuality Score Composition:")
print("  - 1/3 ROE score (normalized)")
print("  - 1/3 P/E score (inverted & normalized)")
print("  - 1/3 Debt/Equity score (inverted & normalized)")
print("  Higher scores = better quality")

In [None]:
# Heatmap: Fundamental metrics
fig, ax = plt.subplots(figsize=(12, 8))

# Prepare data for heatmap (normalize for visualization)
heatmap_data = ranking_data.set_index('Symbol')[['ROE', 'PERatio', 'DebtToEquity', 'CurrentRatio']].copy()

# Normalize each column to 0-1 for better visualization
for col in heatmap_data.columns:
    col_min = heatmap_data[col].min()
    col_max = heatmap_data[col].max()
    heatmap_data[col] = (heatmap_data[col] - col_min) / (col_max - col_min)

# Invert P/E and Debt (lower is better)
heatmap_data['PERatio'] = 1 - heatmap_data['PERatio']
heatmap_data['DebtToEquity'] = 1 - heatmap_data['DebtToEquity']

# Rename for clarity
heatmap_data.columns = ['ROE\n(higher better)', 'P/E\n(lower better)', 'Debt/Equity\n(lower better)', 'Current Ratio\n(higher better)']

# Create heatmap
sns.heatmap(
    heatmap_data,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    center=0.5,
    linewidths=1,
    linecolor='black',
    cbar_kws={'label': 'Normalized Score\n(0=worst, 1=best)'},
    ax=ax,
)

ax.set_title(f'Fundamental Metrics Heatmap ({test_date.date()})', fontsize=14, fontweight='bold', pad=20)
ax.set_xlabel('Metric', fontsize=12, fontweight='bold')
ax.set_ylabel('Stock', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nHeatmap Interpretation:")
print("  - Green = Good (high normalized score)")
print("  - Yellow = Average (medium normalized score)")
print("  - Red = Poor (low normalized score)")
print("  - Look for rows with mostly green (best overall quality)")

---

## Part 7: Time Series Analysis

Analyze how fundamentals change over time.

In [None]:
# Get time series for a specific stock
example_stock = 'AAPL'
example_sid = securities_df[securities_df['Symbol'] == example_stock]['Sid'].values[0]

# Query all quarters for this stock
stock_history = get_prices(
    db_code=DB_CODE,
    sids=[example_sid],
)

stock_history['Date'] = pd.to_datetime(stock_history['Date'])
stock_history = stock_history.sort_values('Date')

print(f"\n{example_stock} Historical Fundamentals:")
print("="*80)
display(stock_history[['Date', 'Revenue', 'NetIncome', 'EPS', 'ROE', 'PERatio']])

In [None]:
# Plot time series
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle(f'{example_stock} - Fundamental Trends Over Time', fontsize=16, fontweight='bold')

# Revenue
axes[0, 0].plot(stock_history['Date'], stock_history['Revenue'] / 1e9, marker='o', linewidth=2, markersize=8)
axes[0, 0].set_title('Quarterly Revenue', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Revenue (Billions $)', fontsize=10)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].tick_params(axis='x', rotation=45)

# EPS
axes[0, 1].plot(stock_history['Date'], stock_history['EPS'], marker='o', linewidth=2, markersize=8, color='green')
axes[0, 1].set_title('Earnings Per Share (EPS)', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('EPS ($)', fontsize=10)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].tick_params(axis='x', rotation=45)

# ROE
axes[1, 0].plot(stock_history['Date'], stock_history['ROE'], marker='o', linewidth=2, markersize=8, color='orange')
axes[1, 0].set_title('Return on Equity (ROE)', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('ROE (%)', fontsize=10)
axes[1, 0].axhline(y=10, color='red', linestyle='--', alpha=0.5, label='Target: 10%')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].tick_params(axis='x', rotation=45)

# P/E Ratio
axes[1, 1].plot(stock_history['Date'], stock_history['PERatio'], marker='o', linewidth=2, markersize=8, color='purple')
axes[1, 1].set_title('P/E Ratio', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('P/E Ratio', fontsize=10)
axes[1, 1].axhline(y=30, color='red', linestyle='--', alpha=0.5, label='Target: < 30')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print(f"\n{example_stock} Trend Analysis:")
print("  - Look for consistent growth in Revenue and EPS")
print("  - Stable/improving ROE indicates efficient operations")
print("  - P/E ratio shows market valuation relative to earnings")

---

## Part 8: Integration with Backtesting

Example of how to use fundamental data in a Zipline backtest.

In [None]:
print("""
INTEGRATION WITH ZIPLINE BACKTESTS
===================================

To use this custom fundamental data in a backtest:

1. CREATE A PIPELINE LOADER:
   
   from zipline.data.custom import CustomSQLiteLoader
   
   fundamentals_loader = CustomSQLiteLoader('fundamentals')

2. REGISTER WITH PIPELINE ENGINE:
   
   from zipline.pipeline.engine import SimplePipelineEngine
   from zipline.data.bundles import load
   
   bundle_data = load('sharadar')  # Or your bundle
   
   engine = SimplePipelineEngine(
       get_loader=lambda column: fundamentals_loader if column.dataset == Fundamentals else bundle_data.equity_daily_bar_reader,
       asset_finder=bundle_data.asset_finder,
       default_domain=...,
   )

3. USE IN YOUR ALGORITHM:
   
   def initialize(context):
       # Define your pipeline with fundamental factors
       pipe = Pipeline(
           columns={
               'roe': Fundamentals.ROE.latest,
               'pe': Fundamentals.PERatio.latest,
               'close': EquityPricing.close.latest,
           },
           screen=(Fundamentals.ROE.latest > 10) & (Fundamentals.PERatio.latest < 30)
       )
       
       attach_pipeline(pipe, 'quality_stocks')
   
   def before_trading_start(context, data):
       # Get today's pipeline output
       context.output = pipeline_output('quality_stocks')
       
       # Select top stocks based on fundamentals
       context.longs = context.output.nlargest(10, 'roe').index
   
   def rebalance(context, data):
       # Equal-weight portfolio of top quality stocks
       for asset in context.longs:
           if data.can_trade(asset):
               order_target_percent(asset, 1.0 / len(context.longs))

4. RUN YOUR BACKTEST:
   
   from zipline import run_algorithm
   
   results = run_algorithm(
       start=pd.Timestamp('2023-01-01'),
       end=pd.Timestamp('2023-12-31'),
       initialize=initialize,
       before_trading_start=before_trading_start,
       capital_base=100000,
       bundle='sharadar',
   )

See the Zipline documentation for complete backtest examples:
https://zipline.ml4trading.io/
""")

---

## Part 9: Troubleshooting & Tips

Common issues and solutions.

In [None]:
print("""
TROUBLESHOOTING GUIDE
=====================

PROBLEM: "Database not found"
SOLUTION: Run the database creation cell (Part 1) first
   
PROBLEM: "No data returned from query"
SOLUTION: 
   - Check that data was loaded successfully (Part 2)
   - Verify your date range matches the data
   - Check that Sids exist in the database

PROBLEM: "Unmapped identifiers" warning
SOLUTION:
   - Add missing tickers to securities.csv
   - Or set fail_on_unmapped=False to skip them

PROBLEM: "Column not found" error
SOLUTION:
   - Verify column names match between CSV and schema
   - Check case sensitivity (Revenue vs revenue)
   - Run describe_custom_db() to see available columns

PROBLEM: Pipeline gives errors
SOLUTION:
   - Ensure dates are timezone-aware: pd.Timestamp('2023-01-01', tz='UTC')
   - Check that assets exist in both bundle AND custom data
   - Verify CustomSQLiteLoader is registered correctly

TIPS FOR BEST RESULTS:
======================

1. DATA QUALITY:
   - Clean your CSV data before loading
   - Handle missing values appropriately
   - Ensure dates are in consistent format

2. PERFORMANCE:
   - Use appropriate data types (int for large numbers, not float)
   - Index frequently-queried columns
   - Use date range filters in queries

3. DATA UPDATES:
   - Use on_duplicate='replace' to update existing records
   - Use on_duplicate='ignore' to skip duplicates
   - Use on_duplicate='fail' to catch data issues

4. FACTOR DESIGN:
   - Normalize factors to similar scales for combining
   - Handle missing data with .fillna() or filters
   - Test factors individually before combining

5. BACKTESTING:
   - Ensure point-in-time correctness (no look-ahead bias)
   - Match fundamental frequency (quarterly) with rebalancing
   - Consider reporting lag (fundamentals released ~45 days after quarter-end)

NEXT STEPS:
===========

1. Load your own fundamental data CSV
2. Create custom factors based on your research
3. Backtest strategies using fundamental signals
4. Combine with price/volume factors for multi-factor models
5. Analyze results and iterate

For more examples and documentation:
  - Zipline Custom Data: src/zipline/data/custom/README.md
  - Zipline Pipeline: https://zipline.ml4trading.io/pipeline.html
  - Example notebooks: examples/custom_data/
""")

---

## Summary

**What We Covered:**

1. ✅ Created a custom database for fundamental data
2. ✅ Loaded CSV data with symbol-to-sid mapping
3. ✅ Created Pipeline DataSets from custom data
4. ✅ Built screening pipelines (quality filters)
5. ✅ Created ranking pipelines (composite scores)
6. ✅ Performed sector analysis
7. ✅ Visualized fundamental metrics
8. ✅ Analyzed trends over time
9. ✅ Learned backtest integration

**Key Takeaways:**

- Custom data enables fundamental analysis in Zipline
- Pipeline makes it easy to screen and rank stocks
- Combine multiple factors for robust signals
- Visualizations help understand the data
- Integration with backtesting enables strategy development

**Next Steps:**

1. Try with your own fundamental data
2. Experiment with different factor combinations
3. Build and test investment strategies
4. Combine with technical indicators
5. Run full backtests and analyze performance

Happy researching! 📊🚀