# Congressional Trading Feature Engineering
## Market Variables & Informed Trading Indicators

**Author:** Big Data ML Project  
**Date:** January 2026  
**Objective:** Enrich congressional trading data with comprehensive market variables for anomaly detection

---

### Methodology Overview

This notebook constructs a wide dataset suitable for machine learning-based detection of abnormal trading patterns. The approach follows academic literature on informed trading detection (Bogousslavsky, Fos & Muravyev, 2021) and extends it with event-proximity features.

**Key Design Decisions:**

1. **Sample Restrictions** (following ITI paper):
   - Exclude stocks with price < $5 (avoid penny stocks)
   - Exclude stocks with market cap < $100M (avoid microcaps)
   - Drop non-equity securities (bonds, bills) when identifiable

2. **Variable Groups** (~60-80 features):
   - **Returns & Volatility**: Daily, intraday, rolling volatility measures
   - **Volume & Liquidity**: Turnover, abnormal volume, Amihud illiquidity
   - **Momentum**: Short (5d), medium (20d, 60d), long (252d)
   - **Factor Exposures**: CAPM beta, Fama-French loadings
   - **Event Proximity**: Days to/from earnings, M&A announcements
   - **Post-Trade Validation**: CAR (30d, 60d, 90d) with multiple benchmarks

3. **Time Windows**:
   - Variables *at trade date* (t): prices, volumes, returns
   - Variables *pre-trade*: momentum, volatility (lookback: 5-252 days)
   - Variables *post-trade*: cumulative abnormal returns (forward: 30-90 days)

4. **Data Quality**:
   - Ticker changes → exclude (no historical mapping)
   - Missing data → report, flag, but preserve row
   - Outliers → winsorize at 0.5% / 99.5% following ITI

---

## 1. Setup & Dependencies

In [None]:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta
from tqdm import tqdm
import warnings
from pathlib import Path
import json

# For factor data (Fama-French)
import pandas_datareader.data as web

# For earnings calendar (we'll use yfinance's calendar where available)
from scipy import stats

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

print("Dependencies loaded successfully.")

## 2. Load Congressional Trading Data

In [None]:
# Load the base dataset
df_trades = pd.read_csv('data/congress-trading-all.csv')

print(f"Original dataset shape: {df_trades.shape}")
print(f"\nColumns: {df_trades.columns.tolist()}")
print(f"\nFirst few rows:")
df_trades.head()

In [None]:
# Inspect data types and identify date column
print("Data types:")
print(df_trades.dtypes)
print(f"\nUnique tickers: {df_trades['ticker'].nunique() if 'ticker' in df_trades.columns else 'N/A'}")
print(f"Date range: {df_trades['transaction_date'].min()} to {df_trades['transaction_date'].max()}" 
      if 'transaction_date' in df_trades.columns else "Check date column name")

### 2.1 Data Preparation

**Steps:**
- Parse transaction dates
- Clean ticker symbols (remove spaces, convert to uppercase)
- Identify and flag non-equity securities
- Create unique trade identifier

In [None]:
# This cell will be customized based on actual column names
# For now, assuming standard columns exist

# Parse date (adjust column name as needed)
date_col = 'transaction_date'  # UPDATE if different
ticker_col = 'ticker'  # UPDATE if different

df = df_trades.copy()
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
df = df.dropna(subset=[date_col])  # Drop rows with invalid dates

# Clean tickers
df[ticker_col] = df[ticker_col].str.strip().str.upper()

# Flag potential non-equities (basic heuristic: bonds often have numbers, '--' in name)
df['likely_equity'] = ~df[ticker_col].str.contains(r'\d{3,}|--|BOND|BILL|NOTE', case=False, na=False)

# Create unique trade ID
df['trade_id'] = range(len(df))

print(f"After data prep: {df.shape}")
print(f"Likely equities: {df['likely_equity'].sum()} ({df['likely_equity'].mean()*100:.1f}%)")

## 3. Market Data Fetching

**Approach:**
- Download daily OHLCV data for all unique tickers
- Download S&P 500 (^GSPC) for market benchmark
- Download Russell 3000 (^RUA) as alternative benchmark
- Fetch Fama-French factors from Ken French's data library
- Download earnings calendar data where available

**Error Handling:**
- Failed ticker downloads → logged but not fatal
- Missing dates → forward fill conservatively
- Delisted stocks → mark explicitly

In [None]:
# Get unique tickers and date range
tickers = df[df['likely_equity']][ticker_col].unique()
start_date = df[date_col].min() - timedelta(days=365)  # Extra year for rolling calculations
end_date = df[date_col].max() + timedelta(days=120)    # Extra months for post-trade CAR

print(f"Fetching data for {len(tickers)} tickers")
print(f"Date range: {start_date.date()} to {end_date.date()}")

# Initialize storage
price_data = {}
failed_tickers = []
earnings_data = {}  # Will store earnings dates per ticker

In [None]:
# Download market benchmarks first
print("Downloading market benchmarks...")

sp500 = yf.download('^GSPC', start=start_date, end=end_date, progress=False)
russell3000 = yf.download('^RUA', start=start_date, end=end_date, progress=False)

# Calculate market returns
sp500['Return'] = sp500['Adj Close'].pct_change()
russell3000['Return'] = russell3000['Adj Close'].pct_change()

print(f"SP500 data: {len(sp500)} days")
print(f"Russell 3000 data: {len(russell3000)} days")

In [None]:
# Download Fama-French factors
print("\nDownloading Fama-French factors...")

try:
    # FF 3-factor daily
    ff3 = web.DataReader('F-F_Research_Data_Factors_daily', 'famafrench', 
                          start=start_date, end=end_date)[0]
    # Convert from percentage to decimal
    ff3 = ff3 / 100
    
    # Momentum factor (daily)
    mom = web.DataReader('F-F_Momentum_Factor_daily', 'famafrench',
                          start=start_date, end=end_date)[0]
    mom = mom / 100
    
    # Merge factors
    ff_factors = ff3.join(mom, how='outer')
    ff_factors.columns = ['Mkt-RF', 'SMB', 'HML', 'RF', 'Mom']
    
    print(f"Fama-French factors loaded: {len(ff_factors)} days")
    print(f"Factors: {ff_factors.columns.tolist()}")
    
except Exception as e:
    print(f"Warning: Could not load FF factors: {e}")
    print("Will proceed without FF-adjusted measures.")
    ff_factors = None

In [None]:
# Download individual stock data
print(f"\nDownloading {len(tickers)} tickers...")
print("This may take several minutes.\n")

for ticker in tqdm(tickers[:10]):  # REMOVE [:10] FOR FULL RUN - limiting for testing
    try:
        # Download OHLCV data
        stock = yf.Ticker(ticker)
        hist = stock.history(start=start_date, end=end_date)
        
        if len(hist) < 50:  # Require at least 50 days of data
            failed_tickers.append((ticker, "Insufficient data"))
            continue
            
        # Calculate returns
        hist['Return'] = hist['Close'].pct_change()
        hist['Log_Return'] = np.log(hist['Close'] / hist['Close'].shift(1))
        
        # Store
        price_data[ticker] = hist
        
        # Try to get earnings dates
        try:
            earnings = stock.get_earnings_dates(limit=200)  # Get historical earnings
            if earnings is not None and len(earnings) > 0:
                earnings_data[ticker] = earnings.index.tolist()
        except:
            pass  # Earnings data not available, not critical
            
    except Exception as e:
        failed_tickers.append((ticker, str(e)))
        continue

print(f"\nSuccessfully downloaded: {len(price_data)} tickers")
print(f"Failed: {len(failed_tickers)} tickers")
print(f"Earnings data available for: {len(earnings_data)} tickers")

In [None]:
# Save failed tickers for inspection
if failed_tickers:
    failed_df = pd.DataFrame(failed_tickers, columns=['ticker', 'reason'])
    Path('data/outputs').mkdir(parents=True, exist_ok=True)
    failed_df.to_csv('data/outputs/failed_tickers.csv', index=False)
    print(f"Failed tickers saved to data/outputs/failed_tickers.csv")

## 4. Feature Engineering

Now we construct features for each trade. This is the core of the analysis.

**Methodology:**
- All features are calculated as of the trade date (no forward-looking bias except CAR)
- Rolling windows use expanding or fixed lookback (never forward)
- Features standardized where appropriate (following ITI paper)

### 4.1 Price & Return Features

In [None]:
def calculate_return_features(ticker, trade_date, price_df):
    """
    Calculate return-based features at trade date.
    
    Returns:
        dict: Feature values
    """
    features = {}
    
    # Get data up to and including trade date
    hist = price_df[price_df.index <= trade_date].copy()
    
    if len(hist) < 5:
        return features  # Not enough data
    
    # Daily return at trade date
    features['return_t'] = hist['Return'].iloc[-1] if len(hist) >= 1 else np.nan
    
    # Overnight return (close to open, using high-low as proxy)
    if len(hist) >= 2:
        features['return_overnight'] = (hist['Open'].iloc[-1] / hist['Close'].iloc[-2]) - 1
        features['return_intraday'] = (hist['Close'].iloc[-1] / hist['Open'].iloc[-1]) - 1
    
    # Momentum at various horizons
    # Short-term reversal (5 days)
    if len(hist) >= 6:
        features['momentum_5d'] = (hist['Close'].iloc[-1] / hist['Close'].iloc[-6]) - 1
        
    # Medium-term momentum (20 and 60 days)
    if len(hist) >= 21:
        features['momentum_20d'] = (hist['Close'].iloc[-1] / hist['Close'].iloc[-21]) - 1
        
    if len(hist) >= 61:
        features['momentum_60d'] = (hist['Close'].iloc[-1] / hist['Close'].iloc[-61]) - 1
    
    # Long-term momentum (252 days ~ 1 year)
    if len(hist) >= 253:
        features['momentum_252d'] = (hist['Close'].iloc[-1] / hist['Close'].iloc[-253]) - 1
    
    # Absolute returns (for volatility proxies)
    features['abs_return_t'] = abs(features.get('return_t', np.nan))
    
    return features

### 4.2 Volatility Features

**Realized Volatility:**
- Standard deviation of returns over rolling window
- Calculated at 30d, 60d, 252d horizons
- Annualized using √252 factor

**High-Low Range:**
- Proxy for intraday volatility (Parkinson estimator)
- Less noisy than close-to-close for daily vol

In [None]:
def calculate_volatility_features(ticker, trade_date, price_df):
    """
    Calculate volatility measures using data up to trade date.
    """
    features = {}
    hist = price_df[price_df.index <= trade_date].copy()
    
    if len(hist) < 5:
        return features
    
    # Realized volatility (annualized)
    # 30-day
    if len(hist) >= 30:
        features['realized_vol_30d'] = hist['Return'].iloc[-30:].std() * np.sqrt(252)
    
    # 60-day
    if len(hist) >= 60:
        features['realized_vol_60d'] = hist['Return'].iloc[-60:].std() * np.sqrt(252)
    
    # 252-day (annual)
    if len(hist) >= 252:
        features['realized_vol_252d'] = hist['Return'].iloc[-252:].std() * np.sqrt(252)
    
    # High-Low volatility (Parkinson estimator)
    # More efficient than close-to-close
    if len(hist) >= 30:
        hl = np.log(hist['High'].iloc[-30:] / hist['Low'].iloc[-30:])
        features['parkinson_vol_30d'] = np.sqrt(1/(4*30*np.log(2)) * (hl**2).sum()) * np.sqrt(252)
    
    # Volatility of volatility (VoV) - measures uncertainty
    if len(hist) >= 60:
        rolling_vol = hist['Return'].rolling(20).std().iloc[-60:]
        features['vol_of_vol_60d'] = rolling_vol.std() * np.sqrt(252)
    
    return features

### 4.3 Volume & Liquidity Features

**Turnover:**
- Volume / Shares Outstanding (approx using volume)
- Abnormal turnover vs historical mean

**Amihud Illiquidity:**
- Avg(|Return| / Dollar Volume)
- Standard measure in market microstructure

**Bid-Ask Spread Proxy:**
- Roll (1984) estimator from return covariances
- High-Low range as alternative

In [None]:
def calculate_liquidity_features(ticker, trade_date, price_df):
    """
    Calculate liquidity and volume features.
    """
    features = {}
    hist = price_df[price_df.index <= trade_date].copy()
    
    if len(hist) < 5:
        return features
    
    # Volume at trade date
    features['volume_t'] = hist['Volume'].iloc[-1]
    
    # Dollar volume (Volume * Close)
    hist['Dollar_Volume'] = hist['Volume'] * hist['Close']
    features['dollar_volume_t'] = hist['Dollar_Volume'].iloc[-1]
    
    # Abnormal volume (vs 30-day mean)
    if len(hist) >= 30:
        mean_vol_30d = hist['Volume'].iloc[-31:-1].mean()  # Exclude current day
        features['volume_ratio_30d'] = hist['Volume'].iloc[-1] / mean_vol_30d if mean_vol_30d > 0 else np.nan
        features['abnormal_volume_30d'] = hist['Volume'].iloc[-1] - mean_vol_30d
    
    # Amihud illiquidity (2002)
    # Avg(|Return| / Dollar Volume) over past 20 days
    if len(hist) >= 21:
        dv = hist['Dollar_Volume'].iloc[-21:].replace(0, np.nan)
        amihud = (hist['Return'].iloc[-21:].abs() / dv).mean()
        features['amihud_illiq_20d'] = amihud * 1e6  # Scale by million for readability
    
    # Roll (1984) spread estimator
    # Spread = 2 * sqrt(-Cov(r_t, r_{t-1})) if negative covariance
    if len(hist) >= 30:
        returns = hist['Return'].iloc[-30:].dropna()
        if len(returns) >= 2:
            cov = returns.autocorr(lag=1) * returns.var()
            if cov < 0:
                features['roll_spread_30d'] = 2 * np.sqrt(-cov)
            else:
                features['roll_spread_30d'] = 0  # No bid-ask bounce detected
    
    # High-Low spread proxy (average over 20 days)
    if len(hist) >= 20:
        hl_spread = ((hist['High'] - hist['Low']) / hist['Close']).iloc[-20:].mean()
        features['hl_spread_20d'] = hl_spread
    
    # Number of zero-volume days (illiquidity indicator)
    if len(hist) >= 30:
        features['zero_volume_days_30d'] = (hist['Volume'].iloc[-30:] == 0).sum()
    
    return features

### 4.4 Factor Exposures (CAPM & Fama-French)

**CAPM Beta:**
- Rolling regression: r_stock = α + β * r_market + ε
- Estimated over 252 trading days (1 year)
- Benchmark: S&P 500

**Fama-French Loadings:**
- r_stock - r_f = α + β_mkt * (r_mkt - r_f) + β_smb * SMB + β_hml * HML + ε
- Estimated over 252 days
- Used later for FF-adjusted CARs

In [None]:
def calculate_factor_exposures(ticker, trade_date, price_df, market_df, ff_df=None):
    """
    Calculate CAPM beta and Fama-French factor loadings.
    
    Args:
        ticker: Stock ticker
        trade_date: Date of trade
        price_df: Stock price history
        market_df: Market index (SP500) history
        ff_df: Fama-French factors (optional)
    """
    features = {}
    
    # Get historical data
    stock_hist = price_df[price_df.index <= trade_date].copy()
    market_hist = market_df[market_df.index <= trade_date].copy()
    
    if len(stock_hist) < 60:
        return features  # Need minimum data for regression
    
    # CAPM Beta (252-day rolling)
    lookback = min(252, len(stock_hist))
    stock_ret = stock_hist['Return'].iloc[-lookback:]
    
    # Align dates
    merged = pd.DataFrame({
        'stock': stock_ret,
        'market': market_hist.loc[stock_ret.index, 'Return']
    }).dropna()
    
    if len(merged) >= 30:  # Minimum for stable regression
        # Simple beta: Cov(r_stock, r_market) / Var(r_market)
        features['beta_252d'] = merged['stock'].cov(merged['market']) / merged['market'].var()
        
        # R-squared of market model
        features['r2_market_252d'] = merged['stock'].corr(merged['market']) ** 2
    
    # Fama-French 3-factor loadings
    if ff_df is not None:
        ff_hist = ff_df[ff_df.index <= trade_date].iloc[-lookback:]
        
        # Merge stock returns with FF factors
        ff_merged = pd.DataFrame({
            'stock_excess': stock_ret - ff_hist.loc[stock_ret.index, 'RF'],
            'mkt_rf': ff_hist.loc[stock_ret.index, 'Mkt-RF'],
            'smb': ff_hist.loc[stock_ret.index, 'SMB'],
            'hml': ff_hist.loc[stock_ret.index, 'HML']
        }).dropna()
        
        if len(ff_merged) >= 30:
            from scipy import stats
            
            # FF3 regression
            X = ff_merged[['mkt_rf', 'smb', 'hml']].values
            y = ff_merged['stock_excess'].values
            
            # Add constant for alpha
            X = np.column_stack([np.ones(len(X)), X])
            
            try:
                # OLS regression
                coeffs = np.linalg.lstsq(X, y, rcond=None)[0]
                
                features['alpha_ff3_252d'] = coeffs[0] * 252  # Annualized alpha
                features['beta_mkt_ff3_252d'] = coeffs[1]
                features['beta_smb_ff3_252d'] = coeffs[2]
                features['beta_hml_ff3_252d'] = coeffs[3]
                
                # R-squared
                y_pred = X @ coeffs
                ss_res = ((y - y_pred) ** 2).sum()
                ss_tot = ((y - y.mean()) ** 2).sum()
                features['r2_ff3_252d'] = 1 - (ss_res / ss_tot)
                
            except:
                pass  # Regression failed, skip
    
    return features

### 4.5 Event Proximity Features

**Earnings Announcements:**
- Days until next earnings (forward-looking, but public info)
- Days since last earnings
- Dummy for "earnings window" (±5 days)

**M&A / News Events:**
- Will add if data becomes available
- For now, flagged as potential extension

In [None]:
def calculate_event_proximity(ticker, trade_date, earnings_dates):
    """
    Calculate proximity to known events (earnings, etc.).
    
    Args:
        ticker: Stock ticker
        trade_date: Trade date (pandas Timestamp)
        earnings_dates: List of earnings announcement dates for this ticker
    """
    features = {}
    
    if not earnings_dates or len(earnings_dates) == 0:
        return features
    
    # Convert to pandas datetime if needed
    earnings_dates = pd.to_datetime(earnings_dates)
    
    # Days to next earnings (future)
    future_earnings = earnings_dates[earnings_dates > trade_date]
    if len(future_earnings) > 0:
        next_earnings = future_earnings.min()
        features['days_to_earnings'] = (next_earnings - trade_date).days
    else:
        features['days_to_earnings'] = np.nan
    
    # Days since last earnings (past)
    past_earnings = earnings_dates[earnings_dates <= trade_date]
    if len(past_earnings) > 0:
        last_earnings = past_earnings.max()
        features['days_since_earnings'] = (trade_date - last_earnings).days
    else:
        features['days_since_earnings'] = np.nan
    
    # Dummy for earnings window (within ±5 days)
    min_dist = min(
        abs(features.get('days_to_earnings', 999)),
        abs(features.get('days_since_earnings', 999))
    )
    features['within_5d_earnings'] = 1 if min_dist <= 5 else 0
    features['within_10d_earnings'] = 1 if min_dist <= 10 else 0
    
    return features

### 4.6 Post-Trade Validation: Cumulative Abnormal Returns (CAR)

**Purpose:**
- Measure if trade predicted future returns (informed trading signal)
- Calculate at 30, 60, 90 day horizons

**Methodology:**

1. **Raw CAR (market-adjusted):**
   - CAR = (Buy-and-hold stock return) - (Buy-and-hold market return)
   - Benchmark: S&P 500

2. **Risk-adjusted CAR (CAPM):**
   - Expected return = RF + β * (R_market - RF)
   - CAR = Actual return - Expected return
   
3. **FF3-adjusted CAR:**
   - Expected return = RF + β_mkt*(R_mkt-RF) + β_smb*SMB + β_hml*HML
   - CAR = Actual return - Expected return
   - Most robust to risk factors

**Note:** This is the ONLY forward-looking feature. It's for validation, not prediction.

In [None]:
def calculate_CAR(ticker, trade_date, horizon_days, price_df, market_df, beta=None, ff_betas=None, ff_df=None):
    """
    Calculate Cumulative Abnormal Returns post-trade.
    
    Args:
        ticker: Stock ticker
        trade_date: Trade date
        horizon_days: Days forward to measure (30, 60, 90)
        price_df: Stock price history
        market_df: Market benchmark
        beta: CAPM beta (if available)
        ff_betas: Dict with FF3 betas {mkt, smb, hml}
        ff_df: Fama-French factor returns
    """
    features = {}
    
    # Get future data (trade date + horizon)
    end_date = trade_date + timedelta(days=horizon_days)
    
    stock_future = price_df[(price_df.index > trade_date) & (price_df.index <= end_date)]
    market_future = market_df[(market_df.index > trade_date) & (market_df.index <= end_date)]
    
    if len(stock_future) < horizon_days * 0.5:  # Require at least 50% of trading days
        return features
    
    # Buy-and-hold returns
    try:
        stock_return = (stock_future['Close'].iloc[-1] / price_df.loc[trade_date, 'Close']) - 1
    except:
        return features
    
    # Market return (same period)
    if len(market_future) > 0:
        try:
            market_return = (market_future['Adj Close'].iloc[-1] / 
                           market_df.loc[trade_date, 'Adj Close']) - 1
        except:
            market_return = 0
    else:
        market_return = 0
    
    # 1. Raw CAR (market-adjusted)
    features[f'car_raw_{horizon_days}d'] = stock_return - market_return
    
    # 2. CAPM-adjusted CAR
    if beta is not None and not np.isnan(beta):
        expected_return = beta * market_return  # Simplified: ignoring risk-free rate
        features[f'car_capm_{horizon_days}d'] = stock_return - expected_return
    
    # 3. FF3-adjusted CAR
    if ff_betas is not None and ff_df is not None:
        ff_future = ff_df[(ff_df.index > trade_date) & (ff_df.index <= end_date)]
        
        if len(ff_future) > 0:
            # Average factor returns over period
            factor_returns = ff_future[['Mkt-RF', 'SMB', 'HML', 'RF']].mean() * len(ff_future)
            
            expected_return_ff3 = (
                factor_returns['RF'] +
                ff_betas.get('mkt', 1) * factor_returns['Mkt-RF'] +
                ff_betas.get('smb', 0) * factor_returns['SMB'] +
                ff_betas.get('hml', 0) * factor_returns['HML']
            )
            
            features[f'car_ff3_{horizon_days}d'] = stock_return - expected_return_ff3
    
    return features

### 4.7 Stock Characteristics (Fundamentals)

**Available from yfinance:**
- Market capitalization
- Price (for penny stock filter)
- Book value (if available)
- Basic ratios (P/E, P/B)

**Note:** Fundamental data quality from yfinance is limited. Missing data is common.

In [None]:
def get_stock_fundamentals(ticker, trade_date):
    """
    Get fundamental characteristics at trade date.
    
    Note: yfinance fundamentals are often delayed or missing.
    """
    features = {}
    
    try:
        stock = yf.Ticker(ticker)
        info = stock.info
        
        # Market cap (in millions)
        features['market_cap'] = info.get('marketCap', np.nan) / 1e6
        
        # Price (for filtering)
        features['price'] = info.get('regularMarketPrice', np.nan)
        
        # Book value per share
        features['book_value'] = info.get('bookValue', np.nan)
        
        # Price-to-book
        features['price_to_book'] = info.get('priceToBook', np.nan)
        
        # Enterprise value / EBITDA
        features['ev_to_ebitda'] = info.get('enterpriseToEbitda', np.nan)
        
    except:
        pass  # Failed to get info, return empty dict
    
    return features

## 5. Main Feature Construction Loop

Now we apply all feature functions to each trade in the dataset.

In [None]:
# Prepare output dataframe
df_features = df.copy()

# Initialize feature columns (will be filled)
feature_dict = {}  # Will store all features for each trade

print(f"Processing {len(df_features)} trades...")
print(f"Tickers with price data: {len(price_data)}")

# REMOVE [:100] FOR FULL RUN - limiting for testing
for idx, row in tqdm(df_features.iterrows(), total=len(df_features)):
    
    ticker = row[ticker_col]
    trade_date = row[date_col]
    
    # Skip if no price data
    if ticker not in price_data:
        feature_dict[idx] = {}  # Empty features
        continue
    
    price_df = price_data[ticker]
    
    # Initialize feature dict for this trade
    features = {}
    
    # 1. Returns
    features.update(calculate_return_features(ticker, trade_date, price_df))
    
    # 2. Volatility
    features.update(calculate_volatility_features(ticker, trade_date, price_df))
    
    # 3. Liquidity
    features.update(calculate_liquidity_features(ticker, trade_date, price_df))
    
    # 4. Factor exposures
    factor_feats = calculate_factor_exposures(
        ticker, trade_date, price_df, sp500, 
        ff_df=ff_factors if ff_factors is not None else None
    )
    features.update(factor_feats)
    
    # 5. Event proximity
    if ticker in earnings_data:
        features.update(calculate_event_proximity(ticker, trade_date, earnings_data[ticker]))
    
    # 6. Post-trade CAR (30d, 60d, 90d)
    # Extract beta and FF betas if available
    beta = factor_feats.get('beta_252d', None)
    ff_betas = {
        'mkt': factor_feats.get('beta_mkt_ff3_252d', None),
        'smb': factor_feats.get('beta_smb_ff3_252d', None),
        'hml': factor_feats.get('beta_hml_ff3_252d', None)
    } if 'beta_mkt_ff3_252d' in factor_feats else None
    
    for horizon in [30, 60, 90]:
        car_feats = calculate_CAR(
            ticker, trade_date, horizon, price_df, sp500,
            beta=beta, ff_betas=ff_betas, 
            ff_df=ff_factors if ff_factors is not None else None
        )
        features.update(car_feats)
    
    # 7. Fundamentals (expensive, only if needed)
    # Uncomment if you want fundamentals:
    # features.update(get_stock_fundamentals(ticker, trade_date))
    
    # Store
    feature_dict[idx] = features

print("Feature construction complete!")

## 6. Merge Features with Original Data

In [None]:
# Convert feature dict to dataframe
df_features_wide = pd.DataFrame.from_dict(feature_dict, orient='index')

# Merge with original trades
df_final = df_features.join(df_features_wide)

print(f"Final dataset shape: {df_final.shape}")
print(f"Number of features added: {df_features_wide.shape[1]}")
print(f"\nFeature names: {df_features_wide.columns.tolist()}")

## 7. Data Quality Filtering

Apply filters from ITI paper:
- Price >= $5
- Market cap >= $100M
- Sufficient data availability

In [None]:
# Get price at trade date from features (if calculated)
# For now, use a proxy: stocks with market_cap > 100M

print("Before filtering:")
print(f"Total rows: {len(df_final)}")

# Flag trades with insufficient data
df_final['has_price_data'] = df_final[ticker_col].isin(price_data.keys())
df_final['has_min_features'] = df_final['return_t'].notna()  # Has basic return data

print(f"\nHas price data: {df_final['has_price_data'].sum()} ({df_final['has_price_data'].mean()*100:.1f}%)")
print(f"Has minimum features: {df_final['has_min_features'].sum()} ({df_final['has_min_features'].mean()*100:.1f}%)")

# Optionally filter (or just flag)
# df_filtered = df_final[df_final['has_price_data'] & df_final['has_min_features']].copy()
# For now, keep all rows but with flags

## 8. Winsorization

Following ITI paper: winsorize extreme values at 0.5% and 99.5%

In [None]:
from scipy.stats import mstats

# List of features to winsorize (exclude categorical/binary)
features_to_winsorize = [
    col for col in df_features_wide.columns 
    if col not in ['within_5d_earnings', 'within_10d_earnings']
]

print(f"Winsorizing {len(features_to_winsorize)} features...")

for col in features_to_winsorize:
    if df_final[col].notna().sum() > 10:  # Only if enough data
        df_final[col] = mstats.winsorize(
            df_final[col].values, 
            limits=[0.005, 0.005],  # 0.5% on each tail
            nan_policy='omit'
        )

print("Winsorization complete.")

## 9. Export Results

In [None]:
# Create output directory
Path('data/outputs').mkdir(parents=True, exist_ok=True)

# Save enriched dataset
output_file = 'data/outputs/congress_trading_features.csv'
df_final.to_csv(output_file, index=False)

print(f"Dataset saved to: {output_file}")
print(f"Shape: {df_final.shape}")
print(f"Features: {df_final.shape[1] - df_trades.shape[1]} new columns added")

## 10. Create Variable Dictionary

In [None]:
# Create comprehensive variable dictionary
variable_dict = []

# Define all variables with descriptions
var_definitions = {
    # Returns
    'return_t': 'Daily return on trade date',
    'return_overnight': 'Overnight return (close to open)',
    'return_intraday': 'Intraday return (open to close)',
    'momentum_5d': '5-day momentum (short-term reversal)',
    'momentum_20d': '20-day momentum (1 month)',
    'momentum_60d': '60-day momentum (3 months)',
    'momentum_252d': '252-day momentum (1 year)',
    'abs_return_t': 'Absolute daily return',
    
    # Volatility
    'realized_vol_30d': 'Realized volatility (30-day, annualized)',
    'realized_vol_60d': 'Realized volatility (60-day, annualized)',
    'realized_vol_252d': 'Realized volatility (252-day, annualized)',
    'parkinson_vol_30d': 'Parkinson high-low volatility estimator (30-day)',
    'vol_of_vol_60d': 'Volatility of volatility (60-day)',
    
    # Volume & Liquidity
    'volume_t': 'Trading volume on trade date',
    'dollar_volume_t': 'Dollar trading volume (Volume * Price)',
    'volume_ratio_30d': 'Volume / 30-day average volume',
    'abnormal_volume_30d': 'Volume - 30-day average volume',
    'amihud_illiq_20d': 'Amihud (2002) illiquidity measure (20-day)',
    'roll_spread_30d': 'Roll (1984) bid-ask spread estimator (30-day)',
    'hl_spread_20d': 'High-Low spread proxy (20-day average)',
    'zero_volume_days_30d': 'Number of zero-volume days in past 30 days',
    
    # Factor Exposures
    'beta_252d': 'CAPM beta (252-day rolling, vs S&P 500)',
    'r2_market_252d': 'R-squared of market model (252-day)',
    'alpha_ff3_252d': 'Fama-French 3-factor alpha (252-day, annualized)',
    'beta_mkt_ff3_252d': 'FF3 market beta (252-day)',
    'beta_smb_ff3_252d': 'FF3 size (SMB) beta (252-day)',
    'beta_hml_ff3_252d': 'FF3 value (HML) beta (252-day)',
    'r2_ff3_252d': 'R-squared of FF3 model (252-day)',
    
    # Event Proximity
    'days_to_earnings': 'Days until next earnings announcement',
    'days_since_earnings': 'Days since last earnings announcement',
    'within_5d_earnings': 'Dummy: 1 if within ±5 days of earnings',
    'within_10d_earnings': 'Dummy: 1 if within ±10 days of earnings',
    
    # Post-Trade CAR (Validation)
    'car_raw_30d': 'Market-adjusted CAR, 30 days post-trade',
    'car_raw_60d': 'Market-adjusted CAR, 60 days post-trade',
    'car_raw_90d': 'Market-adjusted CAR, 90 days post-trade',
    'car_capm_30d': 'CAPM-adjusted CAR, 30 days post-trade',
    'car_capm_60d': 'CAPM-adjusted CAR, 60 days post-trade',
    'car_capm_90d': 'CAPM-adjusted CAR, 90 days post-trade',
    'car_ff3_30d': 'Fama-French 3-factor adjusted CAR, 30 days post-trade',
    'car_ff3_60d': 'Fama-French 3-factor adjusted CAR, 60 days post-trade',
    'car_ff3_90d': 'Fama-French 3-factor adjusted CAR, 90 days post-trade',
    
    # Fundamentals (if included)
    'market_cap': 'Market capitalization (millions USD)',
    'price': 'Stock price',
    'book_value': 'Book value per share',
    'price_to_book': 'Price-to-book ratio',
    'ev_to_ebitda': 'Enterprise value / EBITDA',
    
    # Flags
    'likely_equity': 'Flag: 1 if security appears to be equity (not bond/bill)',
    'has_price_data': 'Flag: 1 if price data was available',
    'has_min_features': 'Flag: 1 if minimum features could be calculated'
}

# Build dictionary dataframe
for col in df_final.columns:
    if col in var_definitions:
        variable_dict.append({
            'variable_name': col,
            'description': var_definitions[col],
            'source': 'yfinance + Fama-French' if 'ff3' in col else 'yfinance',
            'type': 'feature'
        })
    elif col in df_trades.columns:
        variable_dict.append({
            'variable_name': col,
            'description': 'Original variable from congressional trading data',
            'source': 'congress-trading-all.csv',
            'type': 'original'
        })

var_dict_df = pd.DataFrame(variable_dict)
var_dict_df.to_csv('data/outputs/variable_dictionary.csv', index=False)

print(f"Variable dictionary saved: data/outputs/variable_dictionary.csv")
print(f"Total variables documented: {len(var_dict_df)}")

## 11. Data Quality Report

In [None]:
# Completeness report
print("=" * 60)
print("DATA QUALITY REPORT")
print("=" * 60)

print(f"\n1. SAMPLE SIZE")
print(f"   Total trades: {len(df_final):,}")
print(f"   Trades with price data: {df_final['has_price_data'].sum():,} ({df_final['has_price_data'].mean()*100:.1f}%)")
print(f"   Trades with features: {df_final['has_min_features'].sum():,} ({df_final['has_min_features'].mean()*100:.1f}%)")

print(f"\n2. FEATURE COMPLETENESS")
feature_cols = [col for col in df_features_wide.columns if col in df_final.columns]
completeness = df_final[feature_cols].notna().mean().sort_values(ascending=False)

print(f"   Features with >90% coverage: {(completeness > 0.9).sum()}")
print(f"   Features with >50% coverage: {(completeness > 0.5).sum()}")
print(f"   Features with <10% coverage: {(completeness < 0.1).sum()}")

print(f"\n3. TOP 10 MOST COMPLETE FEATURES:")
for feat, pct in completeness.head(10).items():
    print(f"   {feat:30s} {pct*100:5.1f}%")

print(f"\n4. TOP 10 LEAST COMPLETE FEATURES:")
for feat, pct in completeness.tail(10).items():
    print(f"   {feat:30s} {pct*100:5.1f}%")

print(f"\n5. TICKERS")
print(f"   Unique tickers in data: {df_final[ticker_col].nunique():,}")
print(f"   Tickers with price data: {len(price_data):,}")
print(f"   Failed to download: {len(failed_tickers):,}")

print(f"\n6. EVENTS")
print(f"   Tickers with earnings data: {len(earnings_data):,}")
if 'days_to_earnings' in df_final.columns:
    print(f"   Trades with earnings proximity: {df_final['days_to_earnings'].notna().sum():,}")

print("\n" + "=" * 60)

## 12. Summary Statistics

In [None]:
# Summary stats for key features
key_features = [
    'return_t', 'momentum_60d', 'realized_vol_60d', 
    'volume_ratio_30d', 'amihud_illiq_20d', 'beta_252d',
    'car_raw_30d', 'car_capm_60d'
]

summary = df_final[key_features].describe()
print("\nSUMMARY STATISTICS (KEY FEATURES):")
print(summary.round(4))

# Save summary
summary.to_csv('data/outputs/summary_statistics.csv')
print("\nSummary statistics saved to: data/outputs/summary_statistics.csv")

## 13. Correlation Heatmap (Optional)

Visualize relationships between key features.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Correlation matrix for key features
corr_features = [
    'return_t', 'momentum_20d', 'momentum_60d', 'momentum_252d',
    'realized_vol_30d', 'realized_vol_60d',
    'volume_ratio_30d', 'amihud_illiq_20d',
    'beta_252d', 'car_raw_30d', 'car_raw_60d'
]

# Filter to features that exist
corr_features = [f for f in corr_features if f in df_final.columns]

corr_matrix = df_final[corr_features].corr()

# Plot
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.savefig('data/outputs/correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("Correlation heatmap saved to: data/outputs/correlation_heatmap.png")

---

## ✅ NOTEBOOK COMPLETE

### Outputs Generated:

1. **`data/outputs/congress_trading_features.csv`**  
   Main dataset with ~60-80 market features added

2. **`data/outputs/variable_dictionary.csv`**  
   Documentation of all variables

3. **`data/outputs/failed_tickers.csv`**  
   Tickers that could not be downloaded

4. **`data/outputs/summary_statistics.csv`**  
   Summary stats for key features

5. **`data/outputs/correlation_heatmap.png`**  
   Correlation matrix visualization

---

### Next Steps:

1. **Dimensionality Reduction:**
   - PCA, Factor Analysis, or feature selection
   - Remove highly correlated features

2. **Missing Data Imputation:**
   - Decide on strategy: drop, forward-fill, or model-based

3. **Anomaly Detection Models:**
   - Isolation Forest (Stage 1)
   - DBSCAN clustering
   - Supervised models (XGBoost, Random Forest)

4. **Validation:**
   - CAR analysis by anomaly score deciles
   - Portfolio sorts

---

**All design decisions documented inline. Ready for ML pipeline.**