# Project Titan - Notebook 2: Time-Varying Factor Exposures

### **Objective**
This notebook constructs the **time-varying Factor Exposure Matrix ($X_t$)**, a critical input for a dynamic multifactor risk model. This "Project Titan" version implements a professional-grade process for building robust, point-in-time factor exposures. For each month in our sample period, we will calculate a full cross-section of exposures based on data that would have been known at that time.

The final output is a large panel dataset where each row represents a specific stock at a specific point in time, and each column represents that stock's exposure to a fundamental factor.

---

### **Methodology: Point-in-Time & Composite Factor Construction**

The methodology focuses on creating a robust and realistic `X` matrix by incorporating two key professional techniques: point-in-time data handling and composite factor construction.

*   **1. Industry Factor Classification:** I classify each stock into one of the 12 Fama-French industry groups based on its historical SIC code. This creates 12 orthogonal "dummy variable" factors that will form the basis for capturing market-wide and sector-specific risk.

*   **2. Composite Style Factor Construction:** We build our style factors using a **multi-descriptor composite approach**, as recommended by Grinold & Kahn for model robustness. This involves:
    *   **Descriptor Calculation:** We first calculate the raw, underlying data ("descriptors") for each factor. This includes:
        *   **Value:** Book-to-Market (B/M) and Earnings-to-Price (E/P).
        *   **Momentum:** 12-month and 6-month historical returns (skipping the most recent month).
        *   **Size:** The natural logarithm of market capitalization.
        *   **Financial Constraints:** The Whited-Wu (WW) Index, which itself is a composite of several accounting ratios.
    *   **Point-in-Time Lagging:** To avoid lookahead bias, accounting-based descriptors (like Book Equity and Earnings) are appropriately lagged to simulate real-world reporting delays.

*   **3. Cross-Sectional Standardization:** This is the core of the process. **For each month in our sample**, I perform a **capitalization-weighted standardization** on each of the raw style factor descriptors individually. This converts each descriptor into a comparable Z-score relative to the market *at that specific point in time*.

*   **4. Final Factor Assembly:** The final factor exposures are created:
    *   For composite factors (Value, Momentum), we take the **average of their respective standardized descriptors.**
    *   The final composite factors are then re-standardized to ensure they have a clean, cap-weighted mean of zero and standard deviation of one. This creates pure, "extra-market" style factors.

---

### **Key Concepts & Theoretical Justification**

#### **1. Composite Factors for Robustness**

A key principle from "Active Portfolio Management" is that relying on a single descriptor for a factor (e.g., only using Book-to-Market for "Value") makes a model fragile. Any single accounting ratio can be noisy, subject to measurement error, or misleading for certain industries (e.g., B/M for tech firms). By creating a **composite factor** from several related but distinct descriptors, we **diversify away the idiosyncratic noise** of each individual measure. The resulting factor is a more robust and stable representation of the underlying economic concept.

#### **2. Time-Varying Exposures**

Companies evolve. A firm can grow from a "small-cap" to a "large-cap." It can transition from a "growth" stock to a "value" stock. By recalculating the standardized exposures for every period, our risk model can adapt to this evolution, providing a more accurate, forward-looking assessment of risk.

#### **3. Point-in-Time Data & Lookahead Bias**

Lookahead bias is one of the most critical errors in quantitative research. It occurs when a model is built using information that would not have been available at the time of the decision. By carefully lagging accounting data to account for reporting delays, I ensure the factor exposures are "point-in-time" correct and our subsequent backtests are valid and realistic.

---
**Output:** This notebook generates and saves the `factor_exposures_titan.parquet` file. This panel dataset, indexed by `(date, permno)`, is the primary $X$ input for the Fama-MacBeth risk model estimation in Notebook 3.



### 1. Imports and Load Data

In [44]:
import pandas as pd
import numpy as np
import os
import statsmodels.api as sm
from pathlib import Path 

print("Libraries imported successfully.")

# --- Load the master panel data from Notebook 1 ---
onedrive_root = str(Path(os.environ['OneDrive']))
DATA_DIR = os.path.join(onedrive_root, "0. DATASETS", "outputs")

PANEL_DATA_FILE = os.path.join(DATA_DIR, 'panel_data.parquet')

df = pd.read_parquet(PANEL_DATA_FILE)

# making sure permno and industry codes are stored as int
df['permno'] = df['permno'].astype('int')
# nullable int:
df['sic'] = df['sic'].astype('Int64')

# Setting a multi-index for efficiency
df.reset_index(inplace=True)  # move index back to columns
df.set_index(['permno', 'date'], inplace=True)
df.sort_index(inplace=True)

print("Monthly panel data loaded successfully.")
print(f"Data shape: {df.shape}")
print(f"Date range: {df.index.get_level_values('date').min()} to {df.index.get_level_values('date').max()}")


Libraries imported successfully.
Monthly panel data loaded successfully.
Data shape: (1660775, 32)
Date range: 1995-01-31 00:00:00 to 2023-12-31 00:00:00


### 2. Industry Factor Exposures

In [45]:
# --- Create Industry Factor Exposures ---

# Helper function to map from SIC codes to FF12 industries.
def sic_to_ff12(sic):
    """
    Converts a SIC code to one of the 12 Fama-French industry classifications.
    Based on the definitions from Ken French's website.
    """
    if pd.isnull(sic):
        return np.nan
    
    sic = int(sic)
    
    if 100 <= sic <= 999: return 'Consumer' # Non-Durables
    if 1000 <= sic <= 1499: return 'Other' # Mining, Quarrying
    if 1500 <= sic <= 1799: return 'Other' # Construction
    if 2000 <= sic <= 2999: return 'Consumer' # Food, Tobacco, Textiles, Apparel, Paper
    if 3000 <= sic <= 3999: return 'Durables' # Cars, TVs, Furniture, Industrial Equip
    if 4000 <= sic <= 4999: return 'Telecom' # Telephone, TV, Radio, Utilities
    if 5000 <= sic <= 5199: return 'Shops' # Wholesale
    if 5200 <= sic <= 5999: return 'Shops' # Retail
    if 6000 <= sic <= 6999: return 'Finance' # Finance, Insurance, Real Estate
    if 7000 <= sic <= 8999: return 'Services' # Hotels, Business Svcs, Healthcare
    if 9000 <= sic <= 9999: return 'Other' # Public Admin
    
    # Refined categories for more detail
    if sic in [2830, 2831, 2833, 2834, 2835, 2836]: return 'Healthcare'
    if sic in [3570, 3571, 3572, 3575, 3577, 3578]: return 'Technology'
    if sic in [3660, 3661, 3663, 3665, 3669, 3670, 3671, 3672, 3674]: return 'Technology'
    if sic in [4810, 4812, 4813, 4822, 4832, 4833, 4841, 4881, 4891, 4892, 4899]: return 'Telecom'
    if sic in [4900, 4911, 4920, 4922, 4923, 4924, 4925, 4931, 4932, 4939, 4941]: return 'Utilities'
    if sic in [7370, 7371, 7372, 7373, 7374, 7375, 7376, 7377, 7378, 7379]: return 'Technology'
    if sic in [1310, 1311, 1321, 1381, 1382, 1389]: return 'Energy'
    if sic in [2911, 2912, 2992, 2999]: return 'Energy'

    return 'Other' # Default for anything missed

# Apply the function to the 'sic' column. Note: CRSP hsiccd is better if available.
df['industry'] = df['sic'].apply(sic_to_ff12)

# Create the dummy variables
industry_dummies = pd.get_dummies(df['industry'], prefix='Ind')

# We'll join this back to our main DataFrame later.
print("Industry factor exposures created.")


Industry factor exposures created.


In [46]:
df.head(100)

Unnamed: 0_level_0,Unnamed: 1_level_0,share_code,exchange_code,sic,prc,ret_daily,shrout,vwretd,sprtrn,gvkey,mkt_cap,...,lctq,ltq,oiadpq,pstkq,saleq,oancfy,dvpspq,prccq,ret_monthly,industry
permno,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
10001,1995-01-31,11.0,3.0,4925,7.750000,0.026915,2224.0,0.003962,0.004077,012994,17236.000000,...,8.520,23.217,1.486,0.0,10.537,,0.00,8.000,-3.124999e-02,Telecom
10001,1995-02-28,11.0,3.0,4925,7.546875,-0.026210,2224.0,0.008116,0.007400,012994,16784.250000,...,8.520,23.217,1.486,0.0,10.537,,0.00,8.000,-2.620967e-02,Telecom
10001,1995-03-31,11.0,3.0,4925,7.500000,-0.032258,2244.0,-0.002444,-0.003007,012994,16830.000000,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.500,5.970750e-03,Telecom
10001,1995-04-30,11.0,3.0,4925,7.500000,-0.006211,2244.0,0.001800,0.002259,012994,16830.000000,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.500,8.138990e-09,Telecom
10001,1995-05-31,11.0,3.0,4925,7.875000,0.000000,2244.0,0.014017,0.018755,012994,17671.500000,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.500,5.000000e-02,Telecom
10001,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10001,2002-12-31,11.0,3.0,4925,7.351000,-0.007962,2591.0,0.001781,0.000489,012994,19046.440565,...,24.727,47.158,0.471,0.0,22.485,-2.995,0.00,7.351,-1.074939e-01,Telecom
10001,2003-01-31,11.0,3.0,4925,8.440000,-0.034325,2591.0,0.012660,0.013130,012994,21868.038913,...,24.727,47.158,0.471,0.0,22.485,-2.995,0.00,7.351,1.481431e-01,Telecom
10001,2003-02-28,11.0,3.0,4925,8.740000,0.027027,2591.0,0.004282,0.004622,012994,22645.339407,...,24.727,47.158,0.471,0.0,22.485,-2.995,0.00,7.351,3.554505e-02,Telecom
10001,2003-03-31,11.0,3.0,4925,7.650000,0.013245,2593.0,-0.015745,-0.017742,012994,19836.450247,...,22.844,46.072,3.345,0.0,29.617,4.439,0.27,7.650,-1.101438e-01,Telecom


### 3. Create Style Factor Descriptors 

In [47]:
# --- REVISED AND FINAL Cell 3: Create Style Factor Descriptors ---

print("Calculating raw style factor descriptors...")

# --- Step 1: Ensure DataFrame is "flat" for calculations ---
if isinstance(df.index, pd.MultiIndex):
    df.reset_index(inplace=True)

# --- Size Descriptor ---
df['size_desc'] = np.log(df['mkt_cap'])

# --- Value Descriptors (Composite) ---
print("  Calculating Value descriptors...")
# Descriptor 1: Book-to-Market (B/M)
df['book_equity_lagged'] = df.sort_values('date').groupby('permno')['ceqq'].shift(6)
df['bm_desc'] = df['book_equity_lagged'] / df['mkt_cap']

# Descriptor 2: Earnings-to-Price (E/P)
quarterly_fundamentals = df[['permno', 'datadate', 'ibq']].copy().drop_duplicates()
quarterly_fundamentals.dropna(subset=['datadate'], inplace=True)
quarterly_fundamentals.sort_values(by=['permno', 'datadate'], inplace=True)
quarterly_fundamentals['ltm_earnings'] = quarterly_fundamentals.groupby('permno')['ibq'].rolling(window=4, min_periods=4).sum().values
quarterly_fundamentals['announcement_date'] = quarterly_fundamentals['datadate'] + pd.DateOffset(months=3)
quarterly_fundamentals.dropna(subset=['permno', 'announcement_date', 'ltm_earnings'], inplace=True)

df = pd.merge_asof(
    left=df.sort_values('date'),
    right=quarterly_fundamentals[['permno', 'announcement_date', 'ltm_earnings']].sort_values('announcement_date'),
    left_on='date',
    right_on='announcement_date',
    by='permno'
)
df['ep_desc'] = df['ltm_earnings'] / df['mkt_cap']


# --- Momentum Descriptors (Composite) ---
print("  Calculating Momentum descriptors...")
df['mom12_1_desc'] = df.sort_values('date').groupby('permno')['ret_monthly'].transform(lambda x: x.shift(1).rolling(11).apply(lambda r: (1+r).prod()-1))
df['mom6_1_desc'] = df.sort_values('date').groupby('permno')['ret_monthly'].transform(lambda x: x.shift(1).rolling(5).apply(lambda r: (1+r).prod()-1))


# --- Momentum Descriptors (Composite) ---
print("  Calculating Momentum descriptors...")
df['mom12_1_desc'] = df.sort_values('date').groupby('permno')['ret_monthly'].transform(lambda x: x.shift(1).rolling(11).apply(lambda r: (1+r).prod()-1))
df['mom6_1_desc'] = df.sort_values('date').groupby('permno')['ret_monthly'].transform(lambda x: x.shift(1).rolling(5).apply(lambda r: (1+r).prod()-1))

# --- Financial Constraints (Whited-Wu) ---
print("  Calculating Financial Constraint (WW) descriptor...")
df['cf_at'] = (df['ibq'] + df['dpq']) / df['atq']
df['div_pos'] = ((df['dvpspq'] * df['cshoq']) > 0).astype(int)
df['tLtd_at'] = df['dlttq'] / df['atq']
df['sg'] = df.sort_values('date').groupby('permno')['saleq'].pct_change(fill_method=None)

df['isg_industry'] = df.groupby(['industry', 'date'])['sg'].transform('mean')
df['isg'] = df.groupby(['permno'])['isg_industry'].shift(1)
df.drop('isg_industry', axis = 1, inplace = True)

df['ww_desc'] = -0.091*df['cf_at'] - 0.062*df['div_pos'] + 0.021*df['tLtd_at'] - 0.044*np.log(df['atq'].replace(0, np.nan)) + 0.102*df['isg'] - 0.035*df['sg']

# --- Final Cleanup ---
descriptor_cols = ['size_desc', 'bm_desc', 'ep_desc', 'mom12_1_desc', 'mom6_1_desc', 'ww_desc']
for col in descriptor_cols:
    if col in df.columns:
        df[col] = df[col].replace([np.inf, -np.inf], np.nan)

# --- Step 2: Set index back for the next steps ---
df.set_index(['permno', 'date'], inplace=True)
df.sort_index(inplace=True)

print("\nRaw style factor descriptors calculated successfully.")

Calculating raw style factor descriptors...
  Calculating Value descriptors...
  Calculating Momentum descriptors...
  Calculating Momentum descriptors...
  Calculating Financial Constraint (WW) descriptor...

Raw style factor descriptors calculated successfully.


In [48]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,share_code,exchange_code,sic,prc,ret_daily,shrout,vwretd,sprtrn,gvkey,mkt_cap,...,ltm_earnings,ep_desc,mom12_1_desc,mom6_1_desc,cf_at,div_pos,tLtd_at,sg,isg,ww_desc
permno,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
10001,1995-01-31,11.0,3.0,4925,7.75,0.026915,2224.0,0.003962,0.004077,12994,17236.0,...,,,,,0.036094,0,0.316366,,,
10001,1995-02-28,11.0,3.0,4925,7.546875,-0.02621,2224.0,0.008116,0.0074,12994,16784.25,...,,,,,0.036094,0,0.316366,0.0,,
10001,1995-03-31,11.0,3.0,4925,7.5,-0.032258,2244.0,-0.002444,-0.003007,12994,16830.0,...,,,,,0.046296,1,0.330715,0.069185,0.058488,-0.207453
10001,1995-04-30,11.0,3.0,4925,7.5,-0.006211,2244.0,0.0018,0.002259,12994,16830.0,...,,,,,0.046296,1,0.330715,0.0,0.188749,-0.191745
10001,1995-05-31,11.0,3.0,4925,7.875,0.0,2244.0,0.014017,0.018755,12994,17671.5,...,,,,,0.046296,1,0.330715,0.0,0.0,-0.210997


date
1995-01-31    4.703737e+09
1995-02-28    4.880051e+09
1995-03-31    5.021973e+09
1995-04-30    5.135041e+09
1995-05-31    5.301979e+09
                  ...     
2023-08-31    4.305005e+10
2023-09-30    4.092016e+10
2023-10-31    3.973609e+10
2023-11-30    4.326090e+10
2023-12-31    4.542899e+10
Name: mkt_cap, Length: 348, dtype: float64

In [None]:
def standardize_cap_weighted(df, var_name):
    df['total_mkt_cap'] = df.groupby('date')['mkt_cap'].transform('sum')
    
    df['cap_weight'] = df['mkt_cap'] / df['total_mkt_cap']
    df['weighted_mkt_cap'] = df['cap_weight'] * df['mkt_cap']
    df[f'{var_name}_weighted_mean'] = df.groupby('date')['weighted_mkt_cap'].transform('sum')
    df['weighted_squared_deviation'] = df[var_name] - df['']

    df[f"{var_name}_std"] = df.groupby('date')[var_name].std()
    df[f'z_{var_name}'] = (df['var_name'] - df['']
