# Project Titan - Notebook 2: Time-Varying Factor Exposures

### **Objective**
This notebook constructs the **time-varying Factor Exposure Matrix ($X_t$)**, a critical input for a dynamic multifactor risk model. This "Project Titan" version implements a professional-grade process for building robust, point-in-time factor exposures. For each month in our sample period, we will calculate a full cross-section of exposures based on data that would have been known at that time.

The final output is a large panel dataset where each row represents a specific stock at a specific point in time, and each column represents that stock's exposure to a fundamental factor.

---

### **Methodology: Point-in-Time & Composite Factor Construction**

The methodology focuses on creating a robust and realistic `X` matrix by incorporating two key professional techniques: point-in-time data handling and composite factor construction.

*   **1. Industry Factor Classification:** I classify each stock into one of the 12 Fama-French industry groups based on its historical SIC code. This creates 12 orthogonal "dummy variable" factors that will form the basis for capturing market-wide and sector-specific risk.

*   **2. Composite Style Factor Construction:** We build our style factors using a **multi-descriptor composite approach**, as recommended by Grinold & Kahn for model robustness. This involves:
    *   **Descriptor Calculation:** We first calculate the raw, underlying data ("descriptors") for each factor. This includes:
        *   **Value:** Book-to-Market (B/M) and Earnings-to-Price (E/P).
        *   **Momentum:** 12-month and 6-month historical returns (skipping the most recent month).
        *   **Size:** The natural logarithm of market capitalization.
        *   **Financial Constraints:** The Whited-Wu (WW) Index, which itself is a composite of several accounting ratios.
    *   **Point-in-Time Lagging:** To avoid lookahead bias, accounting-based descriptors (like Book Equity and Earnings) are appropriately lagged to simulate real-world reporting delays.

*   **3. Cross-Sectional Standardization:** This is the core of the process. **For each month in our sample**, I perform a **capitalization-weighted standardization** on each of the raw style factor descriptors individually. This converts each descriptor into a comparable Z-score relative to the market *at that specific point in time*.

*   **4. Final Factor Assembly:** The final factor exposures are created:
    *   For composite factors (Value, Momentum), we take the **average of their respective standardized descriptors.**
    *   The final composite factors are then re-standardized to ensure they have a clean, cap-weighted mean of zero and standard deviation of one. This creates pure, "extra-market" style factors.

---

### **Key Concepts & Theoretical Justification**

#### **1. Composite Factors for Robustness**

A key principle from "Active Portfolio Management" is that relying on a single descriptor for a factor (e.g., only using Book-to-Market for "Value") makes a model fragile. Any single accounting ratio can be noisy, subject to measurement error, or misleading for certain industries (e.g., B/M for tech firms). By creating a **composite factor** from several related but distinct descriptors, we **diversify away the idiosyncratic noise** of each individual measure. The resulting factor is a more robust and stable representation of the underlying economic concept.

#### **2. Time-Varying Exposures**

Companies evolve. A firm can grow from a "small-cap" to a "large-cap." It can transition from a "growth" stock to a "value" stock. By recalculating the standardized exposures for every period, our risk model can adapt to this evolution, providing a more accurate, forward-looking assessment of risk.

#### **3. Point-in-Time Data & Lookahead Bias**

Lookahead bias is one of the most critical errors in quantitative research. It occurs when a model is built using information that would not have been available at the time of the decision. By carefully lagging accounting data to account for reporting delays, I ensure the factor exposures are "point-in-time" correct and our subsequent backtests are valid and realistic.

---
**Output:** This notebook generates and saves the `factor_exposures_titan.parquet` file. This panel dataset, indexed by `(date, permno)`, is the primary $X$ input for the Fama-MacBeth risk model estimation in Notebook 3.



### 1. Imports and Load Data

In [3]:
import pandas as pd
import numpy as np
import os
import statsmodels.api as sm
from pathlib import Path 

print("Libraries imported successfully.")

# --- Load the master panel data from Notebook 1 ---
onedrive_root = str(Path(os.environ['OneDrive']))
DATA_DIR = os.path.join(onedrive_root, "0. DATASETS", "outputs")

PANEL_DATA_FILE = os.path.join(DATA_DIR, 'panel_data.parquet')

df = pd.read_parquet(PANEL_DATA_FILE)

# making sure permno and industry codes are stored as int
df['permno'] = df['permno'].astype('int')
# nullable int:
df['hsiccd'] = df['hsiccd'].astype('Int64')

# Setting a multi-index for efficiency
df.reset_index(inplace=True)  # move index back to columns
df.set_index(['permno', 'date'], inplace=True)
df.sort_index(inplace=True)

print("Monthly panel data loaded successfully.")
print(f"Data shape: {df.shape}")
print(f"Date range: {df.index.get_level_values('date').min()} to {df.index.get_level_values('date').max()}")


Libraries imported successfully.


ArrowKeyError: No type extension with name arrow.py_extension_type found

In [11]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,hsiccd,prc,vol,ret_daily,shrout,gvkey,mkt_cap,datadate,fyearq,fqtr,...,lctq,ltq,oiadpq,pstkq,saleq,oancfy,dvpspq,prccq,sic,ret_monthly
permno,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
10001,1995-01-31,4925,7.75,0.0,0.026915,2224.0,12994,17236.0,1994-12-31,1995.0,2.0,...,8.52,23.217,1.486,0.0,10.537,,0.0,8.0,4924,-0.03125
10001,1995-02-28,4925,7.546875,400.0,-0.02621,2224.0,12994,16784.25,1994-12-31,1995.0,2.0,...,8.52,23.217,1.486,0.0,10.537,,0.0,8.0,4924,-0.02621
10001,1995-03-31,4925,7.5,200.0,-0.032258,2244.0,12994,16830.0,1995-03-31,1995.0,3.0,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.5,4924,0.00597
10001,1995-04-30,4925,7.5,600.0,-0.006211,2244.0,12994,16830.0,1995-03-31,1995.0,3.0,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.5,4924,0.0
10001,1995-05-31,4925,7.875,0.0,0.0,2244.0,12994,17671.5,1995-03-31,1995.0,3.0,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.5,4924,0.05


In [17]:
df['sic'] = df['sic'].astype('int')

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

### Cell 2: Industry Factor Exposures

In [13]:
# --- Create Industry Factor Exposures ---

# Helper function to map from SIC codes to FF12 industries.
def sic_to_ff12(sic):
    """
    Converts a SIC code to one of the 12 Fama-French industry classifications.
    Based on the definitions from Ken French's website.
    """
    if pd.isnull(sic):
        return np.nan
    
    sic = int(sic)
    
    if 100 <= sic <= 999: return 'Consumer' # Non-Durables
    if 1000 <= sic <= 1499: return 'Other' # Mining, Quarrying
    if 1500 <= sic <= 1799: return 'Other' # Construction
    if 2000 <= sic <= 2999: return 'Consumer' # Food, Tobacco, Textiles, Apparel, Paper
    if 3000 <= sic <= 3999: return 'Durables' # Cars, TVs, Furniture, Industrial Equip
    if 4000 <= sic <= 4999: return 'Telecom' # Telephone, TV, Radio, Utilities
    if 5000 <= sic <= 5199: return 'Shops' # Wholesale
    if 5200 <= sic <= 5999: return 'Shops' # Retail
    if 6000 <= sic <= 6999: return 'Finance' # Finance, Insurance, Real Estate
    if 7000 <= sic <= 8999: return 'Services' # Hotels, Business Svcs, Healthcare
    if 9000 <= sic <= 9999: return 'Other' # Public Admin
    
    # Refined categories for more detail
    if sic in [2830, 2831, 2833, 2834, 2835, 2836]: return 'Healthcare'
    if sic in [3570, 3571, 3572, 3575, 3577, 3578]: return 'Technology'
    if sic in [3660, 3661, 3663, 3665, 3669, 3670, 3671, 3672, 3674]: return 'Technology'
    if sic in [4810, 4812, 4813, 4822, 4832, 4833, 4841, 4881, 4891, 4892, 4899]: return 'Telecom'
    if sic in [4900, 4911, 4920, 4922, 4923, 4924, 4925, 4931, 4932, 4939, 4941]: return 'Utilities'
    if sic in [7370, 7371, 7372, 7373, 7374, 7375, 7376, 7377, 7378, 7379]: return 'Technology'
    if sic in [1310, 1311, 1321, 1381, 1382, 1389]: return 'Energy'
    if sic in [2911, 2912, 2992, 2999]: return 'Energy'

    return 'Other' # Default for anything missed

# Apply the function to the 'sic' column. Note: CRSP hsiccd is better if available.
df['industry'] = df['sic'].apply(sic_to_ff12)

# Create the dummy variables
industry_dummies = pd.get_dummies(df['industry'], prefix='Ind')

# We'll join this back to our main DataFrame later.
print("Industry factor exposures created.")


ValueError: invalid literal for int() with base 10: ''