# Notebook 2: Time-Varying Factor Exposures

### **Objective**
This notebook constructs the **time-varying Factor Exposure Matrix ($X_t$)**, a critical input for a dynamic multifactor risk model. This "Project Titan" version implements a professional-grade process for building robust, point-in-time factor exposures. For each month in our sample period, we will calculate a full cross-section of exposures based on data that would have been known at that time.

The final output is a large panel dataset where each row represents a specific stock at a specific point in time, and each column represents that stock's exposure to a fundamental factor.

---

### **Methodology: Point-in-Time & Composite Factor Construction**

The methodology focuses on creating a robust and realistic `X` matrix by incorporating two key professional techniques: point-in-time data handling and composite factor construction.

*   **1. Industry Factor Classification:** I classify each stock into one of the 12 Fama-French industry groups based on its historical SIC code. This creates 12 orthogonal "dummy variable" factors that will form the basis for capturing market-wide and sector-specific risk.

*   **2. Composite Style Factor Construction:** We build our style factors using a **multi-descriptor composite approach**, as recommended by Grinold & Kahn for model robustness. This involves:
    *   **Descriptor Calculation:** We first calculate the raw, underlying data ("descriptors") for each factor. This includes:
        *   **Value:** Book-to-Market (B/M) and Earnings-to-Price (E/P).
        *   **Momentum:** 12-month and 6-month historical returns (skipping the most recent month).
        *   **Size:** The natural logarithm of market capitalization.
        *   **Financial Constraints:** The Whited-Wu (WW) Index, which itself is a composite of several accounting ratios.
    *   **Point-in-Time Lagging:** To avoid lookahead bias, accounting-based descriptors (like Book Equity and Earnings) are appropriately lagged to simulate real-world reporting delays.

*   **3. Cross-Sectional Standardization:** This is the core of the process. **For each month in our sample**, I perform a **capitalization-weighted standardization** on each of the raw style factor descriptors individually. This converts each descriptor into a comparable Z-score relative to the market *at that specific point in time*.

*   **4. Final Factor Assembly:** The final factor exposures are created:
    *   For composite factors (Value, Momentum), we take the **average of their respective standardized descriptors.**
    *   The final composite factors are then re-standardized to ensure they have a clean, cap-weighted mean of zero and standard deviation of one. This creates pure, "extra-market" style factors.

---

### **Key Concepts & Theoretical Justification**

#### **1. Composite Factors for Robustness**

A key principle from "Active Portfolio Management" is that relying on a single descriptor for a factor (e.g., only using Book-to-Market for "Value") makes a model fragile. Any single accounting ratio can be noisy, subject to measurement error, or misleading for certain industries (e.g., B/M for tech firms). By creating a **composite factor** from several related but distinct descriptors, we **diversify away the idiosyncratic noise** of each individual measure. The resulting factor is a more robust and stable representation of the underlying economic concept.

#### **2. Time-Varying Exposures**

Companies evolve. A firm can grow from a "small-cap" to a "large-cap." It can transition from a "growth" stock to a "value" stock. By recalculating the standardized exposures for every period, our risk model can adapt to this evolution, providing a more accurate, forward-looking assessment of risk.

#### **3. Point-in-Time Data & Lookahead Bias**

Lookahead bias is one of the most critical errors in quantitative research. It occurs when a model is built using information that would not have been available at the time of the decision. By carefully lagging accounting data to account for reporting delays, I ensure the factor exposures are "point-in-time" correct and our subsequent backtests are valid and realistic.

---
**Output:** This notebook generates and saves the `factor_exposures_titan.parquet` file. This panel dataset, indexed by `(date, permno)`, is the primary $X$ input for the Fama-MacBeth risk model estimation in Notebook 3.



### 1. Imports and Load Data

In [10]:
import pandas as pd
import numpy as np
import os
import statsmodels.api as sm
from pathlib import Path 

print("Libraries imported successfully.")

# --- Load the master panel data from Notebook 1 ---
onedrive_root = str(Path(os.environ['OneDrive']))
DATA_DIR = os.path.join(onedrive_root, "0. DATASETS", "outputs")

PANEL_DATA_FILE = os.path.join(DATA_DIR, 'panel_data.parquet')

df = pd.read_parquet(PANEL_DATA_FILE)

# making sure permno and industry codes are stored as int
df['permno'] = df['permno'].astype('int')
# nullable int:
df['sic'] = df['sic'].astype('Int64')

# Setting a multi-index for efficiency
df.reset_index(inplace=True)  # move index back to columns
df.set_index(['permno', 'date'], inplace=True)
df.sort_index(inplace=True)

print("Monthly panel data loaded successfully.")
print(f"Data shape: {df.shape}")
print(f"Date range: {df.index.get_level_values('date').min()} to {df.index.get_level_values('date').max()}")


Libraries imported successfully.
Monthly panel data loaded successfully.
Data shape: (1660775, 33)
Date range: 1995-01-31 00:00:00 to 2023-12-31 00:00:00


### 2. Industry Factor Exposures

In [12]:
# --- Create Industry Factor Exposures ---

# Helper function to map from SIC codes to FF12 industries.
def sic_to_ff12(sic):
    """
    Converts a SIC code to one of the 12 Fama-French industry classifications.
    Based on the definitions from Ken French's website.
    """
    if pd.isnull(sic):
        return np.nan
    
    sic = int(sic)
    
    # --- 1. Check for SPECIFIC, granular industries FIRST ---
    
    # Healthcare, Pharma, Biotech
    if 2830 <= sic <= 2836 or 3840 <= sic <= 3851 or 8000 <= sic <= 8099:
        return 'Healthcare'
    # Technology (Computers, Software, Electronics)
    if 3570 <= sic <= 3579 or 3660 <= sic <= 3679 or 7370 <= sic <= 7379:
        return 'Technology'
    # Energy
    if 1300 <= sic <= 1399 or 2900 <= sic <= 2999: return 'Energy'
    # Utilities
    if 4900 <= sic <= 4949: return 'Utilities'
    # Telecom
    if 4800 <= sic <= 4899: return 'Telecom'
    # Finance
    if 6000 <= sic <= 6999: return 'Finance'
        
    if 100 <= sic <= 999: return 'Consumer'     # Non-Durables (food, tobacco, textiles, etc.)
    if 1000 <= sic <= 1499: return 'Other'      # Mining
    if 1500 <= sic <= 1999: return 'Other'      # Construction
    if 2000 <= sic <= 2799: return 'Consumer'   # More non-durables
    if 2800 <= sic <= 2829: return 'Chemicals'  # Chemicals is often its own FF group
    if 2840 <= sic <= 2899: return 'Consumer'   # More non-durables
    if 3000 <= sic <= 3999: return 'Durables'   # Durables (cars, furniture, industrial equip)
    if 4000 <= sic <= 4799: return 'Other'      # Transportation
    if 5000 <= sic <= 5999: return 'Shops'      # Wholesale, Retail
    if 7000 <= sic <= 7999: return 'Services'   # Business and Personal Services
    if 8100 <= sic <= 8999: return 'Services'   # (Excluding Healthcare which was caught above)
    if 9100 <= sic <= 9999: return 'Other'      # Public Admin, etc.

    return 'Other' # Final catch-all for any SIC codes not covered

# Apply the function to the 'sic' column. Note: CRSP hsiccd is better if available.
df['industry'] = df['sic'].apply(sic_to_ff12)

# Create the dummy variables
industry_dummies = pd.get_dummies(df['industry'], prefix='Ind')

# We'll join this back to our main DataFrame later.
print("Industry factor exposures created.")


Industry factor exposures created.


In [13]:
df['industry'].value_counts()

industry
Finance       305951
Technology    232476
Durables      211521
Other         187659
Healthcare    162360
Shops         153877
Services      145449
Consumer      109546
Energy         55672
Telecom        45440
Utilities      38598
Chemicals      11181
Name: count, dtype: int64

In [14]:
df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,share_code,exchange_code,sic,prc,ret_daily,shrout,vwretd,sprtrn,gvkey,mkt_cap,...,lctq,ltq,oiadpq,pstkq,saleq,oancfy,dvpspq,prccq,ret_monthly,industry
permno,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
10001,1995-01-31,11,3,4925,7.75,0.026915,2224.0,0.003962,0.004077,12994,17236.0,...,8.52,23.217,1.486,0.0,10.537,,0.0,8.0,-0.03124999,Utilities
10001,1995-02-28,11,3,4925,7.546875,-0.02621,2224.0,0.008116,0.0074,12994,16784.25,...,8.52,23.217,1.486,0.0,10.537,,0.0,8.0,-0.02620967,Utilities
10001,1995-03-31,11,3,4925,7.5,-0.032258,2244.0,-0.002444,-0.003007,12994,16830.0,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.5,0.00597075,Utilities
10001,1995-04-30,11,3,4925,7.5,-0.006211,2244.0,0.0018,0.002259,12994,16830.0,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.5,8.13899e-09,Utilities
10001,1995-05-31,11,3,4925,7.875,0.0,2244.0,0.014017,0.018755,12994,17671.5,...,6.108,20.823,1.829,0.0,11.266,,0.19,7.5,0.05,Utilities


### 3. Create Style Factor Descriptors 

#### 3.1 Size and Value Descriptors

In [16]:
# --- Creating Style Factor Descriptors ---

print("Calculating raw style factor descriptors...")

# --- Step 1: Ensure DataFrame is "flat" for calculations ---
if isinstance(df.index, pd.MultiIndex):
    df.reset_index(inplace=True)

# --- Size Descriptor ---
df['size_desc'] = np.log(df['mkt_cap'])


# --- Value Descriptors (Composite) ---
print("  Calculating Value descriptors...")
# Descriptor 1: Book-to-Market (B/M)
# We need a lagged version of book equity (ceqq) 
# because an attribute must be known when running the FM regression
# First, let's go back to quarterly data for book equity
fundamentals_for_bm = df[['permno', 'datadate', 'rdq', 'ceqq']].copy().drop_duplicates()
fundamentals_for_bm.dropna(subset=['datadate'], inplace=True)

# Use 'rdq' as the true announcement date, with a fallback on a simple offset
fundamentals_for_bm['announcement_date'] = fundamentals_for_bm['rdq'].fillna(
    fundamentals_for_bm['datadate'] + pd.DateOffset(months=6) # Use a 6-month lag for B/M
)

# Rename the 'ceqq' column to give it a descriptive, unique name BEFORE the merge.
fundamentals_for_bm.rename(columns={'ceqq': 'book_equity_lagged'}, inplace=True)

# Merge this lagged book equity back into the main panel
df = pd.merge_asof(
    left=df.sort_values('date'),
    right=fundamentals_for_bm[['permno', 'announcement_date', 'book_equity_lagged']].sort_values('announcement_date'),
    left_on='date',
    right_on='announcement_date',
    by='permno'
)
# Defining the Book-to-Market descriptor as the ratio of lagged book equity to market cap.
df['bm_desc'] = df['book_equity_lagged'] / df['mkt_cap']

# Descriptor 2: Earnings-to-Price (E/P)
quarterly_fundamentals = df[['permno', 'datadate', 'rdq', 'ibq']].copy().drop_duplicates()
quarterly_fundamentals.dropna(subset=['datadate'], inplace=True)
quarterly_fundamentals.sort_values(by=['permno', 'datadate'], inplace=True)
# defining last-12-months earnings 
quarterly_fundamentals['ltm_earnings'] = quarterly_fundamentals.groupby('permno')['ibq'].rolling(window=4, min_periods=4).sum().values
# Use the actual report date 'rdq' as the announcement date.
# We must handle cases where 'rdq' might be missing. If it is, offset the datadate forward by 3 months.
quarterly_fundamentals['announcement_date'] = quarterly_fundamentals['rdq'].fillna(
    quarterly_fundamentals['datadate'] + pd.DateOffset(months=3)
    )
#drop any rows with missing permno, announcement date or ltm earnings
quarterly_fundamentals.dropna(subset=['permno', 'announcement_date', 'ltm_earnings'], inplace=True)

df = pd.merge_asof(
    left=df.sort_values('date'),
    right=quarterly_fundamentals[['permno', 'announcement_date', 'ltm_earnings']].sort_values('announcement_date'),
    left_on='date',
    right_on='announcement_date',
    by='permno'
)
df['ep_desc'] = df['ltm_earnings'] / df['mkt_cap']


Calculating raw style factor descriptors...
  Calculating Value descriptors...


#### 3.2 Momentum and Financial Constraints

In [18]:
# --- Momentum Descriptors (Composite) ---
# We use two descriptors for momentum: return from month t-12 to t-1 and return from t-6 t t-1 
print("  Calculating Momentum descriptors...")
df['mom12_1_desc'] = df.sort_values('date').groupby('permno')['ret_monthly'].transform(lambda x: x.shift(1).rolling(11).apply(lambda r: (1+r).prod()-1))
df['mom6_1_desc'] = df.sort_values('date').groupby('permno')['ret_monthly'].transform(lambda x: x.shift(1).rolling(5).apply(lambda r: (1+r).prod()-1))


# --- Financial Constraints (Whited-Wu) ---
# As a placeholer, we calculate the WW index for financial constraints. 
# This index can be complemented or repalced by a measure from textual analyses. 
print("  Calculating Financial Constraint (WW) descriptor...")
# cash flow over asset
df['cf_at'] = (df['ibq'] + df['dpq']) / df['atq']
# Dividend payment indicator
df['div_pos'] = ((df['dvpspq'] * df['cshoq']) > 0).astype(int)
# Leverage
df['tLtd_at'] = df['dlttq'] / df['atq']
# sales growth
df['sg'] = df.sort_values('date').groupby('permno')['saleq'].pct_change(fill_method=None)
# Lagged industry sales growth
df['isg_industry'] = df.groupby(['industry', 'date'])['sg'].transform('mean')
df['isg'] = df.groupby(['permno'])['isg_industry'].shift(1)
# Calculating WW
df['ww_desc'] = -0.091*df['cf_at'] - 0.062*df['div_pos'] + 0.021*df['tLtd_at'] - 0.044*np.log(df['atq'].replace(0, np.nan)) + 0.102*df['isg'] - 0.035*df['sg']

df.drop(['cf_at', 'div_pos', 'tLtd_at', 'sg', 'isg_industry', 'isg'], axis = 1 , inplace = True)

# --- Final Cleanup ---
descriptor_cols = ['size_desc', 'bm_desc', 'ep_desc', 'mom12_1_desc', 'mom6_1_desc', 'ww_desc']
for col in descriptor_cols:
    if col in df.columns:
        df[col] = df[col].replace([np.inf, -np.inf], np.nan)

# --- Step 2: Set index back for the next steps ---
df.set_index(['permno', 'date'], inplace=True)
df.sort_index(inplace=True)

print("\nRaw style factor descriptors calculated successfully.")

  Calculating Momentum descriptors...
  Calculating Financial Constraint (WW) descriptor...

Raw style factor descriptors calculated successfully.


### 4. Standardization and Composite Factor Assembly

#### 4.1 Creating the standardization function

In [19]:
def standardize_cap_weighted(series, weights):
    """
    Performs capitalization-weighted standardization on a single Series.
    Handles NaN values
    """
    # --- Guard Clause: Check for all-NaN input ---
    # If the series has no valid data points, we can't standardize. Return NaNs.
    if series.isnull().all():
        return pd.Series(np.nan, index=series.index)

    # Ensure indices match and align the data
    series, weights = series.align(weights, join='left')
    
    # Identify the valid (non-NaN) data points
    is_valid = series.notna()
    
    # Calculate the sum of weights for valid data points
    valid_weights_sum = weights[is_valid].sum()
    
    # --- Guard Clause: Check for zero valid weights ---
    if valid_weights_sum == 0:
        return pd.Series(np.nan, index=series.index)
        
    # Calculate weighted mean using only valid data
    mean = np.sum(series[is_valid] * weights[is_valid]) / valid_weights_sum

    # Calculate weighted standard deviation
    de_meaned = series - mean
    weighted_var = np.sum((de_meaned[is_valid]**2) * weights[is_valid]) / valid_weights_sum
    std_dev = np.sqrt(weighted_var)

    # --- Guard Clause: Check for zero standard deviation ---
    # If all valid values are the same, std_dev will be 0. Return 0 for all.
    if std_dev == 0:
        return pd.Series(0.0, index=series.index)

    # Calculate and return the Z-scores
    return de_meaned / std_dev

####  4.2 Applying the standardization function

In [20]:

# Calculate cap weights for each month
df['cap_weight'] = df.groupby('date')['mkt_cap'].transform(lambda x: x / x.sum())

# List of our raw descriptor columns
descriptor_cols = ['size_desc', 'bm_desc', 'ep_desc', 'mom12_1_desc', 'mom6_1_desc', 'ww_desc']

# --- Standardize ALL descriptors month-by-month using transform ---
print("Standardizing all raw descriptors...")
for col in descriptor_cols:
    new_col_name = f"z_{col}"
    # The transform will apply our function to each 'date' group
    df[new_col_name] = df.groupby('date')[col].transform(
        lambda x: standardize_cap_weighted(x, df.loc[x.index, 'cap_weight'])
    )

print("Standardization complete.")

Standardizing all raw descriptors...
Standardization complete.


#### 4.3 Build and Re-Standardize Composite Factors

In [21]:

# Create the composite factors by averaging the standardized descriptors
df['Value_composite'] = df[['z_bm_desc', 'z_ep_desc']].mean(axis=1)
df['Momentum_composite'] = df[['z_mom12_1_desc', 'z_mom6_1_desc']].mean(axis=1)

print("\nComposite factors created. Now re-standardizing...")

# Re-standardize the final composites
# This ensures they have a clean mean=0, std=1 profile
df['Value'] = df.groupby('date')['Value_composite'].transform(
    lambda x: standardize_cap_weighted(x, df.loc[x.index, 'cap_weight'])
)
df['Momentum'] = df.groupby('date')['Momentum_composite'].transform(
    lambda x: standardize_cap_weighted(x, df.loc[x.index, 'cap_weight'])
)

# Rename the single-descriptor factors for consistency
df.rename(columns={
    'z_size_desc': 'Size',
    'z_ww_desc': 'FinConstraint'
}, inplace=True)

print("Final factors assembled and cleaned.")


Composite factors created. Now re-standardizing...
Final factors assembled and cleaned.


In [None]:
#TEMP_FILE = os.path.join(onedrive_root, "0. DATASETS", "temps", "df_temp.parquet")
#df.to_parquet(TEMP_FILE)

### 5. Assemble and Save the Final X Matrix 


In [22]:
industry_dummies = pd.get_dummies(df['industry'], prefix='Ind')

In [23]:

# Our final style factors are the re-standardized composites and the single descriptors
style_factors = ['Size', 'Value', 'Momentum', 'FinConstraint']

# Combine our final style factors with the industry dummies
# First, let's align them to the same index (date, permno)
df_for_x = df[style_factors].copy()
X = df_for_x.join(industry_dummies)

# Drop any rows with missing factor exposures, as we can't use them in the regression
X.dropna(inplace=True)

# Define the output file path
X_FILE = os.path.join(DATA_DIR, 'factor_exposures.parquet')

# Save the final, time-varying X matrix to a Parquet file
X.to_parquet(X_FILE)

print(f"\nFinal time-varying factor exposure matrix (X) saved to {X_FILE}")
print(f"Shape of final X matrix: {X.shape}")
print("Notebook 2 (Project Titan) is complete.")


Final time-varying factor exposure matrix (X) saved to D:\OneDrive\0. DATASETS\outputs\factor_exposures.parquet
Shape of final X matrix: (1030177, 16)
Notebook 2 (Project Titan) is complete.
