# Notebook 2: Factor Exposure Creation

### **Objective**
The goal of this notebook is to construct the **Factor Exposure Matrix ($X$)**, which is a cornerstone of the multifactor risk model. This matrix quantifies the "DNA" of each stock in our universe, describing its sensitivity to a set of predefined, economically intuitive factors. In the Grinold-Kahn framework, these exposures are the "attributes" that explain the systematic co-movement of stocks.

---

### **Methodology & Pipeline**

We will construct three classic style factors (or "risk indices") based on the cross-sectional characteristics of the firms at a specific point in time.

*   **1. Data Acquisition:** I source the necessary raw data ("descriptors") for each factor. For this project, I use a combination of market data (historical prices, shares outstanding) and fundamental data (book value) sourced from the `yfinance` API.

*   **2. Descriptor Calculation:** I calculate the raw descriptor for each factor:
    *   **Size:** The natural logarithm of a firm's market capitalization.
    *   **Value:** The Book-to-Market ratio (Book Value / Market Cap).
    *   **Momentum:** The stock's total return over the past 12 months, skipping the most recent month.

*   **3. Capitalization-Weighted Standardization:** This is the most critical step. Raw descriptors are not comparable (e.g., market cap is in dollars, B/M is a ratio). I convert each descriptor into a standardized **Z-score**. Crucially, this standardization is performed on a **capitalization-weighted** basis. This ensures that the final factor exposures for the market-cap-weighted benchmark portfolio are, by construction, equal to zero. This process creates pure, "extra-market" style factors.

*   **4. Final Matrix Assembly:** The resulting standardized scores for each factor are assembled into the final $N \times K$ Factor Exposure Matrix, $X$, where $N$ is the number of stocks and $K$ is the number of factors.

---

### **Key Concepts & Theoretical Justification**

#### **1. Factor Exposures (Loadings)**

A stock's return can be decomposed into a systematic part and a specific part. The systematic portion is driven by common factors that affect many stocks simultaneously. The factor exposure, $X_{nk}$, is the sensitivity of stock $n$ to factor $k$.
$$ r_n = \sum_{k=1}^{K} (X_{nk} \cdot b_k) + u_n $$
The $X$ matrix contains all these $X_{nk}$ values. It is a known input at the start of an investment period and forms the basis for forecasting risk.

#### **2. Standardization**

Standardization is the process of converting a variable to have a mean of zero and a standard deviation of one. This is essential for making different factors (like Size and Value) comparable and for interpreting the exposures in a consistent way. An exposure of `+1.5` on the Value factor means the stock is 1.5 standard deviations "cheaper" than the average stock.

The formula for cap-weighted standardization of a descriptor $d$ is:
$$ X_n = \frac{d_n - \bar{d}_{cw}}{\sigma_{cw}(d)} $$
Where the mean $\bar{d}_{cw}$ and standard deviation $\sigma_{cw}(d)$ are calculated using market capitalization as weights.

#### **3. Benchmark Neutrality of Style Factors**

A critical feature of this framework is that style factors are constructed to be **benchmark-neutral**. By using capitalization-weighted standardization, I mathematically enforce that the capitalization-weighted average exposure of our universe to each style factor is zero.
$$ \sum_{n=1}^{N} (w_n \cdot X_{nk}) = 0 $$
This ensures a clean separation of risk. The market's overall movement will be captured by other factors (like industries), while these style factors will capture purely cross-sectional, "extra-market" sources of risk and return.

---
**Output:** This notebook generates and saves the `factor_exposures.csv` file. This matrix, $X$, is a primary input for both the risk model estimation (Notebook 3) and our final portfolio construction (Notebook 5).



In [1]:
import pandas as pd
import numpy as np
import yfinance as yf
import os

print("Libraries imported successfully.")

# --- Load the processed data from Notebook 1 ---
DATA_DIR = 'data'
PRICES_FILE = os.path.join(DATA_DIR, 'monthly_prices.csv')
RETURNS_FILE = os.path.join(DATA_DIR, 'monthly_excess_returns.csv')

# Load the data, ensuring the 'Date' column is parsed correctly as dates
monthly_prices = pd.read_csv(PRICES_FILE, index_col='Date', parse_dates=True)
monthly_excess_returns = pd.read_csv(RETURNS_FILE, index_col='Date', parse_dates=True)

# Get our list of tickers from the data
tickers = monthly_prices.columns.tolist()

print("Data from Notebook 1 loaded successfully.")


Libraries imported successfully.
Data from Notebook 1 loaded successfully.


In [None]:
# --- Download Necessary Data for Factor Calculation ---

# To calculate our factors, I need Market Cap (for Size), and Book Value (for Value).
# Fot this illustrative project, I can get this from yfinance's 'info' 
# attribute for each ticker.

# Create an empty dictionary to store the data
ticker_info = {}

print("Fetching financial data for each ticker from yfinance...")
for ticker in tickers:
    # yf.Ticker() creates a Ticker object that I can get info from
    stock_info = yf.Ticker(ticker).info
    ticker_info[ticker] = stock_info
    print(f"  ...fetched data for {ticker}")

print("Financial data fetched successfully.")


Fetching financial data for each ticker from yfinance...
  ...fetched data for AAPL
  ...fetched data for AMZN
  ...fetched data for GOOGL
  ...fetched data for JNJ
  ...fetched data for JPM
  ...fetched data for MSFT
  ...fetched data for PG
  ...fetched data for TSLA
  ...fetched data for UNH
  ...fetched data for XOM
Financial data fetched successfully.


In [None]:
# --- Create a DataFrame of Raw Factor Descriptors ---

# I'll extract the specific pieces of information I need.
# Note: yfinance keys can sometimes change. These are the common ones as of late 2023.
descriptors = pd.DataFrame(index=tickers)

descriptors['market_cap'] = [info.get('marketCap', np.nan) for ticker, info in ticker_info.items()]
descriptors['book_value'] = [info.get('bookValue', np.nan) * info.get('sharesOutstanding', np.nan) for ticker, info in ticker_info.items()]
descriptors['book_to_market'] = descriptors['book_value'] / descriptors['market_cap']

# For Momentum, I calculate the past 12-month return, excluding the most recent month.
momentum_period = monthly_prices.pct_change(periods=11).shift(1) # 11-month change, shifted by 1 month

# I'll just grab the most recent momentum value for this static example.
# A full-blown model would calculate this for every month in our history.
descriptors['momentum_12m_1m'] = momentum_period.iloc[-1]

print("Raw Descriptor DataFrame:")
descriptors

Raw Descriptor DataFrame:


Unnamed: 0,market_cap,book_value,book_to_market,momentum_12m_1m
AAPL,4056108761088,73748780000.0,0.018182,0.470113
AMZN,2667836342272,369742500000.0,0.138593,0.739167
GOOGL,3499663032320,186368000000.0,0.053253,0.502097
JNJ,463162900480,79379050000.0,0.171385,-0.097916
JPM,862276550656,340165700000.0,0.394497,0.198602
MSFT,3774681710592,362997300000.0,0.096166,0.593986
PG,345357516800,52499390000.0,0.152015,0.038647
TSLA,1449957392384,80012560000.0,0.055183,0.949018
UNH,294171082752,95769790000.0,0.325558,0.055088
XOM,507262926848,260551600000.0,0.513642,-0.036738


In [None]:
# --- Define the Standardization Function ---

def standardize_cap_weighted(series, weights):
    """
    Performs capitalization-weighted standardization (creates Z-scores).
    
    Args:
        series (pd.Series): A series of raw factor values (e.g., book-to-market ratios).
        weights (pd.Series): A series of market capitalization weights for the same stocks.
        
    Returns:
        pd.Series: The cap-weighted standardized Z-scores.
    """
    # Ensure indices match
    series = series.reindex(weights.index)
    
    # Calculate the cap-weighted mean
    mean = (series * weights).sum()
    
    # De-mean the series
    de_meaned_series = series - mean
    
    # Calculate the cap-weighted standard deviation
    squared_devs = (de_meaned_series**2) * weights
    std_dev = np.sqrt(squared_devs.sum())
    
    # Create the Z-scores
    z_scores = de_meaned_series / std_dev
    
    return z_scores

print("Standardization function defined.")


Standardization function defined.


In [None]:
# --- Build the Final Factor Exposure Matrix (X) ---

# Calculate the market cap weights
total_market_cap = descriptors['market_cap'].sum()
cap_weights = descriptors['market_cap'] / total_market_cap

# Create our final exposure matrix
X = pd.DataFrame(index=tickers)

# Standardize each of our factors using the function
# For Size, I standardize the log of market cap
X['Size'] = standardize_cap_weighted(np.log(descriptors['market_cap']), cap_weights)
X['Value'] = standardize_cap_weighted(descriptors['book_to_market'], cap_weights)
X['Momentum'] = standardize_cap_weighted(descriptors['momentum_12m_1m'], cap_weights)

print("Final Factor Exposure Matrix (X) for the most recent date:")
X

Final Factor Exposure Matrix (X) for the most recent date:


Unnamed: 0,Size,Value,Momentum
AAPL,0.673549,-0.791782,-0.229023
AMZN,0.072566,0.292714,0.913754
GOOGL,0.461882,-0.475909,-0.093175
JNJ,-2.439123,0.588061,-2.641663
JPM,-1.5476,2.597559,-1.382237
MSFT,0.570399,-0.089405,0.297117
PG,-2.86014,0.413602,-2.061628
TSLA,-0.802083,-0.45853,1.805074
UNH,-3.090257,1.976648,-1.991795
XOM,-2.308657,3.670656,-2.381819


In [None]:
# --- Sanity Check: Verify Benchmark Neutrality ---
# The cap-weighted average exposure of the market to our style factors should be zero.

market_exposures = (X * cap_weights.values.reshape(-1, 1)).sum()

print("Market's Exposure to each factor:")
print(market_exposures)


Market's Exposure to each factor:
Size        1.096345e-15
Value       1.387779e-17
Momentum    6.938894e-17
dtype: float64


In [None]:
# --- Save the Factor Exposure Matrix ---
# In the actual model, I would have an X matrix for every month.
# For this toy project, I'll save this single, most recent X matrix.
X_FILE = os.path.join(DATA_DIR, 'factor_exposures.csv')
X.to_csv(X_FILE)

print(f"\nFactor exposure matrix saved to {X_FILE}")
print("Notebook 2 is complete.")



Factor exposure matrix saved to data\factor_exposures.csv
Notebook 2 is complete.
