# Notebook 4: Alpha Signal Generation & Neutralization

### **Objective**
The objective of this notebook is to construct clean, actionable, and time-varying **alpha signals** that will drive our active portfolio decisions. In the Grinold-Kahn framework, a high-quality alpha signal is the mathematical representation of a manager's proprietary, skill-based market view.

This notebook demonstrates a crucial step in a professional quantitative process: the **purification** of raw investment ideas into portfolio-ready alpha vectors. We will generate two distinct alpha signals:
1.  A classic **Momentum** factor.
2.  A **Financial Constraints (FC)** factor, based on the Whited-Wu index. This serves as a quantitative baseline for future research using more advanced LLM-based measures.

Crucially, both raw signals will be rigorously neutralized to remove any unintended systematic (benchmark) bias.

---

### **Methodology: The Signal Purification Pipeline**

The process involves a clear, three-step pipeline for each of our alpha ideas:

*   **1. Generate Raw Alpha Signals:** We start by calculating the raw, un-neutralized score for each of our investment ideas for each stock at each point in time.
    *   **Momentum:** Calculated as the stock's historical return over the past 12 months, skipping the most recent month.
    *   **Financial Constraints:** Calculated using the formula for the **Whited-Wu Index**, which combines several accounting ratios into a single score representing a firm's access to external capital.

*   **2. Diagnose Systematic Bias:** A raw factor signal is rarely "pure" and often contains an implicit bet on the overall market. We diagnose this bias for each signal by calculating its **benchmark-weighted average** for each month.
    $$ \bar{\alpha}_{\text{raw}, B, t} = \sum_{n=1}^{N_t} h_{B,n,t} \cdot \alpha_{\text{raw},n,t} $$
    A non-zero value indicates that the raw signal has a systematic tilt that must be removed.

*   **3. Perform Benchmark Neutralization:** To create a pure, "skill-based" signal suitable for a stock-selection strategy, we remove this systematic bias using a **beta-adjusted neutralization**. This method cleans each stock's raw alpha by subtracting the portion of its alpha expected to come from the system-wide bias, proportional to the stock's own beta.
    $$ \alpha_{\text{final},n,t} = \alpha_{\text{raw},n,t} - \beta_{n,t} \cdot \bar{\alpha}_{\text{raw}, B, t} $$

*   **4. Combine Signals (Optional):** As a final step, we can create a composite alpha signal by taking a simple average of our two final, neutralized alpha vectors. This is a basic form of signal combination to create a more diversified alpha source.

---

### **Key Concepts & Theoretical Justification**

#### **1. Alpha ($\alpha$) as a Forecast**

In the active management context, the alpha vector is the manager's **forecast of expected residual return**. It represents the performance that is expected to be uncorrelated with the benchmark's returns. It is the mathematical embodiment of unique, skill-based insights.

#### **2. The Importance of Benchmark Neutrality**

The constraint that the benchmark-weighted average of the alphas is zero, $h_B^T \alpha = 0$, is a crucial disciplining step. It ensures the alpha signal is, on average, **orthogonal to the benchmark**, cleanly separating stock-picking skill from any implicit market timing bet. An optimizer given a benchmark-neutral alpha signal and no other instruction will naturally build an active portfolio with a beta of one.

#### **3. Beta-Adjusted Neutralization**

The beta-adjusted method is theoretically superior to a simple flat subtraction. By adjusting each stock's alpha based on its beta ($\beta_n$), we remove the systematic bias in a "risk-aware" manner. This process helps ensure the final alpha signal has a lower correlation with the benchmark return itself, making it a higher-quality input for portfolio optimization.

---
**Output:** This notebook produces the final, clean `alpha_signals.parquet` file. This panel dataset, containing the purified alpha forecasts for each stock in each month, is the primary "signal" that will be combined with our risk model ($V_t$) to construct the optimal active portfolio in Notebook 5.



### 1. Imports and Load Data

In [15]:
import numpy as np 
import pandas as pd 
import statsmodels.api as sm 
import os 
print('Libraries imported successfully')

Libraries imported successfully


In [16]:
onedrive_root = os.environ['OneDrive']
DATA_DIR = os.path.join(onedrive_root, '0. DATASETS', 'outputs')

# defining paths and loading the datasets:

# --- Load Required Datasets ---
PANEL_DATA_PATH = os.path.join(DATA_DIR, 'panel_data.parquet')
FACTOR_EXPOSURES_PATH = os.path.join(DATA_DIR, 'factor_exposures.parquet')

panel_data = pd.read_parquet(PANEL_DATA_PATH)
X_factors = pd.read_parquet(FACTOR_EXPOSURES_PATH)


# make sure the dataframes have the same indices
if not panel_data.index.names == ['permno', 'date']:
    panel_data.reset_index(inplace = True)
    panel_data.set_index(['permno', 'date'], inplace = True)
    
if not X_factors.index.names == ['permno', 'date']:
    X_factors.reset_index(inplace = True)
    X_factors.set_index(['permno', 'date'], inplace = True)

# Create smaller DataFrames with only the columns we need for this notebook.
market_data = panel_data[['ret_monthly', 'mkt_cap', 'vwretd']]
alpha_signals = X_factors[['Momentum', 'FinConstraint']]

# creating one df of our vars
df = market_data.join(alpha_signals, how = 'inner')

df.dropna(inplace = True)

print("Data loading and merging complete.")
print(f"Final DataFrame shape: {df.shape}")
print("Sample of the final prepared data:")
print(df.head())

Data loading and merging complete.
Final DataFrame shape: (1199061, 5)
Sample of the final prepared data:
                   ret_monthly   mkt_cap    vwretd  Momentum  FinConstraint
permno date                                                                
10001  1990-06-30     0.013750  10052.25  0.002203 -0.441271       2.858116
       1990-07-31     0.025641  10310.00  0.001627 -0.609343       3.222712
       1990-08-31    -0.050000   9794.50  0.009347 -0.333894       2.736315
       1990-09-30     0.040789  10179.00  0.014276 -0.018822       2.972653
       1990-10-31    -0.012821  10048.50  0.000644  0.328334       3.033946


In [17]:

duplicates = df[df.index.duplicated(keep=False)]
if len(duplicates) > 0 :
    print(f"Warning: data has duplicate entries ")
    print(duplicates.head())
else:
    print("Great!, no dupliates found in the data")

Great!, no dupliates found in the data


### 2. Calculate Time-Varying Betas

In [18]:
# --- Calculate Time-Varying Betas ---
# The goal is to calculate a time-series of benchmark betas for each stock. 
# For each stock, we calcualte beta on a 60-month (5 year) rolling basis 

from statsmodels.regression.rolling import RollingOLS

print("Calculating time-varying historical betas for each stock...")

def calculate_rolling_beta(group, window_size=60, min_obs=36):
    """
    Calculates the rolling beta of a stock's excess returns against the benchmark's excess returns.
    The benchmark is the value-weighted portfolio of CRSP portfolio of stocks. 
    Returns a Series aligned with the group's index. If the group is too short, i.e. with fewer than 60 rows
    returns a Series of NaNs so pandas alignment works correctly.
    """
    # Require at least 'window_size' observations: 
    # RollingOLS expects the underlying arrays to be at least as long as the window.
    if len(group) < min_obs:
        return pd.Series(np.nan, index=group.index)

    y = group['ret_monthly']
    X = sm.add_constant(group['vwretd'])

    try:
        rols = RollingOLS(y, X, window=window_size, min_nobs=min_obs)
        results = rols.fit()
        return results.params['vwretd']
    except IndexError:
        # If statsmodels raises an indexing error
        # return NaNs aligned to the group's index.
        return pd.Series(np.nan, index=group.index)

# Apply the function to each 'permno' group.
rolling_betas = df.groupby('permno', group_keys=False).apply(calculate_rolling_beta)

# Assign the resulting Series back to our main DataFrame.
df['beta'] = rolling_betas

print("Rolling beta calculation complete.")

# --- Verification Step ---
print("\nVerifying the beta calculation:")
# Drop NaNs for a meaningful description of the calculated betas
print(df['beta'].describe())

# Check how many valid betas we have
print(f"\nTotal observations: {len(df)}")
print(f"Number of valid beta calculations: {df['beta'].notna().sum()}")

Calculating time-varying historical betas for each stock...
Rolling beta calculation complete.

Verifying the beta calculation:
count    701280.000000
mean          0.155622
std           2.409209
min         -95.971368
25%          -0.926747
50%           0.272374
75%           1.381265
max          42.308698
Name: beta, dtype: float64

Total observations: 1199061
Number of valid beta calculations: 701280
Rolling beta calculation complete.

Verifying the beta calculation:
count    701280.000000
mean          0.155622
std           2.409209
min         -95.971368
25%          -0.926747
50%           0.272374
75%           1.381265
max          42.308698
Name: beta, dtype: float64

Total observations: 1199061
Number of valid beta calculations: 701280


### 3. Winsorize the Beta Estimates

In [19]:
# --- Winsorize the Beta Estimates ---
# Defining the quantile thresholds 
lower_quantile = 0.01
upper_quantile = 0.99

# We are going to calculate the 1% and 99% percent quantile for each date group 
# and clip out beta series 
df['beta_winsorized'] = df.groupby('date')['beta'].transform(
    lambda x : x.clip(
        lower = x.quantile(lower_quantile),
        upper = x.quantile(upper_quantile)
    )
)

print("Beta winsorization complete.")

# --- Verification Step ---
print("\nComparing Raw vs. Winsorized Beta Distributions:")

beta_comparison = pd.concat([df['beta'].describe(), df['beta_winsorized'].describe()], axis = 1 )
beta_comparison.columns = ['Raw Beta', 'Winsorized Beta' ]
print(beta_comparison)



Beta winsorization complete.

Comparing Raw vs. Winsorized Beta Distributions:
            Raw Beta  Winsorized Beta
count  701280.000000    701280.000000
mean        0.155622         0.163484
std         2.409209         2.206242
min       -95.971368       -12.844529
25%        -0.926747        -0.926747
50%         0.272374         0.272374
75%         1.381265         1.381265
max        42.308698        12.951891


### 4. Alpha Neutralization
To take our "raw" alpha signals (Momentum and FinConstraint) and remove their systematic, benchmark-related bias. The output will be two new columns, alpha_Momentum and alpha_FinConstraint, which are our final, benchmark-neutral alpha vectors.

In [20]:
# --- Alpha Neutralization ---

def neutralize_alpha(group, signal_col_name):
    """
    Performs a beta-adjusted benchmark neutralization on a given signal column.
    Takes a DataFrame for a single period ('group') and the name of the signal column.
    Returns a Series of neutralized alphas.
    """
    # Define the columns we absolutely need for this calculation
    required_cols = ['mkt_cap', 'beta_winsorized', signal_col_name]
    
    # Drop any stocks with missing data for this month
    clean_group = group.dropna(subset=required_cols)
    
    # If no data is left, return an empty series
    if clean_group.empty:
        return pd.Series(dtype='float64')

    # Calculate cap weights for the valid universe this month
    cap_weights = clean_group['mkt_cap'] / clean_group['mkt_cap'].sum()

    # Calculate the raw signal bias and the beta of our tradable benchmark
    benchmark_raw_alpha = (clean_group[signal_col_name] * cap_weights).sum()
    benchmark_beta = (clean_group['beta_winsorized'] * cap_weights).sum()

    # If benchmark_beta is zero or very small, we can't neutralize. Return raw de-meaned.
    if np.isclose(benchmark_beta, 0):
        # A simple de-meaning is a safe fallback
        return clean_group[signal_col_name] - benchmark_raw_alpha

    # Calculate the beta-adjusted bias term
    adjustment_factor = benchmark_raw_alpha / benchmark_beta

    # Calculate the final, neutralized alpha
    neutralized_signal = clean_group[signal_col_name] - clean_group['beta_winsorized'] * adjustment_factor
    
    return neutralized_signal

# --- Apply the function for each signal ---

print("Neutralizing Momentum signal...")
# We tell apply to use our function, and then pass 'Momentum' as the 'signal_col_name' argument
alpha_momentum = df.groupby('date', group_keys= False).apply(neutralize_alpha, signal_col_name='Momentum')

print("Neutralizing Financial Constraint signal...")
alpha_fin_constraint = df.groupby('date', group_keys = False).apply(neutralize_alpha, signal_col_name='FinConstraint')


# --- Assign the new alpha signals back to the main DataFrame ---
# Since we used group_keys = False, the result of the .apdl y) will have a (date, permno) multi-index 
# so we can assign it directly to our dataframe 
df['alpha_Momentum'] = alpha_momentum
df['alpha_FinConstraint'] = alpha_fin_constraint


# --- Final Verification ---
print("\nVerifying neutralization...")
# We must use the original weights from the df for a true verification
df['cap_weight'] = df.groupby('date')['mkt_cap'].transform(lambda x: x / x.sum())

mom_alpha_benchmark_avg = (df['alpha_Momentum'] * df['cap_weight']).groupby('date').sum().mean()
fc_alpha_benchmark_avg = (df['alpha_FinConstraint'] * df['cap_weight']).groupby('date').sum().mean()

if np.allclose(mom_alpha_benchmark_avg, 0): 
    print(f"Averag0 benchmark-weighted Momentum Alpha: {mom_alpha_benchmark_avg:.18f} which is cose to 0")
else:
    print(f"Aver0ge benchmark-weighted Momentum Alpha: {mom_alpha_benchmark_avg:.18f} which is different from zero")

if np.allclose(fc_alpha_benchmark_avg, 0):
    print(f"Avera0e benchmark-weighted FinConstraint Alpha: {fc_alpha_benchmark_avg:.18f} which is close to 0")
else:
    print(0"Average benchmark-weighted Momentum Alpha: {fc_alpha_benchmark_avg:.18f} which is different from zero")   


Neutralizing Momentum signal...


Neutralizing Momentum signal...


Neutralizing Financial Constraint signal...

Verifying neutralization...

Verifying neutralization...
Average benchmark-weighted Momentum Alpha: -0.0000000000 which is cose to 0
Average benchmark-weighted FinConstraint Alpha: -0.0000000000 which is close to 0
Average benchmark-weighted Momentum Alpha: -0.0000000000 which is cose to 0
Average benchmark-weighted FinConstraint Alpha: -0.0000000000 which is close to 0


In [None]:
# --- FINAL CELL: Finalize and Save All Alpha Signals ---

# 1. Create the simple, equal-weighted composite signal
df['alpha_composite'] = (df['alpha_Momentum'] + df['alpha_FinConstraint']) / 2

# 2. Re-standardize ALL THREE alpha signals for consistency
# This ensures each signal we test has a comparable cross-sectional distribution (mean=0, std=1).
print("Re-standardizing all final alpha signals...")

df['alpha_Momentum_final'] = df.groupby('date')['alpha_Momentum'].transform(
    lambda x: (x - x.mean()) / x.std()
)
df['alpha_FinConstraint_final'] = df.groupby('date')['alpha_FinConstraint'].transform(
    lambda x: (x - x.mean()) / x.std()
)
df['alpha_Composite_final'] = df.groupby('date')['alpha_composite'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# Create a final DataFrame with clean alpha signals
alpha_signals_final = df[['alpha_Momentum_final', 'alpha_FinConstraint_final', 'alpha_Composite_final']].copy()
alpha_signals_final.fillna(0, inplace=True) # Fill any NaNs from std=0 cases with 0

# --- Verification --- 
print("\nVerifying benchmark neutrality of the final composite signal...")
df['cap_weight'] = df.groupby('date')['mkt_cap'].transform(lambda x: x / x.sum())
final_composite_benchmark_alpha = (alpha_signals_final['alpha_Composite_final'] * df['cap_weight']).groupby('date').sum().mean()
print(f"Average benchmark-weighted Final Composite Alpha: {final_composite_benchmark_alpha:.10f}")

# 4. Save the Final Alpha Signals DataFrame
ALPHA_FILE = os.path.join(DATA_DIR, 'alpha_signals.parquet')
alpha_signals_final.to_parquet(ALPHA_FILE)

print(f"\nFinal DataFrame of all alpha signals saved to {ALPHA_FILE}")
print("--- Notebook 4 (Project Titan) is Complete ---")


permno  date      
10001   1990-06-30         NaN
        1990-07-31         NaN
        1990-08-31         NaN
        1990-09-30         NaN
        1990-10-31         NaN
                        ...   
93436   2023-08-31    0.085492
        2023-09-30    0.121931
        2023-10-31    0.736613
        2023-11-30    0.035304
        2023-12-31    0.264956
Name: alpha_composite, Length: 1199061, dtype: float64