# GKY (2020) Replication: Part 2 - Data Merging & Final Matrix Construction

##### This notebook takes the three prepared data files from `01_data_downloader.ipynb` and completes the data assembly process.

##### 1. Step 4: Merging Data
##### - We load `crsp_prepared.parquet`, `characteristics_prepared.parquet`, and `macro_predictors_lagged.parquet`.
##### - We perform sequential `inner` joins to ensure that every row in our final dataset has a valid return, a full set of characteristics, and corresponding macro data for that month.

##### 2.  **Step 5: Constructing the Final 920-Predictor Matrix**
##### - Interaction Terms: We create interaction features by multiplying each of the 94 stock characteristics with each of the 8 macro predictors (plus a macro intercept), resulting in `94 * 9 = 846` interaction features. The original characteristics and macro variables are not kept, as per GKY's methodology.
##### - Industry Dummies: We create 74 one-hot encoded dummy variables from the `sic2` industry codes.
##### - Final Predictor Count: `846 (Interactions) + 74 (Dummies) = 920 Predictors`.

##### The final output is a single Parquet file (`gky_final_data.parquet`) containing the response variable (`ret_excess`), identifying columns, and the complete 920-predictor matrix.


In [1]:
import pandas as pd
import numpy as np

import os
import gc


In [2]:
# --- Configuration ---

# Define the input directory where the prepared files are stored.
# This path is relative to this notebook's location in `notebooks/`.
INPUT_DIR = '../data'

# Define the final output file path.
OUTPUT_PATH = os.path.join(INPUT_DIR, 'gky_final_data.parquet')

# --- Verify Input Files Exist ---
required_files = [
    'characteristics_prepared.parquet',
    'crsp_prepared.parquet',
    'macro_predictors_lagged.parquet'
]

for f in required_files:
    path = os.path.join(INPUT_DIR, f)
    if not os.path.exists(path):
        raise FileNotFoundError(
            f"Required input file not found: {path}. "
            "Please run the '01_data_downloader.ipynb' notebook first."
        )
print("All required input files found.")

# --- Load Prepared Data ---

print("Loading prepared data from parquet files...")
characteristics_df = pd.read_parquet(os.path.join(INPUT_DIR, 'characteristics_prepared.parquet'))
crsp_df = pd.read_parquet(os.path.join(INPUT_DIR, 'crsp_prepared.parquet'))
macro_df = pd.read_parquet(os.path.join(INPUT_DIR, 'macro_predictors_lagged.parquet'))

print("Data successfully loaded.")
print(f"Characteristics shape: {characteristics_df.shape}")
print(f"CRSP shape: {crsp_df.shape}")
print(f"Macro shape: {macro_df.shape}")


All required input files found.
Loading prepared data from parquet files...
Data successfully loaded.
Characteristics shape: (3816941, 97)
CRSP shape: (3065543, 4)
Macro shape: (1836, 9)


In [3]:
# --- Step 4: Merge All Data Sources ---
print("\n--- Starting Step 4: Merging DataFrames ---")

# Check date conventions before fixing
print(f"CRSP/Characteristics 'month' head:\n{crsp_df['month'].head(2)}")
print(f"Macro 'month' head (before fix):\n{macro_df['month'].head(2)}")
print("-" * 30)

# Standardize macro_df date to month-end IN-PLACE
print("Applying date standardization fix to macro data...")
macro_df['month'] = macro_df['month'] + pd.offsets.MonthEnd(0)
print(f"Macro 'month' head (after fix):\n{macro_df['month'].head(2)}")
print("-" * 30)

# METHODOLOGY: Use 'inner' joins to ensure data integrity across all sources.
# We only want observations that have returns, characteristics, and macro data.

# Merge returns (CRSP) with stock-level characteristics
print("Merging CRSP data with characteristics...")
merged_df = pd.merge(
    crsp_df, 
    characteristics_df, 
    on=['permno', 'month'], 
    how='inner'
)
print(f"Shape after merging with characteristics: {merged_df.shape}")

# Merge the result with macroeconomic predictors
print("Merging with lagged macroeconomic predictors...")
merged_df = pd.merge(
    merged_df, 
    macro_df, 
    on='month', 
    how='inner'
)
print(f"Final merged shape: {merged_df.shape}")

# Memory Management: Clean up original dataframes
del characteristics_df, crsp_df, macro_df
gc.collect()



--- Starting Step 4: Merging DataFrames ---
CRSP/Characteristics 'month' head:
0   1986-03-31
1   1986-04-30
Name: month, dtype: datetime64[ns]
Macro 'month' head (before fix):
0   1871-02-28
1   1871-03-28
Name: month, dtype: datetime64[ns]
------------------------------
Applying date standardization fix to macro data...
Macro 'month' head (after fix):
0   1871-02-28
1   1871-03-31
Name: month, dtype: datetime64[ns]
------------------------------
Merging CRSP data with characteristics...
Shape after merging with characteristics: (2851604, 99)
Merging with lagged macroeconomic predictors...
Final merged shape: (2851604, 107)


0

In [4]:
# --- Step 5: Construct the Final 920-Predictor Matrix ---

print("\n--- Starting Step 5: Constructing the Predictor Matrix ---")

# --- Part A: Create Industry Dummies ---
print("Creating 74 industry dummies from 'sic2'...")

# To ensure exactly 74 dummies are created, we first identify all possible SIC codes from the original, un-merged characteristics data.
print("Temporarily reloading characteristics data to get all SIC codes...")
original_chars_df = pd.read_parquet(os.path.join(INPUT_DIR, 'characteristics_prepared.parquet'))
all_sic_codes = sorted(original_chars_df['sic2'].unique())
del original_chars_df # Clean up memory
gc.collect()

print(f"Identified {len(all_sic_codes)} unique SIC codes from the source file.")

# Convert the 'sic2' column in our final merged data to a 'categorical' type.
# By explicitly providing all possible categories, we force get_dummies to create a column for each, even if a category is missing after the merge.
merged_df['sic2'] = pd.Categorical(merged_df['sic2'], categories=all_sic_codes)

# Now, create the dummies. This will generate 74 columns.
sic2_dummies = pd.get_dummies(merged_df['sic2'], prefix='sic2', dtype=float)

print(f"Shape of industry dummies: {sic2_dummies.shape}")
if sic2_dummies.shape[1] == 74:
    print("SUCCESS: Correctly created 74 industry dummy columns.")
else:
    print(f"WARNING: Expected 74 SIC dummies, but got {sic2_dummies.shape[1]}.")

# --- Part B: Create Interaction Terms ---
print("Creating 846 interaction features (this is computationally intensive)...")

# METHODOLOGY: Interact 94 characteristics with 8 macro predictors + 1 intercept.
# Add the intercept column to the macro predictors before creating interactions.
merged_df['macro_intercept'] = 1.0

# Identify the column groups
char_cols = [col for col in merged_df.columns if 'characteristic_' in col]
macro_cols = [col for col in merged_df.columns if 'macro_' in col]

print(f"Found {len(char_cols)} characteristic columns.")
print(f"Found {len(macro_cols)} macro columns (including intercept).")

# Use a highly efficient vectorized approach instead of a slow loop
char_matrix = merged_df[char_cols].values
macro_matrix = merged_df[macro_cols].values

# Broadcasting creates all pairwise products efficiently:
# (N, 94, 1) * (N, 1, 9) -> (N, 94, 9)
interaction_matrix = char_matrix[:, :, np.newaxis] * macro_matrix[:, np.newaxis, :]
# Reshape to the final (N, 846) matrix
interaction_matrix_reshaped = interaction_matrix.reshape(interaction_matrix.shape[0], -1)

# Create meaningful column names for the new features
interaction_col_names = [
    f"{c_col}_x_{m_col}" for c_col in char_cols for m_col in macro_cols
]

# Convert the NumPy array back to a DataFrame
interaction_df = pd.DataFrame(
    interaction_matrix_reshaped,
    columns=interaction_col_names,
    index=merged_df.index
)
print(f"Shape of interaction features: {interaction_df.shape}")

# --- Part C: Assemble the Final DataFrame ---
print("Assembling the final analysis-ready DataFrame...")

# Identify ID and response columns to keep
id_response_cols = ['permno', 'month', 'ret_excess', 'mktcap_lag']
final_ids_df = merged_df[id_response_cols]

# Concatenate the final pieces: IDs, Dummies, and Interactions
final_df = pd.concat([final_ids_df, sic2_dummies, interaction_df], axis=1)

# Memory Management
del merged_df, sic2_dummies, interaction_df, final_ids_df
del char_matrix, macro_matrix, interaction_matrix, interaction_matrix_reshaped
gc.collect()

print("--- Final Matrix Construction Complete ---")



--- Starting Step 5: Constructing the Predictor Matrix ---
Creating 74 industry dummies from 'sic2'...
Temporarily reloading characteristics data to get all SIC codes...
Identified 74 unique SIC codes from the source file.
Shape of industry dummies: (2851604, 74)
SUCCESS: Correctly created 74 industry dummy columns.
Creating 846 interaction features (this is computationally intensive)...
Found 94 characteristic columns.
Found 9 macro columns (including intercept).
Shape of interaction features: (2851604, 846)
Assembling the final analysis-ready DataFrame...
--- Final Matrix Construction Complete ---


In [5]:
# --- Final Verification and Saving ---

print("\n--- Verifying the Final DataFrame ---")

# Check the final shape
print(f"Final DataFrame shape: {final_df.shape}")

# Verify the number of predictors
predictor_cols = [col for col in final_df.columns if col not in id_response_cols]
num_predictors = len(predictor_cols)
print(f"Total number of predictor columns: {num_predictors}")

if num_predictors == 920:
    print("SUCCESS: The number of predictors (920) matches the GKY paper.")
else:
    print(f"WARNING: The number of predictors ({num_predictors}) does NOT match the GKY paper (920). Please review the steps.")

# Display info and head
print("\nFinal DataFrame Info:")
final_df.info(verbose=False, memory_usage='deep')

print("\nHead of the final DataFrame:")
print(final_df.head())

print(f"\nSaving final analysis-ready data to '{OUTPUT_PATH}'...")
final_df.to_parquet(OUTPUT_PATH, index=False)
print("Save complete. The data is now ready for modeling.")



--- Verifying the Final DataFrame ---
Final DataFrame shape: (2851604, 924)
Total number of predictor columns: 920
SUCCESS: The number of predictors (920) matches the GKY paper.

Final DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2851604 entries, 0 to 2851603
Columns: 924 entries, permno to characteristic_zerotrade_x_macro_intercept
dtypes: datetime64[ns](1), float32(1), float64(921), int32(1)
memory usage: 19.6 GB

Head of the final DataFrame:
   permno      month  ret_excess   mktcap_lag  sic2_1.0  sic2_2.0  sic2_7.0  \
0   10000 1986-03-31    0.359385  11960.00000       0.0       0.0       0.0   
1   10000 1986-04-30   -0.103792  16330.00000       0.0       0.0       0.0   
2   10000 1986-05-31   -0.227556  15172.00000       0.0       0.0       0.0   
3   10000 1986-06-30   -0.010225  11793.87834       0.0       0.0       0.0   
4   10000 1986-07-31   -0.086008  11734.59375       0.0       0.0       0.0   

   sic2_8.0  sic2_9.0  sic2_10.0  ...  \
0       0.0  