# Project Titan - Notebook 1: Data Engineering & CRSP-Compustat Merge

### **Objective**
This notebook constructs the foundational dataset for our quantitative analysis. It takes the pre-filtered, large-scale datasets from CRSP and Compustat (prepared offline in Stata for efficiency) and performs the crucial **CRSP-Compustat Merge (CCM)**. The goal is to produce a single, clean, "point-in-time" monthly panel dataset that correctly aligns market data (like returns) with the appropriate lagged fundamental data from financial statements.

---

### **Methodology: The Point-in-Time Merge**

The core challenge in building a research-quality dataset is the correct temporal alignment of data from different sources. Market data (CRSP) is available daily, while fundamental data (Compustat) is reported quarterly with a significant lag. A naive merge would introduce severe **lookahead bias**.

The methodology applied here follows the standard academic and professional approach:

*   **1. Link File Preparation:** First load the clean CCM linking table, which provides the historical mapping between CRSP's `PERMNO` and Compustat's `GVKEY`.

*   **2. Event-Driven Merge (`merge_asof`):** We use Python's `pandas` library, specifically the powerful `pd.merge_asof()` function. This function performs a "nearest-in-time" merge, which is the correct tool for this problem. The process is a two-step merge:
    *   First, we merge the CRSP market data with the CCM linking table to assign the correct `GVKEY` to each `PERMNO` for each point in time.
    *   Second, we merge this combined market data with the Compustat fundamental data. The `merge_asof` ensures that for any given month's market data, we are merging it with the **most recently available public financial statement data**, correctly simulating the information delay.

*   **3. Feature Engineering & Finalization:** Once the panel is constructed, we calculate key derived variables (e.g., market capitalization, excess returns) and resample the data to a monthly frequency.

---
**Output:** This notebook's final output is a single, analysis-ready `panel_data.parquet` file. Parquet format is highly efficient for storing large, structured datasets, preserving data types and offering significant speed advantages over CSV.


In [1]:
import pandas as pd
import numpy as np
import os
from pathlib import Path

onedrive_root = str(Path(os.environ['OneDrive']))
INPUT_DATA_DIR = os.path.join(onedrive_root, "0. DATASETS", "temps")

# --- Define File Paths ---
CRSP_FILE = os.path.join(INPUT_DATA_DIR, 'crsp_clean_daily.dta')
COMP_FILE = os.path.join(INPUT_DATA_DIR, 'compustat_clean_quarterly.dta')
CCM_FILE = os.path.join(INPUT_DATA_DIR, 'ccm_linking_table_clean.dta')
OUTPUT_FILE = os.path.join('data', 'panel_data.parquet')

print("Setup complete.")


Setup complete.


### 2. Load the Three Cleaned CSV Files

In [2]:
# Load crsp data 
crsp = pd.read_stata(CRSP_FILE)
crsp.rename(columns= {'ret' : 'ret_daily'}, inplace = True) 
print("\n Converting date columns in crsp to datetime objects...")
crsp['date'] = pd.to_datetime(crsp['date'],  format = '%d%b%Y')

# Load Compustat Data
comp = pd.read_stata(COMP_FILE)

# Load CCM Linking Table
ccm = pd.read_stata(CCM_FILE)

print("CRSP, Compustat, and CCM data loaded successfully.")
print(f"CRSP shape: {crsp.shape}")
print(f"Compustat shape: {comp.shape}")



 Converting date columns in crsp to datetime objects...
CRSP, Compustat, and CCM data loaded successfully.
CRSP shape: (35166396, 10)
Compustat shape: (605175, 23)


### 3. The CRSP-CCM Merge

In [None]:
# --- Step 1 of the Merge: Link CRSP and CCM ---

# Sort both tables by the linking key (permno) and the date
crsp.sort_values(by=['date', 'permno'], inplace=True)
ccm.sort_values(by=['link_start_date', 'permno' ], inplace=True)

# After sorting, we reset the index. This ensures the DataFrame's internal
# structure is clean and satisfies the strict sorting requirement of merge_asof.
crsp.reset_index(drop=True, inplace=True)
ccm.reset_index(drop=True, inplace=True)

# nearest merge to crsp going backwards in time
# Perform the backward merge_asof to find the correct link for each CRSP observation
# merge_asof: for each row in left (crsp date), find the nearest earlier matching row in right (ccm link start).
# This finds the most recent link that was active as of the CRSP date.
# => looks in ccm for the most recent link_start_date that is â‰¤  CRSP date, and attaches that row.

crsp_ccm = pd.merge_asof(left=crsp,
                         right=ccm,
                         left_on='date',
                         right_on='link_start_date',
                         by='permno'
                         )

# the above matches each crsp date to all link_start_dates. some of the linked matches ended prior to cris date. 
# => filter out any matches where the link was no longer valid
# i.e., where the CRSP date is after the link's end date
crsp_ccm = crsp_ccm[crsp_ccm['date'] <= crsp_ccm['link_end_date']]

# calculate market capitalization as number of shares outstanding x share price
crsp_ccm['mkt_cap'] = crsp_ccm['prc'] * crsp_ccm['shrout']
print("CRSP and CCM merged successfully.")

CRSP and CCM merged successfully.


### 4. Merge with Compustat

In [5]:
# sort by date and the linking key (gvkey)
crsp_ccm.sort_values(by = ['date', 'gvkey'], inplace = True)
comp.sort_values(by = ['datadate', 'gvkey'], inplace = True)

panel_data = pd.merge_asof(left = crsp_ccm, 
                           right = comp, 
                           left_on = 'date', 
                           right_on = 'datadate', 
                           by = 'gvkey')

# drop the helper columns
panel_data.drop(columns=['link_start_date', 'link_end_date'], inplace=True)

print("Full daily panel data constructed.")
print(f"Daily panel shape: {panel_data.shape}")

Full daily panel data constructed.
Daily panel shape: (34640540, 34)


### 5. Resample to Monthly and Calculate Monthly Returns


In [7]:
# We will group by each firm (permno) and then by month
# and take the last observation of the month.
# This gives us month-end values for market cap, fundamentals, etc.

monthly_panel = panel_data.groupby('permno').resample('ME', on = 'date').last()

In [9]:
# Dropping the extra 'permno' column from the index
monthly_panel.reset_index(level = 0 , drop = True, inplace = True)

In [10]:
# Calculate monthly returns by compounding the daily returns within each month
# 1+r_month = (1+r_1)(1+r_2)...(1+r_30)
monthly_returns = panel_data.groupby(['permno', pd.Grouper(key='date', freq='ME')])['ret_daily'].apply(lambda x: (1 + x).prod() - 1)
# turning it to dataframe and neming the ret_monthly column 
monthly_returns = monthly_returns.to_frame(name='ret_monthly')

In [11]:
# Merge the monthly returns back into our main panel
# join takes column from the left df and pairs with index of the right df 
monthly_panel = monthly_panel.join(monthly_returns, on=['permno', 'date'])

In [13]:
monthly_panel.columns

Index(['permno', 'share_code', 'exchange_code', 'sic_x', 'prc', 'ret_daily',
       'shrout', 'vwretd', 'sprtrn', 'gvkey', 'mkt_cap', 'datadate', 'fyearq',
       'fqtr', 'tic', 'actq', 'atq', 'ceqq', 'cheq', 'cshoq', 'dlcq', 'dlttq',
       'dpq', 'ibq', 'lctq', 'ltq', 'oiadpq', 'pstkq', 'saleq', 'oancfy',
       'dvpspq', 'prccq', 'sic_y', 'ret_monthly'],
      dtype='object')

In [12]:
# Forward-fill any fundamentals that might be missing for a month
# makea list of fundamentals columns to be ffilled
ffill_cols = comp.columns.drop(['datadate', 'gvkey'])
monthly_panel[ffill_cols] = monthly_panel.groupby('permno')[ffill_cols].ffill()

KeyError: "Columns not found: 'sic'"

In [18]:
# any entry with missing market cap or monthly returns will be dropped
monthly_panel.dropna(subset=['ret_monthly', 'mkt_cap'], inplace=True)

In [19]:
monthly_panel.columns

Index(['permno', 'hsiccd', 'prc', 'vol', 'ret_daily', 'shrout', 'gvkey',
       'mkt_cap', 'datadate', 'fyearq', 'fqtr', 'tic', 'actq', 'atq', 'ceqq',
       'cheq', 'cshoq', 'dlcq', 'dlttq', 'dpq', 'ibq', 'lctq', 'ltq', 'oiadpq',
       'pstkq', 'saleq', 'oancfy', 'dvpspq', 'prccq', 'sic', 'ret_monthly'],
      dtype='object')

### 6. Save the Final Panel Datase


In [20]:
# --- Save the Final Dataset in Parquet Format ---

monthly_panel.to_parquet(OUTPUT_FILE)

print(f"Final merged panel data saved to {OUTPUT_FILE}")
print("Notebook 1 (Project Titan) is complete.")

Final merged panel data saved to data\panel_data.parquet
Notebook 1 (Project Titan) is complete.
