# Table 1 Replication - Leverage Analysis (1965-2003)

This notebook replicates **Table 1** from the paper, showing descriptive statistics for **All Firms** and **Survivors** using Compustat data.

---


In [9]:
# STEP 0 — Environment Setup
# Install required packages: wrds, pandas, numpy, scipy, jupyter, ipykernel
%pip install -r requirements.txt


Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import required libraries
import wrds
import pandas as pd
import numpy as np
from scipy import stats

# Display settings for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 3)
pd.set_option('display.width', 120)


## STEP 1 — Connect to WRDS

Establish connection to WRDS database. You'll be prompted for username and password.


In [3]:
# Connect to WRDS (enter credentials when prompted)
db = wrds.Connection()


WRDS recommends setting up a .pgpass file.
Created .pgpass file successfully.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done


In [4]:
# Sanity check: verify 'comp' library is available
libraries = db.list_libraries()
print(f"Available libraries: {len(libraries)}")
print(f"'comp' available: {'comp' in libraries}")


Available libraries: 221
'comp' available: True


## STEP 2 — Pull Compustat Data

Pull annual Compustat data matching Section I of the paper:
- Nonfinancial firms
- 1965–2003
- Consolidated, domestic, INDL format


In [5]:
# STEP 2 — Pull Compustat Data
# Pull annual Compustat data matching Section I: nonfinancial firms, 1965-2003, consolidated domestic

# Option: Load existing data to skip download
import os
if os.path.exists('data/01_raw_data.csv'):
    print("Loading existing raw data from data/01_raw_data.csv")
    df = pd.read_csv('data/01_raw_data.csv', parse_dates=['datadate'])
    print(f"Loaded {len(df):,} observations")
else:
    print("No existing data found, downloading from WRDS...")

sql = """
SELECT
    gvkey,
    datadate,
    fyear,
    sich,      -- SIC code (historical SIC code column in funda table)
    at,        -- total assets
    dlc,       -- short-term debt
    dltt,      -- long-term debt
    sale,
    oibdp,
    ppent,
    prcc_f,
    csho,
    pstkl,
    txditc,
    intan,
    dvc
FROM comp.funda
WHERE indfmt = 'INDL'
  AND datafmt = 'STD'
  AND popsrc = 'D'
  AND consol = 'C'
  AND fyear BETWEEN 1965 AND 2003
"""

print("Downloading data from WRDS... (this may take a few minutes)")
df = db.raw_sql(sql, date_cols=['datadate'])
print(f"Downloaded {len(df):,} observations")
print(f"Unique firms (gvkeys): {df['gvkey'].nunique():,}")


No existing data found, downloading from WRDS...
Downloading data from WRDS... (this may take a few minutes)
Downloaded 323,162 observations
Unique firms (gvkeys): 27,986


In [6]:
# Save raw downloaded data
import os
os.makedirs('data', exist_ok=True)
df.to_csv('data/01_raw_data.csv', index=False)
print(f"✓ Saved raw data: {len(df):,} observations to data/01_raw_data.csv")


✓ Saved raw data: 323,162 observations to data/01_raw_data.csv


In [7]:
# Preview the raw data
df.head()


Unnamed: 0,gvkey,datadate,fyear,sich,at,dlc,dltt,sale,oibdp,ppent,prcc_f,csho,pstkl,txditc,intan,dvc
0,1000,1965-12-31,1965,,2.31,0.3,1.154,1.688,-0.16,1.397,,0.206,0.0,0.0,0.0,0.0
1,1002,1965-12-31,1965,,,,0.8,11.7,,,,0.803,0.0,0.0,0.0,0.0
2,1004,1966-05-31,1965,,2.519,0.347,0.153,3.821,0.706,0.41,,0.42,0.0,0.015,0.0,0.0
3,1010,1966-04-30,1965,,328.7,0.0,93.526,323.2,60.78,188.9,47.75,5.921,0.0,23.176,0.0,10.698
4,1040,1965-12-31,1965,,451.9,6.5,186.0,385.8,62.38,124.8,19.625,17.15,5.745,19.957,8.838,15.44


## STEP 3 — Basic Cleaning

Apply paper's cleaning rules:
1. Require non-missing, positive assets
2. Fill missing debt components with zero
3. Calculate total debt


In [8]:
# Save cleaned data (after basic cleaning)
df.to_csv('data/02_cleaned_data.csv', index=False)
print(f"✓ Saved cleaned data: {len(df):,} observations to data/02_cleaned_data.csv")


✓ Saved cleaned data: 323,162 observations to data/02_cleaned_data.csv


In [9]:
print(f"Before cleaning: {len(df):,} observations")

# Require non-missing assets
df = df[df['at'].notna() & (df['at'] > 0)]
print(f"After asset filter: {len(df):,} observations")

# Replace missing debt components with 0
df['dlc'] = df['dlc'].fillna(0)
df['dltt'] = df['dltt'].fillna(0)

# Total debt
df['debt'] = df['dlc'] + df['dltt']

print(f"Debt computed for all {len(df):,} observations")


Before cleaning: 323,162 observations
After asset filter: 290,492 observations
Debt computed for all 290,492 observations


## STEP 4 — Construct Leverage Measures

Calculate book and market leverage as defined in the Appendix:
- **Book leverage** = Total Debt / Total Assets
- **Market leverage** = Total Debt / (Total Debt + Market Equity)


In [10]:
# Save data after leverage construction
df.to_csv('data/03_leverage_data.csv', index=False)
print(f"✓ Saved leverage data: {len(df):,} observations to data/03_leverage_data.csv")


✓ Saved leverage data: 290,492 observations to data/03_leverage_data.csv


In [11]:
# Book leverage
df['book_lev'] = df['debt'] / df['at']

# Market equity
df['me'] = df['prcc_f'] * df['csho']

# Market leverage
df['market_lev'] = df['debt'] / (df['debt'] + df['me'])

print(f"Before leverage filter: {len(df):,} observations")

# Keep leverage in [0,1]
df = df[
    (df['book_lev'].between(0, 1)) &
    (df['market_lev'].between(0, 1))
]

print(f"After leverage filter [0,1]: {len(df):,} observations")


Before leverage filter: 290,492 observations
After leverage filter [0,1]: 225,177 observations


## STEP 5 — Construct Table 1 Variables

Create all variables that appear in Table 1:
- Log sales (firm size)
- Market-to-book ratio
- Profitability
- Tangibility
- Intangibles
- Dividend payer dummy


In [12]:
# Save data after variable construction
df.to_csv('data/04_variables_data.csv', index=False)
print(f"✓ Saved variables data: {len(df):,} observations to data/04_variables_data.csv")


✓ Saved variables data: 225,177 observations to data/04_variables_data.csv


In [13]:
# Log sales (proxy for firm size) - only for positive sales
df['log_sales'] = np.log(df['sale'].clip(lower=1))  # clip to avoid log(0) or negative

# Market-to-book ratio
df['mtb'] = (
    df['me']
    + df['debt']
    + df['pstkl'].fillna(0)
    - df['txditc'].fillna(0)
) / df['at']

# Profitability (EBITDA / Assets)
df['profitability'] = df['oibdp'] / df['at']

# Tangibility (PPE / Assets)
df['tangibility'] = df['ppent'] / df['at']

# Intangibles (Intangible Assets / Assets)
df['intangibles'] = df['intan'] / df['at']

# Dividend payer (binary: 1 if pays dividend, 0 otherwise)
df['div_payer'] = (df['dvc'].fillna(0) > 0).astype(int)

print("All Table 1 variables constructed")


All Table 1 variables constructed


## STEP 6 — Cash-Flow Volatility

Calculate rolling 3-year standard deviation of operating income (≥3 years per firm required).


In [14]:
# Sort by firm and year
df = df.sort_values(['gvkey', 'fyear'])

# Rolling 3-year standard deviation of operating income
df['cf_vol'] = (
    df.groupby('gvkey')['oibdp']
      .rolling(window=3, min_periods=3)
      .std()
      .reset_index(level=0, drop=True)
)

print(f"Cash-flow volatility computed for {df['cf_vol'].notna().sum():,} observations")


Cash-flow volatility computed for 175,465 observations


In [15]:
# Save data after trimming
df.to_csv('data/05_trimmed_data.csv', index=False)
print(f"✓ Saved trimmed data: {len(df):,} observations to data/05_trimmed_data.csv")


✓ Saved trimmed data: 225,177 observations to data/05_trimmed_data.csv


## STEP 7 — Industry Median Book Leverage

Calculate industry median leverage using 2-digit SIC codes.

**Note:** This uses a simple SIC-2 approach, which may differ from the paper's Fama-French 38 industry classification.


In [18]:
# STEP 7 — Industry Median Book Leverage
# Create 2-digit SIC code (using sich column from Compustat)
df['sic2'] = df['sich'] // 100

# Industry median leverage (by SIC-2 and year)
df['ind_med_lev'] = (
    df.groupby(['sic2', 'fyear'])['book_lev']
      .transform('median')
)

print(f"Industry median leverage computed")
print(f"Number of unique industries (SIC-2): {df['sic2'].nunique()}")


Industry median leverage computed
Number of unique industries (SIC-2): 73


## STEP 8 — Trim Ratios at 1% / 99%

Winsorize all ratio variables at the 1st and 99th percentiles (Section I of paper).


In [19]:
# STEP 8 — Trim Ratios at 1% / 99%
# Winsorize all ratio variables at the 1st and 99th percentiles (Section I of paper)
ratio_vars = [
    'book_lev', 'market_lev', 'log_sales', 'mtb', 'profitability',
    'tangibility', 'cf_vol', 'ind_med_lev', 'intangibles'
]

print(f"Before trimming: {len(df):,} observations")

for v in ratio_vars:
    # Only compute quantiles on non-NaN values
    non_null = df[v].dropna()
    if len(non_null) > 0:
        low, high = non_null.quantile([0.01, 0.99])
        before = len(df)
        # Filter: keep NaN values OR values between quantiles
        # (This preserves rows with missing data for some variables)
        df = df[(df[v].isna()) | (df[v].between(low, high, inclusive='both'))]
        after = len(df)
        print(f"  {v}: removed {before - after:,} obs (kept between {low:.3f} and {high:.3f})")
    else:
        print(f"  {v}: all NaN, skipping")

print(f"\nAfter trimming: {len(df):,} observations")


Before trimming: 225,177 observations
  book_lev: removed 2,252 obs (kept between 0.000 and 0.848)
  market_lev: removed 2,230 obs (kept between 0.000 and 0.924)
  log_sales: removed 2,194 obs (kept between 0.000 and 9.722)
  mtb: removed 4,370 obs (kept between 0.096 and 14.249)
  profitability: removed 4,204 obs (kept between -1.077 and 0.405)
  tangibility: removed 2,070 obs (kept between 0.000 and 0.932)
  cf_vol: removed 3,264 obs (kept between 0.065 and 324.961)
  ind_med_lev: removed 1,787 obs (kept between 0.028 and 0.568)
  intangibles: removed 1,810 obs (kept between 0.000 and 0.589)

After trimming: 200,996 observations


## STEP 9 — Define Survivors

**Survivors** are defined as firms with ≥20 years of book leverage data in the sample.


In [22]:
# STEP 9 — Define Survivors
# Count years of book leverage data per firm
lev_counts = df.groupby('gvkey')['book_lev'].count()

# Survivors: firms with ≥20 years
survivors = lev_counts[lev_counts >= 20].index

total_firms = df['gvkey'].nunique()
print(f"Total unique firms: {total_firms:,}")
print(f"Survivors (≥20 years): {len(survivors):,}")
if total_firms > 0:
    print(f"Survivor rate: {100 * len(survivors) / total_firms:.1f}%")
else:
    print("Survivor rate: N/A (no firms in dataset)")

# Create two datasets
df_all = df.copy()
df_surv = df[df['gvkey'].isin(survivors)]

print(f"\nAll Firms dataset: {len(df_all):,} observations")
print(f"Survivors dataset: {len(df_surv):,} observations")


Total unique firms: 21,965
Survivors (≥20 years): 2,555
Survivor rate: 11.6%

All Firms dataset: 200,996 observations
Survivors dataset: 71,135 observations


## STEP 10 — Replicate Table 1 Statistics

Generate descriptive statistics (Mean, Median, SD) for both **All Firms** and **Survivors**.


In [23]:
vars_table1 = [
    'book_lev', 'market_lev', 'log_sales', 'mtb', 'profitability',
    'tangibility', 'cf_vol', 'ind_med_lev', 'div_payer', 'intangibles'
]

def summary_table(data):
    """Generate summary statistics table"""
    return pd.DataFrame({
        'Mean': data[vars_table1].mean(),
        'Median': data[vars_table1].median(),
        'SD': data[vars_table1].std()
    })

table_all = summary_table(df_all)
table_surv = summary_table(df_surv)


In [24]:
print("="*80)
print("TABLE 1 REPLICATION: ALL FIRMS")
print("="*80)
print(table_all)
print(f"\nNumber of observations: {len(df_all):,}")
print(f"Number of unique firms: {df_all['gvkey'].nunique():,}")


TABLE 1 REPLICATION: ALL FIRMS
                Mean  Median      SD
book_lev       0.239   0.215   0.195
market_lev     0.291   0.238   0.254
log_sales      4.342   4.337   2.116
mtb            1.381   0.925   1.487
profitability  0.073   0.107   0.179
tangibility    0.307    0.25    0.25
cf_vol         15.77   3.342  36.233
ind_med_lev    0.202   0.192   0.114
div_payer       0.46     0.0   0.498
intangibles    0.051   0.001     0.1

Number of observations: 200,996
Number of unique firms: 21,965


In [25]:
print("="*80)
print("TABLE 1 REPLICATION: SURVIVORS")
print("="*80)
print(table_surv)
print(f"\nNumber of observations: {len(df_surv):,}")
print(f"Number of unique firms: {df_surv['gvkey'].nunique():,}")


TABLE 1 REPLICATION: SURVIVORS
                 Mean  Median      SD
book_lev        0.255   0.243   0.175
market_lev      0.321    0.29   0.241
log_sales       5.285   5.306    1.91
mtb             1.105   0.851   0.954
profitability   0.126   0.129     0.1
tangibility     0.355   0.304   0.242
cf_vol         20.677   5.092  41.364
ind_med_lev     0.223   0.226   0.106
div_payer       0.683     1.0   0.465
intangibles     0.042   0.003   0.081

Number of observations: 71,135
Number of unique firms: 2,555


## STEP 11 — Comparison: All Firms vs Survivors

Compare the two groups side-by-side to highlight differences.


In [26]:
# Create side-by-side comparison
comparison = pd.DataFrame({
    'All_Mean': table_all['Mean'],
    'All_Median': table_all['Median'],
    'Surv_Mean': table_surv['Mean'],
    'Surv_Median': table_surv['Median'],
    'Diff_Mean': table_surv['Mean'] - table_all['Mean']
})

print("="*80)
print("COMPARISON: ALL FIRMS vs SURVIVORS")
print("="*80)
print(comparison)
print("\nKey Observations:")
print("- Survivors are LARGER (higher log_sales)")
print("- Survivors are MORE PROFITABLE (higher profitability)")
print("- Survivors are MORE TANGIBLE (higher tangibility)")
print("- Survivors have LOWER GROWTH (lower mtb)")
print("- Survivors are MORE LEVERED (higher book & market leverage)")


COMPARISON: ALL FIRMS vs SURVIVORS
               All_Mean  All_Median  Surv_Mean  Surv_Median  Diff_Mean
book_lev          0.239       0.215      0.255        0.243      0.016
market_lev        0.291       0.238      0.321         0.29       0.03
log_sales         4.342       4.337      5.285        5.306      0.943
mtb               1.381       0.925      1.105        0.851     -0.276
profitability     0.073       0.107      0.126        0.129      0.053
tangibility       0.307        0.25      0.355        0.304      0.048
cf_vol            15.77       3.342     20.677        5.092      4.907
ind_med_lev       0.202       0.192      0.223        0.226      0.022
div_payer          0.46         0.0      0.683          1.0      0.223
intangibles       0.051       0.001      0.042        0.003     -0.009

Key Observations:
- Survivors are LARGER (higher log_sales)
- Survivors are MORE PROFITABLE (higher profitability)
- Survivors are MORE TANGIBLE (higher tangibility)
- Survivors have 

## Export Results

Save the results to CSV files for further analysis or reporting.


In [27]:
# Export summary tables
table_all.to_csv('table1_all_firms.csv')
table_surv.to_csv('table1_survivors.csv')
comparison.to_csv('table1_comparison.csv')

# Export full datasets (optional)
# df_all.to_csv('data_all_firms.csv', index=False)
# df_surv.to_csv('data_survivors.csv', index=False)

print("Results exported to CSV files!")


Results exported to CSV files!


In [None]:
# Close the database connection
db.close()
print("WRDS connection closed.")
