In [1]:
# import
from utils import *

# filter the warnings for clarity
import warnings
warnings.filterwarnings("ignore")

In this notebook, we show how to reconstruct the ECL benchmark dataset. This dataset combines three existing data sources: the Edgar corpus, Compustat and the Lopucki bankruptcy research database. Due to the paid access required for Compustat, we are unable to share the complete benchmark dataset. However, here we provide the necessary code that allows you to reconstruct the dataset.

In the ECL csv-file, each row corresponds to a 10K filing. Each 10K filing can be matched (1) with a document from the Edgar corpus through the ```filename``` variable and (2) with an entry from Compustat through the ```gvkey``` and ```datadate``` variables. See the repository ```README.md``` for access to the data sources.

In [2]:
# specify path
path_ECL = '../bankruptcy research data/ECL.csv' # change path to correct location

# read data 
dataset = pd.read_csv(path_ECL, index_col=0)
dataset.sample(5)

Unnamed: 0,cik,company,period_of_report,gvkey,datadate,filename,can_label,qualified,label,bankruptcy_prediction_split,bankruptcy_date_1,bankruptcy_date_2,bankruptcy_date_3,filing_date
193129,930095.0,EDEN BIOSCIENCE CORP,2008-12-31,140024.0,31/12/2008,/2008/930095_10K_2008_0001145443-09-000625.json,True,No,False,out-of-scope,,,,2009-03-27
115214,876437.0,MGIC INVESTMENT CORP,2007-12-31,24379.0,31/12/2007,/2007/876437_10K_2007_0000897069-08-000525.json,True,Yes,False,train,,,,2008-02-29
53488,62709.0,"MARSH & MCLENNAN COMPANIES, INC.",2021-12-31,7065.0,31/12/2021,/2021/62709_10K_2021_0000062709-22-000009.json,True,out-of-period,False,out-of-scope,,,,2022-02-16
209392,1321741.0,GLADSTONE INVESTMENT CORPORATION\DE,2009-03-31,164155.0,31/03/2009,/2009/1321741_10K_2009_0001104659-09-036093.json,True,Yes,False,train,,,,2009-06-02
138159,912967.0,FIRST BANCSHARES INC /MO/,2003-06-30,29466.0,30/06/2003,/2003/912967_10KSB_2003_0000912967-03-000005.json,True,Yes,False,train,,,,2003-09-30


#### Match with Compustat through WRDS API

When working with the WRDS API for the Compustat data, the compustat_wrds() function can be used to match these records with the ECL csv-file. This function:
- reads the Compustat file from the API (we use the ```comp_na_annual_all``` library and the ```funda``` table)
- filters the Compustat file on screening variables (```datafmt```, ```indfmt```, ```consol``` and ```popsrc```)
- matches the datasets on the ```gvkey``` and ```datadata``` variables

In [None]:
# load data through API
username = ''
db = wrds.Connection(wrds_username=username)

# select desired variables
variables = 'ch, dt, act'

# match datasets
dataset = compustat_wrds(variables, dataset, db)

In [4]:
# inspect
dataset.sample(5)[['cik', 'company', 'period_of_report', 'ch', 'dt', 'act']]

Unnamed: 0,cik,company,period_of_report,ch,dt,act
52545,768251.0,ALTERA CORP,2006-12-29,738.412,3.927,1734.552
106982,1000683.0,BLONDER TONGUE LABORATORIES INC,2019-12-31,0.572,3.399,12.174
998,745543.0,"ALL STATE PROPERTIES HOLDINGS, INC.",2015-06-30,0.0,0.0,0.0
135142,1085634.0,VOYAGER NET INC,1999-12-31,18.063,23.892,24.517
65157,1436229.0,BTCS INC.,2021-12-31,1.401,0.0,5.467


#### Match with local Compustat file

When working with a local copy of the Compustat data, the compustat_local() function can be used to match these records with the ECL csv-file. This function:
- reads the local Compustat file
- filters the Compustat file on screening variables (```datafmt```, ```indfmt```, ```consol``` and ```popsrc```)
- matches the datasets on the ```gvkey``` and ```datadata``` variables

In [3]:
# load data and match datasets
path = '../bankruptcy research data/Compustat/data.csv' # change path to correct location
dataset = compustat_local(path, dataset, update=False)

In [5]:
# inspect
dataset.sample(5)[['cik', 'company', 'period_of_report', 'ch', 'dt', 'act']]

Unnamed: 0,cik,company,period_of_report,ch,dt,act
27730,73986.0,OHIO POWER CO,2012-12-31,3.64,,1103.705
83749,819050.0,VICAL INC,2016-12-31,5.069,0.0,47.867
42986,895421.0,MORGAN STANLEY,2013-12-31,59883.0,153575.0,
55408,814184.0,TCF FINANCIAL CORP,2014-12-31,144.892,,
73502,844161.0,CHEROKEE INC,1996-06-01,1.207,,2.477


#### Match with Edgar corpus

The ECL data can be matched with a document in the Edgar corpus through the ```filename``` variable.

In [6]:
# get filename for example company
corpus_path = '../bankruptcy research data/original_corpus'
file = dataset.loc[dataset['cik'] == 1318605, 'filename'].iloc[-1]

# read
with open(corpus_path + file) as fp:
    text = json.load(fp)

In [7]:
# inspect
print(text['item_7'][802:1700] + '...')

Our mission is to accelerate the world’s transition to sustainable energy. We design, develop, manufacture, lease and sell high-performance fully electric vehicles, solar energy generation systems and energy storage products. We also offer maintenance, installation, operation, financial and other services related to our products. Additionally, we are increasingly focused on products and services based on artificial intelligence, robotics and automation.
In 2022, we produced 1,369,611 consumer vehicles and delivered 1,313,851 consumer vehicles, despite ongoing supply chain and logistics challenges and factory shutdowns. We are currently focused on increasing vehicle production, capacity and delivery capabilities, improving and developing battery technologies, improving our FSD capabilities, increasing the affordability and efficiency of our vehicles, bringing new products to market and ...
