In [1]:
# import library
from library import *

# filter the warnings for clarity
import warnings
warnings.filterwarnings("ignore")

In this notebook, we show how to reconstruct the ECL benchmark dataset. This dataset combines three existing data sources: the Edgar corpus, Compustat and the Lopucki bankruptcy research database. Due to the paid access required for Compustat, we are unable to share the complete benchmark dataset. However, here we provide the necessary code that allows you to reconstruct the dataset.

In the ECL csv-file (see the repository readme), each row corresponds to a 10K filing. Each 10K filing can be matched (1) with a document from the Edgar corpus through the ```filename``` variable and (2) with an entry from Compustat through the ```gvkey``` and ```datadate``` variables. See the repository readme for access to the Edgar corpus and Compustat.

In [3]:
# specify path
path_ECL = '../bankruptcy research data/ECL.csv' # change path to correct location

# read data 
dataset = pd.read_csv(path_ECL, index_col=0)
dataset.sample(5)

Unnamed: 0,cik,company,period_of_report,gvkey,datadate,filename,can_label,qualified,label,bankruptcy_prediction_split,bankruptcy_date_1,bankruptcy_date_2,bankruptcy_date_3,filing_date
219964,1447051.0,TERRITORIAL BANCORP INC.,2019-12-31,183247.0,31/12/2019,/2019/1447051_10K_2019_0001558370-20-002628.json,True,Yes,False,test,,,,2020-03-13
49126,50601.0,INNOVEX INC,2004-09-30,5973.0,30/09/2004,/2004/50601_10K_2004_0001206774-04-001710.json,True,No,False,out-of-scope,,,,2004-12-07
190534,1080034.0,VALUECLICK INC/CA,2007-12-31,133547.0,31/12/2007,/2007/1080034_10K_2007_0001047469-08-002074.json,True,Yes,False,train,,,,2008-02-29
50778,54473.0,KANSAS CITY LIFE INSURANCE CO,2011-12-31,6333.0,31/12/2011,/2011/54473_10K_2011_0001193125-12-089295.json,True,Yes,False,train,,,,2012-02-29
104975,810270.0,EXUS NETWORKS INC,2002-12-31,20216.0,31/12/2002,/2002/810270_10KSB_2002_0001172665-03-000124.json,True,No,False,out-of-scope,,,,2003-05-01


#### Match with Compustat through WRDS API

When working with the WRDS API for the Compustat data, the compustat_wrds() function can be used to match these records with the ECL csv-file. This function:
- reads the Compustat file from the API (we use the "comp_na_annual_all" library and the "funda" table)
- filters the Compustat file on screening variables (```datafmt```, ```indfmt```, ```consol``` and ```popsrc```)
- matches the datasets on the ```gvkey``` and ```datadata``` variables

In [None]:
# load data through API
username = ''
db = wrds.Connection(wrds_username=username)

# select desired variables
variables = 'ch, dt, act'

# match datasets
dataset = compustat_wrds(variables, dataset, db)

In [5]:
# inspect
dataset.sample(5)[['cik', 'company', 'period_of_report', 'ch', 'dt', 'act']]

Unnamed: 0,cik,company,period_of_report,ch,dt,act
106764,1002517.0,SCANSOFT INC,2002-12-31,18.853,0.0,40.429
82836,1672013.0,ACUSHNET HOLDINGS CORP.,2019-12-31,34.184,393.682,742.818
168089,1403528.0,"OAKTREE CAPITAL GROUP, LLC",2013-12-31,,579.464,
147766,1174922.0,WYNN RESORTS LTD,2014-12-31,2182.164,7345.262,2782.331
62673,1339256.0,"HOT MAMAS FOODS, INC.",2013-12-31,0.15,2.331,3.827


#### Match with local Compustat file

When working with a local copy of the Compustat data, the compustat_local() function can be used to match these records with the ECL csv-file. This function:
- reads the local Compustat file
- filters the Compustat file on screening variables (```datafmt```, ```indfmt```, ```consol``` and ```popsrc```)
- matches the datasets on the ```gvkey``` and ```datadata``` variables

In [None]:
# load data and match datasets
path = '../bankruptcy research data/Compustat/data.csv' # change path to correct location
dataset = compustat_local(path, dataset, update=False)

In [4]:
# inspect
dataset.sample(5)[['cik', 'company', 'period_of_report', 'ch', 'dt', 'act']]

Unnamed: 0,cik,company,period_of_report,ch,dt,act
100975,1731289.0,NIKOLA CORP,2021-12-31,497.241,27.925,524.729
29804,278130.0,PIER 1 IMPORTS INC/DE,2006-02-25,246.115,184.0,774.923
126625,1053369.0,ELITE PHARMACEUTICALS INC /DE/,2003-03-31,3.264,2.934,3.5
142106,1120295.0,IXIA,2007-12-31,188.892,0.0,248.025
166253,1564822.0,PINNACLE FOODS INC.,2015-12-27,180.549,2287.779,857.634


#### Match with Edgar corpus

The ECL data can be matched with a document in the Edgar corpus through the ```filename``` variable.

In [8]:
# get filename for example company
corpus_path = '../bankruptcy research data/original_corpus'
file = dataset.loc[dataset['cik'] == 1318605, 'filename'].iloc[-1]

# read
with open(corpus_path + file) as fp:
    text = json.load(fp)

In [9]:
# inspect
print(text['item_7'][802:1700] + '...')

Our mission is to accelerate the world’s transition to sustainable energy. We design, develop, manufacture, lease and sell high-performance fully electric vehicles, solar energy generation systems and energy storage products. We also offer maintenance, installation, operation, financial and other services related to our products. Additionally, we are increasingly focused on products and services based on artificial intelligence, robotics and automation.
In 2022, we produced 1,369,611 consumer vehicles and delivered 1,313,851 consumer vehicles, despite ongoing supply chain and logistics challenges and factory shutdowns. We are currently focused on increasing vehicle production, capacity and delivery capabilities, improving and developing battery technologies, improving our FSD capabilities, increasing the affordability and efficiency of our vehicles, bringing new products to market and ...
