In [1]:
# import
from utils import *

# filter the warnings for clarity
import warnings
warnings.filterwarnings("ignore")

In this notebook, we show how to reconstruct the ECL benchmark dataset. This dataset combines three existing data sources: the Edgar corpus, Compustat and the Lopucki bankruptcy research database. Due to the paid access required for Compustat, we are unable to share the complete benchmark dataset. However, here we provide the necessary code that allows you to reconstruct the dataset.

In the ECL csv-file, each row corresponds to a 10K filing. Each 10K filing can be matched (1) with a document from the Edgar corpus through the ```filename``` variable and (2) with an entry from Compustat through the ```gvkey``` and ```datadate``` variables. See the repository ```README.md``` for access to the data sources.

In [2]:
# specify path
path_ECL = '../bankruptcy research data/ECL.csv' # change path to correct location

# read data 
dataset = pd.read_csv(path_ECL, index_col=0)
dataset.sample(5)

Unnamed: 0,cik,company,period_of_report,gvkey,datadate,filename,can_label,qualified,label,bankruptcy_prediction_split,bankruptcy_date_1,bankruptcy_date_2,bankruptcy_date_3,filing_date
79825,806388.0,NICHOLS RESEARCH CORP /AL/,1997-08-31,13096.0,31/08/1997,/1997/806388_10K_1997_0000950144-97-012891.json,True,Yes,False,train,,,,1997-11-28
170650,1020910.0,TRANSCEND THERAPEUTICS INC,1998-12-31,65033.0,31/12/1998,/1998/1020910_10K_1998_0000927016-99-001259.json,True,No,False,out-of-scope,,,,1999-03-31
205402,1123312.0,YASHENG GROUP,2009-12-31,160362.0,31/12/2009,/2009/1123312_10K_2009_0001199835-10-000484.json,True,Yes,False,train,,,,2010-08-06
342381,865436.0,WHOLE FOODS MARKET INC,2000-09-24,24893.0,30/09/2000,/2000/865436_10K_2000_0000927356-00-002262.json,True,Yes,False,train,,,,2000-12-22
148093,934739.0,WELLS FINANCIAL CORP,2001-12-31,31716.0,31/12/2001,/2001/934739_10KSB_2001_0000946275-02-000206.json,True,Yes,False,train,,,,2002-03-28


#### Match with Compustat through WRDS API

When working with the WRDS API for the Compustat data, the compustat_wrds() function can be used to match these records with the ECL csv-file. This function:
- reads the Compustat file from the API (we use the ```comp_na_annual_all``` library and the ```funda``` table)
- filters the Compustat file on screening variables (```datafmt```, ```indfmt```, ```consol``` and ```popsrc```)
- matches the datasets on the ```gvkey``` and ```datadata``` variables

In [None]:
# load data through API
username = ''
db = wrds.Connection(wrds_username=username)

# select desired variables
variables = 'ch, dt, act'

# match datasets
dataset = compustat_wrds(variables, dataset, db)

In [4]:
# inspect
dataset.sample(5)[['cik', 'company', 'period_of_report', 'ch', 'dt', 'act']]

Unnamed: 0,cik,company,period_of_report,ch,dt,act
151340,899460.0,MANNKIND CORP,2014-12-31,120.841,148.876,201.153
99758,54507.0,WESTAR ENERGY INC /KS,2017-12-31,3.432,,727.05
165235,927472.0,PARABEL INC.,2011-12-31,8.842,0.0,9.812
22415,60751.0,LUBRIZOL CORP,2007-12-31,502.3,1428.8,1847.3
69555,1002902.0,UNITED SHIPPING & TECHNOLOGY INC,2000-07-01,3.993,46.662,76.136


#### Match with local Compustat file

When working with a local copy of the Compustat data, the compustat_local() function can be used to match these records with the ECL csv-file. This function:
- reads the local Compustat file
- filters the Compustat file on screening variables (```datafmt```, ```indfmt```, ```consol``` and ```popsrc```)
- matches the datasets on the ```gvkey``` and ```datadata``` variables

In [3]:
# load data and match datasets
path = '../bankruptcy research data/Compustat/data.csv' # change path to correct location
dataset = compustat_local(path, dataset, update=False)

In [5]:
# inspect
dataset.sample(5)[['cik', 'company', 'period_of_report', 'ch', 'dt', 'act']]

Unnamed: 0,cik,company,period_of_report,ch,dt,act
89090,910073.0,NEW YORK COMMUNITY BANCORP INC,2013-12-31,644.55,,
126645,1088213.0,EMERGENCY FILTRATION PRODUCTS INC/ NV,1999-12-31,0.0,0.0,0.102
4524,1040441.0,BEVERLY ENTERPRISES INC,2001-12-31,89.343,741.673,524.048
156785,1537667.0,SELECT INCOME REIT,2014-12-31,13.504,445.816,
117600,1026650.0,ERESEARCHTECHNOLOGY INC /DE/,2007-12-31,38.082,1.145,78.328


#### Match with Edgar corpus

The ECL data can be matched with a document in the Edgar corpus through the ```filename``` variable.

In [6]:
# get filename for example company
corpus_path = '../bankruptcy research data/original_corpus'
file = dataset.loc[dataset['cik'] == 1318605, 'filename'].iloc[-1]

# read
with open(corpus_path + file) as fp:
    text = json.load(fp)

In [7]:
# inspect
print(text['item_7'][802:1700] + '...')

Our mission is to accelerate the world’s transition to sustainable energy. We design, develop, manufacture, lease and sell high-performance fully electric vehicles, solar energy generation systems and energy storage products. We also offer maintenance, installation, operation, financial and other services related to our products. Additionally, we are increasingly focused on products and services based on artificial intelligence, robotics and automation.
In 2022, we produced 1,369,611 consumer vehicles and delivered 1,313,851 consumer vehicles, despite ongoing supply chain and logistics challenges and factory shutdowns. We are currently focused on increasing vehicle production, capacity and delivery capabilities, improving and developing battery technologies, improving our FSD capabilities, increasing the affordability and efficiency of our vehicles, bringing new products to market and ...
