# ETL | Public Companies Bankruptcy Cases Opened and Monitored 


##  Data Set Source 2009 - 2011

https://catalog.data.gov/dataset/public-company-bankruptcy-cases-opened-and-monitored

Data updated June 18, 2019


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#ETL-|-Public-Companies-Bankruptcy-Cases-Opened-and-Monitored" data-toc-modified-id="ETL-|-Public-Companies-Bankruptcy-Cases-Opened-and-Monitored-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>ETL | Public Companies Bankruptcy Cases Opened and Monitored</a></span><ul class="toc-item"><li><span><a href="#Data-Set-Source-2009---2011" data-toc-modified-id="Data-Set-Source-2009---2011-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data Set Source 2009 - 2011</a></span></li></ul></li><li><span><a href="#Dependecies" data-toc-modified-id="Dependecies-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dependecies</a></span></li><li><span><a href="#Extract" data-toc-modified-id="Extract-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Extract</a></span><ul class="toc-item"><li><span><a href="#Load-&amp;-Read-CSV" data-toc-modified-id="Load-&amp;-Read-CSV-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Load &amp; Read CSV</a></span></li></ul></li><li><span><a href="#Transform" data-toc-modified-id="Transform-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Transform</a></span><ul class="toc-item"><li><span><a href="#Add,-drop,-rearrenge,-rename-columns" data-toc-modified-id="Add,-drop,-rearrenge,-rename-columns-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Add, drop, rearrenge, rename columns</a></span></li><li><span><a href="#Data-info,-types" data-toc-modified-id="Data-info,-types-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Data info, types</a></span></li><li><span><a href="#Identify-columns-with-bad-data" data-toc-modified-id="Identify-columns-with-bad-data-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Identify columns with bad data</a></span></li><li><span><a href="#Drop-row-with-bad-data" data-toc-modified-id="Drop-row-with-bad-data-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Drop row with bad data</a></span></li><li><span><a href="#Check-deletion" data-toc-modified-id="Check-deletion-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Check deletion</a></span></li><li><span><a href="#String-replace,-set-column-type" data-toc-modified-id="String-replace,-set-column-type-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>String replace, set column type</a></span></li><li><span><a href="#Append-data" data-toc-modified-id="Append-data-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Append data</a></span></li><li><span><a href="#Save-to-CSV-and-json" data-toc-modified-id="Save-to-CSV-and-json-4.8"><span class="toc-item-num">4.8&nbsp;&nbsp;</span>Save to CSV and json</a></span></li></ul></li><li><span><a href="#Load-MongoDB" data-toc-modified-id="Load-MongoDB-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Load MongoDB</a></span><ul class="toc-item"><li><span><a href="#Dependecies" data-toc-modified-id="Dependecies-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Dependecies</a></span></li></ul></li><li><span><a href="#Questions-??" data-toc-modified-id="Questions-??-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Questions ??</a></span></li></ul></div>

# Dependecies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

# Extract

## Load & Read CSV

In [2]:
# Load CSV files

data09 = "../bankruptcy_data/public_company_bankruptcy_cases_2009.csv"
data10 = "../bankruptcy_data/public_company_bankruptcy_cases_2010.csv"
data11 = "../bankruptcy_data/public_company_bankruptcy_cases_2011.csv"

In [3]:
# Read all data files and store into Pandas DataFrame

data09_df = pd.read_csv(data09)
data10_df = pd.read_csv(data10)
data11_df = pd.read_csv(data11)


# Transform 

## Add, drop, rearrenge, rename columns

In [4]:
# Add fiscal year to each data frame

data09_df['FISCAL_YEAR'] = 2009
data10_df['FISCAL_YEAR'] = 2010
data11_df['FISCAL_YEAR'] = 2011

In [5]:
# Drop NAN values

data09_df = data09_df.dropna()
data10_df = data10_df.dropna()
data11_df = data11_df.dropna()


In [6]:
# Rearrenge columns of all data and reaname columns

data09_df = data09_df[['FISCAL_YEAR','STATE', 'COMPANY NAME', 'ASSETS (MILLIONS)','LIABILITIES (MILLIONS)', 'DISTRICT']] 
data09_df = data09_df.rename(columns= {'COMPANY NAME': 'COMPANY', 'ASSETS (MILLIONS)': 'ASSETS_MILLIONS', 'LIABILITIES (MILLIONS)':'LIABILITIES_MILLIONS', 'DISTRICT': 'COURT_DISTRICT'})

data10_df = data10_df[['FISCAL_YEAR','STATE', 'COMPANY NAME', 'ASSETS (MILLIONS)','LIABILITIES (MILLIONS)', 'DISTRICT']] 
data10_df = data10_df.rename(columns= {'COMPANY NAME': 'COMPANY', 'ASSETS (MILLIONS)': 'ASSETS_MILLIONS', 'LIABILITIES (MILLIONS)':'LIABILITIES_MILLIONS', 'DISTRICT': 'COURT_DISTRICT'})

data11_df = data11_df[['FISCAL_YEAR','STATE', 'COMPANY NAME', 'ASSETS (MILLIONS)','LIABILITIES (MILLIONS)', 'DISTRICT']] 
data11_df = data11_df.rename(columns= {'COMPANY NAME': 'COMPANY', 'ASSETS (MILLIONS)': 'ASSETS_MILLIONS', 'LIABILITIES (MILLIONS)':'LIABILITIES_MILLIONS', 'DISTRICT': 'COURT_DISTRICT'})

## Data info, types

In [7]:
data09_df.info()
data10_df.info()
data11_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 127 entries, 0 to 128
Data columns (total 6 columns):
FISCAL_YEAR             127 non-null int64
STATE                   127 non-null object
COMPANY                 127 non-null object
ASSETS_MILLIONS         127 non-null float64
LIABILITIES_MILLIONS    127 non-null float64
COURT_DISTRICT          127 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 6.9+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 63 entries, 1 to 63
Data columns (total 6 columns):
FISCAL_YEAR             63 non-null int64
STATE                   63 non-null object
COMPANY                 63 non-null object
ASSETS_MILLIONS         63 non-null object
LIABILITIES_MILLIONS    63 non-null object
COURT_DISTRICT          63 non-null object
dtypes: int64(1), object(5)
memory usage: 3.4+ KB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 61
Data columns (total 6 columns):
FISCAL_YEAR             62 non-null int64
STATE           

In [8]:
data10_df.dtypes


FISCAL_YEAR              int64
STATE                   object
COMPANY                 object
ASSETS_MILLIONS         object
LIABILITIES_MILLIONS    object
COURT_DISTRICT          object
dtype: object

## Identify columns with bad data

In [9]:
data10_df[data10_df['COMPANY'] == 'Spongetech Delivery Systems, Inc.']

# Kore Holdings, Inc.
# Bridgetech Holdings International, Inc.
# Java Detour, Inc.
# Law Enforcement Associates Corporation
# Sand Spring Capital


Unnamed: 0,FISCAL_YEAR,STATE,COMPANY,ASSETS_MILLIONS,LIABILITIES_MILLIONS,COURT_DISTRICT
51,2010,NY,"Spongetech Delivery Systems, Inc.",0.5,----,SD


In [10]:
data11_df[data11_df['COMPANY'] == 'Bridgetech Holdings International, Inc.']

Unnamed: 0,FISCAL_YEAR,STATE,COMPANY,ASSETS_MILLIONS,LIABILITIES_MILLIONS,COURT_DISTRICT
10,2011,CA,"Bridgetech Holdings International, Inc.",--,7,SD


## Drop row with bad data

In [11]:
# Drop rows with bad data

data10_df = data10_df.set_index('COMPANY')
data10_df = data10_df.drop('Spongetech Delivery Systems, Inc.', axis = 0)
data10_df = data10_df.drop('Kore Holdings, Inc.', axis = 0)
data10_df = data10_df.drop('U.S. Dry Cleaning Services Corporation', axis = 0)

data10_df= data10_df.reset_index()

In [12]:
# Drop rows with bad data

data11_df = data11_df.set_index('COMPANY')
data11_df = data11_df.drop('Bridgetech Holdings International, Inc.', axis = 0)
data11_df = data11_df.drop('Java Detour, Inc.', axis = 0)
data11_df = data11_df.drop('Law Enforcement Associates Corporation', axis = 0)
data11_df = data11_df.drop('Sand Spring Capital', axis = 0)

data11_df= data11_df.reset_index()

## Check deletion 

In [13]:
data10_df[data10_df['COMPANY'] == 'Spongetech Delivery Systems, Inc.']

Unnamed: 0,COMPANY,FISCAL_YEAR,STATE,ASSETS_MILLIONS,LIABILITIES_MILLIONS,COURT_DISTRICT


In [14]:
data11_df[data11_df['COMPANY'] == 'Bridgetech Holdings International, Inc.']

Unnamed: 0,COMPANY,FISCAL_YEAR,STATE,ASSETS_MILLIONS,LIABILITIES_MILLIONS,COURT_DISTRICT


## String replace, set column type

In [15]:
data10_df['ASSETS_MILLIONS'] = data10_df['ASSETS_MILLIONS'].str.replace(",","",regex = True).astype(float)
data10_df['LIABILITIES_MILLIONS'] = data10_df['LIABILITIES_MILLIONS'].str.replace(",","",regex = True).astype(float)


In [16]:
data10_df.dtypes
data10_df.head(10000)

Unnamed: 0,COMPANY,FISCAL_YEAR,STATE,ASSETS_MILLIONS,LIABILITIES_MILLIONS,COURT_DISTRICT
0,Advanta Group,2010,DE,363.0,331.0,D
1,"Amcore Financial, Inc.",2010,IL,7.0,75.0,ND
2,American Mortgage Acceptance Company,2010,NY,6.37,119.968,SD
3,"Barzel Industries, Inc.",2010,DE,366.0,385.0,D
4,Baseline Oil & Gas Corp.,2010,TX,80.0,139.0,SD
5,Blockbuster,2010,NY,1017.04,1464.94,SD
6,"BSML, Inc.",2010,FL,6.94,9.97,SD
7,"California Coastal Communities, Inc.",2010,CA,291.0,231.0,CD
8,"Canopy Financial, Inc.",2010,IL,18.99,25.84,ND
9,Capital Growth Systems Inc.,2010,DE,26.97,17.146,D


In [17]:
data11_df['ASSETS_MILLIONS'] = data11_df['ASSETS_MILLIONS'].str.replace(",","",regex = True).astype(float)
data11_df['LIABILITIES_MILLIONS'] = data11_df['LIABILITIES_MILLIONS'].str.replace(",","",regex = True).astype(float)

## Append data 

In [18]:
# Append data for all three years
appen1 =  data09_df.append(data10_df, ignore_index = True, sort = "false")
appen2 =  appen1.append(data11_df, ignore_index = True, sort = "false")

# Rearrenge columns of all data and reaname columns

appen2 = appen2[['FISCAL_YEAR','STATE', 'COMPANY', 'ASSETS_MILLIONS','LIABILITIES_MILLIONS', 'COURT_DISTRICT']] 

appen2.dtypes


FISCAL_YEAR               int64
STATE                    object
COMPANY                  object
ASSETS_MILLIONS         float64
LIABILITIES_MILLIONS    float64
COURT_DISTRICT           object
dtype: object

## Save to CSV and json

In [19]:
data09_clean = data09_df
data10_clean = data10_df
data11_clean = data11_df

combined_data = appen2

data09_clean.to_csv("../bankruptcy_data/data09_clean.csv", index=False, encoding='utf8')
data10_clean.to_csv("../bankruptcy_data/data10_clean.csv", index=False, encoding='utf8')
data11_clean.to_csv("../bankruptcy_data/data11_clean.csv", index=False, encoding='utf8')
combined_data.to_csv("../bankruptcy_data/combined_data.csv", index=False, encoding='utf8')

data09_clean.to_json("../bankruptcy_data/data09_clean.json", orient='columns')
data10_clean.to_json("../bankruptcy_data/data10_clean.json", orient='columns')
data11_clean.to_json("../bankruptcy_data/data11_clean.json", orient='columns')
combined_data.to_json("../bankruptcy_data/combined_data.json", orient='columns')

# Load MongoDB

## Dependecies

In [20]:
import pymongo

In [21]:
# https://realpython.com/introduction-to-mongodb-and-python/
# Establish a connection
conn = 'mongodb://localhost:27017' ## Specify the port number
client = pymongo.MongoClient(conn) 

# Create or access a database. By default if the database is not found, it will created

db = client.bankruptcyDB

# Declare a collection

collection09 = db.data2009
collection10 = db.data2010
collection11 = db.data2011
collectionCombined = db.combinedData

In [22]:
# Show the dabase connection and/or the one you are working on. 
db

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'bankruptcyDB')

In [23]:
# Transform all dataframe to dictionaries

data09_clean_dict = data09_clean.to_dict(orient= 'records')
data09_clean_dict

data10_clean_dict = data10_clean.to_dict(orient= 'records')
data10_clean_dict

data11_clean_dict = data11_clean.to_dict(orient= 'records')
data11_clean_dict

combined_data_dict = combined_data.to_dict(orient= 'records')
combined_data_dict


[{'FISCAL_YEAR': 2009,
  'STATE': 'FL',
  'COMPANY': 'A21, Inc.',
  'ASSETS_MILLIONS': 25.2,
  'LIABILITIES_MILLIONS': 30.3,
  'COURT_DISTRICT': 'MD'},
 {'FISCAL_YEAR': 2009,
  'STATE': 'DE',
  'COMPANY': 'Abitibibowater Inc.',
  'ASSETS_MILLIONS': 9937.0,
  'LIABILITIES_MILLIONS': 2213.0,
  'COURT_DISTRICT': 'D'},
 {'FISCAL_YEAR': 2009,
  'STATE': 'FL',
  'COMPANY': 'Accentia Biopharmaceuticals, Inc.',
  'ASSETS_MILLIONS': 134.9,
  'LIABILITIES_MILLIONS': 77.6,
  'COURT_DISTRICT': 'MD'},
 {'FISCAL_YEAR': 2009,
  'STATE': 'DE',
  'COMPANY': 'Accredited Home Lenders Holding Co.',
  'ASSETS_MILLIONS': 799.5,
  'LIABILITIES_MILLIONS': 490.7,
  'COURT_DISTRICT': 'D'},
 {'FISCAL_YEAR': 2009,
  'STATE': 'NY',
  'COMPANY': 'Apex Silver Mines LTD',
  'ASSETS_MILLIONS': 721.3,
  'LIABILITIES_MILLIONS': 930.9,
  'COURT_DISTRICT': 'SD'},
 {'FISCAL_YEAR': 2009,
  'STATE': 'DE',
  'COMPANY': 'Applied Solar, Inc.',
  'ASSETS_MILLIONS': 17.6,
  'LIABILITIES_MILLIONS': 29.1,
  'COURT_DISTRICT': 'D'},


In [24]:
# Insert the documents into the database
collection09.insert_many(data09_clean_dict)
collection10.insert_many(data10_clean_dict)
collection11.insert_many(data11_clean_dict)
collection11.insert_many(data11_clean_dict)
collectionCombined.insert_many(combined_data_dict)

<pymongo.results.InsertManyResult at 0x2316d9f5e08>

In [25]:
resultscombined = db.combinedData.find()
for x in resultscombined:
    print(x)

{'_id': ObjectId('5d73d6e472cc8b364336b5af'), 'FISCAL_YEAR': 2009, 'STATE': 'FL', 'COMPANY': 'A21, Inc.', 'ASSETS_MILLIONS': 25.2, 'LIABILITIES_MILLIONS': 30.3, 'COURT_DISTRICT': 'MD'}
{'_id': ObjectId('5d73d6e472cc8b364336b5b0'), 'FISCAL_YEAR': 2009, 'STATE': 'DE', 'COMPANY': 'Abitibibowater Inc.', 'ASSETS_MILLIONS': 9937.0, 'LIABILITIES_MILLIONS': 2213.0, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d6e472cc8b364336b5b1'), 'FISCAL_YEAR': 2009, 'STATE': 'FL', 'COMPANY': 'Accentia Biopharmaceuticals, Inc.', 'ASSETS_MILLIONS': 134.9, 'LIABILITIES_MILLIONS': 77.6, 'COURT_DISTRICT': 'MD'}
{'_id': ObjectId('5d73d6e472cc8b364336b5b2'), 'FISCAL_YEAR': 2009, 'STATE': 'DE', 'COMPANY': 'Accredited Home Lenders Holding Co.', 'ASSETS_MILLIONS': 799.5, 'LIABILITIES_MILLIONS': 490.7, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d6e472cc8b364336b5b3'), 'FISCAL_YEAR': 2009, 'STATE': 'NY', 'COMPANY': 'Apex Silver Mines LTD', 'ASSETS_MILLIONS': 721.3, 'LIABILITIES_MILLIONS': 930.9, 'COURT_DISTRICT': 'S

In [26]:
# Verify results

results09 = db.data2009.find()
for x in results09:
    print(x)

{'_id': ObjectId('5d73d4930899a127820fb2cb'), 'FISCAL_YEAR': 2009, 'STATE': 'FL', 'COMPANY': 'A21, Inc.', 'ASSETS_MILLIONS': 25.2, 'LIABILITIES_MILLIONS': 30.3, 'COURT_DISTRICT': 'MD'}
{'_id': ObjectId('5d73d4930899a127820fb2cc'), 'FISCAL_YEAR': 2009, 'STATE': 'DE', 'COMPANY': 'Abitibibowater Inc.', 'ASSETS_MILLIONS': 9937.0, 'LIABILITIES_MILLIONS': 2213.0, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d4930899a127820fb2cd'), 'FISCAL_YEAR': 2009, 'STATE': 'FL', 'COMPANY': 'Accentia Biopharmaceuticals, Inc.', 'ASSETS_MILLIONS': 134.9, 'LIABILITIES_MILLIONS': 77.6, 'COURT_DISTRICT': 'MD'}
{'_id': ObjectId('5d73d4930899a127820fb2ce'), 'FISCAL_YEAR': 2009, 'STATE': 'DE', 'COMPANY': 'Accredited Home Lenders Holding Co.', 'ASSETS_MILLIONS': 799.5, 'LIABILITIES_MILLIONS': 490.7, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d4930899a127820fb2cf'), 'FISCAL_YEAR': 2009, 'STATE': 'NY', 'COMPANY': 'Apex Silver Mines LTD', 'ASSETS_MILLIONS': 721.3, 'LIABILITIES_MILLIONS': 930.9, 'COURT_DISTRICT': 'S

In [27]:
# Verify results

results10 = db.data2010.find()
for x in results10:
    print(x)

{'_id': ObjectId('5d73d4930899a127820fb34a'), 'COMPANY': 'Advanta Group', 'FISCAL_YEAR': 2010, 'STATE': 'DE', 'ASSETS_MILLIONS': 363.0, 'LIABILITIES_MILLIONS': 331.0, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d4930899a127820fb34b'), 'COMPANY': 'Amcore Financial, Inc.', 'FISCAL_YEAR': 2010, 'STATE': 'IL', 'ASSETS_MILLIONS': 7.0, 'LIABILITIES_MILLIONS': 75.0, 'COURT_DISTRICT': 'ND'}
{'_id': ObjectId('5d73d4930899a127820fb34c'), 'COMPANY': 'American Mortgage Acceptance Company', 'FISCAL_YEAR': 2010, 'STATE': 'NY', 'ASSETS_MILLIONS': 6.37, 'LIABILITIES_MILLIONS': 119.968, 'COURT_DISTRICT': 'SD'}
{'_id': ObjectId('5d73d4930899a127820fb34d'), 'COMPANY': 'Barzel Industries, Inc.', 'FISCAL_YEAR': 2010, 'STATE': 'DE', 'ASSETS_MILLIONS': 366.0, 'LIABILITIES_MILLIONS': 385.0, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d4930899a127820fb34e'), 'COMPANY': 'Baseline Oil & Gas Corp.', 'FISCAL_YEAR': 2010, 'STATE': 'TX', 'ASSETS_MILLIONS': 80.0, 'LIABILITIES_MILLIONS': 139.0, 'COURT_DISTRICT': 'SD

In [28]:
# Verify results

results11 = db.data2011.find()
for x in results11:
    print(x)

{'_id': ObjectId('5d73d4930899a127820fb386'), 'COMPANY': 'Ad Systems Communications, Inc.', 'FISCAL_YEAR': 2011, 'STATE': 'NV', 'ASSETS_MILLIONS': 405.0, 'LIABILITIES_MILLIONS': 4.0, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d4930899a127820fb387'), 'COMPANY': 'Alphatrade.com', 'FISCAL_YEAR': 2011, 'STATE': 'NV', 'ASSETS_MILLIONS': 686.0, 'LIABILITIES_MILLIONS': 4.0, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d4930899a127820fb388'), 'COMPANY': 'AMBAC Financial Group', 'FISCAL_YEAR': 2011, 'STATE': 'NY', 'ASSETS_MILLIONS': 395.0, 'LIABILITIES_MILLIONS': 1683.0, 'COURT_DISTRICT': 'SD'}
{'_id': ObjectId('5d73d4930899a127820fb389'), 'COMPANY': 'Ambassdors International , Inc.', 'FISCAL_YEAR': 2011, 'STATE': 'DE', 'ASSETS_MILLIONS': 86.44, 'LIABILITIES_MILLIONS': 87.32, 'COURT_DISTRICT': 'D'}
{'_id': ObjectId('5d73d4930899a127820fb38a'), 'COMPANY': 'American Pacific Financial Corporation', 'FISCAL_YEAR': 2011, 'STATE': 'NV', 'ASSETS_MILLIONS': 19.18, 'LIABILITIES_MILLIONS': 161.08, 'COU

# Questions ??