In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as ply

In [53]:
data = pd.read_csv('../data/data.csv')

In [54]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6819 entries, 0 to 6818
Data columns (total 96 columns):
 #   Column                                                               Non-Null Count  Dtype  
---  ------                                                               --------------  -----  
 0   Bankrupt?                                                            6819 non-null   int64  
 1    ROA(C) before interest and depreciation before interest             6819 non-null   float64
 2    ROA(A) before interest and % after tax                              6819 non-null   float64
 3    ROA(B) before interest and depreciation after tax                   6819 non-null   float64
 4    operating gross margin                                              6819 non-null   float64
 5    realized sales gross margin                                         6819 non-null   float64
 6    operating profit rate                                               6819 non-null   float64
 7    tax P

In [55]:
data.head()

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,operating gross margin,realized sales gross margin,operating profit rate,tax Pre-net interest rate,after-tax net interest rate,non-industry income and expenditure/revenue,...,net income to total assets,total assets to GNP price,No-credit interval,Gross profit to Sales,Net income to stockholder's Equity,liability to equity,Degree of financial leverage (DFL),Interest coverage ratio( Interest expense to EBIT ),one if net income was negative for the last two year zero otherwise,equity to liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,0.56405,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.60145,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.77467,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.9987,0.796967,0.808966,0.30335,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.03549


As this dataset does not have any missing values, we will be checking if everything has been encoded as a valid entry

In [56]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Bankrupt?,6819.0,0.032263,0.176710,0.0,0.000000,0.000000,0.000000,1.0
ROA(C) before interest and depreciation before interest,6819.0,0.505180,0.060686,0.0,0.476527,0.502706,0.535563,1.0
ROA(A) before interest and % after tax,6819.0,0.558625,0.065620,0.0,0.535543,0.559802,0.589157,1.0
ROA(B) before interest and depreciation after tax,6819.0,0.553589,0.061595,0.0,0.527277,0.552278,0.584105,1.0
operating gross margin,6819.0,0.607948,0.016934,0.0,0.600445,0.605997,0.613914,1.0
...,...,...,...,...,...,...,...,...
liability to equity,6819.0,0.280365,0.014463,0.0,0.276944,0.278778,0.281449,1.0
Degree of financial leverage (DFL),6819.0,0.027541,0.015668,0.0,0.026791,0.026808,0.026913,1.0
Interest coverage ratio( Interest expense to EBIT ),6819.0,0.565358,0.013214,0.0,0.565158,0.565252,0.565725,1.0
one if net income was negative for the last two year zero otherwise,6819.0,1.000000,0.000000,1.0,1.000000,1.000000,1.000000,1.0


Everything seems to be within the same range of 0 to 1, so we do not need to worry about encoding errors such as 9999 or negative numbers.  Everything is also a float or integer, so we do not need to encode categorical features.  There are features where the min and max are 1, so we will explore below

In [57]:
data['one if net income was negative for the last two year zero otherwise'].value_counts()

1    6819
Name: one if net income was negative for the last two year zero otherwise, dtype: int64

In [58]:
columns = data.columns
for column in columns:
    if data[column].min() == 1:
        print(column)
    else:
        pass

one if net income was negative for the last two year zero otherwise


This is the only column with all 1's, so we will drop this before doing exploratory data analysis

All companies in this column are 1, so we will most likely be dropped, as it does not provide any real information

In [59]:
data = data.drop('one if net income was negative for the last two year zero otherwise', axis=1)

In [60]:
data.shape

(6819, 95)

In [61]:
data.aggregate(['min','max']).T.sort_values(by='min')

Unnamed: 0,min,max
Bankrupt?,0.0,1.000000e+00
Retained Earnings/Total assets,0.0,1.000000e+00
long-term liability to current assets,0.0,9.540000e+09
current liability/equity,0.0,1.000000e+00
working capital/equity,0.0,1.000000e+00
...,...,...
regular net profit growth rate,0.0,1.000000e+00
after-tax net profit growth rate,0.0,1.000000e+00
operating profit growth rate,0.0,1.000000e+00
quick ratio,0.0,9.230000e+09


As we can see below, this dataset is unbalanced, so we will need to be conscientious for correctly predicting class labels

In [62]:
data['Bankrupt?'].value_counts()

0    6599
1     220
Name: Bankrupt?, dtype: int64

I will be exporting this dataset to move to Exploratory Data Analysis

In [64]:
data.to_csv('../data/data_cleaned.csv')