### Predict bankrupcy

### Data Set Information:

**The dataset is about bankruptcy prediction of Polish companies**

The data was collected from Emerging Markets Information Service (EMIS, [Web Link]), which is a database containing information on emerging markets around the world. The bankrupt companies were analyzed in the period 2000-2012, while the still operating companies were evaluated from 2007 to 2013.

Basing on the collected data five classification cases were distinguished, that depends on the forecasting period:

- 1stYear: the data contains financial rates from 1st year of the forecasting period and corresponding class label that indicates bankruptcy status after 5 years. The data contains 7027 instances (financial statements), 271 represents bankrupted companies, 6756 firms that did not bankrupt in the forecasting period.

- 2ndYear: the data contains financial rates from 2nd year of the forecasting period and corresponding class label that indicates bankruptcy status after 4 years. The data contains 10173 instances (financial statements), 400 represents bankrupted companies, 9773 firms that did not bankrupt in the forecasting period.

- 3rdYear: the data contains financial rates from 3rd year of the forecasting period and corresponding class label that indicates bankruptcy status after 3 years. The data contains 10503 instances (financial statements), 495 represents bankrupted companies, 10008 firms that did not bankrupt in the forecasting period.

- 4thYear: the data contains financial rates from 4th year of the forecasting period and corresponding class label that indicates bankruptcy status after 2 years. The data contains 9792 instances (financial statements), 515 represents bankrupted companies, 9277 firms that did not bankrupt in the forecasting period.

- 5thYear: the data contains financial rates from 5th year of the forecasting period and corresponding class label that indicates bankruptcy status after 1 year. The data contains 5910 instances (financial statements), 410 represents bankrupted companies, 5500 firms that did not bankrupt in the forecasting period.


### Features description

- X1    net profit / total assets
- X2    total liabilities / total assets 
- X3    working capital / total assets 
- X4    current assets / short-term liabilities 
- X5    [(cash + short-term securities + receivables - short-term liabilities) / (operating expenses - depreciation)] * 365 
- X6    retained earnings / total assets 
- X7    EBIT / total assets 
- X8    book value of equity / total liabilities 
- X9    sales / total assets 
- X10   equity / total assets 
- X11   (gross profit + extraordinary items + financial expenses) / total assets 
- X12   gross profit / short-term liabilities 
- X13   (gross profit + depreciation) / sales 
- X14   (gross profit + interest) / total assets 
- X15   (total liabilities * 365) / (gross profit + depreciation) 
- X16   (gross profit + depreciation) / total liabilities 
- X17   total assets / total liabilities 
- X18   gross profit / total assets 
- X19   gross profit / sales 
- X20   (inventory * 365) / sales 
- X21   sales (n) / sales (n-1) 
- X22   profit on operating activities / total assets 
- X23   net profit / sales 
- X24   gross profit (in 3 years) / total assets 
- X25   (equity - share capital) / total assets 
- X26   (net profit + depreciation) / total liabilities 
- X27   profit on operating activities / financial expenses 
- X28   working capital / fixed assets 
- X29   logarithm of total assets 
- X30   (total liabilities - cash) / sales 
- X31   (gross profit + interest) / sales 
- X32   (current liabilities * 365) / cost of products sold 
- X33   operating expenses / short-term liabilities 
- X34   operating expenses / total liabilities 
- X35   profit on sales / total assets 
- X36   total sales / total assets 
- X37   (current assets - inventories) / long-term liabilities 
- X38   constant capital / total assets 
- X39   profit on sales / sales 
- X40   (current assets - inventory - receivables) / short-term liabilities 
- X41   total liabilities / ((profit on operating activities + depreciation) * (12/365)) 
- X42   profit on operating activities / sales 
- X43   rotation receivables + inventory turnover in days 
- X44   (receivables * 365) / sales 
- X45   net profit / inventory 
- X46   (current assets - inventory) / short-term liabilities 
- X47   (inventory * 365) / cost of products sold 
- X48   EBITDA (profit on operating activities - depreciation) / total assets 
- X49   EBITDA (profit on operating activities - depreciation) / sales 
- X50   current assets / total liabilities 
- X51   short-term liabilities / total assets 
- X52   (short-term liabilities * 365) / cost of products sold) 
- X53   equity / fixed assets 
- X54   constant capital / fixed assets 
- X55   working capital 
- X56   (sales - cost of products sold) / sales 
- X57   (current assets - inventory - short-term liabilities) / (sales - gross profit - depreciation) 
- X58   total costs /total sales 
- X59   long-term liabilities / equity 
- X60   sales / inventory 
- X61   sales / receivables 
- X62   (short-term liabilities *365) / sales 
- X63   sales / short-term liabilities 
- X64   sales / fixed assets


In [1]:
#install neccessary package for this exercise
from scipy.io.arff import loadarff
import pandas as pd
import numpy as np

In [2]:
#import data from arff files
data_objects = []
for i in range(1,6):
    i = str(i)
    file_name = i+'year.arff'
    data_objects.append(loadarff('data/bankruptcy/'+i+'year.arff'))

In [3]:
#creating dataframes
df_list = [pd.DataFrame.from_records(data = x[0]) for x in data_objects]
companies = pd.concat(df_list, axis = 0)
column_names = ['x' + str(i) for i in range(1,65)] + ['bankrupt']
column_names = {k:v for (k,v) in zip(companies.columns, column_names)}
companies.rename(columns = column_names, inplace = True)
companies['bankrupt'] = companies['bankrupt'].astype('int')
companies.shape


(43405, 65)

In [4]:
missing_values_count = companies.isnull().sum()
missing_values_count


x1            8
x2            8
x3            8
x4          134
x5           89
           ... 
x61         102
x62         127
x63         134
x64         812
bankrupt      0
Length: 65, dtype: int64

In [5]:
#delete all col with > 2000 NaN
col_to_drop = companies[['x21', 'x27','x37', 'x45', 'x60']]
companies_new = companies.drop(col_to_drop, axis = 1)

In [6]:
#replace col with na with median value
companies_new = companies_new.fillna(companies_new.median())

In [7]:
companies_new.isna().sum()

x1          0
x2          0
x3          0
x4          0
x5          0
x6          0
x7          0
x8          0
x9          0
x10         0
x11         0
x12         0
x13         0
x14         0
x15         0
x16         0
x17         0
x18         0
x19         0
x20         0
x22         0
x23         0
x24         0
x25         0
x26         0
x28         0
x29         0
x30         0
x31         0
x32         0
x33         0
x34         0
x35         0
x36         0
x38         0
x39         0
x40         0
x41         0
x42         0
x43         0
x44         0
x46         0
x47         0
x48         0
x49         0
x50         0
x51         0
x52         0
x53         0
x54         0
x55         0
x56         0
x57         0
x58         0
x59         0
x61         0
x62         0
x63         0
x64         0
bankrupt    0
dtype: int64

In [8]:
companies['bankrupt'].value_counts()

0    41314
1     2091
Name: bankrupt, dtype: int64

In [9]:
#Gånger 100 och normalize True för att få fram andel procesnt som är True/False, dvs som gått i konkurs
100*companies['bankrupt'].value_counts(normalize = True)

0    95.182583
1     4.817417
Name: bankrupt, dtype: float64