<a href="https://colab.research.google.com/github/gittymarina/merogit/blob/master/US_BANKRUPTCY_PREDICTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#IMPORTING OUR LIABARIES
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.metrics import recall_score,precision_score,f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier


In [None]:
#LOADING OUR DATASET
df=pd.read_csv("american_bankruptcy.csv")
df

UNDERSTANDING THE LENGTH OF DATASET

In [None]:
len(df)

CHANGING OUR COLUMN NAMES INORDER TO EASILY UNDERSTAND OUR VARIABLES

In [None]:
columns_names=['company_name','status_label','year','current_assets','cost of goods sold','depreciation and amortization','EBITDA','inventory','net income',
                                                'total receivables','market value','net sales','total assets','total long term debt','EBIT','gross profit','total current liabilities',
                                                'retained earnings','total revenue','total liabilities','total operating expenses']


In [None]:
df.columns=columns_names
df

INFORMATION ABOUT OUR DATASET

In [None]:
df.info()

DESCRIPTION OF OUR DATASET

In [None]:
df.describe()

FINDING OUT THE NULL VALUES IN THE DATASET

In [None]:
df.isnull().sum()

DROPPING THE DUPLICATES IN THE DATASET

In [None]:
df.drop_duplicates()

UNDERSTANDING THE DATASET USING BARPLOTS

In [None]:

categorical_cols=['current_assets','cost of goods sold','depreciation and amortization','EBITDA','inventory','net income',
                                                'total receivables','market value','net sales','total assets']

plt.figure(figsize=(20,40))
for i,col in enumerate(categorical_cols,1):
  plt.subplot(11,1,i)
  sns.barplot(x=col,y='status_label',data =df)
  plt.xlabel(col)
  plt.ylabel('status_label')
plt.show()


INFERENCE: INCREASE IN TOTAL ASSETS AND DEPRECIATION AND AMORTIZATION HAS LEFT THE COMPANY TO BANKRUPT

NEGATIVE NET INCOME:
   our company is having other positive sales but its expenses & other costs are being exceeded the amt taken in as revenue

In [None]:

categorical_cols=['total long term debt','EBIT','gross profit','total current liabilities',
                  'retained earnings','total revenue','total liabilities','total operating expenses']

plt.figure(figsize=(20,40))
for i,col in enumerate(categorical_cols,1):
  plt.subplot(11,1,i)
  sns.barplot(x=col,y='status_label',data =df)
  plt.xlabel(col)
  plt.ylabel('status_label')
plt.show()

INFERENCE:

increase in total liabilities and total operating expenses and total revenue
has left the company bankrupt

NEGATIVE RETAINED EARNINGS:
our retained earnings goes to minus because of cumulative net loss (i.e the excess of net loss previously allocated to members over the net income previously allocated to the members)




FINDING CORRELATION BETWEEN VARIABLES THROUGH HEATMAP

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True,cmap="viridis")

FINDING CORRELATED VARIABLES AND THEIR  CORRELATED VALUES

In [None]:
df.corr()

WE ARE DROPPING THE VARIABLES WERE OUR VALUES ARE CORRELATED

In [None]:
df=df.drop("gross profit",axis=1)

 "Cost of goods sold" and "Net sales" are strongly correlated, but not the same factor. However, "Gross profit" is actually equal net sales subtracted cost of goods sold, so this feature can be dropped to decrease multicollinearity.

In [None]:
df=df.drop("total revenue",axis=1)

"Net sales" and "Total revenue" are identical and therefore perfectly correlated. Dropping one to decrease multicollinearity.

In [None]:
df=df.drop("total operating expenses",axis=1)


There is also a near-perfect correlation between "Cost of goods sold" and  "Total Operating Expenses". It seems that 'cost of goods sold is a subset of all expenses covered in 'total operating expenses'.'total opearting expenses' will then be dropped to reduce multicollinearity.

In [None]:
df=df.drop("EBIT",axis=1)

The variables X4 "EBITDA" and X12 "EBIT" are very strongly correlated with their difference being captured in X3 "Depreciation and amortization", i.e. X3 = X4 - X12. Thus, X12 can be dropped to decrease multicollinearity.

LABEL ENCODING OUR VALUES WHERE OUR VARIABLE LIKE COMPANY NAME,STATUS LABEL ARE IN STRING

In [None]:
label=preprocessing.LabelEncoder()
df['status_label']=label.fit_transform(df['status_label'])


AS OUR STATUS LABEL IS IN OBJECT TYPE SO WE DO LABEL ENCODING INORDER TO CONVERT TO FLOAT AS zeros'0'AND ones'1'

FITTING OUR MODEL IN LOGISTICS REGRESSION

In [None]:
x=df.drop('status_label',axis=1)
y=df['status_label']
x=pd.get_dummies(x,drop_first=True)
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.8,random_state=50)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
model=LogisticRegression()
model.fit(x_train,y_train)
predictions=model.predict(x_test)
accuracy=accuracy_score(y_test,predictions)
conf_matrix=confusion_matrix(y_test,predictions)
class_report=classification_report(y_test,predictions)
print('accuracy:',accuracy)
print("confusion matrix:\n",conf_matrix)
print('classification report:\n',class_report)

FITTING OUR MODEL IN RANDOM FOREST

In [None]:
ran=RandomForestClassifier()
ran.fit(x_train,y_train)
y_pred=ran.predict(x_test)
accuracy1=accuracy_score(y_test,y_pred)
classif_report=classification_report(y_test,y_pred)
conf_mat=confusion_matrix(y_test,y_pred)
print("accuracy score:",accuracy1)
print('classification report:',classif_report)
print('confusion matrix:',conf_mat)

FITTING OUR MODEL IN DECISION TREE

In [None]:
clf=DecisionTreeClassifier()
clf.fit(x_train,y_train)
y_pred1=clf.predict(x_test)
accuracy=accuracy_score(y_test,y_pred1)
classification_report=classification_report(y_test,y_pred1)
confusion_mat=confusion_matrix(y_test,y_pred1)
print("acuracy score:",accuracy)
print('classification report:',classification_report)
print('confusion matrix:',confusion_mat)

WE ARE TABULATING OUR PERFORMANCE MATRICES OF VARIOUS ALGORITHMS IN A TABLE FORM.

In [None]:
scores=['accuracy','precision','F1','Recall']
name=['LOGISTIC REGRESSION','DESICION TREE','RANDOM FOREST']
pred=[predictions,y_pred1,y_pred]
Accuracy=[]
precision=[]
f1=[]
recall=[]

for j in pred:
  Accuracy.append(accuracy_score(j,y_test))
  precision.append(precision_score(j,y_test))
  f1.append(f1_score(j,y_test))
  recall.append(recall_score(j,y_test))
eval_scores=pd.DataFrame(
    {'MODELS':name,
     'ACCURACY':Accuracy,
     'PRECISION':precision,
     'F1':f1,
     'RECALL':recall}
)
eval_scores

**BAR PLOT**

In [None]:

eval_scores.groupby('MODELS').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)