We shall work on a dataset allowing to predict if a company is in financial distress or not.
This data set deals with the financial distress prediction for a sample of companies and can
be downloaded at https://www.kaggle.com/shebrahimi/financial-distress
The column of this dataset are as follows
• First column: Company represents sample companies.
• Second column: Time shows different time periods that data belongs to. Time series
length varies between 1 to 14 for each company.
• Third column: The target variable is denoted by ”Financial Distress” if it is greater
than -0.50 the company should be considered as healthy (0). Otherwise, it would be
regarded as financially distressed (1).
• Fourth column to the last column: The features denoted by x1 to x83, are some financial
and non-financial characteristics of the sampled companies. These features belong to
the previous time period, which should be used to predict whether the company will be
financially distressed or not (classification). Feature x80 is a categorical variable.
For example, company 1 is financially distressed at time 4 but company 2 is still healthy
at time 14.

In [50]:
import numpy as np
import pandas as pd

FinancialDistress = pd.read_csv('FinancialDistress.csv')
FinancialDistress.head()

Unnamed: 0,Company,Time,Financial Distress,x1,x2,x3,x4,x5,x6,x7,...,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83
0,1,1,0.010636,1.281,0.022934,0.87454,1.2164,0.06094,0.18827,0.5251,...,85.437,27.07,26.102,16.0,16.0,0.2,22,0.06039,30,49
1,1,2,-0.45597,1.27,0.006454,0.82067,1.0049,-0.01408,0.18104,0.62288,...,107.09,31.31,30.194,17.0,16.0,0.4,22,0.010636,31,50
2,1,3,-0.32539,1.0529,-0.059379,0.92242,0.72926,0.020476,0.044865,0.43292,...,120.87,36.07,35.273,17.0,15.0,-0.2,22,-0.45597,32,51
3,1,4,-0.56657,1.1131,-0.015229,0.85888,0.80974,0.076037,0.091033,0.67546,...,54.806,39.8,38.377,17.167,16.0,5.6,22,-0.32539,33,52
4,2,1,1.3573,1.0623,0.10702,0.8146,0.83593,0.19996,0.0478,0.742,...,85.437,27.07,26.102,16.0,16.0,0.2,29,1.251,7,27


Isolate the column of labels and transform this vector into 0/1 labels

In [51]:
FinancialDistress

Unnamed: 0,Company,Time,Financial Distress,x1,x2,x3,x4,x5,x6,x7,...,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83
0,1,1,0.010636,1.2810,0.022934,0.87454,1.21640,0.060940,0.188270,0.52510,...,85.437,27.07,26.102,16.000,16.0,0.2,22,0.060390,30,49
1,1,2,-0.455970,1.2700,0.006454,0.82067,1.00490,-0.014080,0.181040,0.62288,...,107.090,31.31,30.194,17.000,16.0,0.4,22,0.010636,31,50
2,1,3,-0.325390,1.0529,-0.059379,0.92242,0.72926,0.020476,0.044865,0.43292,...,120.870,36.07,35.273,17.000,15.0,-0.2,22,-0.455970,32,51
3,1,4,-0.566570,1.1131,-0.015229,0.85888,0.80974,0.076037,0.091033,0.67546,...,54.806,39.80,38.377,17.167,16.0,5.6,22,-0.325390,33,52
4,2,1,1.357300,1.0623,0.107020,0.81460,0.83593,0.199960,0.047800,0.74200,...,85.437,27.07,26.102,16.000,16.0,0.2,29,1.251000,7,27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3667,422,10,0.438020,2.2605,0.202890,0.16037,0.18588,0.175970,0.198400,2.22360,...,100.000,100.00,100.000,17.125,14.5,-7.0,37,0.436380,4,41
3668,422,11,0.482410,1.9615,0.216440,0.20095,0.21642,0.203590,0.189870,1.93820,...,91.500,130.50,132.400,20.000,14.5,-16.0,37,0.438020,5,42
3669,422,12,0.500770,1.7099,0.207970,0.26136,0.21399,0.193670,0.183890,1.68980,...,87.100,175.90,178.100,20.000,14.5,-20.2,37,0.482410,6,43
3670,422,13,0.611030,1.5590,0.185450,0.30728,0.19307,0.172140,0.170680,1.53890,...,92.900,203.20,204.500,22.000,22.0,6.4,37,0.500770,7,44


In [52]:
# y = FinancialDistress.iloc[:,[2]] # and notice here is still data frame , when you convert to the training table , please make sure adding `ravel` at the end
# x = FinancialDistress.drop(y.columns,axis = 1)

In [53]:
# isolate the column of labels and transform this vector into 0/1 labels (0 if the value is > -0.5, 1 otherwise)
FinancialDistress['Financial Distress'] = FinancialDistress['Financial Distress'].apply(lambda x: 0 if x > -0.5 else 1)
y = FinancialDistress['Financial Distress']
x = FinancialDistress.drop('Financial Distress', axis=1)

In [54]:
FinancialDistress

Unnamed: 0,Company,Time,Financial Distress,x1,x2,x3,x4,x5,x6,x7,...,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83
0,1,1,0,1.2810,0.022934,0.87454,1.21640,0.060940,0.188270,0.52510,...,85.437,27.07,26.102,16.000,16.0,0.2,22,0.060390,30,49
1,1,2,0,1.2700,0.006454,0.82067,1.00490,-0.014080,0.181040,0.62288,...,107.090,31.31,30.194,17.000,16.0,0.4,22,0.010636,31,50
2,1,3,0,1.0529,-0.059379,0.92242,0.72926,0.020476,0.044865,0.43292,...,120.870,36.07,35.273,17.000,15.0,-0.2,22,-0.455970,32,51
3,1,4,1,1.1131,-0.015229,0.85888,0.80974,0.076037,0.091033,0.67546,...,54.806,39.80,38.377,17.167,16.0,5.6,22,-0.325390,33,52
4,2,1,0,1.0623,0.107020,0.81460,0.83593,0.199960,0.047800,0.74200,...,85.437,27.07,26.102,16.000,16.0,0.2,29,1.251000,7,27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3667,422,10,0,2.2605,0.202890,0.16037,0.18588,0.175970,0.198400,2.22360,...,100.000,100.00,100.000,17.125,14.5,-7.0,37,0.436380,4,41
3668,422,11,0,1.9615,0.216440,0.20095,0.21642,0.203590,0.189870,1.93820,...,91.500,130.50,132.400,20.000,14.5,-16.0,37,0.438020,5,42
3669,422,12,0,1.7099,0.207970,0.26136,0.21399,0.193670,0.183890,1.68980,...,87.100,175.90,178.100,20.000,14.5,-20.2,37,0.482410,6,43
3670,422,13,0,1.5590,0.185450,0.30728,0.19307,0.172140,0.170680,1.53890,...,92.900,203.20,204.500,22.000,22.0,6.4,37,0.500770,7,44


In [55]:
x

Unnamed: 0,Company,Time,x1,x2,x3,x4,x5,x6,x7,x8,...,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83
0,1,1,1.2810,0.022934,0.87454,1.21640,0.060940,0.188270,0.52510,0.018854,...,85.437,27.07,26.102,16.000,16.0,0.2,22,0.060390,30,49
1,1,2,1.2700,0.006454,0.82067,1.00490,-0.014080,0.181040,0.62288,0.006423,...,107.090,31.31,30.194,17.000,16.0,0.4,22,0.010636,31,50
2,1,3,1.0529,-0.059379,0.92242,0.72926,0.020476,0.044865,0.43292,-0.081423,...,120.870,36.07,35.273,17.000,15.0,-0.2,22,-0.455970,32,51
3,1,4,1.1131,-0.015229,0.85888,0.80974,0.076037,0.091033,0.67546,-0.018807,...,54.806,39.80,38.377,17.167,16.0,5.6,22,-0.325390,33,52
4,2,1,1.0623,0.107020,0.81460,0.83593,0.199960,0.047800,0.74200,0.128030,...,85.437,27.07,26.102,16.000,16.0,0.2,29,1.251000,7,27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3667,422,10,2.2605,0.202890,0.16037,0.18588,0.175970,0.198400,2.22360,1.091500,...,100.000,100.00,100.000,17.125,14.5,-7.0,37,0.436380,4,41
3668,422,11,1.9615,0.216440,0.20095,0.21642,0.203590,0.189870,1.93820,1.000100,...,91.500,130.50,132.400,20.000,14.5,-16.0,37,0.438020,5,42
3669,422,12,1.7099,0.207970,0.26136,0.21399,0.193670,0.183890,1.68980,0.971860,...,87.100,175.90,178.100,20.000,14.5,-20.2,37,0.482410,6,43
3670,422,13,1.5590,0.185450,0.30728,0.19307,0.172140,0.170680,1.53890,0.960570,...,92.900,203.20,204.500,22.000,22.0,6.4,37,0.500770,7,44


In [56]:
y

0       0
1       0
2       0
3       1
4       0
       ..
3667    0
3668    0
3669    0
3670    0
3671    0
Name: Financial Distress, Length: 3672, dtype: int64

Analyse now the data using Decision trees, Random Forest and GBT

1. Decision Tree

In [57]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y)

In [58]:
from sklearn import tree
reg = tree.DecisionTreeClassifier()
reg = reg.fit(x_train, y_train)
y_pred=reg.predict(x_test)

In [59]:
reg.score(x_test,y_test)

0.9455337690631809

In [60]:
reg.feature_importances_

array([0.04667502, 0.        , 0.        , 0.        , 0.        ,
       0.02279987, 0.01454428, 0.        , 0.00954468, 0.04128679,
       0.00533811, 0.00763575, 0.00678733, 0.00459833, 0.        ,
       0.        , 0.        , 0.00983129, 0.        , 0.        ,
       0.01628959, 0.00495475, 0.00763575, 0.        , 0.01458847,
       0.00043781, 0.01509682, 0.00044772, 0.0101466 , 0.00911354,
       0.        , 0.        , 0.01759125, 0.00971822, 0.0081448 ,
       0.        , 0.        , 0.19556054, 0.        , 0.        ,
       0.00848416, 0.0114415 , 0.0175445 , 0.01527149, 0.        ,
       0.03340262, 0.00939194, 0.        , 0.01665981, 0.02978554,
       0.07305475, 0.        , 0.        , 0.        , 0.01937335,
       0.00989819, 0.        , 0.02548271, 0.02130137, 0.03202557,
       0.00763575, 0.        , 0.        , 0.        , 0.0050905 ,
       0.        , 0.        , 0.01739582, 0.00242109, 0.        ,
       0.        , 0.        , 0.04300462, 0.        , 0.02458

2. Random Forest

In [61]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200)
rf.fit(x_train, y_train)
Z = rf.predict(x_test)
accuracy=rf.score(x_test,y_test)
accuracy

0.9694989106753813

In [62]:
rf.feature_importances_

array([0.01107004, 0.00456077, 0.0081433 , 0.02683157, 0.01835244,
       0.01147368, 0.02225603, 0.00608036, 0.00643628, 0.02458009,
       0.03046335, 0.02104512, 0.00711611, 0.02435163, 0.01575493,
       0.01887964, 0.00564665, 0.01537598, 0.00635169, 0.00906716,
       0.00822356, 0.00932057, 0.00967948, 0.00830929, 0.0136489 ,
       0.01546453, 0.03837793, 0.0087442 , 0.01424763, 0.01176656,
       0.01003657, 0.00991494, 0.00832108, 0.00789479, 0.00948709,
       0.00774332, 0.00709072, 0.01912945, 0.01244094, 0.00886487,
       0.01107912, 0.00771655, 0.01054397, 0.01712984, 0.01011056,
       0.02484456, 0.01127162, 0.02432148, 0.02264401, 0.02812912,
       0.0189008 , 0.01577115, 0.00833253, 0.01624167, 0.0127441 ,
       0.00932215, 0.0087651 , 0.00857204, 0.01449564, 0.00731454,
       0.00806643, 0.0069039 , 0.00749547, 0.00607807, 0.00774321,
       0.00679137, 0.00761319, 0.00616277, 0.00623887, 0.00403838,
       0.0038689 , 0.00388376, 0.00555053, 0.0022553 , 0.00504

In [63]:
sorted_idx = rf.feature_importances_.argsort()
np.flip(sorted_idx)

array([26, 82, 10, 49,  3, 45,  9, 13, 47, 48,  6, 11, 37, 50, 15,  4, 43,
       53, 51, 14, 25, 17, 58, 28, 24, 54, 38, 29,  5, 46, 40,  0, 84, 42,
       44, 30, 31, 22, 34, 55, 21, 19, 83, 39, 56, 27, 57, 52, 32, 23, 20,
        2, 60, 33, 35, 64, 41, 66, 62, 59, 12, 36, 81, 61, 65, 75,  8, 18,
       68, 67,  7, 63, 16, 72, 76, 74,  1, 77, 69, 71, 70, 79, 80, 78, 73],
      dtype=int64)

In [64]:
from sklearn.inspection import permutation_importance
results = permutation_importance(rf, x_train, y_train)
# get importance
importance = results.importances_mean
importance

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.00021786, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.00036311, 0.        , 0.        , 0.00036311, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00029049, 0.        , 0.        ,
       0.00021786, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.00036311, 0.00058097, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.00036311, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [65]:
np.flip(importance.argsort())

array([82, 48, 84, 47, 28, 25, 61, 37, 13, 40,  8,  3, 27,  4, 29, 30, 31,
       32,  2,  5, 33, 34, 35, 36,  1, 38, 26, 24,  9, 16, 10, 11, 12,  7,
       14, 15, 17, 23, 18, 19, 39,  6, 21, 22, 20, 42, 41, 73, 66, 67, 68,
       69, 70, 71, 72, 74, 64, 75, 76, 77, 78, 79, 80, 81, 65, 63, 83, 52,
       43, 44, 45, 46, 49, 50, 51, 53, 62, 54, 55, 56, 57, 58, 59, 60,  0],
      dtype=int64)

3. GBT

In [66]:
from sklearn.ensemble import GradientBoostingClassifier
regressor = GradientBoostingClassifier(
    max_depth=2,
    n_estimators=3,
    learning_rate=1.0
)
regressor.fit(x_train, y_train)

GradientBoostingClassifier(learning_rate=1.0, max_depth=2, n_estimators=3)

In [67]:
from sklearn.metrics import mean_squared_error

errors = [mean_squared_error(y_test, y_pred) for y_pred in regressor.staged_predict(x_test)]
best_n_estimators = np.argmin(errors)

In [68]:
best_n_estimators

1

In [69]:
best_regressor = GradientBoostingClassifier(
    max_depth=2,
    n_estimators=best_n_estimators,
    learning_rate=1.0
)
best_regressor.fit(x_train, y_train)

GradientBoostingClassifier(learning_rate=1.0, max_depth=2, n_estimators=1)

In [70]:
best_n_estimators

1

In [71]:
accuracy=regressor.score(x_test,y_test)
accuracy

0.9466230936819172

Exercise 2
In this exercise, we consider the results of a survey given to visitors of hostels listed on
Booking.com and TripAdvisor.com. Our features here are the average ratings for different
categories
• ”f1”: ”Staff”
• ”f2”: ”Hostel booking”
• ”f3”: ”Check-in and check-out”
• ”f4”: ”Room condition”
• ”f5”: ”Shared kitchen condition”
• ”f6”: ”Shared space condition”
• ”f7”: ”Extra services”
• ”f8”: ”General conditions & conveniences”
• ”f9”: ”Value for money”
• ”f10”: ”Customer Co-creation”
Our target variable is the hostel’s overall rating on the website. The dataset is hostel factors.csv
and can be downloaded on the website
To avoid troubles with some Python versions, one may have to import the following
modules

In [72]:
import sklearn
import six
import sys
sys.modules['sklearn.externals.six'] = six