#Đề tài: Financial Data Analysis Using Expert Bayesian Framework For Bankruptcy Prediction

I. Tổng quan nội dung bài báo

- Bài báo nghiên cứu về việc sử dụng mô hình Expert Bayesian Framework để phân loại các công ty có khả năng phá sản và không phá sản.

- Mục đích của nghiên cứu là phân tích định tính các nhân tố ảnh đến khả năng sinh lời và rủi ro phá sản của các công ty. Từ đó, giúp các nhà đầu tư và ngân hàng đánh giá rủi ro khi cho vay hay đầu tư vào một công ty.

II. Phương thức thực hiện của tác giả và kết quả

- Tác giả sử dụng mô hình Bayesian GLM regression để xây dựng một bộ phân loại phá sản dựa trên dữ liệu tài chính của các công ty.

- Tham khảo từ các chuyên gia về quản lý rủi ro tài chính thì đưa ra 2 mô hình
   -  Model #1: 5 biến độc lập liên quan đến tổng nợ phải trả
   -  Model #2: 12 biến độc lập liên quan đến nợ ngắn hạn

- Sử dụng gói rstanarm cho 2 model <- stan_glm()

- Sử dụng phương pháp Bayesian K Fold cross validation => model #2 tốt hơn

- Tiếp tục sử dụng ROPE (Region of Practical Equivalence) để kiểm tra xem các biến độc lập có ý nghĩa hay không => Việc doanh thu giảm và nợ ngắn hạn không cân đối sẽ là nhân tố chính ảnh hưởng đến việc phá sản trong tương lai gần

III. Phương thức GLM logistic regression cho Model #2

- Các bước thực hiện:
    - B1: preprocessing data (train và test)
    - B2: Chạy mô hình với glm () với bộ data training => kết quả so với bộ data training, nêu ý nghĩa
    - B3: Test mô hình với bộ data testing và nêu ý nghĩa kết quả
    - B4: So sánh kết quả nhóm chạy thực tế so với kết quả của bài báo

Dựa vào kết quả chạy ra thì có thể đưa ra kết luận như ở phần 1 hay không

IV. Các phương thức khác:

Do Model #2 có kết quả tốt hơn nến các phương thức khác sẽ sử dụng các biến độc lập của Model #2 (12 biến)

Các phương thức khác bao gồm
- Altman's Z-score
- SVM-linear kernel
- SVM-RBF kernel
- XGBOOST
- ANN
- GLM logistic regression

So sánh kết quả chạy với nhau (bảng 9 trong bài báo)

V. So sánh kết quả của các phương pháp và đưa ra kết luận

Tổng quan, phương pháp Bayesian GLM regression model đạt được kết quả tốt nhất trong việc phân loại các công ty có khả năng phá sản và không phá sản.

Các phương pháp khác như SVM-linear kernel, SVM-RBF kernel, XGBOOST, ANN, GLM cũng cho kết quả tốt nhưng không bằng phương pháp Bayesian GLM regression model.
Riêng Altman's Z-score cho kết quả tệ nhất.

Kết quả nghiên cứu này có thể ứng dụng trong các lĩnh vực tài chính, ngân hàng để dự đoán khả năng phá sản của các công ty và từ đó đưa ra quyết định kinh doanh hiệu quả.

#Cài đặt và chạy các package cần sử dụng

In [None]:
install.packages("report")
install.packages("dplyr")
install.packages("MLmetrics")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library("report")
library(dplyr)
library("MLmetrics")

#Pre-process - factor and scale data


In [None]:
# Đọc data
# Attr1~64 là double, Class = 1 là phá sản
bankruptcy_train <- select(read.csv('bankruptcy_train_am.csv'), -X)
bankruptcy_test <- select(read.csv('bankruptcy_test_am.csv'), -X)

# Loại bỏ hết các hàng có giá trị NA
na.omit(bankruptcy_train)
na.omit(bankruptcy_test)

# Xóa hết các hàng có giá trị 0
for (i in 1:64) {
      bankruptcy_train <- bankruptcy_train[which(bankruptcy_train[, i] != 0), ]
}
for (i in 1:64) {
      bankruptcy_test <- bankruptcy_test[which(bankruptcy_test[, i] != 0), ]
}


# Kiểm tra số chiều và sample dataframe
dim(bankruptcy_train)
head(bankruptcy_train)

dim(bankruptcy_test)
head(bankruptcy_test)

Unnamed: 0_level_0,Attr1,Attr2,Attr3,Attr4,Attr5,Attr6,Attr7,Attr8,Attr9,Attr10,⋯,Attr56,Attr57,Attr58,Attr59,Attr60,Attr61,Attr62,Attr63,Attr64,class
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,0.098582283,-0.079038656,0.848377949,0.50537664,0.014338073,0.054687707,0.0264825178,1.38045494,3.288826092,0.109299238,⋯,-0.011984429,1.322212e-02,3.945575e-03,-0.041827867,0.017421086,0.182685686,-0.0530908221,5.46976463,0.007555338,0
2,-0.414167153,0.253487039,-3.534237058,-0.32865417,0.005949270,-0.136593487,-0.0613816113,-0.80632555,-0.424415478,-0.403824191,⋯,-0.127202698,2.893643e-02,7.652019e-02,-0.048028660,-0.039024219,-0.278500335,0.1480293434,-0.60949945,-0.030085647,0
3,-0.025802987,-0.008365557,0.568960250,0.13753474,0.009177676,-0.021734134,0.0079650971,-0.14378859,-0.398179581,0.019645353,⋯,0.001628959,2.682907e-03,-1.130722e-02,-0.041906300,-0.037542774,-0.217113728,-0.0306874514,-0.06677460,-0.041202499,0
4,-0.049020128,-0.053784389,1.117131142,0.11579873,0.009065360,-0.016810767,0.0007486314,0.43440523,-0.335682504,0.074469692,⋯,-0.026436663,-1.969096e-03,9.511210e-03,-0.044225148,-0.037737220,-0.241222764,-0.0315211014,-0.04535840,-0.022653701,0
5,-0.031931862,-0.005659990,-0.478874087,-0.21119299,0.008497973,-0.011609587,0.0057414075,-0.16388057,-0.635674608,0.016213148,⋯,0.074682987,1.799416e-03,-1.565010e-02,-0.033703239,-0.029732659,-0.111478178,0.0043239590,-0.42810459,-0.045761133,0
6,0.008277747,-0.013628693,-0.707783692,-0.24917182,0.008111240,-0.013602200,0.0110080407,-0.10172100,-0.190745082,0.026322018,⋯,-0.007804477,7.819498e-03,-8.977661e-03,-0.040985204,-0.036557441,-0.014870379,-0.0156533956,-0.29955160,-0.041135461,0
7,-0.019118479,0.042952335,-0.786834820,-0.24177354,0.008427405,-0.013602200,0.0079233812,-0.40946185,0.605452666,-0.045455066,⋯,-0.012501374,9.610173e-03,4.962779e-03,-0.038878203,-0.027663908,-0.188615181,-0.0158386512,-0.29772135,-0.021748961,0
8,0.018093188,0.033339480,-0.561916951,-0.22297019,0.008408997,-0.013602200,0.0126900029,-0.37345326,0.402724225,-0.033260471,⋯,0.002828555,1.733653e-02,-3.201206e-03,-0.037056647,-0.033726404,-0.187721864,-0.0150711639,-0.30508700,-0.025508536,0
9,-0.036278877,0.051427464,-0.563954292,-0.22283487,0.007867272,-0.017702894,0.0046249276,-0.43781038,-0.016375629,-0.056206374,⋯,-0.036547835,5.910367e-03,4.846829e-03,-0.032232718,-0.039536652,-0.135236893,0.0045923906,-0.42925408,-0.030425196,0
10,0.147248116,-0.026811971,-0.466242190,-0.20810597,0.008667560,-0.013602200,0.0438197260,0.02507038,1.509721152,0.043045950,⋯,0.018308334,2.672425e-02,-1.884928e-02,-0.036553005,0.054726737,0.213794119,-0.0452273664,0.80060933,-0.024939531,0


Unnamed: 0_level_0,Attr1,Attr2,Attr3,Attr4,Attr5,Attr6,Attr7,Attr8,Attr9,Attr10,⋯,Attr56,Attr57,Attr58,Attr59,Attr60,Attr61,Attr62,Attr63,Attr64,class
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,0.015347705,-0.0593111444,0.62843685,0.043866892,0.009211853,0.028531613,0.020769756,0.55558498,-0.27191310,0.079133971,⋯,0.032768496,0.0056991182,-0.0298435223,-0.043979316,-0.0339510243,-0.207063042,-0.029473838,-0.095500651,-0.0353903642,0
2,-0.147325562,-0.0371694350,-0.09567928,-0.151334373,0.007454725,-0.081077768,-0.015655956,0.09131112,-0.42750431,0.039912963,⋯,-0.131174922,-0.0159498827,0.0791554363,-0.044363004,-0.0411831170,-0.078555087,0.020350455,-0.482342590,-0.0442072157,0
3,-0.017964385,0.0028359511,-0.37601435,-0.186623683,0.008451847,0.031901832,0.008775274,-0.22110739,-0.34235070,0.005435438,⋯,0.105469702,-0.0042256027,-0.0733408583,-0.023425828,-0.0349004185,-0.160523005,-0.022594808,-0.218417800,-0.0428980683,0
4,-0.172374152,0.1190643164,1.60383779,0.050040935,0.012485751,-0.173804801,-0.019948252,-0.59131232,-0.43776720,-0.142022644,⋯,0.440603270,0.3894464402,-0.3006938913,-0.948478705,-0.0405556783,-0.312621522,0.012766320,-0.459843864,0.4560009969,0
5,-0.040311446,0.0218327284,-0.48195877,-0.208888299,0.008515154,0.006507988,0.004154272,-0.35653204,-0.28379322,-0.033407147,⋯,0.022479565,0.0026215071,-0.0230076996,-0.001251845,-0.0305809743,-0.176781365,-0.021062102,-0.239142073,-0.0454180952,0
6,-0.002770682,0.0206903781,0.23778721,-0.130190389,0.008393270,-0.013602200,0.011191900,-0.31843577,0.58874386,-0.017214180,⋯,-0.002780129,0.0098589326,0.0008149053,-0.033174342,-0.0383778368,-0.041358089,-0.026292358,-0.159526045,-0.0140047167,0
7,0.472738910,-0.1020535465,2.59330517,2.427111103,0.012297736,0.182516412,0.090597549,4.83266520,0.27081661,0.136422457,⋯,0.316156122,0.0468757529,-0.2182265684,-0.044437136,-0.0353861969,-0.261141970,-0.047290130,1.169673258,-0.0022485513,0
8,0.008886352,-0.0172638649,1.01748108,-0.009225653,0.008554729,-0.013602200,0.011112331,-0.07009183,1.13116699,0.030933494,⋯,0.323410782,0.0075796228,0.0014631753,-0.042774686,-0.0384256060,0.207488355,-0.036505232,0.121696874,0.0316671954,0
9,0.070563927,-0.0631844983,2.92260523,2.024529638,0.009783519,0.093621462,0.027343874,0.69383507,-0.23749909,0.089187104,⋯,0.055622657,0.0115775460,-0.0513154897,-0.044285192,-0.0380509397,-0.153155771,-0.045354021,0.819023467,0.3355343661,0
10,-0.098080427,-0.0369058157,-0.57782579,-0.236417767,0.007874124,-0.013602200,-0.008336745,0.15021359,-0.56619509,0.055850714,⋯,-0.052972861,-0.0083480045,0.0412606288,-0.036271239,-0.0393630205,14.961853746,-0.017755479,-0.278124241,-0.0457372067,0


Unnamed: 0_level_0,Attr1,Attr2,Attr3,Attr4,Attr5,Attr6,Attr7,Attr8,Attr9,Attr10,⋯,Attr56,Attr57,Attr58,Attr59,Attr60,Attr61,Attr62,Attr63,Attr64,class
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,0.098582283,-0.079038656,0.8483779,0.5053766,0.014338073,0.05468771,0.0264825178,1.3804549,3.2888261,0.10929924,⋯,-0.011984429,0.013222124,0.003945575,-0.04182787,0.01742109,0.18268569,-0.053090822,5.4697646,0.007555338,0
2,-0.414167153,0.253487039,-3.5342371,-0.3286542,0.00594927,-0.13659349,-0.0613816113,-0.8063255,-0.4244155,-0.40382419,⋯,-0.127202698,0.02893643,0.076520192,-0.04802866,-0.03902422,-0.27850034,0.148029343,-0.6094995,-0.030085647,0
3,-0.025802987,-0.008365557,0.5689602,0.1375347,0.009177676,-0.02173413,0.0079650971,-0.1437886,-0.3981796,0.01964535,⋯,0.001628959,0.002682907,-0.011307217,-0.0419063,-0.03754277,-0.21711373,-0.030687451,-0.0667746,-0.041202499,0
4,-0.049020128,-0.053784389,1.1171311,0.1157987,0.00906536,-0.01681077,0.0007486314,0.4344052,-0.3356825,0.07446969,⋯,-0.026436663,-0.001969096,0.00951121,-0.04422515,-0.03773722,-0.24122276,-0.031521101,-0.0453584,-0.022653701,0
5,-0.031931862,-0.00565999,-0.4788741,-0.211193,0.008497973,-0.01160959,0.0057414075,-0.1638806,-0.6356746,0.01621315,⋯,0.074682987,0.001799416,-0.015650099,-0.03370324,-0.02973266,-0.11147818,0.004323959,-0.4281046,-0.045761133,0
6,0.008277747,-0.013628693,-0.7077837,-0.2491718,0.00811124,-0.0136022,0.0110080407,-0.101721,-0.1907451,0.02632202,⋯,-0.007804477,0.007819498,-0.008977661,-0.0409852,-0.03655744,-0.01487038,-0.015653396,-0.2995516,-0.041135461,0


Unnamed: 0_level_0,Attr1,Attr2,Attr3,Attr4,Attr5,Attr6,Attr7,Attr8,Attr9,Attr10,⋯,Attr56,Attr57,Attr58,Attr59,Attr60,Attr61,Attr62,Attr63,Attr64,class
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,0.015347705,-0.059311144,0.62843685,0.04386689,0.009211853,0.028531613,0.020769756,0.55558498,-0.2719131,0.079133971,⋯,0.032768496,0.005699118,-0.0298435223,-0.043979316,-0.03395102,-0.20706304,-0.02947384,-0.09550065,-0.03539036,0
2,-0.147325562,-0.037169435,-0.09567928,-0.15133437,0.007454725,-0.081077768,-0.015655956,0.09131112,-0.4275043,0.039912963,⋯,-0.131174922,-0.015949883,0.0791554363,-0.044363004,-0.04118312,-0.07855509,0.02035046,-0.48234259,-0.04420722,0
3,-0.017964385,0.002835951,-0.37601435,-0.18662368,0.008451847,0.031901832,0.008775274,-0.22110739,-0.3423507,0.005435438,⋯,0.105469702,-0.004225603,-0.0733408583,-0.023425828,-0.03490042,-0.160523,-0.02259481,-0.2184178,-0.04289807,0
4,-0.172374152,0.119064316,1.60383779,0.05004094,0.012485751,-0.173804801,-0.019948252,-0.59131232,-0.4377672,-0.142022644,⋯,0.44060327,0.38944644,-0.3006938913,-0.948478705,-0.04055568,-0.31262152,0.01276632,-0.45984386,0.456001,0
5,-0.040311446,0.021832728,-0.48195877,-0.2088883,0.008515154,0.006507988,0.004154272,-0.35653204,-0.2837932,-0.033407147,⋯,0.022479565,0.002621507,-0.0230076996,-0.001251845,-0.03058097,-0.17678136,-0.0210621,-0.23914207,-0.0454181,0
6,-0.002770682,0.020690378,0.23778721,-0.13019039,0.00839327,-0.0136022,0.0111919,-0.31843577,0.5887439,-0.01721418,⋯,-0.002780129,0.009858933,0.0008149053,-0.033174342,-0.03837784,-0.04135809,-0.02629236,-0.15952604,-0.01400472,0


In [None]:
# Chuẩn hóa data từ Attr1 đến Atrr64
for (i in 1:64) {
      bankruptcy_train[i] <- scale(bankruptcy_train[i], center = TRUE, scale = TRUE)
}
for (i in 1:64) {
      bankruptcy_test[i] <- scale(bankruptcy_test[i], center = TRUE, scale = TRUE)
}

# Chuyển đổi cột "class" thành một factor với 2 levels là 0 và 1
bankruptcy_train$class <- factor(bankruptcy_train$class, levels = c(0,1))
bankruptcy_test$class <- factor(bankruptcy_test$class, levels = c(0,1))

In [None]:
# Kiểm tra lại số chiều và sample dataframe
unique(bankruptcy_train$class)
dim(bankruptcy_train)
head(bankruptcy_train)

unique(bankruptcy_test$class)
dim(bankruptcy_test)
head(bankruptcy_test)

Unnamed: 0_level_0,Attr1,Attr2,Attr3,Attr4,Attr5,Attr6,Attr7,Attr8,Attr9,Attr10,⋯,Attr56,Attr57,Attr58,Attr59,Attr60,Attr61,Attr62,Attr63,Attr64,class
Unnamed: 0_level_1,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",⋯,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",<fct>
1,0.72080134,-1.3651455,0.8963051,0.8075718,0.01704424,0.153642007,0.59824773,1.35819858,4.7011425,1.456725,⋯,-0.0034573223,0.021453269,-0.002032426,-0.03979651,0.07275514,0.20481145,-0.0485526227,6.06335245,0.10868821,0
2,-2.6243319,3.7100133,-3.6762076,-0.4927762,0.01074452,-0.279918266,-2.53125612,-0.78009739,-0.5987456,-4.7679737,⋯,-0.0906992064,0.035608455,0.052879238,-0.04482447,-0.07761755,-0.31930231,0.1234484451,-0.6768251,-0.12075922,0
3,-0.09067747,-0.2865004,0.6047803,0.2340648,0.01316895,-0.019576665,-0.06129718,-0.13224996,-0.5612992,0.369134,⋯,0.0068505716,0.011959717,-0.013573047,-0.0398601,-0.07367092,-0.24953964,-0.0293929145,-0.07509735,-0.18852401,0
4,-0.2421441,-0.9797033,1.1767033,0.2001759,0.0130846,-0.008417301,-0.31832991,0.43312435,-0.4720975,1.0342079,⋯,-0.0144003795,0.007769269,0.002178665,-0.04174036,-0.07418893,-0.2769383,-0.0301058648,-0.05135287,-0.07545645,0
5,-0.13066172,-0.2452068,-0.4884565,-0.309641,0.01265851,0.003371757,-0.14049953,-0.15189647,-0.9002745,0.327498,⋯,0.0621662005,0.011163882,-0.016858974,-0.03320859,-0.05286444,-0.12949037,0.0005493838,-0.4757097,-0.21631199,0
6,0.13166231,-0.3668288,-0.7272847,-0.3688543,0.01236809,-0.001144725,0.04708498,-0.09111508,-0.2652292,0.4501286,⋯,-0.0002923126,0.016586673,-0.011810451,-0.03911323,-0.07104595,-0.01970067,-0.016535558,-0.33318094,-0.18811537,0


Unnamed: 0_level_0,Attr1,Attr2,Attr3,Attr4,Attr5,Attr6,Attr7,Attr8,Attr9,Attr10,⋯,Attr56,Attr57,Attr58,Attr59,Attr60,Attr61,Attr62,Attr63,Attr64,class
Unnamed: 0_level_1,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",⋯,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>",<fct>
1,0.09807784,-0.97173739,0.6463261,0.1120418,0.02075017,0.6247776,0.26811421,0.6881323,-0.3985619,0.76271236,⋯,0.08371726,-0.026223711,-0.10480369,-0.073864169,-0.08353794,-0.18564348,-0.033167746,-0.1111191,-0.03198952,0
2,-0.63573441,-0.65860612,-0.1551898,-0.2858278,-0.06220968,-1.5566432,-0.68116663,0.1268638,-0.6380421,0.41847284,⋯,-0.52057404,-0.129888732,0.47632389,-0.07521015,-0.10630203,-0.06345483,0.014698372,-0.7578489,-0.03617596,0
3,-0.05219159,-0.09284414,-0.4654895,-0.3577563,-0.01513224,0.6918509,-0.04447081,-0.2508242,-0.5069769,0.11586648,⋯,0.35169207,-0.073747678,-0.33670959,-0.001762414,-0.0865263,-0.14139202,-0.026559073,-0.3166143,-0.03555434,0
4,-0.74872754,1.55087427,1.7259864,0.1246261,0.17532177,-3.4020754,-0.79302703,-0.6983712,-0.6538384,-1.17836083,⋯,1.58698594,1.811328721,-1.54884151,-3.24685332,-0.10432707,-0.28601118,0.007412305,-0.7202351,0.20133398,0
5,-0.15299825,0.17581104,-0.5827582,-0.4031372,-0.01214332,0.1864685,-0.16489753,-0.4145414,-0.4168474,-0.22505166,⋯,0.04579252,-0.040960679,-0.06835852,0.076024048,-0.0729302,-0.15685089,-0.025086604,-0.3512616,-0.03675091,0
6,0.01634656,0.15965575,0.2139205,-0.242731,-0.01789785,-0.2137599,0.01850823,-0.3684862,0.926129,-0.08292734,⋯,-0.04731408,-0.006304674,0.05865161,-0.035960259,-0.09747199,-0.02808698,-0.030111303,-0.218158,-0.02183514,0


#GLM logistic regession

In [None]:
logitModel  <- glm(class ~ ., data = bankruptcy_train, family = binomial(link="logit"))
# summary (logitModel)

“glm.fit: algorithm did not converge”
“glm.fit: fitted probabilities numerically 0 or 1 occurred”


In [None]:
# kiểm tra tương quan tuyến tính giữa các biến
# cor(select (bankruptcy_train, -class))

"glm.fit: algorithm did not converge": mô hình GLM không hội tụ đến giá trị tối ưu sau khi thực hiện các vòng lặp để tối ưu hoá hàm mất mát.

"glm.fit: fitted probabilities numerically 0 or 1 occurred": mô hình đã tìm thấy một số trường hợp mà xác suất dự đoán của mô hình là 0 hoặc 1. Điều này có thể xảy ra khi mô hình quá đơn giản hoặc khi dữ liệu đầu vào không tốt.

#Kiểm tra lại kết quả với bộ dữ liệu dùng để training mô hình

In [None]:
# Tính toán xác suất của mỗi quan sát sẽ bằng 1 (class = 1)
predicted_train <- predict(logitModel, bankruptcy_train, type="response")


“prediction from a rank-deficient fit may be misleading”


Xuất hiện thông báo "prediction from a rank-deficient fit may be misleading"

=> Mô hình tuyến tính bị lỗi thiếu hạng (rank-deficient), số lượng biến độc lập trong mô hình lớn hơn số lượng quan sát và/hoặc có sự tương quan mạnh giữa các biến độc lập. Trong trường hợp này, mô hình tuyến tính không thể tính được các ước lượng hợp lý tối đa (maximum likelihood estimates) hoặc giá trị tối ưu của các tham số mô hình.

=> R sẽ tự động loại bỏ một hoặc nhiều biến độc lập khỏi mô hình để giảm số chiều của không gian biến độc lập và giải quyết vấn đề và có thể làm cho các dự đoán của mô hình trở nên không chính xác hoặc thiếu độ chính xác.

Để giải quyết:
- Phân tích thành phần chính (PCA) để giảm số chiều của không gian biến độc lập
- Sử dụng mô hình tuyến tính khác có thể xử lý các biến độc lập tương quan mạnh hơn.

In [None]:
# Nếu xác suất dự đoán lớn hơn hoặc bằng 0.5 thì class = 1, nếu nhỏ hơn 0.5 thì được dự đoán class = 0
predicted_train_scaled <- as.integer(predicted_train >= 0.5)
predicted_train_scaled

In [None]:
table(bankruptcy_train$class, predicted_train_scaled)

   predicted_train_scaled
       0    1
  0 5400   87
  1   99   14

Đánh giá hiệu suất của mô hình dự đoán trên tập dữ liệu đã cho:

- Accuracy: đánh giá tỷ lệ các dự đoán chính xác của mô hình so với tất cả các dự đoán trên tập dữ liệu. Accuracy cao có nghĩa là mô hình dự đoán chính xác nhiều điểm dữ liệu hơn.

- Precision: đánh giá tỷ lệ các dự đoán dương tính đúng (Class = 1) trong tất cả các dự đoán dương tính của mô hình. Precision cao có nghĩa là mô hình không phát hiện nhầm nhiều điểm dữ liệu là thuộc vào nhóm positive.

- Recall: đánh giá tỷ lệ các dự đoán dương tính đúng (Class = 1) trong tất cả các điểm dữ liệu thuộc vào nhóm positive của tập dữ liệu. Recall cao có nghĩa là mô hình không bỏ sót nhiều điểm dữ liệu positive.

- F1-Score: là trung bình điều hòa giữa Precision và Recall, được sử dụng để đánh giá hiệu suất của mô hình dự đoán trên cả hai nhóm positive và negative (Class = 1 hoặc Class = 0). F1-Score cao có nghĩa là mô hình có độ chính xác cao cả trong việc phát hiện positive và negative.


In [None]:
Accuracy(predicted_train_scaled, bankruptcy_train$class)
Precision(bankruptcy_train$class, predicted_train_scaled)
Recall(bankruptcy_train$class, predicted_train_scaled)
F1_Score(bankruptcy_train$class, predicted_train_scaled)

#Kiểm tra lại kết quả với bộ dữ liệu Test

In [None]:
# Tính toán xác suất của mỗi quan sát sẽ bằng 1 (class = 1) với data test
predicted <- predict(logitModel, bankruptcy_test, type="response")

“prediction from a rank-deficient fit may be misleading”


In [None]:
# Nếu xác suất dự đoán lớn hơn hoặc bằng 0.5 thì class = 1, nếu nhỏ hơn 0.5 thì được dự đoán class = 0
predicted_scaled <- as.integer(predicted >= 0.5)
predicted_scaled

In [None]:
table(predicted_scaled)
table(bankruptcy_test$class )
table(bankruptcy_test$class, predicted_scaled)

predicted_scaled
   0    1 
1014 1138 


   0    1 
2105   47 

   predicted_scaled
       0    1
  0  975 1130
  1   39    8

In [None]:
Accuracy(predicted_scaled, bankruptcy_test$class)
Precision(bankruptcy_test$class, predicted_scaled)
Recall(bankruptcy_test$class, predicted_scaled)
F1_Score(bankruptcy_test$class, predicted_scaled)

Độ chính xác thấp nên cải thiện lại mô hình bằng cách chọn lại các biến đầu vào

# GLM logistic regession with Expert No.1
Sử dụng các chỉ số liên quan đến Tổng nợ phải trả

Attr5, Attr24, Attr25, Attr26, Attr34



In [None]:
logitModel_Expert1  <- glm(class ~ Attr5 + Attr24 + Attr25 + Attr26 + Attr34
                                 , data = bankruptcy_train, family = binomial(link="logit"))

“glm.fit: fitted probabilities numerically 0 or 1 occurred”


In [None]:
# Tính toán xác suất của mỗi quan sát sẽ bằng 1 (class = 1)
predicted_train <- predict(logitModel_Expert1, bankruptcy_train, type="response")


In [None]:
# Nếu xác suất dự đoán lớn hơn hoặc bằng 0.5 thì class = 1, nếu nhỏ hơn 0.5 thì được dự đoán class = 0
predicted_train_scaled <- as.integer(predicted_train >= 0.5)
predicted_train_scaled

In [None]:
table(bankruptcy_train$class, predicted_train_scaled)

   predicted_train_scaled
       0    1
  0 5484    3
  1  113    0

In [None]:
Accuracy(predicted_train_scaled, bankruptcy_train$class)
Precision(bankruptcy_train$class, predicted_train_scaled)
Recall(bankruptcy_train$class, predicted_train_scaled)
F1_Score(bankruptcy_train$class, predicted_train_scaled)

###Kiểm tra lại GLM Expert No.1 với data test

In [None]:
# Tính toán xác suất của mỗi quan sát sẽ bằng 1 (class = 1) với data test
predicted <- predict(logitModel_Expert1, bankruptcy_test, type="response")

In [None]:
# Nếu xác suất dự đoán lớn hơn hoặc bằng 0.5 thì class = 1, nếu nhỏ hơn 0.5 thì được dự đoán class = 0
predicted_scaled <- as.integer(predicted >= 0.5)
predicted_scaled

In [None]:
table(predicted_scaled)
table(bankruptcy_test$class)
table(bankruptcy_test$class, predicted_scaled)

predicted_scaled
   0    1 
2149    3 


   0    1 
2105   47 

   predicted_scaled
       0    1
  0 2103    2
  1   46    1

In [None]:
Accuracy(predicted_scaled, bankruptcy_test$class)
Precision(bankruptcy_test$class, predicted_scaled)
Recall(bankruptcy_test$class, predicted_scaled)
F1_Score(bankruptcy_test$class, predicted_scaled)

###Kiểm định giả thuyết

In [None]:
summary (logitModel_Expert1)


Call:
glm(formula = class ~ Attr5 + Attr24 + Attr25 + Attr26 + Attr34, 
    family = binomial(link = "logit"), data = bankruptcy_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2803  -0.2187  -0.1826  -0.1464   4.3653  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.20455    0.11910 -35.302  < 2e-16 ***
Attr5        0.03638    0.29028   0.125 0.900261    
Attr24       0.14244    0.14945   0.953 0.340530    
Attr25      -0.24127    0.06333  -3.810 0.000139 ***
Attr26      -1.00263    0.17303  -5.795 6.85e-09 ***
Attr34       0.40993    0.06635   6.178 6.50e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1105.8  on 5599  degrees of freedom
Residual deviance: 1029.0  on 5594  degrees of freedom
AIC: 1041

Number of Fisher Scoring iterations: 7


Có Attr25, 26, 34 là có mức ý nghĩa cao

25 (equity - share capital) / total assets

27 (net profit + depreciation) / total liabilities

34 operating expenses / total liabilities


# GLM logistic regession with Expert No.2
Sử dụng các chỉ số liên quan đến Nợ ngắn hạn

Attr8, Attr10, Attr12, Attr20, Attr33, Attr40, Attr42, Attr46, Attr49, Attr59, Attr63, Attr64


In [None]:
logitModel_Expert2  <- glm(class ~ Attr8 + Attr10 + Attr12 + Attr20 + Attr33 + Attr40
                                 + Attr42 + Attr46 + Attr49 + Attr59 + Attr63 + Attr64
                                 , data = bankruptcy_train, family = binomial(link="logit"))

“glm.fit: fitted probabilities numerically 0 or 1 occurred”


In [None]:
# Tính toán xác suất của mỗi quan sát sẽ bằng 1 (class = 1)
predicted_train <- predict(logitModel_Expert2, bankruptcy_train, type="response")


In [None]:
# Nếu xác suất dự đoán lớn hơn hoặc bằng 0.5 thì class = 1, nếu nhỏ hơn 0.5 thì được dự đoán class = 0
predicted_train_scaled <- as.integer(predicted_train >= 0.5)
predicted_train_scaled

In [None]:
table(bankruptcy_train$class, predicted_train_scaled)

   predicted_train_scaled
       0    1
  0 5487    0
  1  112    1

In [None]:
Accuracy(predicted_train_scaled, bankruptcy_train$class)
Precision(bankruptcy_train$class, predicted_train_scaled)
Recall(bankruptcy_train$class, predicted_train_scaled)
F1_Score(bankruptcy_train$class, predicted_train_scaled)

###Kiểm tra lại GLM Expert No.2 với data test

In [None]:
# Tính toán xác suất của mỗi quan sát sẽ bằng 1 (class = 1) với data test
predicted <- predict(logitModel_Expert2, bankruptcy_test, type="response")

In [None]:
# Nếu xác suất dự đoán lớn hơn hoặc bằng 0.5 thì class = 1, nếu nhỏ hơn 0.5 thì được dự đoán class = 0
predicted_scaled <- as.integer(predicted >= 0.5)
predicted_scaled

In [None]:
table(predicted_scaled)
table(bankruptcy_test$class)
table(bankruptcy_test$class, predicted_scaled)

predicted_scaled
   0    1 
2070   82 


   0    1 
2105   47 

   predicted_scaled
       0    1
  0 2039   66
  1   31   16

In [None]:
Accuracy(predicted_scaled, bankruptcy_test$class)
Precision(bankruptcy_test$class, predicted_scaled)
Recall(bankruptcy_test$class, predicted_scaled)
F1_Score(bankruptcy_test$class, predicted_scaled)

###Kiểm định giả thuyết

In [None]:
summary (logitModel_Expert2)


Call:
glm(formula = class ~ Attr8 + Attr10 + Attr12 + Attr20 + Attr33 + 
    Attr40 + Attr42 + Attr46 + Attr49 + Attr59 + Attr63 + Attr64, 
    family = binomial(link = "logit"), data = bankruptcy_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8642  -0.2367  -0.1675  -0.0912   3.9034  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.13256    0.23025 -22.292  < 2e-16 ***
Attr8         0.01649    0.07925   0.208 0.835184    
Attr10        0.02294    0.09719   0.236 0.813439    
Attr12        0.22030    0.18994   1.160 0.246109    
Attr20       -0.09344    0.23358  -0.400 0.689121    
Attr33        9.92255    2.04060   4.863 1.16e-06 ***
Attr40        2.87819    0.53284   5.402 6.60e-08 ***
Attr42      -27.37770    8.12182  -3.371 0.000749 ***
Attr46       -4.37675    0.78197  -5.597 2.18e-08 ***
Attr49       27.42524    8.13388   3.372 0.000747 ***
Attr59       -0.03237    0.21838  -0.148 0.882171    
Attr63      -11.09

Có Attr33, 40, 42, 46, 49, 63 là có mức ý nghĩa cao?

33 operating expenses / short-term liabilities

40 (current assets - inventory - receivables) / short-term liabilities

42 profit on operating activities / sales

46 (current assets - inventory) / short-term liabilities

49 EBITDA (profit on operating activities - depreciation) / sales

63 sales / short-term liabilities

=> Các chỉ số liên quan đến tài sản ngắn hạn và doanh thu giống như kết luận của bài báo?