# Mini Project: Tree-Based Algorithms

## The "German Credit" Dataset

### Dataset Details

This dataset has two classes (these would be considered labels in Machine Learning terms) to describe the worthiness of a personal loan: "Good" or "Bad". There are predictors related to attributes, such as: checking account status, duration, credit history, purpose of the loan, amount of the loan, savings accounts or bonds, employment duration, installment rate in percentage of disposable income, personal information, other debtors/guarantors, residence duration, property, age, other installment plans, housing, number of existing credits, job information, number of people being liable to provide maintenance for, telephone, and foreign worker status.

Many of these predictors are discrete and have been expanded into several 0/1 indicator variables (a.k.a. they have been one-hot-encoded).

This dataset has been kindly provided by Professor Dr. Hans Hofmann of the University of Hamburg, and can also be found on the UCI Machine Learning Repository.

## Decision Trees

 As we have learned in the previous lectures, Decision Trees as a family of algorithms (irrespective to the particular implementation) are powerful algorithms that can produce models with a predictive accuracy higher than that produced by linear models, such as Linear or Logistic Regression. Primarily, this is due to the fact the DT's can model nonlinear relationships, and also have a number of tuning paramters, that allow for the practicioner to achieve the best possible model. An added bonus is the ability to visualize the trained Decision Tree model, which allows for some insight into how the model has produced the predictions that it has. One caveat here, to keep in mind, is that sometimes, due to the size of the dataset (both in the sense of the number of records, as well as the number of features), the visualization might prove to be very large and complex, increasing the difficulty of interpretation.

To give you a very good example of how Decision Trees can be visualized and interpreted, we would strongly recommend that, before continuing on with solving the problems in this Mini Project, you take the time to read this fanstastic, detailed and informative blog post: http://explained.ai/decision-tree-viz/index.html

## Building Your First Decision Tree Model

So, now it's time to jump straight into the heart of the matter. Your first task, is to build a Decision Tree model, using the aforementioned "German Credit" dataset, which contains 1,000 records, and 62 columns (one of them presents the labels, and the other 61 present the potential features for the model.)

For this task, you will be using the scikit-learn library, which comes already pre-installed with the Anaconda Python distribution. In case you're not using that, you can easily install it using pip.

Before embarking on creating your first model, we would strongly encourage you to read the short tutorial for Decision Trees in scikit-learn (http://scikit-learn.org/stable/modules/tree.html), and then dive a bit deeper into the documentation of the algorithm itself (http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 

Also, since you want to be able to present the results of your model, we suggest you take a look at the tutorial for accuracy metrics for classification models (http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report) as well as the more detailed documentation (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

Finally, an *amazing* resource that explains the various classification model accuracy metrics, as well as the relationships between them, can be found on Wikipedia: https://en.wikipedia.org/wiki/Confusion_matrix

(Note: as you've already learned in the Logistic Regression mini project, a standard practice in Machine Learning for achieving the best possible result when training a model is to use hyperparameter tuning, through Grid Search and k-fold Cross Validation. We strongly encourage you to use it here as well, not just because it's standard practice, but also becuase it's not going to be computationally to intensive, due to the size of the dataset that you're working with. Our suggestion here is that you split the data into 70% training, and 30% testing. Then, do the hyperparameter tuning and Cross Validation on the training set, and afterwards to a final test on the testing set.)

### Now we pass the torch onto you! You can start building your first Decision Tree model! :)

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [6]:
# Your code here!
from sklearn.metrics import accuracy_score
gc = pd.read_csv("GermanCredit.csv.zip")

X = gc.drop('Class', axis = 1).values
y = gc['Class'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 120)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))
dt_acc = accuracy_score(y_test, y_pred)
dt_acc

              precision    recall  f1-score   support

         Bad       0.58      0.47      0.52       100
        Good       0.76      0.83      0.79       200

    accuracy                           0.71       300
   macro avg       0.67      0.65      0.66       300
weighted avg       0.70      0.71      0.70       300



0.71

In [7]:
params = {'criterion': ['gini', 'entropy', 'log_loss'],
          'max_depth': range(1, 62, 5), 
          'min_samples_leaf': range(1, 102, 5)
         }

import time

start = time.time_ns()
grid = GridSearchCV(DecisionTreeClassifier(), param_grid = params, cv = 10, n_jobs = -1) 
grid.fit(X_train, y_train)
end = time.time_ns()
t = (end - start) // 1e9 
print("Best: %f using %s" % (grid.best_score_, grid.best_params_))
print("Took", t, "seconds.")

Best: 0.728571 using {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 81}
Took 9.0 seconds.


In [8]:
best_clf = DecisionTreeClassifier(criterion = 'gini', max_depth = 6, min_samples_leaf = 81)
best_clf.fit(X_train, y_train)
y_pred = best_clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         Bad       0.56      0.22      0.32       100
        Good       0.70      0.92      0.79       200

    accuracy                           0.68       300
   macro avg       0.63      0.57      0.56       300
weighted avg       0.66      0.68      0.63       300



### After you've built the best model you can, now it's time to visualize it!

Rememeber that amazing blog post from a few paragraphs ago, that demonstrated how to visualize and interpret the results of your Decision Tree model. We've seen that this can perform very well, but let's see how it does on the "German Credit" dataset that we're working on, due to it being a bit larger than the one used by the blog authors.

First, we're going to need to install their package. If you're using Anaconda, this can be done easily by running:

In [9]:
#! pip3 install dtreeviz

If for any reason this way of installing doesn't work for you straight out of the box, please refer to the more detailed documentation here: https://github.com/parrt/dtreeviz

Now you're ready to visualize your Decision Tree model! Please feel free to use the blog post for guidance and inspiration!

In [10]:
# Your code here! :)
'''
from dtreeviz.trees import *

feature_names = gc.drop('Class', axis = 1).columns
viz = dtreeviz.model(clf, X_train, y_train, target_name = 'Class', feature_names = feature_names, class_names = [1, 0])

#viz.view()

IGNORED!
'''

"\nfrom dtreeviz.trees import *\n\nfeature_names = gc.drop('Class', axis = 1).columns\nviz = dtreeviz.model(clf, X_train, y_train, target_name = 'Class', feature_names = feature_names, class_names = [1, 0])\n\n#viz.view()\n\nIGNORED!\n"

## Random Forests

As discussed in the lecture videos, Decision Tree algorithms also have certain undesireable properties. Mainly the have low bias, which is good, but tend to have high variance - which is *not* so good (more about this problem here: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).

Noticing these problems, the late Professor Leo Breiman, in 2001, developed the Random Forests algorithm, which mitigates these problems, while at the same time providing even higher predictive accuracy than the majority of Decision Tree algorithm implementations. While the curriculum contains two excellent lectures on Random Forests, if you're interested, you can dive into the original paper here: https://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324.pdf.

In the next part of this assignment, your are going to use the same "German Credit" dataset to train, tune, and measure the performance of a Random Forests model. You will also see certain functionalities that this model, even though it's a bit of a "black box", provides for some degree of interpretability.

First, let's build a Random Forests model, using the same best practices that you've used for your Decision Trees model. You can reuse the things you've already imported there, so no need to do any re-imports, new train/test splits, or loading up the data again.

In [11]:
from sklearn.ensemble import RandomForestClassifier

In [12]:
# Your code here! :)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print(classification_report(y_test, y_pred))
rf_acc = accuracy_score(y_test, y_pred)
rf_acc

              precision    recall  f1-score   support

         Bad       0.67      0.35      0.46       100
        Good       0.74      0.92      0.82       200

    accuracy                           0.73       300
   macro avg       0.71      0.63      0.64       300
weighted avg       0.72      0.73      0.70       300



0.7266666666666667

In [13]:
params = {'n_estimators': range(1, 200, 15),
          'criterion': ['gini', 'entropy', 'log_loss'],
          'min_samples_leaf': range(1, 102, 5)
         }

start = time.time_ns()
grid_rf = GridSearchCV(RandomForestClassifier(), param_grid = params, cv = 10, n_jobs = -1) 
grid_rf.fit(X_train, y_train)
end = time.time_ns()
t = (end - start) // 1e9 
print("Best: %f using %s" % (grid_rf.best_score_, grid_rf.best_params_))
print("Took", t, "seconds.")

Best: 0.764286 using {'criterion': 'log_loss', 'min_samples_leaf': 1, 'n_estimators': 136}
Took 330.0 seconds.


In [17]:
best_rf = RandomForestClassifier(criterion = 'entropy', n_estimators = 136, min_samples_leaf = 1)
best_rf.fit(X_train, y_train)
y_pred = best_rf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         Bad       0.78      0.36      0.49       100
        Good       0.75      0.95      0.84       200

    accuracy                           0.75       300
   macro avg       0.77      0.66      0.67       300
weighted avg       0.76      0.75      0.72       300



As mentioned, there are certain ways to "peek" into a model created by the Random Forests algorithm. The first, and most popular one, is the Feature Importance calculation functionality. This allows the ML practitioner to see an ordering of the importance of the features that have contributed the most to the predictive accuracy of the model. 

You can see how to use this in the scikit-learn documentation (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_). Now, if you tried this, you would just get an ordered table of not directly interpretable numeric values. Thus, it's much more useful to show the feature importance in a visual way. You can see an example of how that's done here: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#sphx-glr-auto-examples-ensemble-plot-forest-importances-py

Now you try! Let's visualize the importance of features from your Random Forests model!

In [39]:
# Your code here
print("Yeah not happening")

Yeah not happening


A final method for gaining some insight into the inner working of your Random Forests models is a so-called Partial Dependence Plot. The Partial Dependence Plot (PDP or PD plot) shows the marginal effect of a feature on the predicted outcome of a previously fit model. The prediction function is fixed at a few values of the chosen features and averaged over the other features. A partial dependence plot can show if the relationship between the target and a feature is linear, monotonic or more complex. 

In scikit-learn, PDPs are implemented and available for certain algorithms, but at this point (version 0.20.0) they are not yet implemented for Random Forests. Thankfully, there is an add-on package called **PDPbox** (https://pdpbox.readthedocs.io/en/latest/) which adds this functionality to Random Forests. The package is easy to install through pip.

In [40]:
#! pip3 install pdpbox

Collecting pdpbox
  Downloading PDPbox-0.3.0-py3-none-any.whl (35.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m35.8/35.8 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Collecting xgboost>=1.7.1
  Downloading xgboost-1.7.6-py3-none-macosx_10_15_x86_64.macosx_11_0_x86_64.macosx_12_0_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pqdm>=0.2.0
  Downloading pqdm-0.2.0-py2.py3-none-any.whl (6.8 kB)
Collecting sphinx-rtd-theme>=1.1.1
  Downloading sphinx_rtd_theme-1.2.2-py2.py3-none-any.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting bounded-pool-executor
  Downloading bounded_pool_executor-0.0.3-py3-none-any.whl (3.4 kB)
Collecting sphinxcontrib-jquery<5,>=4
  Downloading sphinxcontrib_jquery-4.1-py2.py3-none-any.whl (1

While we encourage you to read the documentation for the package (and reading package documentation in general is a good habit to develop), the authors of the package have also written an excellent blog post on how to use it, showing examples on different algorithms from scikit-learn (the Random Forests example is towards the end of the blog post): https://briangriner.github.io/Partial_Dependence_Plots_presentation-BrianGriner-PrincetonPublicLibrary-4.14.18-updated-4.22.18.html

So, armed with this new knowledge, feel free to pick a few features, and make a couple of Partial Dependence Plots of your own!

In [64]:
# Your code here!
import pdpbox
print("The github page no longer exist...")

The github page no longer exist...


## (Optional) Advanced Boosting-Based Algorithms

As explained in the video lectures, the next generation of algorithms after Random Forests (that use Bagging, a.k.a. Bootstrap Aggregation) were developed using Boosting, and the first one of these were Gradient Boosted Machines, which are implemented in scikit-learn (http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting).

Still, in recent years, a number of variations on GBMs have been developed by different research amd industry groups, all of them bringing improvements, both in speed, accuracy and functionality to the original Gradient Boosting algorithms.

In no order of preference, these are:
1. **XGBoost**: https://xgboost.readthedocs.io/en/latest/
2. **CatBoost**: https://tech.yandex.com/catboost/
3. **LightGBM**: https://lightgbm.readthedocs.io/en/latest/

If you're using the Anaconda distribution, these are all very easy to install:

In [None]:
#! conda install -c anaconda py-xgboost

In [None]:
#! conda install -c conda-forge catboost

In [None]:
#! conda install -c conda-forge lightgbm

Your task in this optional section of the mini project is to read the documentation of these three libraries, and apply all of them to the "German Credit" dataset, just like you did in the case of Decision Trees and Random Forests.

The final deliverable of this section should be a table (can be a pandas DataFrame) which shows the accuracy of all the five algorthms taught in this mini project in one place.

Happy modeling! :)

In [66]:
from sklearn.preprocessing import OrdinalEncoder

X = gc.drop('Class', axis = 1)
y = gc[['Class']]

y_encoded = OrdinalEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size = .3, random_state = 341)

In [98]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, y_train, enable_categorical = True)
dtest = xgb.DMatrix(X_test, y_test, enable_categorical = True)

params = {
    'colsample_bynode': 0.8,
    'learning_rate': 1,
    'max_depth': 5,
    'num_parallel_tree': 100,
    'objective': 'binary:logistic',
    'subsample': 0.8,
    'tree_method': 'hist'
    }

bst = xgb.train(params, dtrain, num_boost_round = 1)
y_pred = bst.predict(dtest)

for i in range(len(y_pred)):
    if y_pred[i] >= .5:
        y_pred[i] = 1
    else:
        y_pred[i] = 0
        
print(classification_report(y_test, y_pred))
xgb_acc = accuracy_score(y_test, y_pred)
xgb_acc

              precision    recall  f1-score   support

         0.0       0.61      0.33      0.43        91
         1.0       0.76      0.91      0.83       209

    accuracy                           0.73       300
   macro avg       0.68      0.62      0.63       300
weighted avg       0.71      0.73      0.71       300



  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):
  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


0.7333333333333333

In [99]:
from catboost import CatBoostClassifier

cb = CatBoostClassifier(None)
cb.fit(X_train, y_train)
y_pred = cb.predict(X_test)

Learning rate set to 0.008847
0:	learn: 0.6890105	total: 1.07ms	remaining: 1.07s
1:	learn: 0.6851044	total: 2.26ms	remaining: 1.13s
2:	learn: 0.6812351	total: 3.39ms	remaining: 1.13s
3:	learn: 0.6775628	total: 4.44ms	remaining: 1.1s
4:	learn: 0.6739294	total: 5.55ms	remaining: 1.1s
5:	learn: 0.6703980	total: 6.62ms	remaining: 1.1s
6:	learn: 0.6670563	total: 7.63ms	remaining: 1.08s
7:	learn: 0.6635263	total: 8.66ms	remaining: 1.07s
8:	learn: 0.6609638	total: 9.99ms	remaining: 1.1s
9:	learn: 0.6580491	total: 11.1ms	remaining: 1.1s
10:	learn: 0.6543360	total: 13.4ms	remaining: 1.2s
11:	learn: 0.6511236	total: 15.1ms	remaining: 1.25s
12:	learn: 0.6480060	total: 16.3ms	remaining: 1.24s
13:	learn: 0.6452339	total: 17.4ms	remaining: 1.23s
14:	learn: 0.6415392	total: 19ms	remaining: 1.25s
15:	learn: 0.6383905	total: 20.3ms	remaining: 1.25s
16:	learn: 0.6354298	total: 21.5ms	remaining: 1.24s
17:	learn: 0.6333752	total: 22ms	remaining: 1.2s
18:	learn: 0.6303712	total: 23.2ms	remaining: 1.2s
19:	

209:	learn: 0.4143841	total: 187ms	remaining: 703ms
210:	learn: 0.4138328	total: 188ms	remaining: 704ms
211:	learn: 0.4133203	total: 189ms	remaining: 703ms
212:	learn: 0.4126091	total: 190ms	remaining: 702ms
213:	learn: 0.4116654	total: 191ms	remaining: 701ms
214:	learn: 0.4109259	total: 192ms	remaining: 700ms
215:	learn: 0.4101786	total: 193ms	remaining: 699ms
216:	learn: 0.4096986	total: 193ms	remaining: 698ms
217:	learn: 0.4090081	total: 194ms	remaining: 697ms
218:	learn: 0.4085430	total: 195ms	remaining: 696ms
219:	learn: 0.4080388	total: 197ms	remaining: 698ms
220:	learn: 0.4074946	total: 198ms	remaining: 699ms
221:	learn: 0.4067219	total: 200ms	remaining: 700ms
222:	learn: 0.4061595	total: 201ms	remaining: 701ms
223:	learn: 0.4055498	total: 202ms	remaining: 701ms
224:	learn: 0.4050373	total: 203ms	remaining: 701ms
225:	learn: 0.4043055	total: 205ms	remaining: 703ms
226:	learn: 0.4037043	total: 206ms	remaining: 702ms
227:	learn: 0.4032695	total: 207ms	remaining: 702ms
228:	learn: 

419:	learn: 0.3228911	total: 374ms	remaining: 516ms
420:	learn: 0.3225988	total: 375ms	remaining: 515ms
421:	learn: 0.3220190	total: 376ms	remaining: 515ms
422:	learn: 0.3216684	total: 377ms	remaining: 514ms
423:	learn: 0.3212063	total: 378ms	remaining: 513ms
424:	learn: 0.3208523	total: 378ms	remaining: 512ms
425:	learn: 0.3204446	total: 379ms	remaining: 511ms
426:	learn: 0.3201995	total: 380ms	remaining: 510ms
427:	learn: 0.3200148	total: 381ms	remaining: 509ms
428:	learn: 0.3197275	total: 382ms	remaining: 508ms
429:	learn: 0.3193102	total: 383ms	remaining: 508ms
430:	learn: 0.3189644	total: 384ms	remaining: 507ms
431:	learn: 0.3186885	total: 385ms	remaining: 507ms
432:	learn: 0.3183005	total: 386ms	remaining: 506ms
433:	learn: 0.3177359	total: 387ms	remaining: 505ms
434:	learn: 0.3173677	total: 388ms	remaining: 504ms
435:	learn: 0.3170820	total: 389ms	remaining: 503ms
436:	learn: 0.3168065	total: 390ms	remaining: 503ms
437:	learn: 0.3164289	total: 391ms	remaining: 502ms
438:	learn: 

636:	learn: 0.2598934	total: 560ms	remaining: 319ms
637:	learn: 0.2596136	total: 561ms	remaining: 319ms
638:	learn: 0.2592766	total: 562ms	remaining: 318ms
639:	learn: 0.2590686	total: 563ms	remaining: 317ms
640:	learn: 0.2586212	total: 564ms	remaining: 316ms
641:	learn: 0.2583283	total: 565ms	remaining: 315ms
642:	learn: 0.2581321	total: 566ms	remaining: 314ms
643:	learn: 0.2578737	total: 567ms	remaining: 313ms
644:	learn: 0.2577282	total: 568ms	remaining: 312ms
645:	learn: 0.2574479	total: 568ms	remaining: 311ms
646:	learn: 0.2573051	total: 569ms	remaining: 311ms
647:	learn: 0.2570828	total: 570ms	remaining: 310ms
648:	learn: 0.2569155	total: 571ms	remaining: 309ms
649:	learn: 0.2567265	total: 572ms	remaining: 308ms
650:	learn: 0.2565057	total: 573ms	remaining: 307ms
651:	learn: 0.2563176	total: 574ms	remaining: 306ms
652:	learn: 0.2560007	total: 575ms	remaining: 305ms
653:	learn: 0.2557549	total: 576ms	remaining: 305ms
654:	learn: 0.2555860	total: 577ms	remaining: 304ms
655:	learn: 

838:	learn: 0.2138381	total: 746ms	remaining: 143ms
839:	learn: 0.2135230	total: 747ms	remaining: 142ms
840:	learn: 0.2133677	total: 748ms	remaining: 141ms
841:	learn: 0.2132123	total: 749ms	remaining: 141ms
842:	learn: 0.2129700	total: 750ms	remaining: 140ms
843:	learn: 0.2127388	total: 751ms	remaining: 139ms
844:	learn: 0.2126065	total: 752ms	remaining: 138ms
845:	learn: 0.2123981	total: 753ms	remaining: 137ms
846:	learn: 0.2122142	total: 753ms	remaining: 136ms
847:	learn: 0.2120554	total: 754ms	remaining: 135ms
848:	learn: 0.2118442	total: 755ms	remaining: 134ms
849:	learn: 0.2116662	total: 756ms	remaining: 133ms
850:	learn: 0.2114932	total: 757ms	remaining: 133ms
851:	learn: 0.2113203	total: 758ms	remaining: 132ms
852:	learn: 0.2110643	total: 759ms	remaining: 131ms
853:	learn: 0.2108785	total: 760ms	remaining: 130ms
854:	learn: 0.2107362	total: 761ms	remaining: 129ms
855:	learn: 0.2105495	total: 762ms	remaining: 128ms
856:	learn: 0.2103138	total: 763ms	remaining: 127ms
857:	learn: 

In [100]:
print(classification_report(y_test, y_pred))
cb_acc = accuracy_score(y_test, y_pred)
cb_acc

              precision    recall  f1-score   support

         0.0       0.68      0.52      0.59        91
         1.0       0.81      0.89      0.85       209

    accuracy                           0.78       300
   macro avg       0.75      0.71      0.72       300
weighted avg       0.77      0.78      0.77       300



0.78

In [101]:
import lightgbm as lgb

train_data = lgb.Dataset(X_train, label = y_train)
test_data = lgb.Dataset(X_test, label = y_test)

param = {'num_leaves': 35, 'objective': 'binary', 'metric': 'auc'}
light = lgb.train(param, train_data, 10, valid_sets = [test_data])
y_pred = light.predict(X_test)

for i in range(len(y_pred)):
    if y_pred[i] >= .5:
        y_pred[i] = 1
    else:
        y_pred[i] = 0

[LightGBM] [Info] Number of positive: 491, number of negative: 209
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 416
[LightGBM] [Info] Number of data points in the train set: 700, number of used features: 54
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.701429 -> initscore=0.854110
[LightGBM] [Info] Start training from score 0.854110
[1]	valid_0's auc: 0.722383
[2]	valid_0's auc: 0.735475
[3]	valid_0's auc: 0.750013
[4]	valid_0's auc: 0.755218
[5]	valid_0's auc: 0.754614
[6]	valid_0's auc: 0.761107
[7]	valid_0's auc: 0.772622
[8]	valid_0's auc: 0.769835
[9]	valid_0's auc: 0.768574
[10]	valid_0's auc: 0.768836




In [102]:
print(classification_report(y_test, y_pred))
light_acc = accuracy_score(y_test, y_pred)
light_acc

              precision    recall  f1-score   support

         0.0       0.68      0.23      0.34        91
         1.0       0.74      0.95      0.83       209

    accuracy                           0.73       300
   macro avg       0.71      0.59      0.59       300
weighted avg       0.72      0.73      0.68       300



0.7333333333333333

In [None]:
l = [{'Model': 'Decision Tree', 'Accuracy': dt_acc},
     {'Model': 'Random Forest', 'Accuracy': rf_acc},
     {'Model': 'XGBoost', 'Accuracy': xgb_acc},
     {'Model': 'CatBoost', 'Accuracy': cb_acc},
     {'Model': 'Ct', 'Accuracy': cb_acc}
    
]