# Financial data analytics

In this section you will be challenged with a couple of analytical questions about the data. There might be no single correct answer to some of the questions, feel free to provide solutions which make the most sense to you. We know that you might not have the time to provide solutions to all the questions: try to finalize at least 3 of them and focus on the quality of your answers.

The objective is to implement the solutions using Python and we would be most happy if you adopt PySpark for at least some of the exercises. However, you can pick the programming language of your choice.

<img src = "img/EntityRelation.png"></src>

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import DateType
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import when, lit
from pyspark.sql.functions import col, udf

from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

from datetime import datetime

import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import tree
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

import statsmodels.api as sm

import os

In [4]:
ss = SparkSession.builder.appName("Scigility Test Challenge").getOrCreate()

## Preprocessing

<font color="green">In this part I work on loading the data, casting them in the correct format and removing unusable ones.</font>

In [5]:
df_trans = ss.read.csv("dataset/trans.csv",sep=';', inferSchema=False, header=True)

In [6]:
@udf
def convertDate(x):
    try:
        cv_date = datetime.strptime(x, '%Y-%m-%d')
    except:
        cv_date = "error"
    return cv_date

In [7]:
df_trans_cast = df_trans.withColumn("date_2", convertDate("date"))

In [8]:
df_trans_cast.filter(col("date_2")=="error").count()

                                                                                

1056320

<font color="green">I choose to remove wrong data format (we can easily find again the wrong records).

For the dates records, I will work with String only for a better display.</font>

In [None]:
df_trans_clean = df_trans_cast.filter(col("date_2")!="error")

In [None]:
df_trans_cast = df_trans_clean \
    .withColumn("trans_id", df_trans_clean["trans_id"].cast("int")) \
    .withColumn("account_id", df_trans_clean["account_id"].cast("int")) \
    .withColumn("amount", df_trans_clean["amount"].cast("int")) \
    .withColumn("balance", df_trans_clean["balance"].cast("int")) \
    .withColumn("account", df_trans_clean["account"].cast("int")) \
    .drop("date_2")

In [None]:
df_trans_cast.show(5)

In [None]:
df_loan = ss.read.csv("dataset/loan.csv",sep=';', inferSchema=False, header=True)

In [None]:
df_loan_cast = df_loan \
    .withColumn("loan_id", df_loan["loan_id"].cast("int")) \
    .withColumn("account_id", df_loan["account_id"].cast("int")) \
    .withColumn("amount", df_loan["amount"].cast("int")) \
    .withColumn("duration", df_loan["duration"].cast("int")) \
    .withColumn("payments", df_loan["payments"].cast("float"))

In [None]:
df_loan_cast.show(5)

In [None]:
df_order = ss.read.csv("data/order.csv", sep=';', inferSchema=False, header=True)

In [None]:
df_order_cast = df_order \
    .withColumn("order_id", df_order["order_id"].cast("int")) \
    .withColumn("account_id", df_order["account_id"].cast("int")) \
    .withColumn("account_to", df_order["account_to"].cast("int")) \
    .withColumn("amount", df_order["amount"].cast("float"))

In [None]:
df_order.show(5)

In [None]:
df_district = ss.read.csv("data/district.csv", sep=';', inferSchema=False, header=True)

In [None]:
df_district_cast = df_district \
    .withColumn("district_id", df_district["district_id"].cast("int")) \
    .withColumn("A4", df_district["A4"].cast("int")) \
    .withColumn("A5", df_district["A5"].cast("int")) \
    .withColumn("A6", df_district["A6"].cast("int")) \
    .withColumn("A7", df_district["A7"].cast("int")) \
    .withColumn("A8", df_district["A8"].cast("int")) \
    .withColumn("A9", df_district["A9"].cast("int")) \
    .withColumn("A11", df_district["A11"].cast("int")) \
    .withColumn("A14", df_district["A14"].cast("int")) \
    .withColumn("A15", df_district["A15"].cast("int")) \
    .withColumn("A16", df_district["A16"].cast("int")) \
    .withColumn("A10", df_district["A10"].cast("float")) \
    .withColumn("A12", df_district["A12"].cast("float")) \
    .withColumn("A13", df_district["A13"].cast("float"))

In [None]:
df_district_cast.show(5)

In [None]:
df_account = ss.read.csv("data/account.csv", sep=';', inferSchema=False, header=True)

In [None]:
df_account_cast = df_account \
    .withColumn("account_id", df_account["account_id"].cast("int")) \
    .withColumn("district_id", df_account["district_id"].cast("int"))

In [None]:
df_account_cast.show(5)

In [None]:
df_disp = ss.read.csv("data/disp.csv", sep=";", inferSchema=False, header=True)

In [None]:
df_disp_cast = df_disp \
    .withColumn("disp_id", df_disp["disp_id"].cast("int")) \
    .withColumn("account_id", df_disp["account_id"].cast("int")) \
    .withColumn("client_id", df_disp["client_id"].cast("int"))

In [None]:
df_disp_cast.show(5)

In [None]:
df_client = ss.read.csv("data/client.csv", sep=";", inferSchema=False, header=True)

In [None]:
df_client_cast = df_client \
    .withColumn("client_id", df_client["client_id"].cast("int")) \
    .withColumn("district_id", df_client["district_id"].cast("int"))

In [None]:
df_client_cast.show(5)

In [None]:
df_card = ss.read.csv("data/card.csv", sep=";", inferSchema=False, header=True)

In [None]:
df_card_cast = df_card \
    .withColumn("card_id", df_card["card_id"].cast("int")) \
    .withColumn("disp_id", df_card["disp_id"].cast("int"))

In [None]:
df_card_cast.show(5)

## Analytics

#### 1.

Look at some basic statistics of the data (mean, variance, etc.) of the “trans” table to understand it better. Create plots or visualizations of your choice (feel free to use the library you prefer. Hint: in case of Python, Matplotlib is probably best suited - in case of Scala you might want to use the Vegas library). Print and explain an aspect of your choice (that you think is interesting) in the notebook.

####  Amount basic statistics

In [None]:
df_trans_cast.describe(["amount"]).show()

<font color="green">On average, transactions have an amount of 6K.</font>

#### Total number of transactions over time

In [None]:
df_transaction_date = df_trans_cast \
    .groupBy('date') \
    .agg({'trans_id': 'count'}) \
    .select("date",col("count(trans_id)").alias("count")) \
    .orderBy(df_trans_cast.date.asc())

In [None]:
df_transaction_date_pd = df_transaction_date.toPandas()

In [None]:
plt.figure(figsize=(20, 10))
ax = plt.subplot()
myLocator = mticker.MaxNLocator(10)
ax.xaxis.set_major_locator(myLocator)
plt.bar(df_transaction_date_pd['date'],df_transaction_date_pd['count'])
plt.title("Total number of transactions per day")
plt.axvline("1997-06-01", c='red')
plt.axvline("1997-11-01", c='red')

<font color="green">We note:

- Number of transactions per day is increasing over time

- There seems to have a significant number of transactions happening twice a year</font>

#### Total amount of transactions over time

In [None]:
df_amount_date = df_trans_cast \
    .groupBy('date') \
    .agg({'amount': 'sum'}) \
    .select("date",col("sum(amount)").alias("sum")) \
    .orderBy(df_trans_cast.date.asc())

In [None]:
df_amount_date_pd = df_amount_date.toPandas()

In [None]:
plt.figure(figsize=(20, 10))
ax = plt.subplot()
ticks_y = mticker.FuncFormatter(lambda x, pos: int(x/1000000))
ax.yaxis.set_major_formatter(ticks_y)
myLocator = mticker.MaxNLocator(10)
ax.xaxis.set_major_locator(myLocator)
plt.bar(df_amount_date_pd['date'],df_amount_date_pd['sum'])
plt.title("Total amount of transactions on time (in millions)")

<font color="green">We note spikes happening regularly over time. We could thus think that the spikes are mostly caused by the number of transactions as seen previously.</font>

#### 4.

Visualize the average loan amount per district.

In [None]:
df_loan_for_ave = df_loan_cast.select("account_id","amount")

In [None]:
df_account_for_ave = df_account_cast.select("account_id","district_id")

In [None]:
df_loan_district = df_loan_for_ave \
    .join(df_account_for_ave, df_loan_for_ave.account_id == df_account_for_ave.account_id, 'left') \
    .drop(df_loan_for_ave.account_id) \
    .drop(df_account_for_ave.account_id)

In [None]:
df_loan_district_gp = df_loan_district \
    .groupBy('district_id') \
    .agg({'amount': 'avg'}) \
    .select("district_id",col("avg(amount)").alias("avg"))

In [None]:
df_loan_district_gp.sort(col("avg").desc()).show(5)

In [None]:
df_loan_district_pd = df_loan_district_gp.toPandas()

In [None]:
plt.figure(figsize=(20,10))
plt.bar(df_loan_district_pd['district_id'],df_loan_district_pd["avg"])
plt.title("Loan amount per district")

In [None]:
df_loan_district_gp.describe("avg").show(5)

<font color="green">Loans are between 74K and 294K across all districts.

District 46 has the highest amount.</font>

## Credit risk prediction

#### 5.

Build an ML model that classifies if a certain loan will be paid or not. You can use any classification model you think is suitable (hint: if you’re using Spark, there are some available out of the box with MLlib - with Python and Scikit-Learn too). Note that the goal is not to get the accuracy as high as possible – it is completely OK to choose a simple model and not spend days on parameter tuning. Think about questions like:

- Which model do you choose and why?

- How do you do the training and how do you measure the model accuracy?

- Which are the variables contributing the most to the prediction?

- How accurate does your model get? If you think a higher accuracy is possible, what would be the next steps you take?


### Basic analysis

<font color="green">In this section I conducted simple analysis to understand better the data.</font>

#### Status

From data provider https://sorry.vse.cz/~berka/challenge/pkdd1999/berka.htm :

'A' stands for contract finished, no problems,

'B' stands for contract finished, loan not payed,

'C' stands for running contract, OK so far,

'D' stands for running contract, client in debt

In [None]:
df_loan_cast.count()

In [None]:
df_loan_cast.filter(col("status")=="D").count()

In [None]:
df_loan_for_join = df_loan_cast.select("loan_id","account_id","amount","status")

In [None]:
status_list = ['A', 'B', 'C', 'D']
count_status = []
for status in status_list:
    count_status.append(df_loan_cast.filter(col("status")==status).count())

In [None]:
plt.figure(figsize=(10,10))
plt.pie(count_status, labels=status_list, autopct='%0.0f%%')
plt.legend()
plt.title('status of loans')

<font color="green">Most clients have a running contract that is OK so far.

Among the sample, 7+5=12% of the loans are missing a payment.</font>

### Feature engineering

<font color="green">

In order to predict whether a loan will be paid or not in a relevant manner, I choose to focus on the following features:
    
- **date** when the loan was granted

- **amount** of money

- **duration** of the loan

- **type** of card
    
**Status** of paying off the loan will be used as the value to predict.
    
In this section I focus on operation to prepare the data for machine learning algorithms, that is:
    
- join
    
- conversion 
    
- standardization

</font>

#### Join

In [None]:
df_disp_for_join = df_disp_cast.select("disp_id", "account_id")

In [None]:
df_loan_disp = df_loan_cast \
    .join(df_disp_for_join, df_loan_cast.account_id == df_disp_for_join.account_id, 'left') \
    .drop(df_disp_for_join.account_id)

<font color="green">Note: since an account can have several clients (and thus several disp_id), the number of records increase. I think it makes sense to consider different records for different clients.</font>

In [None]:
df_card_for_join = df_card_cast.select("disp_id","type")

In [None]:
df_loan_disp_type = df_loan_disp \
    .join(df_card_for_join, df_loan_disp.disp_id == df_card_for_join.disp_id, 'left') \
    .drop(df_card_for_join.disp_id)

#### Conversion

In [None]:
@udf
def typeToInt(x):
    if(x=="junior"):
        n = 0
    elif(x=="classic"):
        n = 1
    elif(x=="gold"):
        n = 2
    else:
        n = -1
    return n

In [None]:
@udf
def statusToInt(x):
    if(x=="A"):
        n = 1
    elif(x=="B"):
        n = 2
    elif(x=="C"):
        n = 3
    elif(x=="D"):
        n = 4
    else:
        n = -1
    return n

In [None]:
from pyspark.sql.functions import unix_timestamp, from_unixtime

# Convert to number
df_type_int = df_loan_disp_type \
    .withColumn("type_2", typeToInt("type")) \
    .withColumn("status_2", statusToInt("status")) \
    .withColumn("date_2", unix_timestamp('date', 'yyy-MM-dd'))

# Change column format
df_type_cast = df_type_int \
    .withColumn("status_int", df_type_int["status_2"].cast("int")) \
    .withColumn("type_int", df_type_int["type_2"].cast("int")) \
    .withColumn("date_int", df_type_int["date_2"].cast("int"))

In [None]:
df_type_na = df_type_cast \
    .withColumn('type_na', when(df_type_cast.type.isNull(),lit('undefined')).otherwise(df_type_cast.type)) \
    .drop("type","type_2", "status_2", "date_2")

#### Standardization

In [None]:
df_type_scaled = df_type_na

unlist = udf(lambda x: round(float(list(x)[0]),3), DoubleType()) # convert column type from vector to double type

for colName in ["amount", "type_int", "date_int", "duration"]:
    assembler = VectorAssembler(inputCols=[colName],outputCol=colName+"_vect") # convert to vector type

    scaler = MinMaxScaler(inputCol=colName+"_vect", outputCol=colName+"_scaled")

    pipeline = Pipeline(stages=[assembler, scaler])
    
    df_type_scaled = pipeline \
        .fit(df_type_scaled) \
        .transform(df_type_scaled) \
        .withColumn(colName+"_scaled", unlist(colName+"_scaled")) \
        .drop(colName+"_vect")

In [None]:
df_type_scaled_pd = df_type_scaled.toPandas()

In [None]:
df_type_scaled_pd.head(10)

### Data visualization

#### Groups by type

In [None]:
def show_samples(samples, labels, dico_labels, features=[0,1], feature_names=None, display_labels=True):
    '''Display the samples in 2D'''
    fig = plt.figure(figsize=(20,10))
    if display_labels:
        nb_labels = np.max(labels)
        for j in range(nb_labels + 1):
            nb_samples = np.sum(labels == j)
            if nb_samples:
                index = np.where(labels == j)[0]
                plt.scatter(samples[index,features[0]],samples[index,features[1]],s=60, label=d[j])
    else:
        plt.scatter(samples[:,features[0]],samples[:,features[1]],color='gray')
    if feature_names is not None:
        plt.xlabel(feature_names[0])
        plt.ylabel(feature_names[1])
    plt.legend(loc=(0.9,0.3))
    plt.axis('auto')
    plt.show()

In [None]:
feature_names = ['amount','type']
d = {1:"A", 2:"B", 3:"C", 4:"D"}
show_samples(samples=np.array(df_type_scaled_pd[["amount_scaled","type_na"]]), \
             labels=np.array(df_type_scaled_pd["status_int"]), dico_labels=d, \
             display_labels=True, feature_names=feature_names)

<font color = "green">

We note:

- Most of the clients who cannot paid their loans do not have a card

- A large proportion of contracts that finished without any issue are for lower amounts

- No junior client are in debt for paying a contract
    
</font>

<font color = "green">It confirms that using the type as a feature is relevant.</font>

#### Dimension reduction for 2D display

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]

In [None]:
pca = PCA(n_components = 2, whiten = True)
X_reduced = pca.fit_transform(X)

In [None]:
pca.explained_variance_ratio_

<font color = "green">The dimension reduction allows us to still catch 72% of the variance.</font>

In [None]:
plt.figure(figsize=(15,7))
plt.plot(X_reduced[:,0],X_reduced[:,1], '+')

<font color = "green">We can see that the data are largely clusterizable.</font>

##  Multiclass approach

#### Linear Regression

<font color = "green">I chose to start with one of the easiest algorithm; it's also the model I know the best so I am able to better extract information of it.</font>

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]
Y = df_type_scaled_pd["status_int"]

In [None]:
X2 = sm.add_constant(X)

In [None]:
lr_sm = sm.OLS(Y,X2)
results = lr_sm.fit()
print(results.summary())

<font color = "green">

- Relationship is highly significant globally since p-value associated with Fisher stat is very low

- All variables are significant expect the amount

- **Date** and **Duration** are the most important factors

- Type is actually not that important (contrary to what we expected in previous part)
    
</font>

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
clf_lr = LinearRegression()
clf_lr = clf_lr.fit(X_train, Y_train)

In [None]:
clf_lr.score(X_test, Y_test)

<font color = "green">First results seem to be satisfying since we want at least a result > 50%.</font>

<font color = "green">Let's run a cross validation to check whether results are stable with data.</font>

In [None]:
cross_val_score(clf_lr, X, Y, cv=5)

<font color = "green">Results seem volatile but still satisfying.</font>

#### Decision tree

<font color = "green">The decision tree is a simple algorithm for non linear relations.</font>

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]
Y = df_type_scaled_pd["status_int"]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
clf_tr = tree.DecisionTreeClassifier(random_state = 0)
clf_tr = clf_tr.fit(X_train,Y_train)

In [None]:
print(clf_tr.score(X_test, Y_test))

<font color = "green">Result is quite high.</font>

In [None]:
clf_tr.get_depth()

In [None]:
clf_tr.feature_importances_

<font color = "green">**Date** and **Duration** are the most important features => in line with the Linear Regression</font>

In [None]:
tree.export_graphviz(clf_tr, out_file="TreeScigility.dot", filled=True)

(Picture displayed using http://viz-js.com/)

<img src = "img/Tree.png"></img>

<font color = "green">The depth is too high to have a good interpretability.</font>

#### AdaBoost (Boosting)

<font color="green">Boosting methods may be well adapted with such few data as we won't be penalised by the computational time.</font>

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]
Y = df_type_scaled_pd["status_int"]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
clf_adb = AdaBoostClassifier()
clf_adb.fit(X_train, Y_train)
clf_adb.score(X_test, Y_test)

#### KNN

<font color="green">KNN is well adapted for our case since we were able to identify clusters in the data.</font>

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]
Y = df_type_scaled_pd["status_int"]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
clf_kn = KNeighborsClassifier(n_neighbors=7)
clf_kn.fit(X_train, Y_train)

In [None]:
clf_kn.score(X_test, Y_test)

### Binary prediction

In [None]:
df_type_scaled_pd.head()

In [None]:
df_type_scaled_pd['status_bin'] = np.where(df_type_scaled_pd['status'] == "D", 0, 1)

#### Linear regression

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]
Y = df_type_scaled_pd["status_bin"]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
clf_lg = LogisticRegression()
clf_lg = clf_lg.fit(X_train,Y_train)

In [None]:
clf_lg.score(X_test, Y_test)

<font color = "green">Results are even better than in multiclass prediction.</font>

In [None]:
prob_lg = clf_lg.predict_proba(X_test)
pred_lg = prob_lg[:,1]
fpr_lg, tpr_lg, _ = metrics.roc_curve(Y_test, pred_lg)
roc_auc_lg = metrics.auc(fpr_lg, tpr_lg)

#### Decision tree

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]
Y = df_type_scaled_pd["status_bin"]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
clf_tr = tree.DecisionTreeClassifier()
clf_tr = clf.fit(X_train,Y_train)
clf_tr.score(X_test, Y_test)

In [None]:
prob_tr = clf_tr.predict_proba(X_test)
pred_tr = prob_tr[:,1]
fpr_tr, tpr_tr, _ = metrics.roc_curve(Y_test, pred_tr)
roc_auc_tr = metrics.auc(fpr_tr, tpr_tr)

#### AdaBoost

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]
Y = df_type_scaled_pd["status_bin"]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
param_test = {
            'n_estimators':[1,10,20,50,100],
            'learning_rate':[0.1,0.2,0.3,0.5,0.8]
    }

In [None]:
clf_grid = GridSearchCV(estimator = AdaBoostClassifier(), param_grid = param_test, scoring='roc_auc', cv=5)

In [None]:
clf_grid.fit(X_train, Y_train)

In [None]:
print(clf_grid.best_score_)
print(clf_grid.best_params_)

In [None]:
clf_ab = AdaBoostClassifier(n_estimators=100, learning_rate=0.3)
clf_ab = clf_ab.fit(X_train, Y_train)

In [None]:
prob_ab = clf_ab.predict_proba(X_test)
pred_ab = prob_ab[:,1]
fpr_ab, tpr_ab, _ = metrics.roc_curve(Y_test, pred_ab)
roc_auc_ab = metrics.auc(fpr_ab, tpr_ab)

#### KNN

In [None]:
X = df_type_scaled_pd[["amount_scaled","date_int_scaled","duration_scaled","type_int_scaled"]]
Y = df_type_scaled_pd["status_bin"]

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [None]:
clf_kn = KNeighborsClassifier(n_neighbors=7)
clf_kn.fit(X_train, Y_train)

In [None]:
clf_kn.score(X_test, Y_test)

<font color = "green">Still very high score.</font>

In [None]:
prob_kn = clf_kn.predict_proba(X_test)
pred_kn = prob_kn[:,1]
fpr_kn, tpr_kn, _ = metrics.roc_curve(Y_test, pred_kn)
roc_auc_kn = metrics.auc(fpr_kn, tpr_kn)

## Model selection

In [None]:
metric_models = {'log_reg':(fpr_lg, tpr_lg, roc_auc_lg),'tree':(fpr_tr, tpr_tr, roc_auc_tr), \
                 'adaboost':(fpr_ab, tpr_ab, roc_auc_ab), 'knn':(fpr_kn, tpr_kn, roc_auc_kn)}

plt.figure(figsize=(8,8))
for model in metric_models:
    fpr, tpr, roc_auc = metric_models[model]
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, label = model + ' AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
plt.show()

### Conclusion

<font color="green">

Those predictions can lead to 2 types of errors:

- False positive: the model wrongly predicted that the client will pay its loan

- False negative: the model wrongly predicted that the client won't pay its loan

A bank would probably want to make sure the loans are indeed paid. They will thus be in favor of an algorithm that minize the first error.

Thus, AdaBoost seems the best algorithm.

**Limits**: explainability, computational time with more data.
</font>