# Projekt

źródło danych: https://www.kaggle.com/datasets/taweilo/loan-approval-classification-data

Dane zawierają informacje z wniosków kredytowych.

Cel biznesowy: Stworzenie modelu, dzięki któremu przed złożeniem wniosku kredytowego, znamy decyzje, co pozwoli na nie tracenie czasu i procesowanie odpowiednich klientów.

Założenie: y to akceptacja i uruchomienie kredytu.

1. Dokonaj wstepnej analizy zbiory.
2. Wytypuj zmienne do modelowania.
3. Dokonaj potrzebnych przekształceń.
4. Zoptymalizuj model.
5. Stwórz symulację optymalizacji punktu cut-off wiedząc,że:
    - False positive to strata banku w postaci czasu poświęconego przez pracownika - szacujemy stratę w wysokości 50.
    - False negative to strata banku w wysokości  (loan_int_rate / 100 ) * loan_amnt (przyblizenie) - przybliżenie zysku banku, gdyby złożyć wniosek i klient by uruchomił kredyt.
    - True positive to zysk w wysokości (loan_int_rate / 100 ) * loan_amnt
    - True negative to oszczędność 50 jednostek.




In [1]:
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('data/loan_data.csv')

In [None]:
df.head()

1. Dokonaj wstepnej analizy zbioru.

In [None]:
df.select_dtypes(exclude='object').corr(method='spearman')

In [None]:
df['loan_status'].value_counts()

In [9]:
import seaborn as sns

In [8]:
cols_to_plot = ['loan_percent_income', 'loan_int_rate','person_income','loan_amnt']

In [None]:
for i in cols_to_plot:
    sns.kdeplot(data=df,x=i,hue = df['loan_status'], common_norm=False)
    plt.title(f'{i}')
    plt.show()

In [None]:
cols = df.select_dtypes(include = 'object').columns
for i in cols:
    print(df[i].value_counts())
    print(df[[i, 'loan_status']].groupby(i).mean())
    print('\n')

In [12]:
df_to_model = df[df['previous_loan_defaults_on_file']=='No'].reset_index(drop=True)

In [None]:
cols = df_to_model.select_dtypes(include = 'object').columns
for i in cols:
    print(df_to_model[i].value_counts())
    print(df_to_model[[i, 'loan_status']].groupby(i).mean())
    print('\n')

2. Wytypuj zmienne do modelowania.

In [17]:
cols_to_encode = ['person_education','loan_intent','person_home_ownership']

In [None]:
cols_to_plot

In [15]:
from sklearn.preprocessing import OneHotEncoder

In [18]:
ohe = OneHotEncoder(sparse_output=False).fit(df_to_model[cols_to_encode])
res = ohe.transform(df_to_model[cols_to_encode])

In [None]:
res

In [20]:
df_to_model = df_to_model.join(pd.DataFrame(data=res,columns = ohe.get_feature_names_out()))

In [None]:
df_to_model.head()

In [None]:
corr = abs(df_to_model.select_dtypes(exclude='object').corr())['loan_status']
corr

In [None]:
x_names = list(corr[(corr>0.05) & (corr <1)].index)
x_names

4. Zoptymalizuj model.


In [28]:
from sklearn.ensemble import GradientBoostingClassifier
from bayes_opt import BayesianOptimization
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import roc_auc_score, roc_curve

In [29]:
train_x, test_x, train_y, test_y = train_test_split(df_to_model[x_names], df_to_model['loan_status'], test_size=0.2, random_state=1)

In [34]:
def opt_fun(learning_rate, min_samples_leaf, n_estimators):
    min_samples_leaf = int(round(min_samples_leaf))
    n_estimators = int(round(n_estimators))
    model = GradientBoostingClassifier(learning_rate=learning_rate,
                                       min_samples_leaf=min_samples_leaf,
                                       n_estimators=n_estimators).fit(train_x,train_y)
    score = cross_val_score(model, train_x, train_y, cv=3, scoring = 'roc_auc' ).mean()
    return score

In [31]:
params = {"learning_rate": [0.01,0.8],
          "min_samples_leaf": [5,50],
          "n_estimators": [20,200]}

In [35]:
optimization = BayesianOptimization(f = opt_fun,
                                    pbounds = params,
                                    )

In [None]:
optimization.maximize(n_iter=10, init_points=5)

In [None]:
best_params = optimization.max['params']
best_params

In [38]:
best_params['min_samples_leaf'] = int(round(best_params['min_samples_leaf']))
best_params['n_estimators'] = int(round(best_params['n_estimators']))

In [39]:
model = GradientBoostingClassifier(**best_params).fit(train_x,train_y)

In [40]:
train_pred = model.predict_proba(train_x)[:,1]
test_pred = model.predict_proba(test_x)[:,1]

In [41]:
auc_train  = round(roc_auc_score(train_y,train_pred),3)
auc_test = round(roc_auc_score(test_y,test_pred),3)

In [None]:
auc_test

In [None]:
auc_train

In [44]:
fpr_train, tpr_train, thresholds_train = roc_curve(train_y,train_pred)
fpr_test, tpr_test, thresholds_test = roc_curve(test_y,test_pred)

In [46]:
import numpy as np

In [None]:
plt.plot(fpr_train,tpr_train,label = 'train')
plt.plot(fpr_test, tpr_test, label = 'test')
plt.plot(np.arange(0,1,0.01), np.arange(0,1,0.01),'--')
plt.legend()
plt.annotate(f'AUC train: {auc_train}',xy=[0.2,0.8])
plt.annotate(f'AUC test: {auc_test}', xy=[0.2,0.75])
plt.show()

5. Stwórz symulację optymalizacji punktu cut-off.

- False positive to strata banku w postaci czasu poświęconego przez pracownika - szacujemy stratę w wysokości 50.
- False negative to strata banku w wysokości  (loan_int_rate / 100 ) * loan_amnt (przyblizenie) - przybliżenie zysku banku, gdyby złożyć wniosek i klient by uruchomił kredyt.
- True positive to zysk w wysokości (loan_int_rate / 100 ) * loan_amnt
- True negative to oszczędność 50 jednostek.


In [51]:
fp = -50 
tn = 50

In [49]:
test  = test_x.copy()
test['pred'] = test_pred
test['class'] = test_y

In [52]:
total_margin_list = []

for i in range(0,100):
    threshold = i*1.0 / 100
    test['pred_class'] = (test['pred']>= threshold).astype(int)
    tp_revenue = np.sum((test['pred_class']==1).astype(int) * (test['class']==1).astype(int) * (test['loan_int_rate']/100)*test['loan_amnt'])
    fn_lost = -np.sum((test['pred_class']==0).astype(int) * (test['class']==1).astype(int) * (test['loan_int_rate']/100)*test['loan_amnt'])
    fp_lost = test[(test['pred_class']==1) & (test['class']==0)].shape[0] * fp
    tn_revenue = test[(test['pred_class']==0) & (test['class']==0)].shape[0] * tn 
    total_margin = tp_revenue + fn_lost + fp_lost + tn_revenue
    total_margin_list.append(total_margin)

In [None]:
total_margin_list

In [None]:
plt.plot(range(0,100),total_margin_list)
plt.show()

In [None]:
cut_off = total_margin_list.index(max(total_margin_list))/100
cut_off

# Projekt część II
Do realizacji jako ostatnia część bloku

1. Zapisz model do pliku, a następnie go pobierz.
2. Zapisz ramkę danych do lokalnej bazy danych.
3. Pobierz ramkę dla kilku rekordów i dokonaj predykcji. Napisz funkcje do pobierania odpowiednich danych oraz do predykcji. 
4. Zapisz model do MLflow.
5. Wytrenuj dowolny, inny model i zapisz go do MLflow.
6. Porównaj wyniki modeli.

1. Zapisz model do pliku, a następnie go pobierz.

In [56]:
import os 
import joblib

In [None]:
os.path.exists('models')

In [59]:
if not os.path.exists('models'):
    os.mkdir('models')
    print('done')

In [None]:
joblib.dump(model,'models/gb_credit_approve.joblib')

In [61]:
loaded_model = joblib.load('models/gb_credit_approve.joblib')

2. Zapisz ramkę danych do lokalnej bazy danych.

In [63]:
import pandas as pd
from sqlalchemy import create_engine
from dotenv import load_dotenv


In [None]:
load_dotenv()

In [None]:
db = os.getenv('DB')
db

In [67]:
engine = create_engine(db)

In [None]:
df.to_sql('model_data_credit',con=engine, if_exists='append',method= 'multi')

3. Pobierz ramkę dla kilku rekordów i dokonaj predykcji. Napisz funkcje do pobierania odpowiednich danych oraz do predykcji. 

In [None]:
df.head()

In [70]:
cond = 'person_age >=60'

In [71]:
import sqlalchemy

In [None]:
type(engine)

In [73]:
def data_load(engine: sqlalchemy.engine.base.Engine, conditions: str):
    """
    Function to import data from database
    """
    try:
        to_pred = pd.read_sql(f"""select * 
                              from model_data_credit where {conditions} """, con= engine)
    except:
        print("Nie udało się pobrać danych dla zadanych warunków")
    if to_pred.shape[0]==0:
        raise BaseException("Brak danych dla podanych ograniczeń")
    return to_pred

In [74]:
to_pred = data_load(engine=engine, conditions=cond)

In [None]:
to_pred

In [76]:
def model_prediction(df: pd.DataFrame, 
                     model_path: str,
                     encoding_path: str):
    try:
        model = joblib.load(model_path)
    except:
        print('Brak modelu')
    try:
        encoding = joblib.load(encoding_path)
    except:
        print('Brak encodingu')
    encoded = pd.DataFrame(data= encoding.transform(df[encoding.feature_names_in_]),columns = encoding.get_feature_names_out())
    df = df.join(encoded)
    preds = model.predict(df[model.feature_names_in_])
    return preds

In [None]:
ohe

In [None]:
joblib.dump(ohe,'models/ohe.joblib')

In [79]:
preds = model_prediction(to_pred, model_path='models/gb_credit_approve.joblib', encoding_path='models/ohe.joblib')

In [None]:
preds

4. Zapisz model do MLflow.

In [81]:
import mlflow

In [None]:
# wiersz polecenia : mlflow ui

In [82]:
mlflow.set_tracking_uri(uri = 'http://127.0.0.1:5000')

In [None]:
mlflow.set_experiment('credit acceptance')

In [None]:
 model.get_params()

In [None]:
with mlflow.start_run():
    for key, value in  model.get_params().items():
        mlflow.log_param(key, value)
    mlflow.log_metric('auc_train', auc_train)
    mlflow.log_metric('auc_test',auc_test)
    signature = mlflow.models.infer_signature(model_input = train_x,
                                              model_output = ((train_pred >=0.04).astype(int)))
    model_info = mlflow.sklearn.log_model(
        sk_model = model,
        artifact_path = 'credits approval',
        signature = signature,
        input_example = train_x,
        registered_model_name  = 'credits approval new'
    )

5. Wytrenuj dowolny, inny model i zapisz go do MLflow.

In [89]:
from sklearn.ensemble import HistGradientBoostingClassifier

In [90]:
model_2= HistGradientBoostingClassifier().fit(train_x, train_y)

In [93]:
train_pred = model_2.predict_proba(train_x)[:,1]
test_pred = model_2.predict_proba(test_x)[:,1]

In [94]:
auc_train = roc_auc_score(train_y, train_pred)
auc_test = roc_auc_score(test_y,test_pred)

6. Porównaj wyniki modeli.

In [None]:
with mlflow.start_run():
    for key, value in  model_2.get_params().items():
        mlflow.log_param(key, value)
    mlflow.log_metric('auc_train', auc_train)
    mlflow.log_metric('auc_test',auc_test)
    signature = mlflow.models.infer_signature(model_input = train_x,
                                              model_output = model_2.predict(train_x))
    model_info = mlflow.sklearn.log_model(
        sk_model = model_2,
        artifact_path = 'credits approval',
        signature = signature,
        input_example = train_x,
        registered_model_name  = 'credits approval hist GB'
    )