# Implémentez un modèle de scoring

- **Projet 7 du parcours « Data Scientist » d’OpenClassrooms**
- **Mark Creasey**

## Étape 2 : Modélisation


## 1.1 Compréhension du problème

### 1.1.1 Problématique

La société financière, nommée **"Prêt à dépenser"**, propose des crédits à la consommation pour des
personnes ayant peu ou pas du tout d'historique de prêt.

L’entreprise souhaite mettre en œuvre **un outil de “scoring crédit”** pour calculer la qu’un client
rembourse son crédit, puis classifie la demande en crédit accordé ou refusé. Elle souhaite donc
développer **un algorithme de classification** en s’appuyant sur des sources de données variées (données
comportementales, données provenant d'autres institutions financières, etc.).

### 1.1.2 Les données

Voici [les données](https://www.kaggle.com/c/home-credit-default-risk/data) pour réaliser le
dashboard. Pour plus de simplicité, vous pouvez les télécharger à
[cette adresse](https://s3-eu-west-1.amazonaws.com/static.oc-static.com/prod/courses/files/Parcours_data_scientist/Projet+-+Impl%C3%A9menter+un+mod%C3%A8le+de+scoring/Projet+Mise+en+prod+-+home-credit-default-risk.zip).

### 1.1.1 Mission

- Sélectionner un kernel Kaggle pour faciliter la préparation des données nécessaires à l’élaboration du modèle de scoring.
- Analyser ce kernel et l’adapter aux besoins de votre mission.

Focalise sur :

1. La construction d'un **modèle de scoring** qui donnera une prédiction sur la probabilité de faillite
   d'un client de façon automatique.
   - élaboration
   - optimisation
   - comprehension (interpretabilité)
2. Construction d'un **dashboard interactif** qui montre avec transparence les décisions d’octroi de
   crédit, à destination des gestionnaires de la relation client permettant d'interpréter les
   prédictions faites par le modèle et d’améliorer la connaissance client des chargés de relation
   client.


## 1.2 Definition de l'environnement

- `local` : Développement local (avec échantillon de 50 Mo de données)
- `colab` : Google Colab
- `kaggle` : Kaggle Kernel


In [1]:
ENV = 'local'

if ENV == 'local':
    # local development
    DATA_FOLDER = '../data/raw'
    OUT_FOLDER = '../data/out'
    IMAGE_FOLDER = '../images/modelisation'

if ENV == 'colab':
    # Colaboratory - uncomment les 2 lignes suivant pour connecter à votre drive
    # from google.colab import drive
    # drive.mount('/content/drive')
    DATA_FOLDER = '/content/drive/MyDrive/data/OC7'
    OUT_FOLDER = '/content/drive/MyDrive/data/OC7'
    IMAGE_FOLDER = '/content/drive/MyDrive/images/OC7/modelisation'


## 1.3 Fichiers de données

1. Les données en format CSV (>700Mb compactés) sont à télecharger de ce lien:

- https://www.kaggle.com/c/home-credit-default-risk/data
- Pour plus de simplicité, vous pouvez les télécharger à [cette adresse.](https://s3-eu-west-1.amazonaws.com/static.oc-static.com/prod/courses/files/Parcours_data_scientist/Projet+-+Impl%C3%A9menter+un+mod%C3%A8le+de+scoring/Projet+Mise+en+prod+-+home-credit-default-risk.zip)

2.  Placer le fichier compacté (**.zip**) dans le **DATA_FOLDER** défini ci-dessous


### Noms des fichiers de données (identique pour nettoyage et l'analyse exploratoire)

- Le grand fichier zip des données doit être placé dans `DATA_FOLDER` au préalable
- Tous les autres fichiers de données sont téléchargés ou crées pendant le nettoyage, puis enregistrés dans `OUT_FOLDER`


In [2]:
# Données (DATA_FOLDER)
ZIPPED_DATA_FILENAME = f'Projet+Mise+en+prod+-+home-credit-default-risk.zip'
RAW_DATA_FILENAME = 'HomeCredit_columns_description.csv'
SAMPLE_DATA_FILENAME = 'HomeCredit_columns_description.csv'


# Données nettoyés (OUT_FOLDER)
CLEAN_DATA_FILENAME = 'cleaned_data_scoring.csv'
CLEAN_DATA_SAMPLE = 'cleaned_data_sample.csv'  # 100,000 registres
CLEAN_DATA_FEATURES = 'cleaned_data_features.csv'  # 100 meilleur features
SAMPLE_SIZE = 10000


## 1.4 Requirements: Bibliothèques utilisées dans ce notebook

Ce notebook marche a été testé en developpement local, sur Google Colab et Kaggle

```txt
# copy dans un fichier requirements.txt, puis
# !pip install -r requirements.txt
```


In [3]:
# Decommentarise la ligne suivant si vous ne voulez pas changer vos versions existants
# !pip install numpy pandas matplotlib seaborn scipy sklearn missingno requests


In [4]:
# import local functions
import outils_io
outils_io.install_libraries({'numpy', 'pandas', 'matplotlib',
                             'seaborn', 'sklearn'})


required modules: ['seaborn', 'numpy', 'pandas', 'matplotlib', 'sklearn']
missing modules: []


## 1.5 Import dependencies


### 1.5.1 Import des bibliothèques utilisées par ce notebook


In [5]:
# suppress furture warnings de pandas 1.3.0
from contextlib import contextmanager
import time
import gc
import os
import warnings
import platform
warnings.simplefilter(action='ignore', category=FutureWarning)


In [6]:
import sklearn
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# feature preprocessing
from sklearn import impute
from sklearn import preprocessing

# feature et parameter selection
from sklearn.model_selection import train_test_split, GridSearchCV

# Sampling (SMOTE : Synthetic Minority Oversampling TEchnique)
import imblearn
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek


### 1.5.2 Liste des versions des bibliothèques utilisées


In [7]:
print(f'python version = {platform.python_version()}')
print('versions des bibliothèques utilisées:')
print('; '.join(f'{m.__name__}=={m.__version__}' for m in globals(
).values() if getattr(m, '__version__', None)))


python version = 3.7.0
versions des bibliothèques utilisées:
platform==1.0.8; sklearn==1.0.2; seaborn==0.11.2; pandas==1.1.5; numpy==1.21.5; imblearn==0.9.0


### 1.5.3 Configuration défauts d'affichage


In [8]:
from sklearn import set_config
pd.set_option('display.max_columns', 200)  # pour afficher toutes les colonnes
pd.set_option('display.max_rows', 20)  # pour afficher max 20 lignes
pd.set_option('display.max_colwidth', 800)

%matplotlib inline
sns.set_theme(style="white", context="notebook")
sns.set_color_codes("pastel")
sns.set_palette("tab10")

set_config(display='diagram')
# displays HTML representation in a jupyter context


### Bibliothèque personelle

On utilise un nom non standard


In [9]:
import outils_io
import outils_preprocess
import outils_stats
import outils_timed
import outils_vis

# frequently used functions
from outils_vis import to_png
from outils_timed import timer


### Configuration personelle


In [10]:

# Enregistre parametres globals dans outils
outils_vis.set_option('IMAGE_FOLDER', IMAGE_FOLDER)
outils_vis.set_option('SAVE_IMAGES', True)

if ENV != 'kaggle':
    outils_io.os_make_dir(DATA_FOLDER)
    outils_io.os_make_dir(OUT_FOLDER)

outils_io.os_make_dir(IMAGE_FOLDER)


# Import des données nettoyés


In [11]:
# Set SAMPLE=True for Rapid Development
SAMPLE = True

full_data_path = f'{OUT_FOLDER}/{CLEAN_DATA_FILENAME}'  # > 700 variables
best_data_path = f'{OUT_FOLDER}/{CLEAN_DATA_FEATURES}'  # les top 100 features
# echantillon pour developpement rapide
sample_data_path = f'{OUT_FOLDER}/{CLEAN_DATA_SAMPLE}'

cleaned_data_path = sample_data_path if SAMPLE else best_data_path
with timer('Load cleaned data'):
    df_data = pd.read_csv(cleaned_data_path)


Load cleaned data - done in 0s


In [12]:
with timer('Reduce memory'):
    df_data = outils_preprocess.reduce_memory(df_data)


Initial df memory usage is 6.49 MB for 85 columns
Final memory usage is: 1.88 MB - decreased by 71.0%
Reduce memory - done in 0s


## Create X (fields), y (target)


In [13]:
def create_X_y(df: pd.DataFrame):
    target = df['TARGET'].copy()
    fields = df.drop(columns=['TARGET', 'SK_ID_CURR'])
    return fields, target


X, y = create_X_y(df_data)


In [14]:
TARGET_CLASSES = ['0=repaid', '1=not repaid']
le = preprocessing.LabelEncoder()
target_classes = le.fit_transform(TARGET_CLASSES)
print(target_classes)


[0 1]


### Split train / test


In [15]:
# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2)


## Preprocessing de données

On crée un preprocessor pour pouvoir ajuster les paramètres de preprocessing


In [16]:
from sklearn.compose import make_column_selector

cat_selector = make_column_selector(dtype_include=object)
num_selector = make_column_selector(dtype_include=np.number)

category_features = cat_selector(x_train)
numerical_features = num_selector(x_train)
target_features = y_train.name

print(f'numerical_features : {len(numerical_features)}')
print(f'category_features : {len(category_features)}')
print(f'target_features : {target_features}')


numerical_features : 83
category_features : 0
target_features : TARGET


In [17]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer


def create_preprocessor(num_cols=num_selector, cat_cols=cat_selector):
    """Preprocessor """
    num_pipe = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
    cat_pipe = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('encoding', OneHotEncoder(dtype=int, sparse=True, handle_unknown='ignore'))
    ])
    preprocessor = ColumnTransformer(transformers=[
        ('num', num_pipe, num_cols),
        ('cat', cat_pipe, cat_cols)
    ],
        remainder='passthrough')
    return preprocessor


preprocessor = create_preprocessor()
preprocessor


In [18]:
numeric_pipeline = Pipeline(steps=[
    ('preprocessor', ColumnTransformer(transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy="median")),
            ('simple_scale', StandardScaler())
        ]), num_selector),
    ]))
])
numeric_pipeline


In [19]:
numeric_pipeline.fit(x_train)
print(numeric_pipeline.feature_names_in_[:5])

# Erreur si on essaie de récuperer les noms de colonnes
# numeric_pipeline.get_feature_names_out(numeric_pipeline.feature_names_in_)


['EXT_SOURCE_3' 'EXT_SOURCE_2' 'EXT_SOURCE_1' 'PREV_DAYS_DECISION_MIN'
 'PREV_AMT_ANNUITY_MEAN']


### Get feature names pour les preprocessors

Pour interpreter les models, il faut savoir les variables après preprocessing.

Malheureusement, beaucoup des transformers de `sklearn` perd leurs feature names

- SimpleImputer, FunctionTransformer, StandardScalar do not implement `get_feature_names_out`
- Ça pose des problèmes pour l'interprétabilité des modèles

L'ordre des features dépend des column selectors et l'ordre des ColumnTransformers

- Avec des 'named transformers' et 'named steps' dans un ordre spécifique, on peut récupérer les noms des features dans l'ordre


In [20]:
def get_features_out(pipe, xtrain_):
    """
    Get column names after preprocessing

    assumes (requires preprocessors with following structure):
    - all named transformers, if present, are in same order as listed below
    - if preprocessor has other transformers, add to list below
    - only final transformer (named 'cat') adds columns, via an encoder step named 'ohe'
    - if imputer is used, it does not add indicator columns
    """

    def get_features_in_(trans: ColumnTransformer, name=None):
        try:
            features = list(trans.named_transformers_[name].feature_names_in_)
        except:
            # named transformer doesn't exist in pipeline, return empty list
            features = []
        return features

    pipe.fit(xtrain_)
    if hasattr(pipe, 'named_steps'):
        trans: ColumnTransformer = pipe.named_steps['preprocessor']
    else:
        trans = pipe
    # SimpleImputer, FunctionTransformer, StandardScalar do not implement get_feature_names_out

    # Get feature names of numeric columns
    num_features = get_features_in_(trans, 'num')
    scale_features = get_features_in_(trans, 'simple_scale')
    log_features = get_features_in_(trans, 'log_scale')
    ordinal_features = get_features_in_(trans, 'ordinal')
    try:
        cat_encoder = trans.named_transformers_['cat']
        ohe = cat_encoder.named_steps['ohe']
        category_features = list(
            ohe.get_feature_names_out(cat_encoder.feature_names_in_))
    except:
        category_features = []
    features_out = (num_features+scale_features + log_features
                    + ordinal_features + category_features)
    return features_out


Test get_features_out


In [21]:
x_transformed = numeric_pipeline.fit_transform(x_train)
x_transformed_columns = get_features_out(numeric_pipeline, x_train)
print(x_transformed.shape)
print(len(x_transformed_columns))
print(x_transformed_columns[:3])


(8000, 83)
83
['EXT_SOURCE_3', 'EXT_SOURCE_2', 'EXT_SOURCE_1']


#### Preprocessor avec feature names


In [22]:
def preprocess(pipe, x_train_, x_test_):
    """Preprocess x_train et x_test séparament.
    Retourner les données transformés comme pandas DataFrames avec feature names"""
    features_out = get_features_out(pipe, x_train_)
    x_train_out = pd.DataFrame(pipe.fit_transform(
        x_train_), columns=features_out, index=x_train_.index)
    x_test_out = pd.DataFrame(pipe.transform(
        x_test_), columns=features_out, index=x_test_.index)
    return x_train_out, x_test_out


### Impute NaN avant oversampling

Pour oversampling, il ne faut pas avoir des NaN

On les remplis avec SimpleImputer (median), puis un rescale


In [23]:
with timer('preprocess data'):
    x_train_prep, x_test_prep = preprocess(preprocessor, x_train, x_test)


preprocess data - done in 0s


## Options de sampling : Ré-equilibration des classes cible

Il y a environ 800 colonnes de données, la plupart seront insignifiant en importance pour la modèle.

On peut faire feature selection basé sur les données, mais comme les données sont pour la plupart target = 0, ça met plus de poid sur la classe 'loan repaid.

La référence ci-dessous recommande faire la ré-equilibration des classes, via oversampling AVANT feature selection.

#### Références

- [SMOTE for high-dimensional class-imbalanced data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3648438/)


### Oversample les données sans valeurs manquantes (NaN)


On a plusieurs stratégies de sampling possible pour équilibré le poid des classes

**Weights**

- ajout plus de poids pour les observations minoritaires

**Random undersampling**

- éliminaire aléatoirement des observations majoritaires

**Random oversampling**

- ajout aleatoirement aux données des copies d'observations minoritaires

**Synthetic Minority Oversampling Technique (SMOTE)**

- ajout des observations minoritaires similaires mais distinctes des observations minoritaires existantes

**SMOTE Tomek**

- oversample, puis undersample les bordeline cas


# Créer des instances des samplers


In [24]:
# Random undersampling
undersampler = RandomUnderSampler(sampling_strategy='majority')

# Random oversampling
oversampler = RandomOverSampler(sampling_strategy='minority')

# Synthetic Minority Oversampling Technique (SMOTE)
smote_adasyn = SMOTE(sampling_strategy='ADASYN')

# over-sample borderline, then undersample)
smote_tomek = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))


In [25]:
from imblearn import over_sampling

oversampler = over_sampling.SMOTE()
with timer(title='oversample'):
    x_train_smote, y_train_smote = oversampler.fit_resample(
        x_train_prep, y_train)


print(X.shape)
print(x_train.shape)
print(x_train_smote.shape)


oversample - done in 0s
(10000, 83)
(8000, 83)
(14676, 83)


In [26]:

undersampler = RandomUnderSampler(sampling_strategy='majority')
oversampler = RandomOverSampler(sampling_strategy='minority')

# SMOTE :
# oversample with ADASYN
smote_adasyn = SMOTE(sampling_strategy='ADASYN')


# over-sample borderline, then undersample)
smote_tomek = SMOTETomek(tomek=TomekLinks(sampling_strategy='majority'))


## Construction d'un model pipeline avec SMOTE

References

- <https://towardsdatascience.com/the-right-way-of-using-smote-with-cross-validation-92a8d09d00c7>


In [27]:
from imblearn import pipeline as imbpipeline
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_classif, chi2, mutual_info_classif
from sklearn.model_selection import StratifiedKFold
# from sklearn.svm import LinearSVC
# test model pipeline avec un classifier rapide
from sklearn.linear_model import RidgeClassifier


def create_model(
    sampler=SMOTE(random_state=11),
    preprocessor=numeric_pipeline,
    feature_selector=SelectKBest(score_func=f_classif, k=50),
    classifier=RidgeClassifier()
):
    model = imbpipeline.Pipeline(steps=[
        ('smote', sampler),
        ('preprocess', preprocessor),
        ('feat_select', feature_selector),
        ('clf', classifier)
    ])
    # ('feat_select', SelectFromModel(LGBMClassifier, max_features=100)),
    return model


test_model = create_model(preprocessor=preprocessor)
test_model


In [28]:


stratified_kfold = StratifiedKFold(n_splits=5,
                                   shuffle=True,
                                   random_state=11)


param_grid = {'clf__alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search = GridSearchCV(estimator=test_model,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=stratified_kfold,
                           verbose=2,
                           return_train_score=True,
                           n_jobs=-1)


grid_search.fit(x_train_prep, y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(x_test_prep, y_test)
# return {'cv_score':cv_score, 'test_score':test_score}


Fitting 5 folds for each of 7 candidates, totalling 35 fits


# Les métriques d'évaluation

Pour la classification binaire, les métriques pour estimer les erreurs entre y_pred et y_test sont :

**ROC AUC (Area Under the Curve)**
- peut être comparé entre modèles

**log loss**
- logistic loss or cross-entropy loss. Normalement utilisé s'il y a plus de 2 classes dans TARGET

**precision**
Quelle portion du target prédit sont du vrai classe ? 
- precision = TP / (TP + FP)

**recall**
Quelle portion du vrai classe sont présent dans le classe prédit ?
- recall = TP / (TP + FN)

**F1-score**
Accuracy équilibré
- F1 = 2 * (precision * recall) / (precision + recall)
- F1 = (2 * TP) / (2 * TP + FP + FN)

**Fbeta-score**
Generalisation de F1-score pour mettre plus de poids sur precision (ex beta=0.5), ou pour mettre plus de poid sur recall (ex: beta=2) 



### Custom Credit Scorer

Pour la banque, le cout d'un faux positif (donner un prêt à un mauvais payeur) est plus que le cout d'un faux negatif (refuser un prêt à un bon client)

Donc, il faut penaliser les faux positif plus que les faux negatif 



In [29]:
def custom_credit_score(y_test, y_pred):
    """
    Penalise les prêts qui ne sont pas repayés (perte plus importante que si on ne prend pas un client bon)
    (TN - FP - 3*FN)
    """
    cm = metrics.confusion_matrix(y_test, y_pred)
    TN = cm[0, 0] # y_true=0 (bon), y_pred=0 (bon) - ce que le banque veut donner un pret
    FN = cm[1, 0] # y_true=1 (mauvais), y_pred=1 (bon) - grand perte pour la banque
    FP = cm[0, 1] # y_true=0 (bon), y_pred=1 (mauvais) - petit perte (manque de preter à un bon client
    TP = cm[1, 1] # y_true=1 (mauvais), y_pred=1 (mauvais) - ce que la banque veut refuser
    # bank_profit = TN * valeur_par_pret_fait
    # bank_loss = FN * grand_pert_mauvais_pret + FP * valeur_par_pret_perdu
    # bank_net_profit = bank_profit - bank loss
    # bank_net_profit = (TN - FP) * valeur_par_pret - FN * grand_pert_mauvais_pret
    # normalise par la taille de l'echantillon (cm.sum())
    valeur_par_pret=1 # suppose la valeur des prêts fait ou perdus sont identiques
    grand_pert_mauvais_pret = 3 # (on suppose que la cout d'un mauvais prêt et 3 fois le rentabilité d'un bon prêt)
    profit_score=(TN - FP - 3*FN)/cm.sum()
    return profit_score


In [30]:
from sklearn import metrics

# mise en constant pour référence ailleurs
ROC_AUC = 'roc_auc'
CV_SCORING = ''


def performance_metrics(y_true_, y_pred_):
    """Plusieurs functions de cout (Loss)"""
    loss = dict(
        roc_auc=metrics.roc_auc_score(y_true_, y_pred_),
        logloss=metrics.log_loss(y_true_, y_pred_),
        precision=metrics.precision_score(y_true_,y_pred_),
        recall=metrics.recall_score(y_true_,y_pred_),
        f1 = metrics.f1_score(y_true_,y_pred_),
        f2 = metrics.fbeta_score(y_true_,y_pred_,beta=2),
        f05 = metrics.fbeta_score(y_true_,y_pred_,beta=0.5),
        custom_score=custom_credit_score(y_true_,y_pred_),
    )
    for metric in loss.keys():
        loss[metric] = round(loss[metric], 3)
    return loss

y_true1=[0,0,1,1]
y_pred1=[0,1,0,1]
print(metrics.confusion_matrix(y_true1,y_pred1))
print(performance_metrics(y_true1,y_pred1))




[[1 1]
 [1 1]]
{'roc_auc': 0.5, 'logloss': 17.27, 'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'f2': 0.5, 'f05': 0.5, 'custom_score': -0.75}



### Accumulation des performances des modèles

In [31]:
df_resultats = pd.DataFrame()


def add_model_score(df: pd.DataFrame = None, model_name: str = 'none', ARI: float = 0, k: int = 0, **kwargs):
    global df_resultats
    if df is None:
        df = df_resultats
    """ajout les resultats d'un model """
    resultats = dict(model=model_name)
    resultats = dict(**resultats, **kwargs)
    df = df.append(resultats, ignore_index=True)
    return df


# test
add_model_score(pd.DataFrame(), optimizer='adam', k=7)


Unnamed: 0,model,optimizer
0,none,adam


## Teste des métriques de performance

In [32]:
y_true1=[0,0,1,1]
y_pred1=[0,0,1,1]
perf=performance_metrics(y_true1,y_pred1)
df_test=add_model_score(pd.DataFrame(),'perfect',**perf)
y_true1=[0,0,1,1]
y_pred1=[0,1,0,1]
perf=performance_metrics(y_true1,y_pred1)
df_test=add_model_score(df_test,'1-each (TN,FN,FP,TP)',**perf)
y_true1=[1,1,0,1]
y_pred1=[0,0,0,1]
perf=performance_metrics(y_true1,y_pred1)
df_test=add_model_score(df_test,'a few bad payers',**perf)
y_true1=[1,1,1,1,0,1]
y_pred1=[0,0,0,0,0,1]
perf=performance_metrics(y_true1,y_pred1)
df_test=add_model_score(df_test,'many bad payers',**perf)
y_true1=[0,0,0,1]
y_pred1=[0,1,1,1]
perf=performance_metrics(y_true1,y_pred1)
df_test=add_model_score(df_test,'a few wrong refusals',**perf)
y_true1=[0,0,0,0,0,1]
y_pred1=[0,1,1,1,1,1]
perf=performance_metrics(y_true1,y_pred1)
df_test=add_model_score(df_test,'many wrong refusals',**perf)
df_test.sort_values(by='custom_score')

Unnamed: 0,custom_score,f05,f1,f2,logloss,model,precision,recall,roc_auc
3,-1.833,0.556,0.333,0.238,23.026,many bad payers,1.0,0.2,0.6
2,-1.25,0.714,0.5,0.385,17.269,a few bad payers,1.0,0.333,0.667
1,-0.75,0.5,0.5,0.5,17.27,"1-each (TN,FN,FP,TP)",0.5,0.5,0.5
5,-0.5,0.238,0.333,0.556,23.026,many wrong refusals,0.2,1.0,0.6
4,-0.25,0.385,0.5,0.714,17.27,a few wrong refusals,0.333,1.0,0.667
0,0.5,1.0,1.0,1.0,0.0,perfect,1.0,1.0,1.0


On voit qu'on peut utiliser F2-score, par exemple, au lieu de custom scorer

On essaie d'optimiser roc_auc, en suivant custom_score et f2_score

In [33]:
custom_scorer=metrics.make_scorer(custom_credit_score,greater_is_better=True)
f2_scorer=metrics.make_scorer(metrics.fbeta_score,greater_is_better=True, beta=2)

# GridsearchCV attend un dictionnaire si on fourni plusieurs scorers 
CV_SCORERS= {
    'roc_auc':'roc_auc',
    'custom_scorer': custom_scorer,
    'f2_scorer':f2_scorer
}


In [35]:
# Test GridSearch
grid_search = GridSearchCV(estimator=test_model,
                           param_grid=param_grid,
                           scoring=CV_SCORERS,
                           refit='roc_auc',
                           cv=stratified_kfold,
                           verbose=2,
                           return_train_score=True,
                           n_jobs=-1)


grid_search.fit(x_train_prep, y_train)
cv_score = grid_search.best_score_
test_score = grid_search.score(x_test_prep, y_test)
df_cv=pd.DataFrame(grid_search.cv_results_)
df_cv.head()

Fitting 5 folds for each of 7 candidates, totalling 35 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__alpha,params,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,split3_test_roc_auc,split4_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc,split0_train_roc_auc,split1_train_roc_auc,split2_train_roc_auc,split3_train_roc_auc,split4_train_roc_auc,mean_train_roc_auc,std_train_roc_auc,split0_test_custom_scorer,split1_test_custom_scorer,split2_test_custom_scorer,split3_test_custom_scorer,split4_test_custom_scorer,mean_test_custom_scorer,std_test_custom_scorer,rank_test_custom_scorer,split0_train_custom_scorer,split1_train_custom_scorer,split2_train_custom_scorer,split3_train_custom_scorer,split4_train_custom_scorer,mean_train_custom_scorer,std_train_custom_scorer,split0_test_f2_scorer,split1_test_f2_scorer,split2_test_f2_scorer,split3_test_f2_scorer,split4_test_f2_scorer,mean_test_f2_scorer,std_test_f2_scorer,rank_test_f2_scorer,split0_train_f2_scorer,split1_train_f2_scorer,split2_train_f2_scorer,split3_train_f2_scorer,split4_train_f2_scorer,mean_train_f2_scorer,std_train_f2_scorer
0,0.586068,0.065218,0.050962,0.009222,0.001,{'clf__alpha': 0.001},0.705082,0.730214,0.73056,0.757922,0.711523,0.72706,0.018434,7,0.762292,0.754394,0.751582,0.753257,0.761558,0.756617,0.004432,0.24875,0.28125,0.304375,0.261875,0.2575,0.27075,0.019896,3,0.305781,0.299531,0.280625,0.2875,0.309531,0.296594,0.010937,0.375587,0.400381,0.411423,0.426997,0.384977,0.399873,0.018336,2,0.435145,0.43308,0.42603,0.430285,0.434731,0.431854,0.003376
1,0.57681,0.093513,0.042363,0.002741,0.01,{'clf__alpha': 0.01},0.705082,0.730214,0.73056,0.757928,0.711523,0.727061,0.018436,5,0.762292,0.754394,0.751582,0.753257,0.761559,0.756617,0.004432,0.24875,0.28125,0.304375,0.261875,0.2575,0.27075,0.019896,3,0.305781,0.299531,0.280625,0.2875,0.309531,0.296594,0.010937,0.375587,0.400381,0.411423,0.426997,0.384977,0.399873,0.018336,2,0.435145,0.43308,0.42603,0.430285,0.434731,0.431854,0.003376
2,0.474728,0.014682,0.042973,0.003599,0.1,{'clf__alpha': 0.1},0.705082,0.730214,0.73056,0.757928,0.711523,0.727061,0.018436,5,0.762293,0.754395,0.751581,0.753258,0.761557,0.756617,0.004432,0.24875,0.28125,0.304375,0.261875,0.2575,0.27075,0.019896,3,0.305469,0.299531,0.280625,0.2875,0.309531,0.296531,0.010885,0.375587,0.400381,0.411423,0.426997,0.384977,0.399873,0.018336,2,0.435042,0.43308,0.42603,0.430285,0.434731,0.431833,0.003356
3,0.480906,0.016891,0.040064,0.001939,1.0,{'clf__alpha': 1},0.705123,0.730235,0.73056,0.757948,0.711513,0.727076,0.018435,4,0.762288,0.754396,0.751579,0.753253,0.761555,0.756614,0.004431,0.24875,0.28125,0.304375,0.261875,0.2575,0.27075,0.019896,3,0.305156,0.299531,0.280625,0.2875,0.309531,0.296469,0.010835,0.375587,0.400381,0.411423,0.426997,0.384977,0.399873,0.018336,2,0.434938,0.43308,0.42603,0.430285,0.434731,0.431813,0.003337
4,0.480822,0.017333,0.040093,0.001485,10.0,{'clf__alpha': 10},0.705191,0.730312,0.730565,0.758015,0.711533,0.727123,0.018441,3,0.762283,0.754405,0.751566,0.753262,0.761559,0.756615,0.004432,0.24875,0.28125,0.305625,0.261875,0.253125,0.270125,0.020966,7,0.305156,0.299531,0.280938,0.288125,0.309219,0.296594,0.010566,0.375587,0.400381,0.411822,0.426997,0.379925,0.398942,0.019299,7,0.434938,0.43308,0.42613,0.430487,0.434626,0.431852,0.003266
