# Bank Customer Churn - Inference

Here it is presented the final step of the pipeline: inference.

As simple as it seems, it is necessary to design it to be robust against problems like different models in the same folder, new data with different schema of the one the model was trained, and assert that the output is in the expected format to be served to an API, for example.

In this case, due to the simplicity of the project, there's no mistery here in what to be done, but it is useful to think about it as part of a fully served pipeline.



In [2]:
import os
import pickle
import re

import pandas as pd
import numpy as np

from xgboost import XGBClassifier

In [3]:
def obtain_score_from_name(file_name: str) -> float | None:
    regex_pattern = r'_(\d+\.\d+)\.pkl$'
    match = re.match(regex_pattern, file_name)
    if match:
        score = float(match.group('score'))
        return score
    
def obtain_model_with_higher_score(file_paths: list[str]) -> str:
    def sorting_function(x: str)->float:
        score = obtain_score_from_name(x)
        if score is None:
            return float('inf')
        return score
        
    sorted_paths = sorted(file_paths, key=sorting_function)
    return sorted_paths[0]

In [4]:
model_folder_path = '../models'
files = os.listdir(model_folder_path)
best_model_file = obtain_model_with_higher_score(files)

with open(os.path.join(model_folder_path,best_model_file), 'rb') as f:
    xgb_model = pickle.load(f)

In [15]:
df_test = pd.read_csv('../data/raw/Abandono_teste.csv', sep=';')
df_inference = pd.read_csv('../data/processed/inference_data.csv')


In [10]:
df_inference

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,France,Germany,Spain,Products_Tenure_relation
0,565,0,31,1,0.00,1,0,1,20443.08,1,0,0,1
1,569,0,34,4,0.00,1,0,1,4045.90,1,0,0,4
2,669,1,20,7,0.00,2,1,0,128838.67,1,0,0,14
3,694,0,39,4,173255.48,1,1,1,81293.10,1,0,0,4
4,504,0,28,10,109291.36,1,1,1,187593.15,0,0,1,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,531,1,34,10,118306.79,1,1,0,26493.05,1,0,0,10
996,575,0,49,2,136822.70,1,1,0,2487.74,0,1,0,2
997,520,1,74,4,0.00,1,0,0,26742.92,1,0,0,4
998,675,0,23,8,0.00,2,0,0,162342.21,0,0,1,16


In [17]:
model_preds = xgb_model.predict(df_inference)
df_test['Exited'] = model_preds

predictions = df_test[['RowNumber', 'Exited']]
predictions

Unnamed: 0,RowNumber,Exited
0,10001,0
1,10002,0
2,10003,0
3,10004,0
4,10005,0
...,...,...
995,10996,0
996,10997,1
997,10998,1
998,10999,0


With this, we finish the reasoning and design of the bank customer churn project. What's left now is the construction of the pipeline code - that must be finished when you read this notebook - adapting the code presented to good software engineering practices in order to mantain it clean, easy to maintain and reproducible.