Data: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Example to follow:
https://www.kaggle.com/code/bandiatindra/telecom-churn-prediction

***Observação: As análises dos dados estão contidas no notebook anterior, este é focado apenas em otimizar o código através de pipelines.***

- *Neste notebook, desenvolvi um pipeline comparando mais de um estimador, ambos os estimadores foram escolhidos de acordo com a validação cruzada do primeiro notebook, o qual o link se encontra abaixo. Vale ressaltar que este notebook foi criado apenas com o intuito de tornar o processo um pouco mais automatizado (o que inclui a comparação entre 2 ou mais modelos em um pipeline)*

Link do notebook com as análises e sem uso de pipelines: https://colab.research.google.com/drive/1wzU2AFfwnxCYVqCNi_2KBfZEmu4uvSQj

O pipeline com apenas um estimador (mais direto ao ponto) se encontra neste link: https://colab.research.google.com/drive/1tIV-2pSEpRGbbHR2noDaOAuIhEnTIXzM#scrollTo=G_LK3CKfmJMB

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from statistics import median, mean
from sklearn import metrics

# importação do Randomized Search
from sklearn.model_selection import RandomizedSearchCV

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn import svm
from sklearn.ensemble import HistGradientBoostingClassifier

In [2]:
df = pd.read_csv("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [3]:
df.shape

(7043, 21)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [5]:
#for i in df.columns:
#  print(df[i].value_counts())

In [6]:
df2 = df.copy()

In [7]:
# Esse loop busca varáveis do tipo Object que contenham 2 valores únicos catrgóricos classificados como "No" e "Yes" e os converte para 0 e 1. ALém disso, após a conversão,
# as variáveis tem seu tipo transformado para int
for i in df2.columns:
  if (df2[i].dtypes == 'object') and (df2[i].nunique() == 2) and ('Yes' in df2[i].values):
    for j in range(df2[i].shape[0]):
      if df2[i][j] == "No":
        n = 0
        df2[i][j] = n
      elif df2[i][j] =='Yes':
        n = 1
        df2[i][j] = n
      else:
        continue
    df2[i] = df2[i].astype(dtype='int64')

# categorical_column = {'No' : 0, 'Yes' : 1}
# df2[i] = df2[i].map(categorical_column)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[i][j] = n
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[i][j] = n


Atribuindo as colunas destacadas abaixo a uma variável nominal para que possa ser aplicado o tratamento categórico exclusivamente a estas colunas

In [8]:
# Variáveis do tipo "Object" que tem mais de 2 valores únicos
nominal_features = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
                    'StreamingMovies', 'Contract', 'PaymentMethod']

In [9]:
df2.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,1,0,1,0,No phone service,DSL,No,...,No,No,No,No,Month-to-month,1,Electronic check,29.85,29.85,0
1,5575-GNVDE,Male,0,0,0,34,1,No,DSL,Yes,...,Yes,No,No,No,One year,0,Mailed check,56.95,1889.5,0
2,3668-QPYBK,Male,0,0,0,2,1,No,DSL,Yes,...,No,No,No,No,Month-to-month,1,Mailed check,53.85,108.15,1
3,7795-CFOCW,Male,0,0,0,45,0,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,0,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,Female,0,0,0,2,1,No,Fiber optic,No,...,No,No,No,No,Month-to-month,1,Electronic check,70.7,151.65,1


In [10]:
df2.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [11]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   int64  
 4   Dependents        7043 non-null   int64  
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   int64  
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   int64  


In [12]:
# Em tese, a coluna Total Charges deveria ser numérica, pórém atualmente ela é do tipo 'Object', isto pq há valores vazios na feature, como mostrado no resultado abaixo.
df2.loc[df2['TotalCharges'] == ' ']

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,1,1,0,0,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,1,Bank transfer (automatic),52.55,,0
753,3115-CZMZD,Male,0,0,1,0,1,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,20.25,,0
936,5709-LVOEQ,Female,0,1,1,0,1,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,0,Mailed check,80.85,,0
1082,4367-NUYAO,Male,0,1,1,0,1,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,25.75,,0
1340,1371-DWPAZ,Female,0,1,1,0,0,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,0,Credit card (automatic),56.05,,0
3331,7644-OMVMY,Male,0,1,1,0,1,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,19.85,,0
3826,3213-VVOLG,Male,0,1,1,0,1,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,25.35,,0
4380,2520-SGTTA,Female,0,1,1,0,1,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,0,Mailed check,20.0,,0
5218,2923-ARZLG,Male,0,1,1,0,1,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,1,Mailed check,19.7,,0
6670,4075-WKNIU,Female,0,1,1,0,1,Yes,DSL,No,...,Yes,Yes,Yes,No,Two year,0,Mailed check,73.35,,0


- Tratar essa coluna com valores nulos apenas nos dados de treino de forma separada dos dados de teste

In [13]:
X = df2.drop(['Churn', 'gender', 'customerID'], axis=1)
y = df2[['Churn']]

In [14]:
# Treinar o pipeline após esta linha, apenas com dados de treino
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Criando algumas funções para que substituam as seguintes funções lambda que estavam dentro do pipeline (Esta foi a alternativa encontrada após não ser possível salvar o modelo .pkl por conta da função lambda inserida no pipeline

    ('replace', FunctionTransformer(lambda x: x.replace(' ', np.nan))),
    ('convert_type', FunctionTransformer(lambda x: x.astype(float)))

In [15]:
def replace_nan(x):
  return x.replace(' ', np.nan)

In [16]:
def convert_type_to_float(x):
  return x.astype(float)

In [17]:
nominal_transformer = Pipeline(steps = [
    ('ohe', OneHotEncoder(handle_unknown ='ignore'))
])

numerical_transformer = Pipeline(steps = [
    ('replace', FunctionTransformer(replace_nan)),
    ('imputer', SimpleImputer(strategy='median')),
    ('convert_type', FunctionTransformer(convert_type_to_float))

])

preprocessor = ColumnTransformer(
    transformers = [
        ('nominal', nominal_transformer, nominal_features),
        ('numerical', numerical_transformer,['TotalCharges'])
    ])

In [18]:
models = {
    'hgb': {'model': HistGradientBoostingClassifier(),
            'param_grid': {"model__verbose": [0, 1],
                           "model__learning_rate": [0.01, 0.1, 0.3],
                           "model__l2_regularization": [0.0, 1.0, 2.0],
                           "model__max_iter": [50,100,150,200, 250],
                           "model__max_depth": [None, 5, 10, 15, 20, 30],
                           "model__min_samples_leaf": [1, 2, 4,7],
                           "model__class_weight": [None, "balanced"]
                           }
            },
    'log_reg': {'model': LogisticRegression(),
                'param_grid': {"model__C": [0.01, 0.1, 1, 10],
                               "model__solver": ['liblinear', 'lbfgs', 'newton-cholesky'],
                               "model__max_iter": [50,100,150,200],
                               "model__penalty": ["l1", "l2"],
                               "model__class_weight": [None, "balanced"]
                               }
                }
    }

models_results = {}

for model_name, config_model in models.items():
  pipeline = Pipeline(steps=[
      ('preprocessor', preprocessor),
      ('scaler', MinMaxScaler()),
      ('model', config_model['model'])
  ])

  rand_search = RandomizedSearchCV(pipeline, config_model['param_grid'], n_iter=32, scoring="accuracy", verbose=True,
                                   cv=5, n_jobs=-1, random_state=2)
  rand_search.fit(X_train, y_train)

  models_results[model_name] = {
        'model': rand_search.best_estimator_,
        'score': rand_search.best_score_,
        'best_parameters': rand_search.best_params_
    }


Fitting 5 folds for each of 32 candidates, totalling 160 fits


  y = column_or_1d(y, warn=True)


Fitting 5 folds for each of 32 candidates, totalling 160 fits


35 fits failed out of a total of 160.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 54, in _check_

In [19]:
for name, result in models_results.items():
    print(f'Model: {name}')
    print(f'Best Parameters: {result["best_parameters"]}')
    print(f'Score: {result["score"]}')
    print('\n')

Model: hgb
Best Parameters: {'model__verbose': 0, 'model__min_samples_leaf': 4, 'model__max_iter': 200, 'model__max_depth': 15, 'model__learning_rate': 0.01, 'model__l2_regularization': 2.0, 'model__class_weight': None}
Score: 0.8004056795131846


Model: log_reg
Best Parameters: {'model__solver': 'liblinear', 'model__penalty': 'l1', 'model__max_iter': 200, 'model__class_weight': None, 'model__C': 0.1}
Score: 0.7947261663286003




Model

In [22]:
pipe = Pipeline(steps = [
    ('preprocessor', preprocessor),
    ('scaler', MinMaxScaler()),
    ('model', HistGradientBoostingClassifier(verbose = 0, min_samples_leaf = 4, max_iter = 200, max_depth = 15,
                                             learning_rate = 0.01, l2_regularization = 2.0, class_weight = None))
])

In [23]:
pipe.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [24]:
prediction = pipe.predict(X_test)

In [25]:
from sklearn import metrics
# Print the prediction accuracy
print(metrics.accuracy_score(y_test, prediction))

0.8021769995267393


Salvando o modelo em arquivo .pkl

In [26]:
import pickle

melhor_modelo = pipe

# Este modelo salvo está tunado da seguinte forma: LogisticRegression(C=0.1, max_iter=50, penalty='l1', solver='liblinear')
with open('best_estimator.pkl', 'wb') as arquivo:
    pickle.dump(melhor_modelo, arquivo)

Salvando o modelo com joblib

In [27]:
import joblib

pipe_model = pipe

joblib.dump(pipe_model, 'pipe_model.joblib')

['pipe_model.joblib']