<a href="https://colab.research.google.com/github/itaborai83/ecd221-SI-trabalho/blob/main/Trabalho_ECD221_ENGENHARIA_DE_SOFTWARE_PARA_CI%C3%8ANCIA_DE_DADOS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ECD221 - ENGENHARIA DE SOFTWARE PARA CIÊNCIA DE DADOS

**Integrantes**

- Rafaela Schifino
- Renan Giordano Sfirri
- Agnello Hupp
- Daniel Lemos Itaboraí


# User Story
**COMO** gerente de relacionamento **QUERO** predizer o cancelamento de um serviço de acordo com um conjunto de características **PARA** prevenir o cancelamento (churn rate) e melhorar a rentabilidade da empresa.

**Critérios de Aceitação:**
DADO que existe um conjunto de informações e características de um cliente
QUANDO executo o modelo de Machine Learning
ENTÃO obtenho uma indicação se um serviço será cancelado pelo cliente.

**Detalhamento:**
#####Objetivo: Predizer se um determinado cliente irá cancelar um serviço considerando seu histórico e um conjunto de características.
#####Objetivos organizacionais: Redizir a taxa de cancelamento melhorando a rentabilidade da empresa.
#####Objetivo do modelo: Predizer com no mínimo 75% de acurácia, em amostras não utilizadas no treinamento, se um determinado cliente irá cancelar um serviço.


	


## Carga dos Arquivos via \%\%writefile

Código também disponível via https://github.com/itaborai83/ecd221-SI-trabalho

In [None]:
!mkdir -p ./telchurn
!mkdir -p ./DATA
!mkdir -p ./MODELS

In [None]:
%%writefile ./telchurn/__init__.py
# empty

Writing ./telchurn/__init__.py


In [None]:
%%writefile ./telchurn/util.py
import io
import os
import sys
import warnings
import logging
import datetime as dt

LOGGER_FORMAT = '%(asctime)s:%(levelname)s:%(filename)s:%(funcName)s:%(lineno)d\n\t%(message)s\n'
LOGGER_FORMAT = '%(levelname)s - %(filename)s:%(funcName)s:%(lineno)s - %(message)s'
stdout_handler = logging.StreamHandler(stream=sys.stdout)
logging.basicConfig(level=logging.INFO, format=LOGGER_FORMAT, handlers=[stdout_handler])
logging.basicConfig(level=logging.INFO, format=LOGGER_FORMAT)

def get_logger(name):
    return logging.getLogger(name)

def report_df(logger, df):
    buffer = io.StringIO()
    df.info(verbose=True, buf=buffer)
    buffer.seek(0)
    logger.info(buffer.read())

def silence_warnings():
    # to silence warnings of subprocesses
    if not sys.warnoptions:
        warnings.simplefilter("ignore")
        os.environ["PYTHONWARNINGS"] = "ignore::UserWarning,ignore::FutureWarning"

Writing ./telchurn/util.py


In [None]:
%%writefile ./telchurn/data_loader.py
 # -*- coding: utf-8 -*-
import io
import sys
import abc
import pandas as pd
import telchurn.util as util

LOGGER = util.get_logger('data_loader')

class DataLoader(abc.ABC):
    
    DELIMITER = ','
    
    @abc.abstractmethod
    def load(self, file_name_or_url: str) -> pd.DataFrame:
        raise NotImplementedError
        
    @abc.abstractmethod
    def load_cleansed(self, file_name_or_url: str) -> pd.DataFrame:
        raise NotImplementedError
        
        
class DataLoaderImpl(DataLoader):
    FIELD_SEPARATOR             = ","
    IMPORT_COLUMN_NAMES         = [
        "customer_id"
    ,   "gender"
    ,   "senior_citizen"
    ,   "partner"
    ,   "dependents"
    ,   "tenure"
    ,   "phone_service"
    ,   "multiple_lines"
    ,   "internet_service"
    ,   "online_security"
    ,   "online_backup"
    ,   "device_protection"
    ,   "tech_support"
    ,   "streaming_tv"
    ,   "streaming_movies"
    ,   "contract"
    ,   "paperless_billing"
    ,   "payment_method"
    ,   "monthly_charges"
    ,   "total_charges"
    ,   "churn"
    ]
    SKIP_ROWS = 1
    
    def load(self, file_name_or_url: str) -> pd.DataFrame:
        LOGGER.info(f'loading dataframe from {file_name_or_url}')
        churn_df = pd.read_csv(
            file_name_or_url
        ,   names     = self.IMPORT_COLUMN_NAMES
        ,   skiprows  = 1
        ,   delimiter = self.DELIMITER
        )
        util.report_df(LOGGER, churn_df)
        return churn_df
        
    def load_cleansed(self, file_name_or_url: str) -> pd.DataFrame:
        LOGGER.info(f'loading cleansed dataframe from {file_name_or_url}')
        churn_df = pd.read_csv(
            file_name_or_url
        ,   delimiter = self.DELIMITER
        )
        util.report_df(LOGGER, churn_df)
        return churn_df

Writing ./telchurn/data_loader.py


In [None]:
%%writefile ./telchurn/data_splitter.py
 # -*- coding: utf-8 -*-
import abc
import pandas as pd
from typing import Tuple, List
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler, ADASYN, SMOTE
import telchurn.util as util

LOGGER = util.get_logger('data_splitter')

class DataSplitter(abc.ABC):
    
    DEFAULT_TEST_PCT_SIZE   = 0.3 # 30% do conjunto de dados
    DEFAULT_RANDOM_STATE    = 42
    
    def __init__(self, seed: int=None, test_split_pct: float=None):
        self.seed = seed if seed else self.DEFAULT_RANDOM_STATE
        self.test_split_pct = test_split_pct if test_split_pct else self.DEFAULT_TEST_PCT_SIZE        
        
    @abc.abstractmethod
    def split(self, df: pd.DataFrame, target: str) -> Tuple[Tuple[pd.DataFrame, pd.DataFrame], Tuple[pd.DataFrame, pd.DataFrame]]:
        raise NotImplementedError
            
class DataSplitterImpl(DataSplitter):
    
    def __init__(self, seed: int=None, test_split_pct: float=None):
        super().__init__(seed, test_split_pct)
        
    def _oversample(self, X_train, y_train):
        return X_train.copy(), y_train.copy()

    def split(self, df: pd.DataFrame, target: str) -> Tuple[Tuple[pd.DataFrame, pd.DataFrame], Tuple[pd.DataFrame, pd.DataFrame]]:
        LOGGER.info('splitting data set into train and test sets')
        all_but_target = df.columns.difference([target])
        X_df = df[all_but_target]
        y = df[target]
        X_train, X_test, y_train, y_test = train_test_split(
            X_df.values
        ,   y
        ,   test_size     = self.test_split_pct
        ,   shuffle       = True
        ,   random_state  = self.seed
        ,   stratify      = y # com estratificação
        )
        X_train_resampled, y_train_resampled = self._oversample(X_train, y_train)
        X_train_df = pd.DataFrame(X_train_resampled, columns=X_df.columns)
        y_train_df = pd.DataFrame(y_train_resampled, columns=[target])
        X_test_df = pd.DataFrame(X_test, columns=X_df.columns)
        y_test_df = pd.DataFrame(y_test, columns=[target])
        return (X_train_df, y_train_df), (X_test_df, y_test_df)

# os data splitters com oversampling abaixo não foram usados pois não 
# apresentaram ganho significativo de performance e aumentavam o tempo de treino

class DataSplitterOverSamplerImpl(DataSplitterImpl):

    def __init__(self, seed: int=None, test_split_pct: float=None):
        super().__init__(seed, test_split_pct)

    def _oversample(self, X_train, y_train):
        ros = RandomOverSampler(random_state=self.seed)
        return ros.fit_resample(X_train, y_train)
        
class DataSplitterSmoteImpl(DataSplitterImpl):
    
    def __init__(self, seed: int=None, test_split_pct: float=None):
        super().__init__(seed, test_split_pct)

    def _oversample(self, X_train, y_train):
        smote = SMOTE(random_state=self.seed)
        return smote.fit_resample(X_train, y_train)
        
class DataSplitterAdasynImpl(DataSplitterImpl):
    
    def __init__(self, seed: int=None, test_split_pct: float=None):
        super().__init__(seed, test_split_pct)

    def _oversample(self, X_train, y_train):
        adasyn = ADASYN(random_state=self.seed)
        return adasyn.fit_resample(X_train, y_train)

Writing ./telchurn/data_splitter.py


In [None]:
%%writefile ./telchurn/feature_processor.py
 # -*- coding: utf-8 -*-
import io
import abc
import numpy as np
import pandas as pd
import telchurn.util as util

LOGGER = util.get_logger('feature_processor')

class FeatureProcessor(abc.ABC):
    
    @abc.abstractmethod
    def handle_categorical_features(self, churn_df: pd.DataFrame) -> pd.DataFrame:
        raise NotImplementedError

    @abc.abstractmethod
    def engineer_features(self, churn_df: pd.DataFrame) -> pd.DataFrame:
        raise NotImplementedError
        
class FeatureProcessorImpl(FeatureProcessor):
    
    BOOLEAN_FEATURES = [
        "senior_citizen"
    ,   "partner"
    ,   "dependents"
    ,   "phone_service"
    ,   "paperless_billing"
    ]    
    
    CATEGORICAL_FEATURES = [
        "multiple_lines"
    ,   "internet_service"
    ,   "online_security"
    ,   "online_backup"
    ,   "device_protection"
    ,   "tech_support"
    ,   "streaming_tv"
    ,   "streaming_movies"
    ,   "contract"
    ,   "payment_method"
    ]
    
    NUMERICAL_FEATURES = [
        "tenure"
    ,   "monthly_charges"
    ,   "total_charges"
    ]
    
    TARGET_VARIABLE = "churn"
    
    BOOLEAN_MAP = {"No": 0, "Yes": 1}
    
    ENABLE_FEATURE_ENGINEERING = False

    # ruído a ser adicionado para evitar overfitting e fazer as variáveis novas parecerem numéricas
    NOISE_STD = 0.0 # zerado por falta de tempo para averiguar se ajudava ou não com overfitting
    
    def __init__(self, seed):
        self.seed = seed
        
    def handle_categorical_features(self, churn_df: pd.DataFrame) -> pd.DataFrame:
        LOGGER.info('handling categorical features')
        # copia o data frame para não estragar os dados originais
        churn_df = churn_df.copy()
        
        # transformando variáveis booleanas em numéricas (dummy encoding não é necessário)
        for feature in self.BOOLEAN_FEATURES:
          churn_df[feature] = churn_df[feature].map(self.BOOLEAN_MAP)

        # realizando o dummy encoding usando pandas
        dummy_df = pd.get_dummies(
            data        = churn_df[self.CATEGORICAL_FEATURES]
        ,   prefix      = self.CATEGORICAL_FEATURES
        ,   prefix_sep  = "="
        )

        # concatenando as variáveis boleanas, categóricas codificadas, numéricas e variável target num novo dataset
        churn_df = pd.concat([
            churn_df[ self.BOOLEAN_FEATURES ]
        ,   dummy_df
        ,   churn_df[ self.NUMERICAL_FEATURES ]
        ,   churn_df[ self.TARGET_VARIABLE ]
        ], axis=1)

        # removendo variáveis equivalentes internet_service=No = 1
        del churn_df[ "device_protection=No internet service" ]
        del churn_df[ "streaming_tv=No internet service"      ]
        del churn_df[ "tech_support=No internet service"      ]
        del churn_df[ "online_backup=No internet service"     ]
        del churn_df[ "streaming_movies=No internet service"  ]
        del churn_df[ "online_security=No internet service"   ]

        # removendo variáveis equivalentes phone_service=0
        del churn_df[ "multiple_lines=No phone service" ]

        # removendo variáveis codificada tornadas redundantes pelas deleções acima
        del churn_df[ "multiple_lines=No"    ]
        del churn_df[ "online_security=No"   ]
        del churn_df[ "online_backup=No"     ]
        del churn_df[ "device_protection=No" ]
        del churn_df[ "tech_support=No"      ]
        del churn_df[ "streaming_tv=No"      ]
        del churn_df[ "streaming_movies=No"  ]
        del churn_df[ "internet_service=No"  ]        
        
        new_column_names = {
            'multiple_lines=Yes'                       : 'multiple_lines'
        ,   'internet_service=DSL'                     : 'dsl'
        ,   'internet_service=Fiber optic'             : 'fiber_optic'
        ,   'online_security=Yes'                      : 'online_security'
        ,   'online_backup=Yes'                        : 'online_backup'
        ,   'device_protection=Yes'                    : 'device_protection'
        ,   'tech_support=Yes'                         : 'tech_support'
        ,   'streaming_tv=Yes'                         : 'streaming_tv'
        ,   'streaming_movies=Yes'                     : 'streaming_movies'
        ,   'contract=Month-to-month'                  : 'monthly_contract'
        ,   'contract=One year'                        : 'one_year_contract'
        ,   'contract=Two year'                        : 'two_year_contract'
        ,   'payment_method=Bank transfer (automatic)' : 'bank_transfer'
        ,   'payment_method=Credit card (automatic)'   : 'credit_card'
        ,   'payment_method=Electronic check'          : 'electronic_check'
        ,   'payment_method=Mailed check'              : 'mailed_check'
        }
        churn_df.rename(columns=new_column_names, inplace=True)                
        util.report_df(LOGGER, churn_df)
        return churn_df
    
    def engineer_features(self, churn_df: pd.DataFrame) -> pd.DataFrame:
        LOGGER.info('engineering new features')
        np.random.seed(self.seed)
        churn_df = churn_df.copy()
        
        # o primeiro quartil da variável tenure conforme análise univariada anterior.
        # a probabilidade de rotatividade é inversamente proporcional à variável tenure
        tenure_1st_quartile = churn_df['tenure'].quantile(0.25)
        
        # o terceiro quartil da variável monthly_charges conforme análise univariada anterior.
        # a probabilidade de rotatividade é  proporcional à variável monthly_charges
        charges_3rd_quartile = churn_df['monthly_charges'].quantile(0.75)

        # quantidade de linhas necesário para criação do ruído
        rows, cols = churn_df.shape
        
        LOGGER.info('creating client factor')
        # client factor
        # A análise das tabulações cruzadas revelou que a existência de parceiro e dependentes tendem a fidelizar o cliente.
        # Em contrapartida, observou-se que clientes na terceira idade proporcionalmente tendem a cancelar os serviços
        # de maneira mais frequente.
        # A expectativa é de que quanto maior for o client_factor, maior a probabilidade de que ele venha a cancelar o seu contrato
        noise_term = np.random.normal(loc=0.0, scale=self.NOISE_STD, size=rows)
        churn_df["client_factor"] = ((
            np.exp(churn_df["senior_citizen"]) # senior_citizen=1 aumenta a rotatividade
        +   np.exp(np.abs(1-churn_df["partner"]))
        +   np.exp(np.abs(1-churn_df["dependents"]))
        +   np.exp((churn_df["tenure"] < tenure_1st_quartile).astype(float))
        +   np.exp((churn_df["monthly_charges"] > charges_3rd_quartile).astype(float))
        ) / 3.0 + noise_term) * (1.0 if self.ENABLE_FEATURE_ENGINEERING else 0.0)
        
        LOGGER.info('creating internet factor')
        # internet factor
        # A análise das tabulações cruzadas revelou que a existência a contratação dos
        # serviços de suporte técnico e de segurança online tendem a indicar que um usuário
        # encontra-se fidelizado. A contratação da internet de fibra ótica, ao elevar o valor
        # mensalmente cobrado, contribui com a rotatividade do cliente. Por outro lado,
        # os clientes com internet DSL tendem a permanecer como cliente devido a valor 
        # comparativamente mais baixo sendo cobrado.
        # A expectativa é de que quanto maior for o internet_factor, maior a probabilidade de que ele venha a cancelar o seu contrato
        noise_term = np.random.normal(loc=0.0, scale=self.NOISE_STD, size=rows)
        churn_df["internet_factor"] = ((
            np.exp(np.abs(1-churn_df["tech_support"])) 
        +   np.exp(np.abs(1-churn_df["online_security"]))  
        +   np.exp(churn_df["fiber_optic"]) # senior_citizen=1 aumenta a rotatividade
        -   np.exp(churn_df["dsl"]) # dsl=1 diminui a rotatividade
        +   np.exp((churn_df["tenure"] < tenure_1st_quartile).astype(float))
        +   np.exp((churn_df["monthly_charges"] > charges_3rd_quartile).astype(float))
        ) / 3.0 + noise_term) * (1.0 if self.ENABLE_FEATURE_ENGINEERING else 0.0)
        
        LOGGER.info('creating financial factor')
        # financial factor
        # A análise das tabulações cruzadas revelou que a existência o uso de cobrança digital,
        # o uso de contratos mensais e o pagamento via cheque eletrônico são fatores que
        # contribuem com a rotatividade dos clientes
        # A expectativa é de que quanto maior for o financial_factor, maior a probabilidade de que ele venha a cancelar o seu contrato
        noise_term = np.random.normal(loc=0.0, scale=self.NOISE_STD, size=rows)
        churn_df["financial_factor"] = ((
            np.exp(churn_df["monthly_contract"]) 
        +   np.exp(churn_df["electronic_check"])
        +   np.exp(churn_df["paperless_billing"])
        +   np.exp((churn_df["tenure"] < tenure_1st_quartile).astype(float))
        +   np.exp((churn_df["monthly_charges"] > charges_3rd_quartile).astype(float))
        ) / 3.0 + noise_term) * (1.0 if self.ENABLE_FEATURE_ENGINEERING else 0.0)
        
        LOGGER.info('combining factors into one')
        # por último, criamos um fator combinando todos usados anteriormente
        noise_term = np.random.normal(loc=0.0, scale=self.NOISE_STD, size=rows)
        churn_df["multi_factor"] = ((
            np.exp(churn_df["senior_citizen"]) # senior_citizen=1 piora as p
        +   np.exp(np.abs(1-churn_df["partner"]))
        +   np.exp(np.abs(1-churn_df["dependents"]))
        +   np.exp(np.abs(1-churn_df["tech_support"])) 
        +   np.exp(np.abs(1-churn_df["online_security"])) 
        -   np.exp(churn_df["dsl"]) 
        +   np.exp(churn_df["fiber_optic"])
        +   np.exp(churn_df["monthly_contract"]) 
        +   np.exp(churn_df["electronic_check"])
        +   np.exp(churn_df["paperless_billing"])
        +   np.exp((churn_df["tenure"] < tenure_1st_quartile).astype(float))
        +   np.exp((churn_df["monthly_charges"] > charges_3rd_quartile).astype(float))
        ) / 9.0 + noise_term) * (1.0 if self.ENABLE_FEATURE_ENGINEERING else 0.0)

        # reposiciona a variável target ao final
        churn_df["churn"] = churn_df.pop("churn")
        util.report_df(LOGGER, churn_df)
        return churn_df

Writing ./telchurn/feature_processor.py


In [None]:
%%writefile ./telchurn/feature_ranker.py
# -*- coding: utf-8 -*-
import abc
import argparse
from typing import Tuple, List
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import telchurn.util as util


from telchurn.data_loader import DataLoader

LOGGER = util.get_logger('feature_ranker')

class FeatureRanker(abc.ABC):
    
    @abc.abstractmethod
    def rank_features(self, df: pd.DataFrame, target_variable: str) -> Tuple[str, float]:
        raise NotImplementedError
    
class FeatureRankerImpl(FeatureRanker):
    
    N_ESTIMATORS = 150
    
    def __init__(self, random_seed=None):
        self.random_seed = random_seed

    def rank_features(self, df: pd.DataFrame, target_variable: str) -> List[Tuple[str, float]]:
        LOGGER.info('ranking features')
        all_but_target = df.columns.difference([target_variable])
        X_df = df[all_but_target]
        y = df[target_variable]

        classifier = RandomForestClassifier(
            n_estimators  = self.N_ESTIMATORS
        ,   bootstrap     = True
        ,   class_weight  = "balanced_subsample"
        ,   random_state  = self.random_seed
        )
        LOGGER.info('fitting random forest classifier')
        classifier.fit(X_df, y)
        
        importances_df = pd.DataFrame({
            "feature"       : X_df.columns
        ,   "importance"    : classifier.feature_importances_
        })
        importances_df.sort_values("importance", ascending=False, inplace=True)
        result = list([
            (row.feature, row.importance) for row in importances_df.itertuples()
        ])
        LOGGER.info('Ranked Features')
        for feature, importance in result:
            LOGGER.info(f'\t-> {feature}: {importance}')
        return result


Writing ./telchurn/feature_ranker.py


In [None]:
%%writefile telchurn/feature_selector.py
  # -*- coding: utf-8 -*-
import abc
import argparse
import pandas as pd
import telchurn.util as util
from telchurn.feature_ranker import FeatureRanker
from telchurn.data_loader import DataLoader

LOGGER = util.get_logger('feature_selector')

class FeatureSelector(abc.ABC):
       
    @abc.abstractmethod
    def select_features(self, top_k: int, target_variable: str, df: pd.DataFrame) -> pd.DataFrame:
        raise NotImplementedError
        
class FeatureSelectorImpl(abc.ABC):
        
    def __init__(self, feature_ranker: FeatureRanker):
        self.feature_ranker = feature_ranker
        
    def select_features(self, top_k: int, target_variable: str, df: pd.DataFrame) -> pd.DataFrame:
        LOGGER.info(f'selecting top {top_k} features')
        df = df.copy()
        rankings = self.feature_ranker.rank_features(df, target_variable)
        feature_names = []
        for i in range(top_k):
            feature_name, importance = rankings[i]
            feature_names.append(feature_name)
        if target_variable not in feature_names:
            feature_names.append(target_variable)
        df = df[ feature_names ].copy()
        return df
        

Writing telchurn/feature_selector.py


In [None]:
%%writefile ./telchurn/data_cleaner.py
 # -*- coding: utf-8 -*-
import io
import sys
import abc
import pandas as pd
import telchurn.util as util
from telchurn.data_loader import DataLoader
from telchurn.feature_processor import FeatureProcessor
from telchurn.feature_selector import FeatureSelector


LOGGER = util.get_logger('data_cleaner')

class DataCleaner(abc.ABC):
    
    DEFAULT_TOP_K_FEATURES  = 16
    DEFAULT_SEED = 42
    
    @abc.abstractmethod
    def clean(self, input_file_name_or_url: str, output_file_name_or_url: str, top_k_features: int=None, fields:str=None) -> None:
        raise NotImplementedError
        
class DataCleanerImpl(DataCleaner):
    
    TARGET_VARIABLE = "churn"
    
    def __init__(self, data_loader: DataLoader, feature_processor: FeatureProcessor, feature_selector: FeatureSelector):
        self.data_loader = data_loader
        self.feature_processor = feature_processor
        self.feature_selector = feature_selector
        
    def clean(self, input_file_name_or_url: str, output_file_name_or_url: str, top_k_features: int=None, fields:str=None) -> None:
        if top_k_features is None:
            top_k_features = self.DEFAULT_TOP_K_FEATURES        
        LOGGER.info(f'starting data cleaner')
        churn_df = self.data_loader.load(input_file_name_or_url)
        
        # transforma a variável target em uma variável numérica
        churn_df["churn"] = churn_df["churn"].map({"No": 0, "Yes": 1})

        # excluindo a variável customer_id
        del churn_df["customer_id"]
        
        # coluna total_charges possui registros vazios com valor ' '
        def convert_total_charges(value):
            return 0.0 if value == ' ' else value
        churn_df["total_charges"] = churn_df["total_charges"].map(convert_total_charges).astype(float)
        
        # diferente das outras colunas, senior_citizem possui valores 1 ou 0 ao invés de "Yes" or "No"
        churn_df["senior_citizen"] = churn_df["senior_citizen"].map({1: "Yes", 0: "No"})
        
        # removendo a feature de sexo que se mostrou irrelevante durante a análise exploratória
        del churn_df["gender"]
        
        churn_df = self.feature_processor.handle_categorical_features(churn_df)
        churn_df = self.feature_processor.engineer_features(churn_df)
        if fields is None:
            churn_df = self.feature_selector.select_features(top_k_features, self.TARGET_VARIABLE, churn_df)
        else:
            column_names = fields.split(',')
            churn_df = churn_df[column_names]
        self.save_cleansed(output_file_name_or_url, churn_df)
    
    def save_cleansed(self, file_name_or_url: str, churn_df: pd.DataFrame) -> None:
        LOGGER.info(f'saving cleansed dataframe to {file_name_or_url}')
        churn_df.to_csv(file_name_or_url, sep=DataLoader.DELIMITER, index=False)


Writing ./telchurn/data_cleaner.py


In [None]:
%%writefile ./telchurn/pipeline_factory.py
 # -*- coding: utf-8 -*-
import io
import sys
import abc
import pandas as pd
import telchurn.util as util

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.svm import SVC

class PipelineFactory(abc.ABC):

    @abc.abstractmethod
    def build_pipeline_for(self, churn_df: pd.DataFrame) -> Pipeline:
        raise NotImplementedError

class PipelineFactoryImpl(PipelineFactory):
    
    NUMERICAL_FEATURES = [
        'tenure'
    ,   'monthly_charges'
    ,   'total_charges'
    ,   'client_factor'
    ,   'internet_factor'
    ,   'financial_factor'
    ,   'multi_factor'    
    ]
    
    def select_numerical_features(self, churn_df):
        # seleciona as variáveis numéricas que existem no data frame
        return list([nf for nf in self.NUMERICAL_FEATURES if nf in churn_df.columns])
        
    def build_pipeline_for(self, churn_df: pd.DataFrame) -> Pipeline:
        numerical_features = self.select_numerical_features(churn_df)
        # Configuração do pipeline

        # Os transformadores numéricos são utilizado spara processamento de todas as variáveis não categóricas.
        numeric_transformer = Pipeline([
          ("scaler", StandardScaler())    
        ])

        column_transformer = ColumnTransformer(
          transformers = [
            ("num", numeric_transformer, numerical_features)
          ],
          # importante usar passthrough quando nem todos os atributos forem processados
          remainder="passthrough" 
        )

        # Este pipelie será ajustado diversas vezes durante o processo de otimização dos hiper parâmetros.
        pipeline = Pipeline([
            # a primeira fase consiste no pré-processamento das variáveis numéricas
            ("feature_scaling", column_transformer),
            # redução de dimesionalidade
            ("reduce_dim", PCA()),
            # O algoritmo de regressão e seus parâmetros serão configurados via gridsearch
            ("classifier", SVC())
        ])
        return pipeline


Writing ./telchurn/pipeline_factory.py


In [None]:
%%writefile ./telchurn/hyper_param_tunner.py
 # -*- coding: utf-8 -*-
import io
import os
import sys
import abc
import pandas as pd
import warnings
import telchurn.util as util
from typing import List, Dict, Tuple
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline

LOGGER = util.get_logger('hp_tunner')

class HyperParamTunner(abc.ABC):
    
    DEFAULT_K_FOLDS = 10
    DEFAULT_METRIC  = "f1"
    
    def __init__(self, k_folds, random_seed=None):
        self.kfold = StratifiedKFold(
            n_splits      = k_folds
        ,   shuffle       = True
        ,   random_state  = random_seed
        )     
        
    @abc.abstractmethod
    def tune(self, pipeline: Pipeline, param_grid: List[Dict], grid_name: str, scoring_metric: str, X_train_df: pd.DataFrame, y_train_df: pd.DataFrame) -> Tuple[RandomizedSearchCV, pd.DataFrame]:
        raise NotImplementedError
        
class HyperParamTunnerImpl(HyperParamTunner):
    
    def __init__(self, k_folds, random_seed=None):
        self.kfold = StratifiedKFold(
            n_splits      = k_folds
        ,   shuffle       = True
        ,   random_state  = random_seed
        )     
                
    def tune(self, pipeline: Pipeline, param_grid: List[Dict], grid_name: str, num_iterations: int, scoring_metric: str, X_train_df: pd.DataFrame, y_train_df: pd.DataFrame) -> Tuple[RandomizedSearchCV, pd.DataFrame]:
        LOGGER.info(f'tunning pipeline {grid_name} using scoring metric {scoring_metric}')
        grid = RandomizedSearchCV(
            estimator           = pipeline
        ,   param_distributions = param_grid
        ,   scoring             = scoring_metric
        ,   cv                  = self.kfold
        ,   n_iter              = num_iterations
        ,   return_train_score  = False
        ,   n_jobs              = -1
        )
        warnings.filterwarnings("ignore")
        util.silence_warnings()
        grid.fit(X_train_df, y_train_df)
        LOGGER.info(f'Best {scoring_metric}: {grid.best_score_}')
        LOGGER.info(f'Best estimator: -> \n{grid.best_estimator_}')
        results_df = pd.DataFrame(grid.cv_results_)
        return grid, results_df


Writing ./telchurn/hyper_param_tunner.py


In [None]:
%%writefile ./telchurn/model_repository.py
# -*- coding: utf-8 -*-
import abc
import os.path
import glob
import pickle
from typing import Tuple, List, Dict
import pandas as pd
import telchurn.util as util
# correção de bug na biblioteca que importa uma dependência de maneira indireta
# https://stackoverflow.com/questions/61867945/python-import-error-cannot-import-name-six-from-sklearn-externals
import sys
import six
sys.modules['sklearn.externals.six'] = six
from mlxtend.classifier import EnsembleVoteClassifier

LOGGER = util.get_logger('model_repository')

class ModelRepository(abc.ABC):
    
    @abc.abstractmethod
    def save_grid(self, grid: Dict, file_name: str) -> None:
        raise NotImplementedError

    @abc.abstractmethod
    def load_grid(self, file_name: str) -> Dict:
        raise NotImplementedError

    @abc.abstractmethod
    def save_final_model(self, model: EnsembleVoteClassifier, file_name: str) -> None:
        raise NotImplementedError

    @abc.abstractmethod
    def load_final_model(self, file_name: str) -> EnsembleVoteClassifier:
        raise NotImplementedError
        
class ModelRepositoryImpl(ModelRepository):
    
    GRID_PREFIX = "grid_"
    GRID_GLOB = "grid_*.pkl"
    
    def __init__(self, repo_dir: str):
        self.repo_dir = repo_dir
    
    def list_grids(self):
        LOGGER.info(f'listing saved grids on {self.repo_dir}')
        path = os.path.join(self.repo_dir, self.GRID_GLOB)
        def get_grid_name(path):
            # f = lambda path: os.path.split(path)[-1].replace(self.GRID_PREFIX, "") # eita, gambiarra danada!
            path_parts = os.path.split(path)
            filename = path_parts[-1]
            filename = filename.replace(self.GRID_PREFIX, "")
            return filename
        return list([ get_grid_name(path) for path in glob.glob(path) ])
        
    def save_grid(self, grid: Dict, file_name: str) -> None:
        path = os.path.join(self.repo_dir, self.GRID_PREFIX + file_name)
        LOGGER.info(f'saving grid search results to {path}')
        data = {
            "best_score_"     : grid.best_score_
        ,   "best_params_"    : grid.best_params_
        ,   "best_estimator_" : grid.best_estimator_
        ,   "cv_results_"     : grid.cv_results_
        ,   "grid"            : grid
        }  
        with open(path, "wb") as fh:
            pickle.dump(data, fh)
        
    def load_grid(self, file_name: str) -> Dict:
        path = os.path.join(self.repo_dir, self.GRID_PREFIX + file_name)
        LOGGER.info(f'loading grid search results from {path}')
        with open(path, "rb") as fh:
            data                  = pickle.load(fh)
        grid                  = data["grid"]
        grid.best_score_      = data["best_score_"]
        grid.best_params_     = data["best_params_"]
        grid.best_estimator_  = data["best_estimator_"]
        grid.cv_results_      = data["cv_results_"]
        return grid
    
    def save_final_model(self, model: EnsembleVoteClassifier, file_name: str) -> None:
        path = os.path.join(self.repo_dir, file_name)
        LOGGER.info(f'saving final model to {path}')
        with open(path, "wb") as fh:
            pickle.dump(model, fh)

    def load_final_model(self, file_name: str) -> EnsembleVoteClassifier:
        path = os.path.join(self.repo_dir, file_name)
        LOGGER.info(f'loading final model from {path}')
        with open(path, "rb") as fh:
            return pickle.load(fh)

Writing ./telchurn/model_repository.py


In [None]:
%%writefile ./telchurn/param_grids.py
 # -*- coding: utf-8 -*-
import abc
import pandas as pd
import telchurn.util as util
from typing import List, Dict

from sklearn.preprocessing import StandardScaler, MinMaxScaler, PolynomialFeatures
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, plot_precision_recall_curve, plot_roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, auc
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer # transformador de colunas, usado para tratamento das variáveis
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

LOGGER = util.get_logger('param_grids')

QUICK_RUN   = False
USE_LOGREG  = True
USE_KNN     = True
USE_NB      = True
USE_DT      = True
USE_SVM     = False
USE_ADA     = True
USE_GB      = True
USE_RF      = True



class ParamGrids(abc.ABC):
    
    @abc.abstractmethod
    def get_parameter_grids(self) -> List[Dict]:
        raise NotImplementedError
    
class ParamGridsImpl(abc.ABC):
    
    def get_parameter_grids(self) -> List:
        return []                   \
        +   (self.get_logreg_grid() if USE_LOGREG else []) \
        +   (self.get_knn_grid()    if USE_KNN else []) \
        +   (self.get_nb_grid()     if USE_NB  else []) \
        +   (self.get_dt_grid()     if USE_DT  else []) \
        +   (self.get_svm_grid()    if USE_SVM else []) \
        +   (self.get_ada_grid()    if USE_ADA else []) \
        +   (self.get_gb_grid()     if USE_GB  else []) \
        +   (self.get_rf_grid()     if USE_RF  else [])
        
    def get_logreg_grid(self):
        LOGGER.info('creating parameter grids for logistic Regression - LOGREG')
        param_grid = [{
            # Logistic Regression
            "feature_scaling__num__scaler"  : ["passthrough", MinMaxScaler(), StandardScaler()],
            "reduce_dim"                    : ["passthrough", PCA(n_components=3), PCA(n_components=5)],
            "classifier"                    : [LogisticRegression()],
            "classifier__n_jobs"            : [-1], # all cpus available
            "classifier__penalty"           : ["elasticnet"],
            "classifier__class_weight"      : ["balanced"],
            "classifier__solver"            : ["saga"],
            "classifier__l1_ratio"          : [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
        },
        {
            "feature_scaling__num__scaler"  : ["passthrough", MinMaxScaler(), StandardScaler()],
            "reduce_dim"                    : ["passthrough", PCA(n_components=3), PCA(n_components=5)],
            "classifier"                    : [LogisticRegression()],
            "classifier__n_jobs"            : [-1], # all cpus available
            "classifier__penalty"           : ["none"],
            "classifier__class_weight"      : ["balanced"],
        }]
        return [{
            "name"          : "LOGREG"
        ,   "iterations"    : (200 if not QUICK_RUN else 20)
        ,   "param_grid"    : param_grid
        }]
    
    def get_knn_grid(self):
        LOGGER.info('creating parameter grids for K Nearest Neighbors - KNN')
        param_grid = [{
            # KNeighborsClassifier
            "feature_scaling__num__scaler"  : ["passthrough", MinMaxScaler(), StandardScaler()],
            "reduce_dim"                    : ["passthrough", PCA(n_components=3), PCA(n_components=5)],
            "classifier"                    : [KNeighborsClassifier()],
            "classifier__n_jobs"            : [-1], # all cpus available
            "classifier__algorithm"         : ["kd_tree"],
            "classifier__metric"            : ["minkowski"],
            "classifier__p"                 : [0.5, 1.0, 1.5, 2.0],
            "classifier__n_neighbors"       : [5, 7, 10, 13, 15, 17, 20],
            "classifier__weights"           : ["uniform", "distance"]
            }]
        return [{
            "name"          : "KNN"
        ,   "iterations"    : (200 if not QUICK_RUN else 20)
        ,   "param_grid"    : param_grid
        }]
    
    def get_nb_grid(self):
        LOGGER.info('creating parameter grids for Naive Bayes - nb')
        param_grid = [{
            # GaussianNB
            "feature_scaling__num__scaler"  : ["passthrough", MinMaxScaler(), StandardScaler()],
            "reduce_dim"                    : ["passthrough", PCA(n_components=3), PCA(n_components=5)],
            "classifier"                    : [GaussianNB()],
            "classifier__var_smoothing"     : [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
        }]    
        return [{
            "name"          : "NB"
        ,   "iterations"    : (200 if not QUICK_RUN else 20)
        ,   "param_grid"    : param_grid
        }]
   
    def get_dt_grid(self):
        LOGGER.info('creating parameter grids for Decision Tree - DT')
        param_grid = [{
            # DecisionTreeClassifier{
            "feature_scaling__num__scaler"  : ["passthrough", MinMaxScaler(), StandardScaler()],
            "reduce_dim"                    : ["passthrough", PCA(n_components=3), PCA(n_components=5)],
            "classifier"                    : [DecisionTreeClassifier()],
            "classifier__class_weight"      : [None, "balanced"],
            "classifier__criterion"         : ["gini", "entropy"],
            "classifier__splitter"          : ["best", "random"],
            "classifier__max_features"      : [None, "auto", "sqrt", "log2"],
            "classifier__max_depth"         : [10, 25, 50]
        }]
        return [{
            "name"          : "DT"
        ,   "iterations"    : (200 if not QUICK_RUN else 20)
        ,   "param_grid"    : param_grid
        }]

    def get_svm_grid(self):
        LOGGER.info('creating parameter grids for Suport Vector Machine classifier - SVM')
        param_grid = param_grid = [{
            # SVC        
            "feature_scaling__num__scaler"  : [MinMaxScaler(), StandardScaler()], # SVC precisa ter os argumentos escalonados para uma melhor performance
            #"reduce_dim"                    : ["passthrough", PCA(n_components=3), PCA(n_components=5)],
            "reduce_dim"                    : [PCA(n_components=3), PCA(n_components=5), PCA(n_components=7)],
            "classifier"                    : [SVC(probability=True)],
            "classifier__kernel"            : ["linear","rbf"],
            "classifier__gamma"             : ["scale", "auto"],
            "classifier__class_weight"      : [None, "balanced"],
            "classifier__tol"               : [1e-2],
            "classifier__C"                 : [1, 0.5, 0.1],
            #"classifier__max_iter"          : [1000],

        }]
        return [{
            "name"          : "SVM"
        ,   "iterations"    : (200 if not QUICK_RUN else 20)
        ,   "param_grid"    : param_grid
        }]
    
    def get_ada_grid(self):
        LOGGER.info('creating parameter grids for AdaBoost classifier - ADA')
        param_grid = [{
        # AdaBoostClassifier
            "feature_scaling__num__scaler"    : ["passthrough", MinMaxScaler(), StandardScaler()],
            "reduce_dim"                      : ["passthrough", PCA(n_components=3), PCA(n_components=5)],
            "classifier"                      : [AdaBoostClassifier()],
            "classifier__n_estimators"        : [100, 250, 500],
            "classifier__learning_rate"       : [0.001, 0.01, 0.1, 1.0]
        }]
        return [{
            "name"          : "ADA"
        ,   "iterations"    : (200 if not QUICK_RUN else 20)
        ,   "param_grid"    : param_grid
        }]

    def get_gb_grid(self):
        LOGGER.info('creating parameter grids for Gradient Boosting classifier - GB')
        param_grid = [{
            # GradientBoostingClassifier        
            "feature_scaling__num__scaler"    : ["passthrough", MinMaxScaler(), StandardScaler()],
            "reduce_dim"                      : ["passthrough", PCA(n_components=3), PCA(n_components=5)],
            "classifier"                      : [GradientBoostingClassifier()],
            "classifier__loss"                : ["log_loss", "deviance", "exponential"],
            "classifier__n_estimators"        : [50, 75, 100, 150],
            "classifier__learning_rate"       : [0.1, 0.3, 0.5, 0.7, 1.0],
            "classifier__max_depth"           : [3, 5, 10],
            "classifier__max_features"        : [None, "sqrt", "log2"]
        }]
        return [{
            "name"          : "GB"
        ,   "iterations"    : (200 if not QUICK_RUN else 20)
        ,   "param_grid"    : param_grid
        }]
        
    def get_rf_grid(self):
        LOGGER.info('creating parameter grids for Random Forest classifier - RF')
        param_grid = [{
            # RandomForestClassifier
            "feature_scaling__num__scaler"    : ["passthrough", MinMaxScaler(), StandardScaler()],
            "reduce_dim"                      : ["passthrough"], #PCA(n_components=5), PCA(n_components=10)],
            "classifier"                      : [RandomForestClassifier()],
            "classifier__n_estimators"        : [50, 100, 150, 200],
            "classifier__criterion"           : ["gini", "entropy"],
            "classifier__bootstrap"           : [True, False],
            "classifier__n_jobs"              : [-1],
            "classifier__class_weight"        : ["balanced", "balanced_subsample"],
            "classifier__max_depth"           : [5, 10, 25]
        }]
        return [{
            "name"          : "RF"
        ,   "iterations"    : (200 if not QUICK_RUN else 20)
        ,   "param_grid"    : param_grid
        }]
        

Writing ./telchurn/param_grids.py


In [None]:
%%writefile ./telchurn/trainer.py
 # -*- coding: utf-8 -*-
import abc
import pandas as pd
from typing import Tuple, List
from sklearn.model_selection import train_test_split

import telchurn.util as util
from telchurn.data_loader import DataLoader
from telchurn.pipeline_factory import PipelineFactory
from telchurn.hyper_param_tunner import HyperParamTunner
from telchurn.param_grids import ParamGridsImpl
from telchurn.model_repository import ModelRepository
from telchurn.data_splitter import DataSplitter

LOGGER = util.get_logger('trainer')

class Trainer(abc.ABC):
    
    #DEFAULT_TEST_PCT_SIZE   = 0.3 # 30% do conjunto de dados
    #DEFAULT_RANDOM_STATE    = 42
                
    @abc.abstractmethod
    def train(self, input_file: str, splitter: DataSplitter) -> None:
        raise NotImplementedError
        
class TrainerImpl(Trainer):
    
    #SCORING_METHOD = "recall"
    SCORING_METHOD = "balanced_accuracy"
    
    def __init__(self, data_loader: DataLoader, pipeline_factory: PipelineFactory, hp_tunner: HyperParamTunner, repo: ModelRepository):
        self.data_loader = data_loader
        self.pipeline_factory = pipeline_factory
        self.hp_tunner = hp_tunner
        self.repo = repo
    
    def get_param_grids(self):
        return ParamGridsImpl().get_parameter_grids()

    def train(self, input_file: str, splitter: DataSplitter) -> None:
        LOGGER.info('starting telco churn model training')
        churn_df = self.data_loader.load_cleansed(input_file)
        target = churn_df.columns[-1]
        util.report_df(LOGGER, churn_df)
        pipeline = self.pipeline_factory.build_pipeline_for(churn_df)
        (X_train_df, y_train_df), (X_test_df, y_test_df) = splitter.split(churn_df, target)
        param_grids = param_grids = self.get_param_grids()
        for param_grid in param_grids:
            name        = param_grid["name"]
            iterations  = param_grid["iterations"]
            grid        = param_grid["param_grid"]
            rand_search_cv, train_df = self.hp_tunner.tune(
                pipeline        = pipeline
            ,   param_grid      = grid
            ,   grid_name       = name
            ,   num_iterations  = iterations
            ,   scoring_metric  = self.SCORING_METHOD
            ,   X_train_df      = X_train_df
            ,   y_train_df      = y_train_df
            )
            grid_name = name + ".pkl"
            self.repo.save_grid(rand_search_cv, grid_name)

Writing ./telchurn/trainer.py


In [None]:
%%writefile ./telchurn/model_evaluator.py
# -*- coding: utf-8 -*-
import abc
import os.path
import pickle
import math
from itertools import combinations
from typing import Tuple, List, Dict
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, balanced_accuracy_score, f1_score, confusion_matrix
from mlxtend.classifier import EnsembleVoteClassifier

import telchurn.util as util

LOGGER = util.get_logger('model_evaluator')

class ModelEvaluator(abc.ABC):
    
    @abc.abstractmethod
    def report_results(self, estimator, X, y):
        raise NotImplementedError
    
class ModelEvaluatorImpl(ModelEvaluator):
    
    def report_results(self, estimator: EnsembleVoteClassifier, X_df: pd.DataFrame, y_df: pd.DataFrame) -> None:
        y_hat           = estimator.predict(X_df)
        # Confusion matrix whose i-th row and j-th column entry indicates the number of 
        # samples with true label being i-th class and predicted label being j-th class.
        accuracy        = accuracy_score(y_df, y_hat)
        precision       = precision_score(y_df, y_hat)
        recall          = recall_score(y_df, y_hat)
        balanced_acc    = balanced_accuracy_score(y_df, y_hat)
        f1              = f1_score(y_df, y_hat)
        conf_matrix     = confusion_matrix(y_df, y_hat)
        true_negative   = conf_matrix[0][0]
        false_positive  = conf_matrix[0][1]
        false_negative  = conf_matrix[1][0]
        true_positive   = conf_matrix[1][1]
        
        LOGGER.info(f"accuracy score        : {accuracy}")
        LOGGER.info(f"precision score       : {precision}")
        LOGGER.info(f"recall score          : {recall}")
        LOGGER.info(f"balanced acc. score   : {balanced_acc}")
        LOGGER.info(f"f1 score              : {f1}")
        LOGGER.info(f"confusion matrix") 
        LOGGER.info(f"\tTrue  Negative : {true_negative}") 
        LOGGER.info(f"\tFalse Positive : {false_positive}") 
        LOGGER.info(f"\tFalse Negative : {false_negative}") 
        LOGGER.info(f"\tTrue  Positive : {true_positive}") 

Writing ./telchurn/model_evaluator.py


In [None]:
%%writefile ./telchurn/ensembler.py
# -*- coding: utf-8 -*-
import abc
import os.path
import pickle
import math
from itertools import combinations
from typing import Tuple, List, Dict
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, balanced_accuracy_score, f1_score, confusion_matrix
from mlxtend.classifier import EnsembleVoteClassifier
from telchurn.data_splitter import DataSplitter
from telchurn.model_evaluator import ModelEvaluator
import telchurn.util as util

LOGGER = util.get_logger('ensembler')

class Ensembler(abc.ABC):
        
    @abc.abstractmethod
    def ensemble_models(self, grids: List[RandomizedSearchCV], churn_df: pd.DataFrame) -> EnsembleVoteClassifier:
        raise NotImplementedError
            
class EnsemblerImpl(Ensembler):
    
     # soft voting é aquele no qual o estimador com mais "certeza" sobre a classificação vence
    MIN_ESTIMATORS  = 1

    def __init__(self, splitter: DataSplitter, evaluator: ModelEvaluator):
        self.splitter = splitter
        self.evaluator = evaluator
        self.top10_scores = [(0.0, 0, "")] * 10
    
    def update_scores(self, score, num_estimators, voting_type):
        self.top10_scores.append((score, num_estimators, voting_type))
        self.top10_scores.sort(reverse=True)
        self.top10_scores.pop(10)
        LOGGER.info(f"current top scores")
        for i, (score, num_estimators, voting_type) in enumerate(self.top10_scores):
            if num_estimators == 0:
                continue
            LOGGER.info(f"\t{i+1} - {score} with {num_estimators} estimators and {voting_type} voting")
    
    def compute_estimator_weights(self, grids: List[RandomizedSearchCV]) -> List[Tuple[RandomizedSearchCV, float]]:
        LOGGER.info('computing estimator weights')
        result = list([ 
            #(grid.best_estimator_, grid.best_score_) 
            (grid.best_estimator_, math.exp(1.0+grid.best_score_))
            for grid in grids 
        ])
        result.sort(key=lambda x: x[1])
        return result
    
    def compute_score(self, estimator, X_test_df, y_test):
      y_test_hat = estimator.predict(X_test_df)
      train_score = balanced_accuracy_score(y_test, y_test_hat)
      return train_score
            
    def ensemble_models(self, grids: List[RandomizedSearchCV], churn_df: pd.DataFrame) -> EnsembleVoteClassifier:
        target = churn_df.columns[-1]
        (X_train_df, y_train_df), (X_test_df, y_test_df) = self.splitter.split(churn_df, target) 
        estimators_and_weights = self.compute_estimator_weights(grids)
        total_estimators = len(estimators_and_weights)
        assert self.MIN_ESTIMATORS <= total_estimators
        num_combined_estimators = range(self.MIN_ESTIMATORS, total_estimators + 1) # final de range é não inclusivo
        best_score          = -99999
        best_estimator      = None
        best_voting_type    = None
        util.silence_warnings()
        for voting_type in ['soft', 'hard']:
            for num_estimators in num_combined_estimators:
                for comb_estimators_weights in combinations(estimators_and_weights, num_estimators):
                    # https://stackoverflow.com/questions/13635032/what-is-the-inverse-function-of-zip-in-python
                    comb_estimators, weights = zip(*comb_estimators_weights)
                    # em versões mais antigas da biblioteca, o parâmetro fit_base_estimators chamava-se refit
                    classifier = EnsembleVoteClassifier(
                        clfs                 = comb_estimators
                    ,   weights              = weights
                    ,   voting               = voting_type
                    #,   fit_base_estimators = False
                    ,   refit                = False
                    )
                    classifier.fit(None, y_train_df) # nenhum dado é necessário pois fit_base_estimators=False
                    score = self.compute_score(classifier, X_test_df, y_test_df)
                    if score > best_score:
                        best_score          = score
                        best_estimator      = classifier
                        best_voting_type    = voting_type
                        self.update_scores(best_score,  num_estimators, best_voting_type)
                    
        LOGGER.info(f"best combination of estimators: ")
        for clf in best_estimator.clfs:
            LOGGER.info(f"\t{clf}")
        LOGGER.info(f"estimators weights: {classifier.weights}")
        LOGGER.info("Train Results")
        self.evaluator.report_results(best_estimator, X_train_df, y_train_df)
        LOGGER.info("Test Results")
        self.evaluator.report_results(best_estimator, X_test_df, y_test_df)
        
        LOGGER.info('refiting base classifiers on whole data set')
        X = pd.concat([X_train_df, X_test_df])
        y = pd.concat([y_train_df, y_test_df])
        for clf in best_estimator.clfs:
            clf.fit(X, y)
        best_estimator.fit(None, y) # nenhum dado é necessário pois fit_base_estimators=False
        LOGGER.warn("Whole data set results (has data leakage)")
        self.evaluator.report_results(best_estimator, X, y)
        return best_estimator

Writing ./telchurn/ensembler.py


In [None]:
%%writefile ./splitter.py
# -*- coding: utf-8 -*-
import abc
import argparse
import pandas as pd
from telchurn.data_loader import DataLoader, DataLoaderImpl
from telchurn.data_splitter import DataSplitter, DataSplitterImpl
import telchurn.util as util

LOGGER = util.get_logger('splitter')

def main(seed: int, testsplit: float, input_file: str, output_file1: str, output_file2: str) -> None:
    data_loader = DataLoaderImpl()
    data_splitter = DataSplitterImpl(seed, testsplit)
    LOGGER.info('starting data splitter')
    df = data_loader.load_cleansed(input_file) # GAMBIARRA: load cleansed does not alter the input file
    LOGGER.info(f"input file: {input_file}")
    util.report_df(LOGGER, df)
    target = df.columns[-1]
    LOGGER.info(f'using column "{target}" as target variable')
    (X_train_df, y_train_df), (X_test_df, y_test_df) = data_splitter.split(df, target)
    
    train_df = X_train_df.copy()
    train_df[target] = y_train_df[target].values
    train_df = train_df[ df.columns ] # reorder columns - don't know why this is needed
    
    test_df = X_test_df.copy()
    test_df[target] = y_test_df[target].values
    test_df = test_df[ df.columns ] # reorder columns - don't know why this is needed
    
    LOGGER.info(f"output file 1: {output_file1}")
    util.report_df(LOGGER, train_df)
    LOGGER.info(f"output file 2: {output_file2}")
    util.report_df(LOGGER, test_df)
    
    train_df.to_csv(output_file1, sep=DataLoader.DELIMITER, index=False)
    test_df.to_csv(output_file2, sep=DataLoader.DELIMITER, index=False)
    
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--seed', type=int,   help='random seed', default=DataSplitter.DEFAULT_RANDOM_STATE)
    parser.add_argument('--testsplit', type=float, help='test split percentage', default=DataSplitter.DEFAULT_TEST_PCT_SIZE)
    parser.add_argument('input_file', type=str,   help='input file name')
    parser.add_argument('output_file1', type=str,   help='output file 1')
    parser.add_argument('output_file2', type=str,   help='output file 2')
    args = parser.parse_args()
    main(
        args.seed
    ,   args.testsplit
    ,   args.input_file
    ,   args.output_file1
    ,   args.output_file2
    )

Writing ./splitter.py


In [None]:
%%writefile dataclean.py
 # -*- coding: utf-8 -*-
import abc
import argparse
from typing import List
from telchurn.data_loader import DataLoader, DataLoaderImpl
from telchurn.feature_processor import FeatureProcessor, FeatureProcessorImpl
from telchurn.feature_ranker import FeatureRanker, FeatureRankerImpl
from telchurn.feature_selector import FeatureSelector, FeatureSelectorImpl
from telchurn.data_cleaner import DataCleaner, DataCleanerImpl
import telchurn.util as util

from telchurn.data_loader import DataLoader

LOGGER = util.get_logger('dataclean')
        
def main(input_file: str, output_file: str, topk: int, seed: int, fields: List[str]):
    data_loader = DataLoaderImpl()
    feature_processor = FeatureProcessorImpl(seed)
    feature_ranker = FeatureRankerImpl(seed)
    feature_selector = FeatureSelectorImpl(feature_ranker)
    data_cleaner = DataCleanerImpl(data_loader, feature_processor, feature_selector)
    data_cleaner.clean(input_file, output_file, topk, fields)
    
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--seed', type=int, help='number of features to keep', default=DataCleaner.DEFAULT_SEED)
    parser.add_argument('--topk', type=int, help='number of features to keep', default=DataCleaner.DEFAULT_TOP_K_FEATURES)
    parser.add_argument('--fields', type=str, help='comma separated list of columns to be kept', default=None)
    parser.add_argument('input_file', type=str, help='input file name or url')
    parser.add_argument('output_file', type=str, help='output file name')
    args = parser.parse_args()
    main(args.input_file, args.output_file, args.topk, args.seed, args.fields)

Writing dataclean.py


In [None]:
%%writefile fieldnames.py
# -*- coding: utf-8 -*-
import abc
import argparse
import pandas as pd
from telchurn.data_loader import DataLoader, DataLoaderImpl
import telchurn.util as util

LOGGER = util.get_logger('fieldnames')

def main(input_file: str, output_file: str) -> None:
    data_loader = DataLoaderImpl()
    LOGGER.info('starting data splitter')
    df = data_loader.load_cleansed(input_file) # GAMBIARRA: load cleansed does not alter the input file
    if output_file == "-":
        print(*df.columns, sep=DataLoader.DELIMITER)
    else:
        with open(output_file, "w") as fh:
            print(*df.columns, sep=DataLoader.DELIMITER, file=fh, end='')
    
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('input_file', type=str,   help='input file name')
    parser.add_argument('output_file', type=str,   help='output file name')
    args = parser.parse_args()
    main(args.input_file, args.output_file)

Writing fieldnames.py


In [None]:
%%writefile ./train.py
# -*- coding: utf-8 -*-
import abc
import argparse
from telchurn.data_loader import DataLoader, DataLoaderImpl
from telchurn.data_splitter import DataSplitter, DataSplitterImpl
from telchurn.feature_processor import FeatureProcessor, FeatureProcessorImpl
from telchurn.feature_ranker import FeatureRanker, FeatureRankerImpl
from telchurn.feature_selector import FeatureSelector, FeatureSelectorImpl
from telchurn.pipeline_factory import PipelineFactory, PipelineFactoryImpl
from telchurn.hyper_param_tunner import HyperParamTunner, HyperParamTunnerImpl
from telchurn.model_repository import ModelRepository, ModelRepositoryImpl
import telchurn.param_grids as param_grids
from telchurn.trainer import Trainer, TrainerImpl
import telchurn.util as util

from telchurn.data_loader import DataLoader

LOGGER = util.get_logger('train')
        
def main(input_file: str, seed: int, testsplit: float, kfolds: int, model_dir: str, quick: bool):
    if quick:
        LOGGER.warn('activating quick run mode')
        param_grids.QUICK_RUN = True
    data_loader = DataLoaderImpl()
    pipeline_factory = PipelineFactoryImpl()
    hp_tunner = HyperParamTunnerImpl(kfolds, seed)
    repo = ModelRepositoryImpl(model_dir)
    splitter = DataSplitterImpl(seed, testsplit)
    trainer = TrainerImpl(data_loader, pipeline_factory, hp_tunner, repo)
    trainer.train(input_file, splitter)
    
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--seed',       type=int,   help='random seed',             default=DataSplitter.DEFAULT_RANDOM_STATE)
    parser.add_argument('--testsplit',  type=float, help='test split percentage',   default=DataSplitter.DEFAULT_TEST_PCT_SIZE)
    parser.add_argument('--kfolds',     type=int,   help='number of k folds',       default=HyperParamTunner.DEFAULT_K_FOLDS)
    parser.add_argument('--quick',      action="store_true", help='quick run', default=False)
    parser.add_argument('input_file',   type=str,   help='input file name')
    parser.add_argument('model_dir',    type=str,   help='models output directory')
    args = parser.parse_args()
    main(args.input_file, args.seed, args.testsplit, args.kfolds, args.model_dir, args.quick)

Writing ./train.py


In [None]:
%%writefile ./ensembler.py
# -*- coding: utf-8 -*-
import abc
import argparse
from telchurn.data_loader import DataLoader, DataLoaderImpl
from telchurn.model_repository import ModelRepository, ModelRepositoryImpl
from telchurn.ensembler import Ensembler, EnsemblerImpl
from telchurn.data_splitter import DataSplitter
from telchurn.hyper_param_tunner import HyperParamTunner
from telchurn.data_splitter import DataSplitterImpl
from telchurn.model_evaluator import ModelEvaluatorImpl
import telchurn.util as util

LOGGER = util.get_logger('ensembler')

class App:

    def __init__(self, data_loader: DataLoader, repo: ModelRepository, ensembler: Ensembler):
        self.data_loader = data_loader
        self.repo = repo
        self.ensembler = ensembler
    
    def read_grids(self):
        LOGGER.info('reading saved grids')
        grids = []
        for grid_name in self.repo.list_grids():
            grid = self.repo.load_grid(grid_name)
            grids.append(grid)
        return grids
        
    def run(self, input_file_or_url: str, model_name: str) -> None:
        LOGGER.info('starting ensembler')
        grids = self.read_grids()
        churn_df = self.data_loader.load_cleansed(input_file_or_url)
        util.report_df(LOGGER, churn_df)
        voting_classifier = self.ensembler.ensemble_models(grids, churn_df)
        self.repo.save_final_model(voting_classifier, model_name)
        
        
def main(input_file: str, seed: int, testsplit: float, kfolds: int, model_dir: str, model_name: str):
    data_loader = DataLoaderImpl()
    repo = ModelRepositoryImpl(model_dir)
    evaluator = ModelEvaluatorImpl()
    splitter = DataSplitterImpl(seed, testsplit)
    ensembler = EnsemblerImpl(splitter, evaluator)
    app = App(data_loader, repo, ensembler)
    app.run(input_file, model_name)
    
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--seed',       type=int,   help='random seed',             default=DataSplitter.DEFAULT_RANDOM_STATE)
    parser.add_argument('--testsplit',  type=float, help='test split percentage',   default=DataSplitter.DEFAULT_TEST_PCT_SIZE)
    parser.add_argument('--kfolds',     type=int,   help='number of k folds',       default=HyperParamTunner.DEFAULT_K_FOLDS)
    parser.add_argument('input_file',   type=str,   help='input file name')
    parser.add_argument('model_dir',    type=str,   help='models output directory')
    parser.add_argument('model_name',   type=str,   help='final model name')
    args = parser.parse_args()
    main(args.input_file, args.seed, args.testsplit, args.kfolds, args.model_dir, args.model_name)

Writing ./ensembler.py


In [None]:
%%writefile ./classify.py
 # -*- coding: utf-8 -*-
import argparse
import os.path
from telchurn.model_repository import ModelRepositoryImpl
from telchurn.data_loader import DataLoaderImpl
from telchurn.model_repository import ModelRepositoryImpl
from telchurn.model_evaluator import ModelEvaluatorImpl
import telchurn.util as util

LOGGER = util.get_logger('classify')
        
def main(model_path: str, input_file: str) -> None:
    LOGGER.info('starting classifier')
    model_dir = os.path.dirname(model_path)
    model_file = os.path.basename(model_path)
    repo = ModelRepositoryImpl(model_dir)
    loader = DataLoaderImpl()
    evaluator = ModelEvaluatorImpl()
    estimator = repo.load_final_model(model_file)
    churn_df = loader.load_cleansed(input_file)    
    target = churn_df.columns[-1]
    all_but_target = churn_df.columns.difference([target])
    X_df = churn_df[all_but_target]
    y_df = churn_df[target]
    evaluator.report_results(estimator, X_df, y_df)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('model_file',   type=str,   help='saved model file')
    parser.add_argument('input_file',   type=str,   help='input file name')
    args = parser.parse_args()
    main(args.model_file, args.input_file)

Writing ./classify.py


## Instalação das Dependências

In [None]:
!pip install pandas imblearn mlxtend shutup six sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting shutup
  Downloading shutup-0.2.0-py3-none-any.whl (1.5 kB)
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2344 sha256=9cb9ee25ade62a06fa7ad44eb2e9da57aeebd30eb21d30b05ddf4668733e27a1
  Stored in directory: /root/.cache/pip/wheels/42/56/cc/4a8bf86613aafd5b7f1b310477667c1fca5c51c3ae4124a003
Successfully built sklearn
Installing collected packages: sklearn, shutup
Successfully installed shutup-0.2.0 sklearn-0.0.post1


## Baixando Arquivo de Dados

In [None]:
!wget -O telco-churn.csv https://raw.githubusercontent.com/itaborai83/ecd221-SI-trabalho/main/DATA/telco-churn.csv

--2022-11-15 16:48:04--  https://raw.githubusercontent.com/itaborai83/ecd221-SI-trabalho/main/DATA/telco-churn.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 970457 (948K) [text/plain]
Saving to: ‘telco-churn.csv’


2022-11-15 16:48:04 (25.0 MB/s) - ‘telco-churn.csv’ saved [970457/970457]



## Configuração das variáveis de ambiente utilizadas como parâmetros

In [None]:

%env SEED=42
%env HOLDOUT_SPLIT_PCT=0.1
%env TEST_SPLIT_PCT=0.3
%env FEATURE_COUNT=14
%env KFOLDS=10
%env MODELS_DIR=./MODELS
%env INPUT_FILE=./telco-churn.csv
%env TRAIN_TEST_FILE=./DATA/telco-churn-train-test.csv
%env HOLD_OUT_FILE=./DATA/telco-churn-holdout.csv
%env TRAIN_TEST_FILE_CLEAN=./DATA/telco-churn-train-test-clean.csv
%env HOLD_OUT_FILE_CLEAN=./DATA/telco-churn-holdout-clean.csv
%env FIELDNAMES=./DATA/fieldnames.txt
%env PYTHONWARNINGS=ignore::UserWarning,ignore::FutureWarning

env: SEED=42
env: HOLDOUT_SPLIT_PCT=0.1
env: TEST_SPLIT_PCT=0.3
env: FEATURE_COUNT=14
env: KFOLDS=10
env: MODELS_DIR=./MODELS
env: INPUT_FILE=./telco-churn.csv
env: TRAIN_TEST_FILE=./DATA/telco-churn-train-test.csv
env: HOLD_OUT_FILE=./DATA/telco-churn-holdout.csv
env: TRAIN_TEST_FILE_CLEAN=./DATA/telco-churn-train-test-clean.csv
env: HOLD_OUT_FILE_CLEAN=./DATA/telco-churn-holdout-clean.csv
env: FIELDNAMES=./DATA/fieldnames.txt


## Geração do Arquivo de Treino/Test e Arquivo de Hold Out

In [None]:
! python splitter.py --seed $SEED --testsplit $HOLDOUT_SPLIT_PCT $INPUT_FILE $TRAIN_TEST_FILE $HOLD_OUT_FILE

INFO - splitter.py:main:14 - starting data splitter
INFO - data_loader.py:load_cleansed:62 - loading cleansed dataframe from ./telco-churn.csv
INFO - util.py:report_df:21 - <class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingT

## Limpeza, Featuring Engineering e Feature Selection Determinística do Arquivo de Treino/Teste

In [None]:
!python dataclean.py --seed $SEED --topk $FEATURE_COUNT $TRAIN_TEST_FILE $TRAIN_TEST_FILE_CLEAN

INFO - data_cleaner.py:clean:35 - starting data cleaner
INFO - data_loader.py:load:51 - loading dataframe from ./DATA/telco-churn-train-test.csv
INFO - util.py:report_df:21 - <class 'pandas.core.frame.DataFrame'>
RangeIndex: 6338 entries, 0 to 6337
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        6338 non-null   object 
 1   gender             6338 non-null   object 
 2   senior_citizen     6338 non-null   int64  
 3   partner            6338 non-null   object 
 4   dependents         6338 non-null   object 
 5   tenure             6338 non-null   int64  
 6   phone_service      6338 non-null   object 
 7   multiple_lines     6338 non-null   object 
 8   internet_service   6338 non-null   object 
 9   online_security    6338 non-null   object 
 10  online_backup      6338 non-null   object 
 11  device_protection  6338 non-null   object 
 12  tech_support       6338 non-null   object

# Recuperação dos Features Selecionados para uso no processamento do arquivo de Hold Out

In [None]:
!python fieldnames.py $TRAIN_TEST_FILE_CLEAN $FIELDNAMES

INFO - fieldnames.py:main:12 - starting data splitter
INFO - data_loader.py:load_cleansed:62 - loading cleansed dataframe from ./DATA/telco-churn-train-test-clean.csv
INFO - util.py:report_df:21 - <class 'pandas.core.frame.DataFrame'>
RangeIndex: 6338 entries, 0 to 6337
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   total_charges      6338 non-null   float64
 1   monthly_charges    6338 non-null   float64
 2   tenure             6338 non-null   int64  
 3   monthly_contract   6338 non-null   int64  
 4   two_year_contract  6338 non-null   int64  
 5   fiber_optic        6338 non-null   int64  
 6   electronic_check   6338 non-null   int64  
 7   paperless_billing  6338 non-null   int64  
 8   partner            6338 non-null   int64  
 9   online_security    6338 non-null   int64  
 10  tech_support       6338 non-null   int64  
 11  online_backup      6338 non-null   int64  
 12  dependents         

## Limpeza, Featuring Engineering e Feature Selection Determinística do Arquivo de Hold Out

In [None]:
! python dataclean.py  --seed $SEED --topk $FEATURE_COUNT --fields=$(cat $FIELDNAMES) $HOLD_OUT_FILE $HOLD_OUT_FILE_CLEAN

INFO - data_cleaner.py:clean:35 - starting data cleaner
INFO - data_loader.py:load:51 - loading dataframe from ./DATA/telco-churn-holdout.csv
INFO - util.py:report_df:21 - <class 'pandas.core.frame.DataFrame'>
RangeIndex: 705 entries, 0 to 704
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customer_id        705 non-null    object 
 1   gender             705 non-null    object 
 2   senior_citizen     705 non-null    int64  
 3   partner            705 non-null    object 
 4   dependents         705 non-null    object 
 5   tenure             705 non-null    int64  
 6   phone_service      705 non-null    object 
 7   multiple_lines     705 non-null    object 
 8   internet_service   705 non-null    object 
 9   online_security    705 non-null    object 
 10  online_backup      705 non-null    object 
 11  device_protection  705 non-null    object 
 12  tech_support       705 non-null    object 
 13

# Treino e ajuste dos hiper-parâmetros

In [None]:
! python train.py --seed $SEED --testsplit $TEST_SPLIT_PCT --kfolds $KFOLDS $TRAIN_TEST_FILE_CLEAN $MODELS_DIR
# ! python train.py --quick --seed $SEED --testsplit $TEST_SPLIT_PCT --kfolds $KFOLDS $TRAIN_TEST_FILE_CLEAN $MODELS_DIR

INFO - trainer.py:train:41 - starting telco churn model training
INFO - data_loader.py:load_cleansed:62 - loading cleansed dataframe from ./DATA/telco-churn-train-test-clean.csv
INFO - util.py:report_df:21 - <class 'pandas.core.frame.DataFrame'>
RangeIndex: 6338 entries, 0 to 6337
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   total_charges      6338 non-null   float64
 1   monthly_charges    6338 non-null   float64
 2   tenure             6338 non-null   int64  
 3   monthly_contract   6338 non-null   int64  
 4   two_year_contract  6338 non-null   int64  
 5   fiber_optic        6338 non-null   int64  
 6   electronic_check   6338 non-null   int64  
 7   paperless_billing  6338 non-null   int64  
 8   partner            6338 non-null   int64  
 9   online_security    6338 non-null   int64  
 10  tech_support       6338 non-null   int64  
 11  online_backup      6338 non-null   int64  
 12  dependen

## Criação de Ensemble com os melhores modelos treinados

In [None]:
! python ensembler.py --seed $SEED --testsplit $TEST_SPLIT_PCT --kfolds $KFOLDS $TRAIN_TEST_FILE_CLEAN $MODELS_DIR final_model.pkl

INFO - ensembler.py:run:31 - starting ensembler
INFO - ensembler.py:read_grids:23 - reading saved grids
INFO - model_repository.py:list_grids:45 - listing saved grids on ./MODELS
INFO - model_repository.py:load_grid:70 - loading grid search results from ./MODELS/grid_DT.pkl
INFO - model_repository.py:load_grid:70 - loading grid search results from ./MODELS/grid_LOGREG.pkl
INFO - model_repository.py:load_grid:70 - loading grid search results from ./MODELS/grid_GB.pkl
INFO - model_repository.py:load_grid:70 - loading grid search results from ./MODELS/grid_NB.pkl
INFO - model_repository.py:load_grid:70 - loading grid search results from ./MODELS/grid_ADA.pkl
INFO - model_repository.py:load_grid:70 - loading grid search results from ./MODELS/grid_RF.pkl
INFO - model_repository.py:load_grid:70 - loading grid search results from ./MODELS/grid_KNN.pkl
INFO - data_loader.py:load_cleansed:62 - loading cleansed dataframe from ./DATA/telco-churn-train-test-clean.csv
INFO - util.py:report_df:21 - 

## Classificação do Arquivo de Hold Out

In [None]:
! python classify.py ./MODELS/final_model.pkl $HOLD_OUT_FILE_CLEAN

INFO - classify.py:main:13 - starting classifier
INFO - model_repository.py:load_final_model:88 - loading final model from ./MODELS/final_model.pkl
INFO - data_loader.py:load_cleansed:62 - loading cleansed dataframe from ./DATA/telco-churn-holdout-clean.csv
INFO - util.py:report_df:21 - <class 'pandas.core.frame.DataFrame'>
RangeIndex: 705 entries, 0 to 704
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   total_charges      705 non-null    float64
 1   monthly_charges    705 non-null    float64
 2   tenure             705 non-null    int64  
 3   monthly_contract   705 non-null    int64  
 4   two_year_contract  705 non-null    int64  
 5   fiber_optic        705 non-null    int64  
 6   electronic_check   705 non-null    int64  
 7   paperless_billing  705 non-null    int64  
 8   partner            705 non-null    int64  
 9   online_security    705 non-null    int64  
 10  tech_support       705 non