<a href="https://colab.research.google.com/github/leandroaguazaco/data_science_portfolio/blob/main/Projects/04-Churn_Telco_Analysis/04_Churn_Telco_Analysis_02_Premodeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center"> 4 - CHURN TELCO ANALYSIS </h1>

<div align="center">

  <img alt="Static Badge" src="https://img.shields.io/badge/active_project-true-blue">

</div>  

<h2 align="center"> 4.1 - Premodeling </h2>

<div align="center">

  <img alt="Static Badge" src="https://img.shields.io/badge/active_section-true-blue">

  <img alt="Static Badge" src="https://img.shields.io/badge/section_status-in progress-green">

</div>  

<object
data="https://img.shields.io/badge/contact-Felipe_Leandro_Aguazaco-blue?style=flat&link=https%3A%2F%2Fwww.linkedin.com%2Fin%2Ffelipe-leandro-aguazaco%2F">
</object>

## a. Project summary

The aim of this project is to analyze and predict customer churn in the telco industry. The information pertains to client behavior, including in-call, out-call, and internet service consumption. There is a variable called 'Churn' that determines whether a customer churned within two weeks after canceling services. The information summarizes eight weeks of data for each telco line or client.

<h3 align="center"> <font color='orange'>NOTE: The project is distributed across multiple sections, separated into notebook files, in the following way:</font> </h3>



4.1 - Preprocessig data: load, join and clean data, and Exploratory data analysis, EDA.

> <font color='gray'> 4.2 - Premodeling: predict customer churn based on PyCaret library. </font> ✍ ▶ <font color='orange'> Current section </font>

4.3 - Modeling: predict customer churn based on sklearn pipelines.

4.4 - Analyzing and explaining predictions.

4.5 - Detecting vulneabilities in final machine learnig model.

4.6 - Model deployment with Streamlit.

## b. Install libraries

In [None]:
!pip install -U --pre pycaret
!pip install rpy2==3.5.1 # Use R
!pip install colorama

## c. Import libraries

In [1]:
%%capture
# c.1 Python Utilies
import pandas as pd
import numpy as np
import rpy2
import shutil
from google.colab import drive
import os
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from colorama import Fore, Style

In [None]:
%%capture
# c.2 PyCaret
from pycaret.classification import *
#from pycaret.regression import *

In [4]:
# c.3 Setups
%matplotlib inline
plt.style.use("ggplot")
warnings.simplefilter("ignore")

## d. Custom functions

### d.1 - Type conversions

In [11]:
# d.1 dtypes conversion and memory reduce function.
def dtype_conversion(df: pd.DataFrame = None, verbose: bool = True)-> pd.DataFrame:
    """
    Summary:
      Function to dtypes conversion and save reduce memory usage; takes a DataFrame as argument, returns DataFrame.
      For more details, visit: https://towardsdatascience.com/how-to-work-with-million-row-datasets-like-a-pro-76fb5c381cdd.
      The modifications include type casting for numerical and object variables.
    Parameters:
      df (pandas.DataFrame): DataFrame containing information.
      verbose (bool, default = True): If true, display results (conversions and warnings)
    Returns:
      pandas.DataFrame: original DataFrame with dtypes conversions
      Plot original dtypes status, variable warning due high cardinality, save memory usage, final dtypes status.
    """
    # 0- Original dtypes
    # print(Fore.GREEN + "Input dtypes" + Style.RESET_ALL)
    # print(df.dtypes)
    # print("\n")
    print(Fore.RED + "High Cardinality, categorical features with levels > 15" + Style.RESET_ALL)

    # 1- Original memory_usage in MB
    start_mem = df.memory_usage().sum() / 1024 ** 2

    # 2- Numerical Types
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int": # First 3 characters
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max):
                    df[col] = df[col].astype(np.float32)
                #elif (c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max):
                #    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    # 3- Categorical Types
    high_card_vars = 0
    for col in df.select_dtypes(exclude = ["int8", "int16", "int32", "int64", "float16", "float32", "float64", "datetime64[ns]"]):
        categories = list(df[col].unique())
        cat_len = len(categories)
        if cat_len >= 2 and cat_len < 15:
           df[col] = df[col].astype("category")
        else:
          high_card_vars =+ 1
          # Print hight cardinality variables, amount of levels and a sample of 50 firts categories
          print(f"Look at: {Fore.RED + col + Style.RESET_ALL}, {cat_len} levels = {categories[:50]}")
    if high_card_vars == 0:
      print(Fore.GREEN + "None" + Style.RESET_ALL)
    else:
      pass

    # 4- Final memory_usage in MB
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print("\n")
        print(f"{Fore.RED}Initial memory usage: {start_mem:.2f} MB{Style.RESET_ALL}")
        print(f"{Fore.BLUE}Memory usage decreased to {end_mem:.2f} MB ({ 100 * (start_mem - end_mem) / start_mem:.1f}% reduction){Style.RESET_ALL}")
        #print("\n")
        #print(Fore.GREEN + "Output dtypes" + Style.RESET_ALL)
        #print(df.dtypes)
        print("\n")

    # 5. Feature types
    print(Fore.GREEN + "Variable types" + Style.RESET_ALL)
    numerical_vars = len(df.select_dtypes(include = ["number"]).columns)
    categorical_vars = len(df.select_dtypes(include = ["category", "object"]).columns)
    datetime_vars = len(df.select_dtypes(include = ["datetime64[ns]"]).columns)
    print(f"Numerical Features: {numerical_vars}")
    print(f"Categorical Features: {categorical_vars}")
    print(f"Datetime Features: {datetime_vars}")

    return df

## 1 - Data

### 1.1 - Import from Google Drive

In [8]:
# Mount Google Drive
drive.mount('/content/drive')


# Specify the source path in Google Drive
drive_filepath = '/content/drive/MyDrive/DataScience_Portfolio/04-Churn_Telco_Analysis/'

# Specify the destination path in Colab
colab_filepath = '/content/'

# Copy the file from Google Drive to Colab
try:
  shutil.copy(src = drive_filepath + '/churn_data.txt', dst = colab_filepath + '/churn_data.txt')
except:
  pass

Mounted at /content/drive


### 1.2 - Load data

In [16]:
churn_df = pd.read_csv(filepath_or_buffer = "churn_data.txt",
                       sep = "|",
                       index_col = "SUBSCRIBER_ID",
                       parse_dates = True,
                       decimal = ".",
                       encoding = "utf-8") \
             .assign(region = lambda x: x.loc[:, "region"].astype("category")) \
             .pipe(dtype_conversion)

# Security copy
churn_df_copy = churn_df.copy(deep = True)

[31mHigh Cardinality, categorical features with levels > 15[0m
Look at: [31mregion[0m, 34 levels = ['Bogota D.C.', 'Santander', 'Antioquia', nan, 'Cundinamarca', 'Quindio', 'Valle Del Cauca', 'Arauca', 'Bolivar', 'Atlantico', 'Tolima', 'Huila', 'Meta', 'Putumayo', 'Boyaca', 'Caldas', 'Cordoba', 'Nariño', 'Magdalena', 'Risaralda', 'Cauca', 'La Guajira', 'Norte De Santander', 'Caqueta', 'Cesar', 'Guaviare', 'Sucre', 'Amazonas', 'Guainia', 'Vichada', 'Choco', 'Casanare', 'Providencia Islas', 'Vaupes']


[31mInitial memory usage: 196.19 MB[0m
[34mMemory usage decreased to 92.12 MB (53.0% reduction)[0m


[32mVariable types[0m
Numerical Features: 38
Categorical Features: 5
Datetime Features: 0


## 2 - Predicting Customer Churn

Pycaret
>[Home - Pycaret](https://pycaret.org/)

>[Read the docs - Pycaret](https://pycaret.readthedocs.io/en/stable/index.html)

### 2.1 - Init the experimentation class and init setup

In [None]:
%%time

# 2.1 Init the experimentation class
churn_binaryclass = ClassificationExperiment()

# 2.2 Init setup

churn_binaryclass.setup(data = churn_df,
                        target = 'churn', # Target column in data
                        index = True, # Hold original index
                        session_id = 123,
                        preprocess = True,
                        verbose = True, # When set to False, Information grid is not printed.
                        train_size = 0.75, # Proportion of the dataset to be used for training and validation
                        test_data = None,

                        # Dtypes categorical features
                        ordinal_features = None,
                        categorical_features = ["canal", "region", 'bandas', 'tipo_gross_adds'],
                        max_encoding_ohe = 30, # Categorical columns with max_encoding_ohe or less unique values are encoded using OneHotEncoding
                        rare_to_value = 0.05, # Minimum fraction of category occurrences in a categorical column
                        rare_value = "Otro", # Value with which to replace rare categories

                        # Dtypes numerical features
                        numeric_features = churn_df.select_dtypes(include = 'number'),

                        # Dtypes datetime features
                        date_features = None,
                        create_date_columns = ["day", "month", "year"], # Columns to create from the date features

                        # Normalize data: depending on similarity of numerical variables scales
                        normalize = True,
                        normalize_method = "robust", # z-score, minmax, maxabs, robust

                        # Transform data: depending on skewnees, kurtosis, outiler presence
                        transformation = True,
                        transformation_method = "yeo-johnson",

                        # Handled missing values
                        imputation_type = "iterative",
                        iterative_imputation_iters = 5,
                        numeric_iterative_imputer = "rf", #"lightgbm", # Default method
                        categorical_iterative_imputer = "lightgbm", # Default method

                        # Outliers
                        remove_outliers = False,
                        outliers_method = "ee", # "iforest": Uses sklearn’s IsolationForest; "ee": Uses sklearn’s EllipticEnvelope; "lof": Uses sklearn’s LocalOutlierFactor
                        outliers_threshold = 0.05, # Percentage of outliers to be removed from the dataset

                        # (didn't work)
                        # Imbalance data: depending on levels distributions in target variable
                        fix_imbalance = True, # Dataset has unequal distribution of target class it can be balanced using this parameter
                        fix_imbalance_method = 'SMOTE', # Synthetic Minority Over-sampling Technique, choose from the name of an imblearn estimator

                        # (didn't work)
                        # Feature selection
                        feature_selection = True,
                        feature_selection_method = "univariate", # "sequential": uses sklearn's SequentialFeatureSelector, "classic": uses sklearn's SelectFromModel
                        feature_selection_estimator = "rf", # Classifier used to determine the feature importance
                        n_features_to_select = 0.2, # The maximum number of features to select with feature_selection
                        low_variance_threshold = 0.1, # Remove features with a training-set variance lower than the provided threshold
                        pca = False,

                        # Multicollinearity
                        remove_multicollinearity = True, # Features with the inter-correlations higher than the defined threshold are removed
                        multicollinearity_threshold = 0.85, # Minimum absolute Pearson correlation to identify correlated features
                        #remove_perfect_collinearity = True,

                        # Cross validation stragery
                        data_split_shuffle = True,
                        data_split_stratify = True, # Controls stratification during train_test_split
                        fold_strategy = "stratifiedkfold",
                        fold = 5, # Number of folds to be used in cross validation, use in tuning hyperparameters

                        # Experiment Logging - Mlflow
                        experiment_name = "Predicting NPS Type - v0.1",
                        log_experiment = True,
                        log_plots = True
)