<a href="https://colab.research.google.com/github/leandroaguazaco/data_science_portfolio/blob/main/Projects/03-NPS_Telco_Analysis/P03_NPS_Telco_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center"> 3 - NPS TELCO ANALYSIS </h1>

<div align="center">

  <img alt="Static Badge" src="https://img.shields.io/badge/active_project-true-blue">

  <img alt="Static Badge" src="https://img.shields.io/badge/status-in progress-green">

</div>  

<object
data="https://img.shields.io/badge/contact-Felipe_Leandro_Aguazaco-blue?style=flat&link=https%3A%2F%2Fwww.linkedin.com%2Fin%2Ffelipe-leandro-aguazaco%2F">
</object>

## a. Summary

Another NPS analysis, in this case corresponding to the telco industry. The dataset is high-dimensional and contains a mix of categorical and numerical variables. The aim is to predict whether a client will be a promoter or a detractor.

* Automated EDA
* Binary classificacion problem
* Not missing values
* Mix categorical and numerical features
* High-dimensional data
* Main libraries: PyCaret, Scikit-Learn, pyjanitor


## b. Install librarires

In [None]:
%%capture
!pip install polars
!pip install xlsx2csv
!pip install pyjanitor
!pip install colorama
!pip install adjustText
!pip install -U ydata-profiling

## c. Import libraries

In [9]:
# c.1 Python Utilies
import pandas as pd
import polars as pl
import numpy as np
import warnings
from janitor import clean_names, remove_empty

# c.2 Visulization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from adjustText import adjust_text
from colorama import Fore, Style

# c.3 Automated EDA
from ydata_profiling import ProfileReport

# c.3 Setups
%matplotlib inline
plt.style.use("ggplot")
warnings.simplefilter("ignore")

## d. Custom functions

In [52]:
# d.1 dtypes conversion and memory reduce function.
def dtype_conversion(df: pd.DataFrame = None, verbose: bool = True)-> pd.DataFrame:
    """
    Summary:
      Function to dtypes conversion and save reduce memory usage; takes a DataFrame as argument, returns DataFrame.
      For more details, visit: https://towardsdatascience.com/how-to-work-with-million-row-datasets-like-a-pro-76fb5c381cdd.
      The modifications include type casting for numerical and object variables.
    Parameters:
      df (pandas.DataFrame): DataFrame containing information.
      verbose (bool, default = True): If true, display results (conversions and warnings)
    Returns:
      pandas.DataFrame: original DataFrame with dtypes conversions
      Plot original dtypes status, variable warning due high cardinality, save memory usage, final dtypes status.
    """
    # 0- Original dtypes
    print(Fore.GREEN + "Input dtypes" + Style.RESET_ALL)
    print(df.dtypes)
    print("\n")
    print(Fore.RED + "High Cardinality, categorical features with levels > 15" + Style.RESET_ALL)

    # 1- Original memory_usage in MB
    start_mem = df.memory_usage().sum() / 1024 ** 2

    # 2- Numerical Types
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int": # First 3 characters
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max):
                    df[col] = df[col].astype(np.float16)
                elif (c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    # 3- Categorical Types
    high_card_vars = 0
    for col in df.select_dtypes(exclude = ["int8", "int16", "int32", "int64", "float16", "float32", "float64", "datetime64[ns]"]):
        categories = list(df[col].unique())
        cat_len = len(categories)
        if cat_len >= 2 and cat_len < 15:
           df[col] = df[col].astype("category")
        else:
          high_card_vars =+ 1
          # Print hight cardinality variables, amount of levels and a sample of 50 firts categories
          print(f"Look at: {Fore.RED + col + Style.RESET_ALL}, {cat_len} levels = {categories[:50]}")
    if high_card_vars == 0:
      print(Fore.GREEN + "None" + Style.RESET_ALL)
    else:
      pass

    # 4- Final memory_usage in MB
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print("\n")
        print(f"{Fore.BLUE}Initial memory usage: {start_mem:.2f} MB{Style.RESET_ALL}")
        print(f"{Fore.BLUE}Memory usage decreased to {end_mem:.2f} MB ({ 100 * (start_mem - end_mem) / start_mem:.1f}% reduction){Style.RESET_ALL}")
        print("\n")
        print(Fore.GREEN + "Output dtypes" + Style.RESET_ALL)
        print(df.dtypes)
        print("\n")

    return df

## 1 - Load data

In [53]:
telco_nps_df = pl.read_excel(source = "BD_NPS_PERSONAS_sample.xlsx",
                             sheet_id = 1,
                             xlsx2csv_options = {"skip_empty_lines": True},
                             read_csv_options ={"has_header": True,
                                                "try_parse_dates": True, # Important
                                                "encoding": "utf8"}) \
                 .to_pandas() \
                 .pipe(clean_names) \
                 .pipe(remove_empty) \
                 .set_index("id") \
                 .drop_duplicates() \
                 .pipe(dtype_conversion) # Custom function

[32mInput dtypes[0m
antigued                object
grupo_edad              object
gener                   object
estado_civil            object
region                  object
calidad_produc           int64
calif_voz                int64
senal_voz                int64
estabil_llamada          int64
uso_serv_cliente        object
data_usur               object
technology              object
band_7                  object
max_network_voice       object
calls_drop_s             int64
calls_failure_s          int64
calls_out_month_p        int64
calls_in_month_p         int64
calls_out_tot_p          int64
calls_in_tot_p           int64
no_answer_calls_p        int64
month_voice_out_c        int64
duration_all_in_a        int64
duration_all_out_a       int64
duration_all_inout_c     int64
setup_time_avg_a         int64
setup_failure_perc_a     int64
dropped_calls_perc_a     int64
mo_mos_avg_a             int64
recharges_month_a        int64
nr_recharges_month_a     int64
tipo_m1          