<a href="https://colab.research.google.com/github/leandroaguazaco/data_science_portfolio/blob/main/Projects/04-Churn_Telco_Analysis/04_Churn_Telco_Analysis_01_Preprocessing_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center"> 4 - CHURN TELCO ANALYSIS </h1>
<h2 align="center"> 4.1 - Preprocessing </h2>

<div align="center">

  <img alt="Static Badge" src="https://img.shields.io/badge/active_project-true-blue">

  <img alt="Static Badge" src="https://img.shields.io/badge/status-in progress-green">

</div>  

<object
data="https://img.shields.io/badge/contact-Felipe_Leandro_Aguazaco-blue?style=flat&link=https%3A%2F%2Fwww.linkedin.com%2Fin%2Ffelipe-leandro-aguazaco%2F">
</object>

## a. Project summary

The aim of this project is to analyze and predict customer churn in the telco industry. The information pertains to client behavior, including in-call, out-call, and internet service consumption. There is a variable called 'Churn' that determines whether a customer churned within two weeks after canceling services. The information summarizes eight weeks of data for each telco line or client.

<h3 align="center"> <font color='orange'>NOTE: The project is distributed across multiple sections, separated into notebook files, in the following way:</font> </h3>



> <font color='gray'> 4.1 - Preprocessig data: load, join and clean data, and Exploratory data analysis, EDA.</font> ✍ ▶ Current section

4.2 - Pre-modeling: predict customer churn based on PyCaret library.

4.3 - Modeling: predict customer churn based on sklearn pipelines.

4.4 - Analyzing and explaining predictions.

4.5 - Detecting vulneabilities in final machine learnig model.

4.6 - Model deployment with Streamlit.

## b. Install libraries

Additional libraries such as pandas, numpy, matplotlib, seaborn, and others are already installed in the Colab environment.

In [1]:
%%capture
!pip install pandas
!pip install polars
!pip install xlsx2csv
!pip install pyjanitor # Clean DataFrame
!pip install colorama
!pip install adjustText
!pip install rpy2==3.5.1 # Use R

## c. Import libraries

In [6]:
%%capture
# c.1 Python Utilies
import pandas as pd
import polars as pl
import numpy as np
import glob
import math
from scipy.stats import spearmanr
import scipy.stats as stats
import warnings
from janitor import clean_names, remove_empty
import rpy2
import shutil
from google.colab import drive

# c.2 Visulization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from adjustText import adjust_text
from colorama import Fore, Style

In [None]:
# c.3 Setups
%matplotlib inline
plt.style.use("ggplot")
warnings.simplefilter("ignore")

## d. Custom functions

### d.1 Load csv files

In [233]:
def custom_readcsv(filepath: str = None) -> pd.DataFrame:
  """
  Summary:
    Function to read a csv files, set SUBSCRIBER_ID as index column and drop unncessary column.
  Parameters:
    file(str, default = None): path to your file of interest.
  Return
    pandas DataFrame.
  """
  df = pd.read_csv(filepath_or_buffer = filepath,
                   sep = "|",
                   index_col = "SUBSCRIBER_ID",
                   parse_dates = True,
                   decimal = ",",
                   encoding = "utf-8") \
         .pipe(lambda x: x.drop([x.columns[0]], axis = 1))

  return df

### d.2 Type conversions

In [234]:
# d.1 dtypes conversion and memory reduce function.
def dtype_conversion(df: pd.DataFrame = None, verbose: bool = True)-> pd.DataFrame:
    """
    Summary:
      Function to dtypes conversion and save reduce memory usage; takes a DataFrame as argument, returns DataFrame.
      For more details, visit: https://towardsdatascience.com/how-to-work-with-million-row-datasets-like-a-pro-76fb5c381cdd.
      The modifications include type casting for numerical and object variables.
    Parameters:
      df (pandas.DataFrame): DataFrame containing information.
      verbose (bool, default = True): If true, display results (conversions and warnings)
    Returns:
      pandas.DataFrame: original DataFrame with dtypes conversions
      Plot original dtypes status, variable warning due high cardinality, save memory usage, final dtypes status.
    """
    # 0- Original dtypes
    # print(Fore.GREEN + "Input dtypes" + Style.RESET_ALL)
    # print(df.dtypes)
    # print("\n")
    print(Fore.RED + "High Cardinality, categorical features with levels > 15" + Style.RESET_ALL)

    # 1- Original memory_usage in MB
    start_mem = df.memory_usage().sum() / 1024 ** 2

    # 2- Numerical Types
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int": # First 3 characters
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max):
                    df[col] = df[col].astype(np.float16)
                elif (c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    # 3- Categorical Types
    high_card_vars = 0
    for col in df.select_dtypes(exclude = ["int8", "int16", "int32", "int64", "float16", "float32", "float64", "datetime64[ns]"]):
        categories = list(df[col].unique())
        cat_len = len(categories)
        if cat_len >= 2 and cat_len < 15:
           df[col] = df[col].astype("category")
        else:
          high_card_vars =+ 1
          # Print hight cardinality variables, amount of levels and a sample of 50 firts categories
          print(f"Look at: {Fore.RED + col + Style.RESET_ALL}, {cat_len} levels = {categories[:50]}")
    if high_card_vars == 0:
      print(Fore.GREEN + "None" + Style.RESET_ALL)
    else:
      pass

    # 4- Final memory_usage in MB
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print("\n")
        print(f"{Fore.RED}Initial memory usage: {start_mem:.2f} MB{Style.RESET_ALL}")
        print(f"{Fore.BLUE}Memory usage decreased to {end_mem:.2f} MB ({ 100 * (start_mem - end_mem) / start_mem:.1f}% reduction){Style.RESET_ALL}")
        #print("\n")
        #print(Fore.GREEN + "Output dtypes" + Style.RESET_ALL)
        #print(df.dtypes)
        print("\n")

    # 5. Feature types
    print(Fore.GREEN + "Variable types" + Style.RESET_ALL)
    numerical_vars = len(df.select_dtypes(include = ["number"]).columns)
    categorical_vars = len(df.select_dtypes(include = ["category", "object"]).columns)
    datetime_vars = len(df.select_dtypes(include = ["datetime64[ns]"]).columns)
    print(f"Numerical Features: {numerical_vars}")
    print(f"Categorical Features: {categorical_vars}")
    print(f"Datetime Features: {datetime_vars}")

    return df

## 1 - Load data

Four files (.csv):

* CONSUMO_DATOS.csv
* CONSUMO_VOZ_IN.csv
* CONSUMO_VOZ_OUT.csv
* INFORMACION_GENERAL.csv

In [235]:
# 1. List of csv.files
files = glob.glob('/content/' + '/*.csv')

In [None]:
# 2. List of DataFrames: four files
df_list = [custom_readcsv(file) for file in files]

for i in np.arange(0,len(df_list)):
  print(f"Rows = {df_list[i].shape[0]}, Columns = {df_list[i].shape[1]}, File = {files[i][files[i].rfind('/') + 1: ]}")

In [None]:
# 3. Concatenate DataFrames
churn_data = pd.concat(objs = df_list, axis = 1, join = "outer", ignore_index = False, ) \
               .pipe(clean_names)