# Customer Churn Prediction

This is the first notebook for now, we will create smaller subnotebooks later. We are working with the dataset from https://www.kaggle.com/blastchar/telco-customer-churn. 

## Import libaries and data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# import statsmodels.api as sm
%matplotlib inline

import tensorflow as tf
from tensorflow import keras
import sklearn

In [3]:
CHURN_PATH = "../data/"

data_path = CHURN_PATH + "customerChurnTelco.csv"

df = pd.read_csv(data_path)

**Preview of the data**

In [4]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## Variables description

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


It seems that we should convert variable **TotalCharges** from Object into float64.

In [6]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [7]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0,7032.0
mean,0.162147,32.371149,64.761692,2283.300441
std,0.368612,24.559481,30.090047,2266.771362
min,0.0,0.0,18.25,18.8
25%,0.0,9.0,35.5,401.45
50%,0.0,29.0,70.35,1397.475
75%,0.0,55.0,89.85,3794.7375
max,1.0,72.0,118.75,8684.8


**We can see total 21 variables and each varaible has 7043 observations.**

- Description of the variables:

    1. **customerID** is a unique identifying number assigned to each customer
    
    2. **gender** indicates the sex of the custormer - Male or female
    
    3. **SeniorCitizen** indicates whether the customer is a senior citizen or not (1, 0)
    
    4. **Partner** indicates whether the customer has a partner or not (Yes, No)
    
    5. **Dependents** indicates whether the customer has dependents or not (Yes, No)

    6. **tenure** is the number of months the customer stayed with the company
    
    7. **PhoneService** indicates whether the customer has a phone service or not (Yes, No)
    
    8. **MultipleLines** indicates whether the customer has multiple lines or not (Yes, No, No phone service)
    
    9. **InternetService** indicates customer's internet service provider (DSL, Fiber optic, No)
    
    10. **OnlineSecurity** indicates whether the customer has online security or not (Yes, No, No internet service)
    
    11. **OnlineBackup** indicates whether the customer has online backup or not (Yes, No, No internet service)
    
    12. **DeviceProtection** indicates whether the customer has device protection or not (Yes, No, No internet service)
    
    13. **TechSupport** indicates whether the customer has tech support or not (Yes, No, No internet service)
    
    14. **StreamingTV** indicates whether the customer has streaming TV or not (Yes, No, No internet service)
    
    15. **StreamingMovies** indicates whether the customer has streaming movies or not (Yes, No, No internet service)
    
    16. **Contract** indicates the contract term of the customer (Month-to-month, One year, Two year)
    
    17. **PaperlessBilling** indicates whether the customer has paperless billing or not (Yes, No)
    
    18. **PaymentMethod** indicates the customer's  payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
    
    19. **MonthlyCharges** is the customer's amount of monthly charges
    
    20. **TotalCharges** is the amount charged to the customer totally
    
    21. **Churn** indicates whether the customer churned or not (Yes or No)
    
**Here, we have that Churn is the target variable and rest of the variables are predictor variables.**

### Categorical and Numerical Variables

In [8]:
display(df.dtypes)

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

**Categorical Variable:**  gender, SeniorCitizen, Partner, Dependent, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, Churn.

**Numerical Variable**: customerID, tenure, MonthlyCharges, TotalCharges.

**Let us check whether any of the features contains  blank, null or empty values**

In [9]:
display(df.isnull().sum())

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

We observe that just **TotalCharges** contains null values.

Lazar's suggestion of the preprocessing:

1.  Univariate analysis of the fatures 
2.  Feature engineering:  modify or create new features from the exsisting features which are otherwise hard to analyse in their raw forms that we saw in Univariate Analysis section
3. Outliers detection and Imputing missing variables
4. Bivaraite and Multivariate analysis
5. Data Transfromation -  categorical variables will be encoded into numerical variables, maybe some categorization or maybe normalization and standardization of our continuous variables and redundant and useless features will be dropped