<a href="https://colab.research.google.com/github/leah-n/Customer-Churn-Prediction-Model-for-a-Telco-Company/blob/Exploratory_Data_Analysis/Customer_Churn_Model_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

### Project Objective
The first objective of this project is to analyse the data and provide a description of the main characteristics.  This dataset will be used by you in the future to build a model that predicts which customers are at risk of churn, our second objective. As such, our exploratory phase might aim to determine which factors are more relevant to the phenomenon of customer churn


### Dataset Description 
The Telco Dataset contains information about customers of a Telco company and their subscriptions. This telecom company appears to have provided home, phone and Internet services to 7,043 customers in the third quarter. The dataset shows which customers have left, stayed or signed up for their service.

The characteristics of the customers given include:  

1. customer_ID:Customer ID  
2. gender:	Whether the customer is a male or a female  
3. senior_citizen: Whether the customer is a senior citizen or not (1, 0)  
4. partner: Whether the customer has a partner or not (Yes, No)     
5. dependents: Whether the customer has dependents or not (Yes, No)  
6. tenure : Number of months the customer has stayed with the company    
7. phone_Service: Whether the customer has a phone service or not (Yes, No)  
8. multiple_lines: Whether the customer has multiple lines or not (Yes, No, No phone service)  
9. internet: Customer’s internet service provider (DSL, Fiber optic, No)    
10. security_online: Whether the customer has online security or not (Yes, No, No internet service)  
11. backup_online: Whether the customer has online backup or not (Yes, No, No internet service)  
12. device_protection: Whether the customer has device protection or not (Yes, No, No internet service)  
13. tech_support: Whether the customer has tech support or not (Yes, No, No internet service)  
14. streaming_tv: Whether the customer has TV streaming  or not (Yes, No, No internet service)  
15. streaming_movies: Whether the customer movie streaming or not (Yes, No, No internet service)    
16. contract_type: The contract term of the customer (Month-to-month, One year, Two year)
17. paperless_billing: Whether the customer has paperless billing or not (Yes, No)  
18. payment_mode: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))  
19. charges_per_month: The amount charged to the customer monthly  
20. charges_total: The total amount charged to the customer  
21. churn: Whether the customer churned or not (Yes or No)  


### 1. Exploratory Data Analysis


### 1.1 Seting up our notebook


In [10]:
#loading required libraries 
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.inspection import permutation_importance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import plotly.graph_objs as go
%matplotlib inline 
from matplotlib import pyplot as plt
from scipy.stats import norm
import scipy as sc
from scipy.stats import skew
from scipy.stats import kurtosis

In [15]:
#loading dataset 
df = pd.read_csv("Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes



### 1.2 Descriptive Analysis


In [12]:
df.info() #to give us a snapshot view of the data including the datafranes shape and size and datatypes present

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [13]:
df.nunique() #How many unique values are there for each attribute

customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64

In [14]:
df.isnull().sum() #checking for null values

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [8]:
df.describe() #summary statistics for numerical features

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


### 1.3 Summary of observations from the descriptive analysis 
1. Our target variable is churn for which we have 20 potential explanatory variables 
2. There are 7043 entries in our data set
3. There are no missing values in our data set 
4. Among the independent variables, most are objects apart from the following which are numeric:  
charges per month (float)    
tenure (int)  
Senior Citizen (int)  
5. Among the independent variables, all are categorical apart from the following:  
charges_total  
charges per month   
tenure  
6. The mixture of datatypes and categorical vs indicates that in our univariate/bivariate analysis, we may be unable to apply the same methods on all variables  