## Project Title : 
Churn Prediction Project 

## Project Description: 
This project is known as churn prediction for a telecome company.  Imagine that we are working at a telecom company that offers phone and internet
services, and we have a problem: some of our customers are churning. They no longer are using our services and are going to a different provider. We would like to prevent that from happening, so we develop a system for identifying these customers and offer them an incentive to stay. We want to target them with promotional messages and give them a discount. We also would like to understand why the model thinks our customers churn, and for that, we need to be able to interpret the model’s predictions.
 
We have collected a dataset where we’ve recorded some information about our customers: what type of services they used, how much they paid, and how long they stayed with us. We also know who canceled their contracts and stopped using our services (churned). We will use this information as the target variable in the machinelearning model and predict it using all other available information. 

The project plan is as follows: 
- First, we download the dataset and do some initial preparation: rename columns and change values inside columns to be consistent throughout the entire dataset.
- Then we split the data into train, validation, and test so we can validate our models.
- As part of the initial data analysis, we look at feature importance to identify which features are important in our data.
- We transform categorical variables into numeric variables so we can use them in the model.
- Finally, we train a logistic regression model.

## Dataset Description
- Url:  https://www.kaggle.com/blastchar/telco-customer-churn.

- Column description
    - CustomerID: the ID of the customer
    - Gender: male/female
    - SeniorCitizen: whether the customer is a senior citizen (0/1)
    - Partner: whether they live with a partner (yes/no)
    - Dependents: whether they have dependents (yes/no)
    - Tenure: number of months since the start of the contract
    - PhoneService: whether they have phone service (yes/no)
    - MultipleLines: whether they have multiple phone lines (yes/no/no phone service)
    - InternetService: the type of internet service (no/fiber/optic)
    - OnlineSecurity: if online security is enabled (yes/no/no internet)
    - OnlineBackup: if online backup service is enabled (yes/no/no internet)
    - DeviceProtection: if the device protection service is enabled (yes/no/no internet)
    - TechSupport: if the customer has tech support (yes/no/no internet)
    - StreamingTV: if the TV streaming service is enabled (yes/no/no internet)
    - StreamingMovies: if the movie streaming service is enabled (yes/no/no internet)
    - Contract: the type of contract (monthly/yearly/two years)
    - PaperlessBilling: if the billing is paperless (yes/no)
    - PaymentMethod: payment method (electronic check, mailed check, bank transfer, credit card)
    - MonthlyCharges: the amount charged monthly (numeric)
    - TotalCharges: the total amount charged (numeric)
    - Churn: if the client has canceled the contract (yes/no)

## Environment Configuration
- Installing virtual Env
    - pip install pipenv 

- Installing Packages
    - pipenv install jupyter notebook pandas pyarrow numpy matplotlib seaborn scikit-learn

- Starting Virtual Env
    - pipenv shell 

- Starting Notebook
    - jupyter-notebook 

- Stoping Notebook 
    - Ctrl+c

- Deactiving Virtual Env
    - exit

## Importing Libraries

In [1]:
## librarie(s) for loading and preprocessing
import pandas as pd
import numpy as np


## libarie(s) for visualization 
import matplotlib.pyplot as plt
import seaborn as sns


## library for building a validation framwork


## library for feature engineering 


## library for ml algorithms


## library for ml metrics 



## Loading And Data Overview

In [3]:
## load dataset
df = pd.read_csv('Dataset/WA_Fn-UseC_-Telco-Customer-Churn.csv')


## create a copy of the 
df1 = df.copy()


In [4]:
## view the first five rows 
df1.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [5]:
## last five rows 
df1.tail()



Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.8,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.2,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.6,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.4,306.6,Yes
7042,3186-AJIEK,Male,0,No,No,66,Yes,No,Fiber optic,Yes,...,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),105.65,6844.5,No


In [6]:
## check for the total rows and columns 

print(f' The total number of rows is {df1.shape[0]} and the total number of columns is {df1.shape[1]}')


 The total number of rows is 7043 and the total number of columns is 21


In [7]:
## check for the brief column summary 
df1.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [8]:
## check for missing values
for each_col in df1.columns:
    print(f' The total number of missing values in the {each_col} column is {df[each_col].isnull().sum()}'
)


 The total number of missing values in the customerID column is 0
 The total number of missing values in the gender column is 0
 The total number of missing values in the SeniorCitizen column is 0
 The total number of missing values in the Partner column is 0
 The total number of missing values in the Dependents column is 0
 The total number of missing values in the tenure column is 0
 The total number of missing values in the PhoneService column is 0
 The total number of missing values in the MultipleLines column is 0
 The total number of missing values in the InternetService column is 0
 The total number of missing values in the OnlineSecurity column is 0
 The total number of missing values in the OnlineBackup column is 0
 The total number of missing values in the DeviceProtection column is 0
 The total number of missing values in the TechSupport column is 0
 The total number of missing values in the StreamingTV column is 0
 The total number of missing values in the StreamingMovies c

In [9]:
## lets check for duplicates 
duplicates = df1[df1.duplicated()]
print(duplicates)
for each_col in df1.columns:
    print(df[each_col].value_counts())



Empty DataFrame
Columns: [customerID, gender, SeniorCitizen, Partner, Dependents, tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn]
Index: []

[0 rows x 21 columns]
customerID
7590-VHVEG    1
3791-LGQCY    1
6008-NAIXK    1
5956-YHHRX    1
5365-LLFYV    1
             ..
9796-MVYXX    1
2637-FKFSY    1
1552-AAGRX    1
4304-TSPVK    1
3186-AJIEK    1
Name: count, Length: 7043, dtype: int64
gender
Male      3555
Female    3488
Name: count, dtype: int64
SeniorCitizen
0    5901
1    1142
Name: count, dtype: int64
Partner
No     3641
Yes    3402
Name: count, dtype: int64
Dependents
No     4933
Yes    2110
Name: count, dtype: int64
tenure
1     613
72    362
2     238
3     200
4     176
     ... 
28     57
39     56
44     51
36     50
0      11
Name: count, Length: 73, dtype: int64
PhoneService
Yes    6361
No      682

In [10]:
## check for uniqueness in each column
for each_col in df1.columns:
    print(f'The uniqueness in {each_col} is {df1[each_col].unique()}')

The uniqueness in customerID is ['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD'
 '3186-AJIEK']
The uniqueness in gender is ['Female' 'Male']
The uniqueness in SeniorCitizen is [0 1]
The uniqueness in Partner is ['Yes' 'No']
The uniqueness in Dependents is ['No' 'Yes']
The uniqueness in tenure is [ 1 34  2 45  8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
  5 46 11 70 63 43 15 60 18 66  9  3 31 50 64 56  7 42 35 48 29 65 38 68
 32 55 37 36 41  6  4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26  0
 39]
The uniqueness in PhoneService is ['No' 'Yes']
The uniqueness in MultipleLines is ['No phone service' 'No' 'Yes']
The uniqueness in InternetService is ['DSL' 'Fiber optic' 'No']
The uniqueness in OnlineSecurity is ['No' 'Yes' 'No internet service']
The uniqueness in OnlineBackup is ['Yes' 'No' 'No internet service']
The uniqueness in DeviceProtection is ['No' 'Yes' 'No internet service']
The uniqueness in TechSupport is ['No' 'Yes' 'No internet service'

In [11]:
for each_col in df1.columns:
    print(f'The number of uniqueness in {each_col} is {df1[each_col].nunique()}')


The number of uniqueness in customerID is 7043
The number of uniqueness in gender is 2
The number of uniqueness in SeniorCitizen is 2
The number of uniqueness in Partner is 2
The number of uniqueness in Dependents is 2
The number of uniqueness in tenure is 73
The number of uniqueness in PhoneService is 2
The number of uniqueness in MultipleLines is 3
The number of uniqueness in InternetService is 3
The number of uniqueness in OnlineSecurity is 3
The number of uniqueness in OnlineBackup is 3
The number of uniqueness in DeviceProtection is 3
The number of uniqueness in TechSupport is 3
The number of uniqueness in StreamingTV is 3
The number of uniqueness in StreamingMovies is 3
The number of uniqueness in Contract is 3
The number of uniqueness in PaperlessBilling is 2
The number of uniqueness in PaymentMethod is 4
The number of uniqueness in MonthlyCharges is 1585
The number of uniqueness in TotalCharges is 6531
The number of uniqueness in Churn is 2


## Data Preprocessing 
- Normalizing the column names 
- Replacing empty string with nan and fill for missing values 
- deleted the customer id column 
- change the data type on the columns 
- any other processing that is necessary

In [12]:
## let convert the the column names to lower case
df1.columns = df1.columns.str.lower()


In [13]:
## preview the columns
df1.columns

Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [None]:
## replace  values in totalcharges column 





In [14]:
## fill in the missing values in the totalcharges column with mean
## replacing the spaces with NaN
df1['totalcharges'] = df1['totalcharges'].replace(' ', np.nan)
df1['totalcharges'].isnull().sum() ## checking for the changes 

np.int64(11)

In [None]:
# changing the dtype to be able to apply mean()
df1['totalcharges'] = df1['totalcharges'].astype('float')


In [None]:
# now replacing NaN with mean
mean_totalcharge = df1['totalcharges'].mean()
mean_totalcharge

In [None]:
df1['totalcharges'] = df1['totalcharges'].fillna(2283.3004408418656)
df1['totalcharges']



In [None]:
## delete the customer id column 
df1.drop(columns=['customerid'], inplace=True)


In [None]:
df1.columns

In [None]:
## display the first five rows using the transpose
df1.head().T

In [None]:
## lets change the datatype of 'object' columns to category datatypes.
for columntype_object in df1.select_dtypes(include='object').columns:
    df1[columntype_object] = df1[columntype_object].astype('category')


In [None]:
df1.dtypes

In [None]:
## lets convert the target column, where yes == 1 and no = 0
df1['churn'] = df1['churn'].replace('Yes', 1) 
df1['churn'] = df1['churn'].replace('No', 0) 

In [None]:
## lets preview the churn column 
df1['churn']

Exploratory Data Analysis
- Target Variable Analysis 
- Outlier analysis 
- any other analysis which is important to this work.

In [None]:
## displaying a countplot for churmn
plt.figure(figsize=(8, 4))
sns.countplot(x='churn', data=df1)
plt.title('Distribution of churn')
plt.xlabel('churn(Yes=1 and No=0)')
plt.ylabel('Counts')
plt.show()


In [None]:
df1

###### 