<a href="https://colab.research.google.com/github/mazed9/predicting_customer_churn/blob/main/Predicting_customer_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this project we are gonna build a model to predict customer churn.

Customer churn, also known as customer attrition or customer turnover, refers to the rate at which customers stop doing business with a company or stop using its products or services. It is a key metric for businesses to track, as it can have a significant impact on their revenue and profitability.

### Our first step will be importing required libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn


### Now we'll read the data.

In [None]:
df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Customer churn/churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


Let's see all the column names.

In [None]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

To see the data types of the columns.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


### Dropping useless columns.

Now we will identify columns that are not gonna be useful in predicting churn, and drop them.
Here, "customerID" is useless.

In [None]:
df.drop(["customerID"],axis=1,inplace=True)
df

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,No,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,Yes,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,No,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


### Checking for missing values

In [None]:
df.isnull().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

There is no missing value.

### Handling categorical values

Let's first see the data type and the number of unique values each column has in a daraframe.

In [None]:
dict= {"column_names":[],"nuniqe":[],"dtypes":[]}
for col in df:
  dict["column_names"].append(col)
  dict["nuniqe"].append(df[col].nunique())
  dict["dtypes"].append(df[col].dtypes)

catvalue=pd.DataFrame(dict)
catvalue

Unnamed: 0,column_names,nuniqe,dtypes
0,gender,2,object
1,SeniorCitizen,2,int64
2,Partner,2,object
3,Dependents,2,object
4,tenure,73,int64
5,PhoneService,2,object
6,MultipleLines,3,object
7,InternetService,3,object
8,OnlineSecurity,3,object
9,OnlineBackup,3,object


In [None]:
df["PaymentMethod"].value_counts()

Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64

We can see, almost all the columns have categorical values and they're nominal values. So we'll apply one-hot-encoding. But before doing that let's seperate our dependent and independent variables.

In [None]:
df.isna().sum()

gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [None]:
df["TotalCharges"]=pd.to_numeric(df["TotalCharges"],errors='coerce')
#df.astype({'TotalCharges': 'float64'})
df["TotalCharges"].dtypes

dtype('float64')

Let's see if there's any NaN value

In [None]:
df.isnull().sum().sum()

11

In [None]:
df['TotalCharges'].isnull().sum()

11

Pay attention, somethin interesting just happened. There was no NaN value earlier. But after typecasting "TotalCharges" some NaN values got generated. I have no idea why this happened. Let's delete the rows containing NaN values.

In [None]:
df=df.dropna()

In [None]:
df.isnull().sum().sum()

0

In [None]:
df['TotalCharges'].isnull().sum()

0

### Apply one-hot-encoding to all the independent variable

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()

In [None]:
for column in df.columns:
  if df[column].dtype==np.number or column=="Churn":
    continue
  encoded=encoder.fit_transform(df[[column]])
  encoded_df=pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
  df=pd.concat([df,encoded_df],axis=1)

  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":
  if df[column].dtype==np.number or column=="Churn":


In [None]:
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,...,Contract_Two year,Contract_nan,PaperlessBilling_No,PaperlessBilling_Yes,PaperlessBilling_nan,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,PaymentMethod_nan
0,Female,0.0,Yes,No,1.0,No,No phone service,DSL,No,Yes,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1,Male,0.0,No,No,34.0,Yes,No,DSL,Yes,No,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,Male,0.0,No,No,2.0,Yes,No,DSL,Yes,Yes,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,Male,0.0,No,No,45.0,No,No phone service,DSL,Yes,No,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,Female,0.0,No,No,2.0,Yes,No,Fiber optic,No,No,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
df.isnull().sum().sum()

242

### Apply label encoding to dependent variable

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
df["Churn"]= LabelEncoder().fit_transform(df["Churn"])
df["Churn"]

0       0
1       0
2       1
3       0
4       1
       ..
3826    2
4380    2
5218    2
6670    2
6754    2
Name: Churn, Length: 7043, dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 6754
Data columns (total 75 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   gender                                   7032 non-null   object 
 1   SeniorCitizen                            7032 non-null   float64
 2   Partner                                  7032 non-null   object 
 3   Dependents                               7032 non-null   object 
 4   tenure                                   7032 non-null   float64
 5   PhoneService                             7032 non-null   object 
 6   MultipleLines                            7032 non-null   object 
 7   InternetService                          7032 non-null   object 
 8   OnlineSecurity                           7032 non-null   object 
 9   OnlineBackup                             7032 non-null   object 
 10  DeviceProtection                         7032 no

Now it's time to drop the columns we encoded.

In [None]:
for column in df.columns:
  if df[column].dtype != np.number and column !="Churn":
    df.drop(column,axis=1,inplace=True)


  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":
  if df[column].dtype != np.number and column !="Churn":


In [None]:
df.shape

(7043, 60)

In [None]:
df.isnull().sum().sum()

66

Again some NaN values got generated. Let's drop them

In [None]:
df=df.dropna()

In [None]:
df.isnull().sum().sum()

0

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7021 entries, 0 to 7031
Data columns (total 60 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   SeniorCitizen                            7021 non-null   float64
 1   tenure                                   7021 non-null   float64
 2   MonthlyCharges                           7021 non-null   float64
 3   TotalCharges                             7021 non-null   float64
 4   Churn                                    7021 non-null   int64  
 5   gender_Female                            7021 non-null   float64
 6   gender_Male                              7021 non-null   float64
 7   Partner_No                               7021 non-null   float64
 8   Partner_Yes                              7021 non-null   float64
 9   Partner_nan                              7021 non-null   float64
 10  Dependents_No                            7021 no

###Let's seperate independent and dependent variables.

In [None]:
X = df.drop('Churn', axis=1)
Y = df['Churn']

###Let's scale the data

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
X=StandardScaler().fit_transform(X)

###Split the data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:

xtrain, xtest, ytrain, ytest = train_test_split(X,Y, test_size=0.2, random_state=42)

### Let's create and train the model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [None]:
model=LogisticRegression()
model.fit(xtrain, ytrain)

### Now it's time to predict

In [None]:
predictions=model.predict(xtest)

In [None]:
predictions

array([1, 0, 0, ..., 0, 0, 0])

### Evaluation

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(ytest, predictions))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87      1061
           1       0.61      0.48      0.54       344

    accuracy                           0.80      1405
   macro avg       0.73      0.69      0.70      1405
weighted avg       0.79      0.80      0.79      1405

