# Churn Analysis
### Description
Customer churn refers to the rate at which customers stop doing business with a company over a given period. This data typically includes information such as customer demographics, purchase history, service usage, and reasons for leaving.

### Objective of Analysis
The primary objective of analyzing churn data is to understand why customers are leaving and to identify patterns or factors that contribute to churn. This analysis helps in developing strategies to retain customers and reduce churn rates.


# Data Discription
|           Column               |                           Description                          |
|--------------------------------|-----------------------------------------------------------------
| CustomerID                     | Unique customer ID                                               |
| Churn                          | Churn Flag                                                       |
| Tenure                         | Tenure of customer in the organization                           |
| PreferredLoginDevice           | Preferred login device of customer                               |
| CityTier                       | City tier                                                        |
| WarehouseToHome                | Distance between warehouse and home of customer                 |
| PreferredPaymentMode           | Preferred payment method of customer                             |
| Gender                         | Gender of customer                                               |
| HourSpendOnApp                 | Number of hours spent on mobile application or website          |
| NumberOfDeviceRegistered       | Total number of devices registered to the customer              |
| PreferedOrderCat               | Preferred order category of customer in last month              |
| SatisfactionScore              | Satisfaction score of customer on service                       |
| MaritalStatus                  | Marital status of customer                                      |
| NumberOfAddress                | Total number of addresses added for the customer                |
| Complain                       | Any complaint raised in last month                              |
| OrderAmountHikeFromlastYear    | Percentage increase in order amount from last year              |
| CouponUsed                     | Total number of coupons used in last month                      |
| OrderCount                     | Total number of orders placed in last month                     |
| DaySinceLastOrder              | Days since last order by customer                               |
| CashbackAmount                 | Average cashback received in last month                         |

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [2]:
data = pd.read_csv('churn.csv')

# Data Understanding

In [3]:
data.shape

(5630, 20)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   CustomerID                   5630 non-null   int64  
 1   Churn                        5630 non-null   int64  
 2   Tenure                       5366 non-null   float64
 3   PreferredLoginDevice         5630 non-null   object 
 4   CityTier                     5630 non-null   int64  
 5   WarehouseToHome              5379 non-null   float64
 6   PreferredPaymentMode         5630 non-null   object 
 7   Gender                       5630 non-null   object 
 8   HourSpendOnApp               5375 non-null   float64
 9   NumberOfDeviceRegistered     5630 non-null   int64  
 10  PreferedOrderCat             5630 non-null   object 
 11  SatisfactionScore            5630 non-null   int64  
 12  MaritalStatus                5630 non-null   object 
 13  NumberOfAddress   

In [5]:
# editing columns names
data.columns=data.columns.str.lower()

In [6]:
# presinting columns after edit
data.columns

Index(['customerid', 'churn', 'tenure', 'preferredlogindevice', 'citytier',
       'warehousetohome', 'preferredpaymentmode', 'gender', 'hourspendonapp',
       'numberofdeviceregistered', 'preferedordercat', 'satisfactionscore',
       'maritalstatus', 'numberofaddress', 'complain',
       'orderamounthikefromlastyear', 'couponused', 'ordercount',
       'daysincelastorder', 'cashbackamount'],
      dtype='object')

In [7]:
for i in data.columns:
    if data[i].dtypes == 'object':
        print(i)
        print()
        print('the values are:') 
        print(data[i].value_counts())
        print()
        print()

preferredlogindevice

the values are:
preferredlogindevice
Mobile Phone    2765
Computer        1634
Phone           1231
Name: count, dtype: int64


preferredpaymentmode

the values are:
preferredpaymentmode
Debit Card          2314
Credit Card         1501
E wallet             614
UPI                  414
COD                  365
CC                   273
Cash on Delivery     149
Name: count, dtype: int64


gender

the values are:
gender
Male      3384
Female    2246
Name: count, dtype: int64


preferedordercat

the values are:
preferedordercat
Laptop & Accessory    2050
Mobile Phone          1271
Fashion                826
Mobile                 809
Grocery                410
Others                 264
Name: count, dtype: int64


maritalstatus

the values are:
maritalstatus
Married     2986
Single      1796
Divorced     848
Name: count, dtype: int64




In [8]:
# convert data types
data['customerid'] = data['customerid'].astype('object')
data['churn'] = data['churn'].astype('object')
data['citytier'] = data['citytier'].astype('object')
data['complain'] = data['complain'].astype('object')

In [9]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tenure,5366.0,10.189899,8.557241,0.0,2.0,9.0,16.0,61.0
warehousetohome,5379.0,15.639896,8.531475,5.0,9.0,14.0,20.0,127.0
hourspendonapp,5375.0,2.931535,0.721926,0.0,2.0,3.0,3.0,5.0
numberofdeviceregistered,5630.0,3.688988,1.023999,1.0,3.0,4.0,4.0,6.0
satisfactionscore,5630.0,3.066785,1.380194,1.0,2.0,3.0,4.0,5.0
numberofaddress,5630.0,4.214032,2.583586,1.0,2.0,3.0,6.0,22.0
orderamounthikefromlastyear,5365.0,15.707922,3.675485,11.0,13.0,15.0,18.0,26.0
couponused,5374.0,1.751023,1.894621,0.0,1.0,1.0,2.0,16.0
ordercount,5372.0,3.008004,2.93968,1.0,1.0,2.0,3.0,16.0
daysincelastorder,5323.0,4.543491,3.654433,0.0,2.0,3.0,7.0,46.0


# Data Cleaning

In [10]:
# showing nulls in the data
data.isna().sum()

customerid                       0
churn                            0
tenure                         264
preferredlogindevice             0
citytier                         0
warehousetohome                251
preferredpaymentmode             0
gender                           0
hourspendonapp                 255
numberofdeviceregistered         0
preferedordercat                 0
satisfactionscore                0
maritalstatus                    0
numberofaddress                  0
complain                         0
orderamounthikefromlastyear    265
couponused                     256
ordercount                     258
daysincelastorder              307
cashbackamount                   0
dtype: int64

In [11]:

px.imshow(data.corr(numeric_only=True), width=800, height=500, color_continuous_scale=px.colors.sequential.Inferno, template='plotly_dark')

we will fill some null values with mediean values , since som columns don't have a strong relation with any other columns to take it as a factor to fill the null values

In [12]:
med_cols = ['tenure','warehousetohome','hourspendonapp','orderamounthikefromlastyear','daysincelastorder']

for col in med_cols:
    data.fillna({col:data[col].median()},inplace = True)

In order count and copon used columns we will fill them by *taking column as a factor of other* , since they have a strong relation with each other

so instead of filling the coupon used column with general median (1.0), we filled each class (order count) with the median of that class

and instead of filling the order count column with general median (2.0), we filled each class (coupon used) with the median of that class

In [13]:
factor = data.groupby('ordercount')['couponused'].median()
data.couponused = data.couponused.fillna(data.ordercount.map(factor))

In [14]:
factor = data.groupby('couponused')['ordercount'].median()
data.ordercount = data.ordercount.fillna(data.couponused.map(factor))

***checking for duplicate values***

In [15]:
data.duplicated().sum()

0

# Outlier Treatment

We will now treat outliers. For this we will define the lower range and upper range which is going to be at a distnace of 1.5 times the Interquartile range from the respective whiskers

In [16]:
numeric_data = data.select_dtypes(include=np.number)
cat_data = data.select_dtypes(exclude=np.number)
fig = go.Figure()
for col in numeric_data.columns:
    fig.add_trace(go.Box(y=numeric_data[col],name=col))
fig.update_layout(template='plotly_dark')
fig.show()

In [17]:

'''
This function removes outliers from a given numerical column
by assigning the outliers values to the upper and lower bound.
'''
def remove_outlier(col):
    sorted(col)
    Q1,Q3=np.percentile(col,[25,75])
    IQR=Q3-Q1
    lr= Q1-(1.5 * IQR)
    ur= Q3+(1.5 * IQR)
    lr =round(lr,0)
    ur =round(ur,0)
    return lr, ur


for column in data.columns:
    if data[column].dtype != 'object': 
        lr,ur=remove_outlier(data[column])
        data[column]=np.where(data[column]>ur,ur,data[column])
        data[column]=np.where(data[column]<lr,lr,data[column])

numeric_data = data.select_dtypes(include=np.number)
fig = go.Figure()
for col in numeric_data.columns:
    fig.add_trace(go.Box(y=numeric_data[col],name=col))
fig.update_layout(template='plotly_dark')
fig.show()

In [18]:
data.shape

(5630, 20)

In [19]:
data.to_csv('cleaned_data.csv',index = False)

# Exploratory Data Analysis

### 1. Univariate Analysis

In [20]:
cat_data.describe().transpose()

Unnamed: 0,count,unique,top,freq
customerid,5630,5630,50001,1
churn,5630,2,0,4682
preferredlogindevice,5630,3,Mobile Phone,2765
citytier,5630,3,1,3666
preferredpaymentmode,5630,7,Debit Card,2314
gender,5630,2,Male,3384
preferedordercat,5630,6,Laptop & Accessory,2050
maritalstatus,5630,3,Married,2986
complain,5630,2,0,4026


In [21]:
numeric_data.describe().transpose().style.background_gradient(cmap='OrRd')

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
tenure,5630.0,10.118117,8.291334,0.0,3.0,9.0,15.0,33.0
warehousetohome,5630.0,15.534636,8.088873,5.0,9.0,14.0,20.0,36.0
hourspendonapp,5630.0,2.934103,0.704344,0.0,2.0,3.0,3.0,4.0
numberofdeviceregistered,5630.0,3.730728,0.929547,2.0,3.0,4.0,4.0,6.0
satisfactionscore,5630.0,3.066785,1.380194,1.0,2.0,3.0,4.0,5.0
numberofaddress,5630.0,4.207993,2.555111,1.0,2.0,3.0,6.0,12.0
orderamounthikefromlastyear,5630.0,15.6746,3.591058,11.0,13.0,15.0,18.0,26.0
couponused,5630.0,1.549734,1.220682,0.0,1.0,1.0,2.0,4.0
ordercount,5630.0,2.563588,1.751315,1.0,1.0,2.0,3.0,6.0
daysincelastorder,5630.0,4.423623,3.423417,0.0,2.0,3.0,7.0,14.0


#### numeric values distribution

In [22]:

fig = make_subplots(cols=3, rows=4, subplot_titles=numeric_data.columns)

for i, col in enumerate(numeric_data.columns):
    # Bar trace
    bar_trace = go.Bar(
        x=numeric_data[col].value_counts().sort_index().index,
        y=numeric_data[col].value_counts().sort_index().values,
        name=col
    )
    
    # Line trace
    line_trace = go.Scatter(
        x=numeric_data[col].value_counts().sort_index().index,
        y=numeric_data[col].value_counts().sort_index().values,
        mode='lines',
        name=f'{col} Line'
    )
    
    # Add traces to the figure
    fig.add_trace(bar_trace, row=i//3+1, col=i%3+1)
    fig.add_trace(line_trace, row=i//3+1, col=i%3+1)

fig.update_layout(height=800, width=1400, template='plotly_dark',showlegend=False)
fig.show()


#### Categorical values distribution

In [23]:
cat_data.drop('customerid', axis=1, inplace=True)

In [24]:
# Assuming cat_data is your DataFrame with numeric columns
fig = make_subplots(cols=4, rows=2, subplot_titles=cat_data.columns)


for i, col in enumerate(cat_data.columns):
    # Bar trace
    bar_trace = go.Bar(
        x=cat_data[col].value_counts().sort_index().index,
        y=cat_data[col].value_counts().sort_index().values,
        name=col
    )
    
    
    # Add traces to the figure
    fig.add_trace(bar_trace, row=i//4+1, col=i%4+1)

fig.update_layout(height=800, width=1400, template='plotly_dark')
fig.show()

# Bivariate Analysis EDA

In [25]:
import pandas as pd

d = {}

# Running loop for calculating and storing the values in the relevant dataframes
for i in data.columns:
    churn_sum = data.groupby(i).churn.sum().rename('Customers_churned')
    total_customers = data[i].value_counts().rename('Total_Customers')
    perc_of_total_cust = round(data.groupby(i).churn.sum() * 100 / data[i].value_counts(), 2).rename('perc_of_total_cust')
    
    d[i] = pd.concat([churn_sum, total_customers, perc_of_total_cust], axis=1)
    d[i].reset_index(level=0, inplace=True)
    d[i] = d[i].rename(columns={'index': i})

def analysis_chart_plotly(variable):
    fig = go.Figure()

    # Adding the lines for the left y-axis
    fig.add_trace(go.Scatter(x=d[variable][variable], y=d[variable]['Customers_churned'],
                             mode='lines', name='Customers churned', line=dict(color='lightskyblue')))
    fig.add_trace(go.Scatter(x=d[variable][variable], y=d[variable]['Total_Customers'],
                             mode='lines', name='Total Customers', line=dict(color='dodgerblue')))

    # Adding the line for the right y-axis
    fig.add_trace(go.Scatter(x=d[variable][variable], y=d[variable]['perc_of_total_cust'],
                             mode='lines', name='Churn as Percent of total', line=dict(color='yellowgreen'), yaxis='y2'))
    
    # Adding the average customer churn line
    y = [20.25] * len(d[variable][variable])
    fig.add_trace(go.Scatter(x=d[variable][variable], y=y,
                             mode='lines', name='Average customer Churn', line=dict(color='orangered', dash='dash'), yaxis='y2'))

    # Updating the layout
    fig.update_layout(
        title='Customers Churn analysed by ' + variable,
        xaxis=dict(title=variable, tickangle=45),
        yaxis=dict(title='No. of customers'),
        yaxis2=dict(title='Percentage of customers churned', overlaying='y', side='right'),
        legend=dict(x=1.1, y=1),
        template='plotly_white'
    )

    fig.show()


# we will use it in deployment

In [26]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

d = {}

# Running loop for calculating and storing the values in the relevant dataframes
for i in data.columns:
    churn_sum = data.groupby(i).churn.sum().rename('Customers_churned')
    total_customers = data[i].value_counts().rename('Total_Customers')
    perc_of_total_cust = round(data.groupby(i).churn.sum() * 100 / data[i].value_counts(), 2).rename('perc_of_total_cust')
    
    d[i] = pd.concat([churn_sum, total_customers, perc_of_total_cust], axis=1)
    d[i].reset_index(level=0, inplace=True)
    d[i] = d[i].rename(columns={'index': i})

def analysis_chart_plotly(variable):
    if variable not in d:
        raise ValueError(f"Variable '{variable}' not found in the data dictionary.")
    
    data = d[variable]
    
    if data.empty:
        raise ValueError(f"No data available for variable '{variable}'.")

    traces = []

    # Adding the lines for the left y-axis
    traces.append(go.Scatter(x=data[variable], y=data['Customers_churned'],
                             mode='lines', name='Customers churned', line=dict(color='lightskyblue')))
    traces.append(go.Scatter(x=data[variable], y=data['Total_Customers'],
                             mode='lines', name='Total Customers', line=dict(color='dodgerblue')))

    # Adding the line for the right y-axis
    traces.append(go.Scatter(x=data[variable], y=data['perc_of_total_cust'],
                             mode='lines', name='Churn as Percent of total', line=dict(color='yellowgreen'), yaxis='y2'))
    
    # Adding the average customer churn line
    y = [20.25] * len(data[variable])
    traces.append(go.Scatter(x=data[variable], y=y,
                             mode='lines', name='Average customer Churn', line=dict(color='orangered', dash='dash'), yaxis='y2'))

    return traces

# Create subplots
fig = make_subplots(rows=4, cols=2, subplot_titles=cat_data.columns)

for i, col in enumerate(cat_data.columns):
    traces = analysis_chart_plotly(col)
    for trace in traces:
        fig.add_trace(trace, row=i//2+1, col=i%2+1)

fig.update_layout(height=900, width=1400, template='plotly_dark',showlegend=False)
fig.show()


In [27]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

# Assuming data is your DataFrame and d is the dictionary to store the dataframes
d = {}

# Running loop for calculating and storing the values in the relevant dataframes
for i in data.columns:
    churn_sum = data.groupby(i).churn.sum().rename('Customers_churned')
    total_customers = data[i].value_counts().rename('Total_Customers')
    perc_of_total_cust = round(data.groupby(i).churn.sum() * 100 / data[i].value_counts(), 2).rename('perc_of_total_cust')
    
    d[i] = pd.concat([churn_sum, total_customers, perc_of_total_cust], axis=1)
    d[i].reset_index(level=0, inplace=True)
    d[i] = d[i].rename(columns={'index': i})

def analysis_chart_plotly(variable):
    if variable not in d:
        raise ValueError(f"Variable '{variable}' not found in the data dictionary.")
    
    data = d[variable]
    
    if data.empty:
        raise ValueError(f"No data available for variable '{variable}'.")

    traces = []

    # Adding the lines for the left y-axis
    traces.append(go.Scatter(x=data[variable], y=data['Customers_churned'],
                             mode='lines', name='Customers churned', line=dict(color='lightskyblue')))
    traces.append(go.Scatter(x=data[variable], y=data['Total_Customers'],
                             mode='lines', name='Total Customers', line=dict(color='dodgerblue')))

    # Adding the line for the right y-axis
    traces.append(go.Scatter(x=data[variable], y=data['perc_of_total_cust'],
                             mode='lines', name='Churn as Percent of total', line=dict(color='yellowgreen'), yaxis='y2'))
    
    # Adding the average customer churn line
    y = [20.25] * len(data[variable])
    traces.append(go.Scatter(x=data[variable], y=y,
                             mode='lines', name='Average customer Churn', line=dict(color='orangered', dash='dash'), yaxis='y2'))

    return traces

# Create subplots
fig = make_subplots(rows=4, cols=3, subplot_titles=numeric_data.columns)

for i, col in enumerate(numeric_data.columns):
    traces = analysis_chart_plotly(col)
    for trace in traces:
        fig.add_trace(trace, row=i//3+1, col=i%3+1)

fig.update_layout(height=800, width=1400, template='plotly_dark',showlegend=False)
fig.show()


# Analyzing the Cherned Customers only

for making more analysis about the customers that only Exited the organization , so we will make a subset of data for only them , and make some insights about why they leaved our organization

In [28]:
churned_customers = data[data['churn'] == 1]
churned_customers.head()

Unnamed: 0,customerid,churn,tenure,preferredlogindevice,citytier,warehousetohome,preferredpaymentmode,gender,hourspendonapp,numberofdeviceregistered,preferedordercat,satisfactionscore,maritalstatus,numberofaddress,complain,orderamounthikefromlastyear,couponused,ordercount,daysincelastorder,cashbackamount
0,50001,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3.0,Laptop & Accessory,2.0,Single,9.0,1,11.0,1.0,1.0,5.0,160.0
1,50002,1,9.0,Phone,1,8.0,UPI,Male,3.0,4.0,Mobile,3.0,Single,7.0,1,15.0,0.0,1.0,0.0,121.0
2,50003,1,9.0,Phone,1,30.0,Debit Card,Male,2.0,4.0,Mobile,3.0,Single,6.0,1,14.0,0.0,1.0,3.0,120.0
3,50004,1,0.0,Phone,3,15.0,Debit Card,Male,2.0,4.0,Laptop & Accessory,5.0,Single,8.0,0,23.0,0.0,1.0,3.0,134.0
4,50005,1,0.0,Phone,1,12.0,CC,Male,3.0,3.0,Mobile,5.0,Single,3.0,0,11.0,1.0,1.0,3.0,130.0


In [29]:
churned_customers.shape

(948, 20)

In [30]:
numeric_churned_customers = churned_customers.select_dtypes(include=np.number)
cat_churned_customers = churned_customers.select_dtypes(exclude=np.number)

In [31]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd

# Assuming churned_customers is your churned_customersFrame and d is the dictionary to store the churned_customersframes
d = {}

# Running loop for calculating and storing the values in the relevant churned_customersframes
for i in churned_customers.columns:
    churn_sum = churned_customers.groupby(i).churn.sum().rename('Customers_churned')
    
    d[i] = pd.concat([churn_sum], axis=1)
    d[i].reset_index(level=0, inplace=True)
    d[i] = d[i].rename(columns={'index': i})

def analysis_chart_plotly(variable):
    """Faster and more efficient than the previous version"""
    churned_customers = d[variable]
    traces = [
        go.Scatter(
            x=churned_customers[variable],
            y=churned_customers['Customers_churned'],
            mode='lines',
            name='Customers churned',
            line=dict(color='lightskyblue')
        )
    ]
    
    return traces

# Create subplots
fig = make_subplots(rows=4, cols=3, subplot_titles=numeric_churned_customers.columns)

for i, col in enumerate(numeric_churned_customers.columns):
    traces = analysis_chart_plotly(col)
    for trace in traces:
        fig.add_trace(trace, row=i//3+1, col=i%3+1)

fig.update_layout(height=800, width=1400, template='plotly_dark',showlegend=False)
fig.show()


In [32]:
cat_churned_customers.columns

Index(['customerid', 'churn', 'preferredlogindevice', 'citytier',
       'preferredpaymentmode', 'gender', 'preferedordercat', 'maritalstatus',
       'complain'],
      dtype='object')

In [33]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd
# cat_churned_customers.drop('customerid', axis=1, inplace=True)
# Assuming cat_churned_customers is your cat_churned_customersFrame and d is the dictionary to store the cat_churned_customersframes
d = {}

# Running loop for calculating and storing the values in the relevant cat_churned_customersframes
for i in cat_churned_customers.columns:
    churn_sum = cat_churned_customers.groupby(i).churn.sum().rename('Customers_churned')
    
    d[i] = pd.concat([churn_sum], axis=1)
    d[i].reset_index(level=0, inplace=True)
    d[i] = d[i].rename(columns={'index': i})

def analysis_chart_plotly(variable):
    """Faster and more efficient than the previous version"""
    cat_churned_customers = d[variable]
    
    traces = [
        go.Bar(
            x = cat_churned_customers[variable]
            , y = cat_churned_customers['Customers_churned']
            , name = 'Customers churned'
            , orientation = 'v'
            , marker_color = 'lightskyblue'
        )
    ]
    
    return traces

# Create subplots
fig = make_subplots(rows=4, cols=3, subplot_titles=cat_churned_customers.columns)

for i, col in enumerate(cat_churned_customers.columns):
    traces = analysis_chart_plotly(col)
    for trace in traces:
        fig.add_trace(trace, row=i//3+1, col=i%3+1)

fig.update_layout(height=800, width=1400, template='plotly_dark',showlegend=False)
fig.show()


# customers Churn Prediction

## Preprocessing

In [34]:
data.drop('customerid', axis=1, inplace=True)
data.head()

Unnamed: 0,churn,tenure,preferredlogindevice,citytier,warehousetohome,preferredpaymentmode,gender,hourspendonapp,numberofdeviceregistered,preferedordercat,satisfactionscore,maritalstatus,numberofaddress,complain,orderamounthikefromlastyear,couponused,ordercount,daysincelastorder,cashbackamount
0,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3.0,Laptop & Accessory,2.0,Single,9.0,1,11.0,1.0,1.0,5.0,160.0
1,1,9.0,Phone,1,8.0,UPI,Male,3.0,4.0,Mobile,3.0,Single,7.0,1,15.0,0.0,1.0,0.0,121.0
2,1,9.0,Phone,1,30.0,Debit Card,Male,2.0,4.0,Mobile,3.0,Single,6.0,1,14.0,0.0,1.0,3.0,120.0
3,1,0.0,Phone,3,15.0,Debit Card,Male,2.0,4.0,Laptop & Accessory,5.0,Single,8.0,0,23.0,0.0,1.0,3.0,134.0
4,1,0.0,Phone,1,12.0,CC,Male,3.0,3.0,Mobile,5.0,Single,3.0,0,11.0,1.0,1.0,3.0,130.0


In [35]:
df_encoded=data.copy()
df_encoded.head()

Unnamed: 0,churn,tenure,preferredlogindevice,citytier,warehousetohome,preferredpaymentmode,gender,hourspendonapp,numberofdeviceregistered,preferedordercat,satisfactionscore,maritalstatus,numberofaddress,complain,orderamounthikefromlastyear,couponused,ordercount,daysincelastorder,cashbackamount
0,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3.0,Laptop & Accessory,2.0,Single,9.0,1,11.0,1.0,1.0,5.0,160.0
1,1,9.0,Phone,1,8.0,UPI,Male,3.0,4.0,Mobile,3.0,Single,7.0,1,15.0,0.0,1.0,0.0,121.0
2,1,9.0,Phone,1,30.0,Debit Card,Male,2.0,4.0,Mobile,3.0,Single,6.0,1,14.0,0.0,1.0,3.0,120.0
3,1,0.0,Phone,3,15.0,Debit Card,Male,2.0,4.0,Laptop & Accessory,5.0,Single,8.0,0,23.0,0.0,1.0,3.0,134.0
4,1,0.0,Phone,1,12.0,CC,Male,3.0,3.0,Mobile,5.0,Single,3.0,0,11.0,1.0,1.0,3.0,130.0


In [36]:
df_encoded = pd.get_dummies(df_encoded,drop_first=True)

In [37]:
df_encoded.head()

Unnamed: 0,tenure,warehousetohome,hourspendonapp,numberofdeviceregistered,satisfactionscore,numberofaddress,orderamounthikefromlastyear,couponused,ordercount,daysincelastorder,...,preferredpaymentmode_UPI,gender_Male,preferedordercat_Grocery,preferedordercat_Laptop & Accessory,preferedordercat_Mobile,preferedordercat_Mobile Phone,preferedordercat_Others,maritalstatus_Married,maritalstatus_Single,complain_1
0,4.0,6.0,3.0,3.0,2.0,9.0,11.0,1.0,1.0,5.0,...,False,False,False,True,False,False,False,False,True,True
1,9.0,8.0,3.0,4.0,3.0,7.0,15.0,0.0,1.0,0.0,...,True,True,False,False,True,False,False,False,True,True
2,9.0,30.0,2.0,4.0,3.0,6.0,14.0,0.0,1.0,3.0,...,False,True,False,False,True,False,False,False,True,True
3,0.0,15.0,2.0,4.0,5.0,8.0,23.0,0.0,1.0,3.0,...,False,True,False,True,False,False,False,False,True,False
4,0.0,12.0,3.0,3.0,5.0,3.0,11.0,1.0,1.0,3.0,...,False,True,False,False,True,False,False,False,True,False


In [38]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features = df_encoded[numeric_data.columns]
features = scaler.fit_transform(features)

In [39]:
scaled_df_encoded = df_encoded.copy()

In [40]:
scaled_df_encoded[numeric_data.columns] = features
scaled_df_encoded.head()

Unnamed: 0,tenure,warehousetohome,hourspendonapp,numberofdeviceregistered,satisfactionscore,numberofaddress,orderamounthikefromlastyear,couponused,ordercount,daysincelastorder,...,preferredpaymentmode_UPI,gender_Male,preferedordercat_Grocery,preferedordercat_Laptop & Accessory,preferedordercat_Mobile,preferedordercat_Mobile Phone,preferedordercat_Others,maritalstatus_Married,maritalstatus_Single,complain_1
0,-0.737959,-1.178839,0.093566,-0.786182,-0.772992,1.875626,-1.301849,-0.45039,-0.892887,0.168378,...,False,False,False,True,False,False,False,False,True,True
1,-0.134866,-0.931564,0.093566,0.289706,-0.048392,1.092812,-0.187872,-1.269676,-0.892887,-1.292281,...,True,True,False,False,True,False,False,False,True,True
2,-0.134866,1.788463,-1.32632,0.289706,-0.048392,0.701405,-0.466367,-1.269676,-0.892887,-0.415886,...,False,True,False,False,True,False,False,False,True,True
3,-1.220433,-0.066101,-1.32632,0.289706,1.400807,1.484219,2.040082,-1.269676,-0.892887,-0.415886,...,False,True,False,True,False,False,False,False,True,False
4,-1.220433,-0.437014,0.093566,-0.786182,1.400807,-0.472817,-1.301849,-0.45039,-0.892887,-0.415886,...,False,True,False,False,True,False,False,False,True,False


In [41]:
X=scaled_df_encoded.drop(['churn_1'],axis=1)
y=scaled_df_encoded['churn_1']

In [42]:
print('Before OverSampling, the shape of X: {}'.format(X.shape)) 
print('Before OverSampling, the shape of y: {} \n'.format(y.shape)) 
  
print("Before OverSampling, counts of label '1': {}".format(sum(y == 1))) 
print("Before OverSampling, counts of label '0': {}".format(sum(y == 0)))

Before OverSampling, the shape of X: (5630, 30)
Before OverSampling, the shape of y: (5630,) 

Before OverSampling, counts of label '1': 948
Before OverSampling, counts of label '0': 4682


In [43]:
from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state=33)
X_res, y_res = sm.fit_resample(X, y.to_numpy())

In [44]:
print('After OverSampling, the shape of X: {}'.format(X_res.shape)) 
print('After OverSampling, the shape of y: {} \n'.format(y_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_res == 0)))

After OverSampling, the shape of X: (9364, 30)
After OverSampling, the shape of y: (9364,) 

After OverSampling, counts of label '1': 4682
After OverSampling, counts of label '0': 4682


In [45]:
X_res = pd.DataFrame(X_res)
y_res = pd.DataFrame(y_res)
y_res.columns = ['churn_1']

data_res = pd.concat([X_res, y_res], axis=1)

In [46]:
from sklearn.model_selection import train_test_split
X = data_res.drop(['churn_1'],axis=1)
y = data_res['churn_1']
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.75, random_state=42)

## Logistic Regression

In [47]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_train_pred = logreg.predict(X_train)
y_test_pred = logreg.predict(X_test)

print('Accuracy of logistic regression classifier on train set: {:.2f}'.format(logreg.score(X_train, y_train)))


Accuracy of logistic regression classifier on train set: 0.86


In [48]:
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix

cm_lr = confusion_matrix(y_train, y_train_pred)
px.imshow(cm_lr, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')


In [49]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

       False       0.86      0.86      0.86      3512
        True       0.86      0.86      0.86      3511

    accuracy                           0.86      7023
   macro avg       0.86      0.86      0.86      7023
weighted avg       0.86      0.86      0.86      7023



In [50]:
cm_lr = confusion_matrix(y_test, y_test_pred)
px.imshow(cm_lr, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')


In [51]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

       False       0.84      0.84      0.84      1170
        True       0.84      0.83      0.84      1171

    accuracy                           0.84      2341
   macro avg       0.84      0.84      0.84      2341
weighted avg       0.84      0.84      0.84      2341



In [52]:

# predict probabilities
probs = logreg.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]

# calculate AUC
auc = roc_auc_score(y_train, probs)
print(f'AUC: {auc:.3f}')

# calculate ROC curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)

# Create the ROC curve plot using Plotly
fig = make_subplots()

# Add a diagonal line for reference
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode='lines',
        line=dict(dash='dash'),
        name='No Skill'
    )
)

# Add the ROC curve
fig.add_trace(
    go.Scatter(
        x=train_fpr,
        y=train_tpr,
        mode='lines',
        name=f'ROC curve (AUC = {auc:.3f})',
        line=dict(color='blue')
    )
)

# Update layout for better visualization
fig.update_layout(
    title='Receiver Operating Characteristic (ROC) Curve',
    xaxis_title='False Positive Rate (FPR)',
    yaxis_title='True Positive Rate (TPR)',
    xaxis=dict(range=[0, 1], constrain='domain'),
    yaxis=dict(range=[0, 1], constrain='domain'),template='plotly_dark',
    showlegend=True
)

# Show the plot
fig.show()


AUC: 0.927


## Polynomial Logistic Regression

In [53]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=3)

X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

logreg = LogisticRegression()
logreg.fit(X_train_poly, y_train)

y_train_pred = logreg.predict(X_train_poly)
y_test_pred = logreg.predict(X_test_poly)

print('Accuracy of logistic regression classifier on train set: {:.2f}'.format(logreg.score(X_train_poly, y_train)))

Accuracy of logistic regression classifier on train set: 1.00


In [54]:
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix

cm_lr = confusion_matrix(y_train, y_train_pred)
px.imshow(cm_lr, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')


In [55]:
cm_lr = confusion_matrix(y_test, y_test_pred)
px.imshow(cm_lr, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')


In [56]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=logreg , X= X_train_poly , y = y_train , cv = 10 , n_jobs=-1)
print(scores)

[0.98008535 0.9772404  0.98008535 0.97150997 0.96581197 0.98433048
 0.98148148 0.98575499 0.97863248 0.98290598]


In [57]:

# predict probabilities
probs = logreg.predict_proba(X_train_poly)
# keep probabilities for the positive outcome only
probs = probs[:, 1]

# calculate AUC
auc = roc_auc_score(y_train, probs)
print(f'AUC: {auc:.3f}')

# calculate ROC curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)

# Create the ROC curve plot using Plotly
fig = make_subplots()

# Add a diagonal line for reference
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode='lines',
        line=dict(dash='dash'),
        name='No Skill'
    )
)

# Add the ROC curve
fig.add_trace(
    go.Scatter(
        x=train_fpr,
        y=train_tpr,
        mode='lines',
        name=f'ROC curve (AUC = {auc:.3f})',
        line=dict(color='blue')
    )
)

# Update layout for better visualization
fig.update_layout(
    title='Receiver Operating Characteristic (ROC) Curve',
    xaxis_title='False Positive Rate (FPR)',
    yaxis_title='True Positive Rate (TPR)',
    xaxis=dict(range=[0, 1], constrain='domain'),
    yaxis=dict(range=[0, 1], constrain='domain'),template='plotly_dark',
    showlegend=True
)

# Show the plot
fig.show()


AUC: 1.000


In [58]:
test_accuracy = logreg.score(X_test_poly, y_test)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.9854762921828278


## Decision Tree

In [59]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()

dt.fit(X_train, y_train)

In [60]:
y_train_predict = dt.predict(X_train)
model_score = dt.score(X_train, y_train)
print(model_score)
print(metrics.classification_report(y_train, y_train_predict))

dt_cm = confusion_matrix(y_train, y_train_predict)
px.imshow(dt_cm, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')

1.0
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      3512
        True       1.00      1.00      1.00      3511

    accuracy                           1.00      7023
   macro avg       1.00      1.00      1.00      7023
weighted avg       1.00      1.00      1.00      7023



In [61]:
y_test_predict = dt.predict(X_test)
model_score = dt.score(X_test, y_test)
print(model_score)
print(metrics.classification_report(y_test, y_test_predict))

dt_cm = confusion_matrix(y_test, y_test_predict)
px.imshow(dt_cm, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')

0.9585647159333618
              precision    recall  f1-score   support

       False       0.96      0.95      0.96      1170
        True       0.95      0.96      0.96      1171

    accuracy                           0.96      2341
   macro avg       0.96      0.96      0.96      2341
weighted avg       0.96      0.96      0.96      2341



In [62]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=dt , X= X_train , y = y_train , cv = 10 , n_jobs=-1)
print(scores)

[0.96728307 0.9544808  0.96728307 0.95014245 0.92592593 0.94586895
 0.96011396 0.95441595 0.94444444 0.95156695]


In [63]:

# predict probabilities
probs = dt.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs = probs[:, 1]

# calculate AUC
auc = roc_auc_score(y_train, probs)
print(f'AUC: {auc:.3f}')

# calculate ROC curve
train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)

# Create the ROC curve plot using Plotly
fig = make_subplots()

# Add a diagonal line for reference
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode='lines',
        line=dict(dash='dash'),
        name='No Skill'
    )
)

# Add the ROC curve
fig.add_trace(
    go.Scatter(
        x=train_fpr,
        y=train_tpr,
        mode='lines',
        name=f'ROC curve (AUC = {auc:.3f})',
        line=dict(color='blue')
    )
)

# Update layout for better visualization
fig.update_layout(
    title='Receiver Operating Characteristic (ROC) Curve',
    xaxis_title='False Positive Rate (FPR)',
    yaxis_title='True Positive Rate (TPR)',
    xaxis=dict(range=[0, 1], constrain='domain'),
    yaxis=dict(range=[0, 1], constrain='domain'),template='plotly_dark',
    showlegend=True
)

# Show the plot
fig.show()


AUC: 1.000


In [64]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)
print("Best accuracy: ", grid_search.best_score_)


best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print("Test accuracy: ", test_accuracy)


Best parameters found:  {'criterion': 'entropy', 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best accuracy:  0.9545783779947481
Test accuracy:  0.9636907304570697


## Random Forests

In [65]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier()

RFC.fit(X_train,y_train)

In [66]:
y_train_predict = RFC.predict(X_train)
model_score = RFC.score(X_train, y_train)
print(model_score)
print(metrics.classification_report(y_train, y_train_predict))

RFC_cm = confusion_matrix(y_train, y_train_predict)
px.imshow(RFC_cm, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')

1.0
              precision    recall  f1-score   support

       False       1.00      1.00      1.00      3512
        True       1.00      1.00      1.00      3511

    accuracy                           1.00      7023
   macro avg       1.00      1.00      1.00      7023
weighted avg       1.00      1.00      1.00      7023



In [67]:
y_test_predict = RFC.predict(X_test)
model_score = RFC.score(X_test, y_test)
print(model_score)
print(metrics.classification_report(y_test, y_test_predict))

RFC_cm = confusion_matrix(y_test, y_test_predict)
px.imshow(RFC_cm, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')

0.9833404527979496
              precision    recall  f1-score   support

       False       0.99      0.98      0.98      1170
        True       0.98      0.99      0.98      1171

    accuracy                           0.98      2341
   macro avg       0.98      0.98      0.98      2341
weighted avg       0.98      0.98      0.98      2341



In [68]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator=RFC , X= X_train , y = y_train , cv = 10 , n_jobs=-1)
print(scores)

[0.99146515 0.98577525 0.9886202  0.97435897 0.97150997 0.98148148
 0.98717949 0.99002849 0.98575499 0.98005698]


In [69]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}
grid_search = GridSearchCV(estimator =RFC, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)
print("Best accuracy: ", grid_search.best_score_)


best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print("Test accuracy: ", test_accuracy)


invalid value encountered in cast



Best parameters found:  {'bootstrap': False, 'max_depth': 30, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best accuracy:  0.984480335797062
Test accuracy:  0.9867577958137548


In [70]:
y_pred = best_model.predict(X_test)

In [71]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9868


## Full Pipeline for Running the Random Forest Classifier

In [72]:
clean_data = pd.read_csv('cleaned_data.csv')
clean_data.drop('customerid', axis=1, inplace=True)
X_clean = clean_data.drop(['churn'],axis=1)
y_clean = clean_data['churn']

X_train, X_test, y_train, y_test = train_test_split(X_clean,y_clean, train_size = 0.75, random_state=42)

In [73]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4222 entries, 1425 to 860
Data columns (total 18 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   tenure                       4222 non-null   float64
 1   preferredlogindevice         4222 non-null   object 
 2   citytier                     4222 non-null   int64  
 3   warehousetohome              4222 non-null   float64
 4   preferredpaymentmode         4222 non-null   object 
 5   gender                       4222 non-null   object 
 6   hourspendonapp               4222 non-null   float64
 7   numberofdeviceregistered     4222 non-null   float64
 8   preferedordercat             4222 non-null   object 
 9   satisfactionscore            4222 non-null   float64
 10  maritalstatus                4222 non-null   object 
 11  numberofaddress              4222 non-null   float64
 12  complain                     4222 non-null   int64  
 13  orderamounthikefromla

In [74]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

# Assuming scaler is defined somewhere
scaler = StandardScaler()

encoder = OneHotEncoder()
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', scaler, numeric_features),
        ('cat', encoder, categorical_features)
    ]
)

rf_classifier = RandomForestClassifier(
    bootstrap=False,
    max_depth=None,
    min_samples_leaf=1,
    min_samples_split=2,
    n_estimators=100
)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', rf_classifier)
])

pipeline.fit(X_train, y_train)

In [77]:
import pickle

# Assuming `model_pipeline` is your trained pipeline
with open('churn_model_pipeline.pkl', 'wb') as file:
    pickle.dump(pipeline, file)

In [76]:
from sklearn.metrics import accuracy_score
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.9773


In [78]:
y_test_predict = pipeline.predict(X_test)
model_score = pipeline.score(X_test, y_test)
print(model_score)
print(metrics.classification_report(y_test, y_test_predict))

pipeline_cm = confusion_matrix(y_test, y_test_predict)
px.imshow(pipeline_cm, width=500, height=500,text_auto=True,color_continuous_scale=px.colors.sequential.Blues, template='plotly_dark')

0.9772727272727273
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      1172
           1       0.96      0.90      0.93       236

    accuracy                           0.98      1408
   macro avg       0.97      0.95      0.96      1408
weighted avg       0.98      0.98      0.98      1408



## END of Notebook