## Predictive Insights for Reducing Customer Churn

### Overview

This a strategic initiative undertaken by SyriaTel, a prominent telecommunications company, to tackle the challenge of customer attrition. The main objective of this initiative is to uncover the intricate web of factors that contribute to customer churn and predict which subscribers are likely to discontinue their services. Armed with a wealth of customer data, ranging from demographics and subscription details to usage patterns and support interactions, SyriaTel is on a quest to unlock predictive insights. The ultimate goal is to empower SyriaTel with the ability to take proactive measures to retain at-risk customers and optimize their service offerings, thereby reducing revenue loss and enhancing overall customer satisfaction.

## Business Problem

### Stakeholder - SyriaTel Company

In a competitive telecom market, SyriaTel faces a critical business problem which is escalating customer churn rates, driven by undisclosed customer dissatisfaction and suboptimal service experiences. With an ever-expanding array of services and plans, the company is challenged to understand and predict why certain subscribers decide to leave. This surge in customer attrition not only impacts SyriaTel's revenue but also casts a shadow over its reputation. The business problem at hand is to proactively identify those customers at risk of churning by decoding the intricate blend of demographic, usage, and interaction data. SyriaTel aims to transform this challenge into an opportunity by deploying predictive insights to enhance customer retention, thereby securing a competitive edge in the telecommunications industry.

## Data Understanding

In [1]:
# import the neccessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder, StandardScaler, FunctionTransformer, MinMaxScaler

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.model_selection import train_test_split, cross_val_score

In [2]:
# load the data
df = pd.read_csv("Data/Telecom.csv")
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
# summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [4]:
# summary statistics of the dataframe
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
account length,3333.0,101.064806,39.822106,1.0,74.0,101.0,127.0,243.0
area code,3333.0,437.182418,42.37129,408.0,408.0,415.0,510.0,510.0
number vmail messages,3333.0,8.09901,13.688365,0.0,0.0,0.0,20.0,51.0
total day minutes,3333.0,179.775098,54.467389,0.0,143.7,179.4,216.4,350.8
total day calls,3333.0,100.435644,20.069084,0.0,87.0,101.0,114.0,165.0
total day charge,3333.0,30.562307,9.259435,0.0,24.43,30.5,36.79,59.64
total eve minutes,3333.0,200.980348,50.713844,0.0,166.6,201.4,235.3,363.7
total eve calls,3333.0,100.114311,19.922625,0.0,87.0,100.0,114.0,170.0
total eve charge,3333.0,17.08354,4.310668,0.0,14.16,17.12,20.0,30.91
total night minutes,3333.0,200.872037,50.573847,23.2,167.0,201.2,235.3,395.0


## Defining x and y

In [5]:
y = df["churn"]
X = df.drop("churn", axis=1)

## Train-Test Split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Dealing with Categorical Features

In [7]:
X_train_copy = X_train.copy()

In [8]:
categorical_features = ["international plan", "voice mail plan"]
X_train_categorical = X_train_copy[categorical_features].copy()
X_train_categorical

Unnamed: 0,international plan,voice mail plan
817,no,no
1373,no,no
679,yes,no
56,no,no
1993,no,no
...,...,...
1095,no,no
1130,no,no
1294,no,no
860,no,no


In [9]:
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

ohe.fit(X_train_categorical)

X_train_ohe = pd.DataFrame(
    ohe.transform(X_train_categorical),
    index=X_train_categorical.index,
    columns=np.hstack(ohe.categories_)
)
X_train_ohe

Unnamed: 0,no,yes,no.1,yes.1
817,1.0,0.0,1.0,0.0
1373,1.0,0.0,1.0,0.0
679,0.0,1.0,1.0,0.0
56,1.0,0.0,1.0,0.0
1993,1.0,0.0,1.0,0.0
...,...,...,...,...
1095,1.0,0.0,1.0,0.0
1130,1.0,0.0,1.0,0.0
1294,1.0,0.0,1.0,0.0
860,1.0,0.0,1.0,0.0


## Dealing with Numerical Features

In [10]:
numerical_features = ["number vmail messages", "total day minutes", "total day calls",
                      "total day charge", "customer service calls"]
X_train_numerical = X_train_copy[numerical_features].copy()
X_train_numerical

Unnamed: 0,number vmail messages,total day minutes,total day calls,total day charge,customer service calls
817,0,95.5,92,16.24,2
1373,0,112.0,105,19.04,4
679,0,222.4,78,37.81,1
56,0,126.9,98,21.57,1
1993,0,216.3,96,36.77,0
...,...,...,...,...,...
1095,0,274.4,120,46.65,1
1130,0,35.1,62,5.97,1
1294,0,87.6,76,14.89,1
860,0,179.2,111,30.46,2


In [11]:
scaler = MinMaxScaler()

scaler.fit(X_train_numerical)
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train_numerical),
    # index is important to ensure we can concatenate with other columns
    index=X_train_numerical.index,
    columns=X_train_numerical.columns
)
X_train_scaled

Unnamed: 0,number vmail messages,total day minutes,total day calls,total day charge,customer service calls
817,0.000000,0.266801,0.459259,0.266892,0.222222
1373,0.000000,0.314187,0.555556,0.314189,0.444444
679,0.000000,0.631246,0.355556,0.631250,0.111111
56,0.000000,0.356979,0.503704,0.356926,0.111111
1993,0.000000,0.613728,0.488889,0.613682,0.000000
...,...,...,...,...,...
1095,0.000000,0.780586,0.666667,0.780574,0.111111
1130,0.000000,0.093337,0.237037,0.093412,0.111111
1294,0.000000,0.244113,0.340741,0.244088,0.111111
860,0.000000,0.507180,0.600000,0.507095,0.222222


In [12]:
X_train_full = pd.concat([X_train_scaled, X_train_ohe], axis=1)
X_train_full

Unnamed: 0,number vmail messages,total day minutes,total day calls,total day charge,customer service calls,no,yes,no.1,yes.1
817,0.000000,0.266801,0.459259,0.266892,0.222222,1.0,0.0,1.0,0.0
1373,0.000000,0.314187,0.555556,0.314189,0.444444,1.0,0.0,1.0,0.0
679,0.000000,0.631246,0.355556,0.631250,0.111111,0.0,1.0,1.0,0.0
56,0.000000,0.356979,0.503704,0.356926,0.111111,1.0,0.0,1.0,0.0
1993,0.000000,0.613728,0.488889,0.613682,0.000000,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...
1095,0.000000,0.780586,0.666667,0.780574,0.111111,1.0,0.0,1.0,0.0
1130,0.000000,0.093337,0.237037,0.093412,0.111111,1.0,0.0,1.0,0.0
1294,0.000000,0.244113,0.340741,0.244088,0.111111,1.0,0.0,1.0,0.0
860,0.000000,0.507180,0.600000,0.507095,0.222222,1.0,0.0,1.0,0.0


## Fitting a Model

In [13]:
logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')
model_log = logreg.fit(X_train_full, y_train)
model_log

## Performance on Training Data