# Churn Prediction for  a Telecommunication Company (MTN) Customers

### Welcome, grab a sit, today we'll be predicting the likelihood of customers of a telecommunication company (MTN) to churn.

> By churn, I mean stop using their services

### We'll also try to identify a common factor (reason) for their churning based on historical churning data so MTN would know what to improve on

##### As always, I'd be documenting this project to the very last detail so let's go!!!

First things first, we'd be importing all the libraries needed for this project

- I decided to use Random Forest model for the analysis because it seems like a good choice for this project, let's see

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import  accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

Next, we'd load our training data into the notebook and get a glimpse of it to understand the kind of data we're dealing with here

In [2]:
train_data = pd.read_csv('churntrain.csv')
train_data

Unnamed: 0,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churn
0,OH,107,area_code_415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.70,1,no
1,NJ,137,area_code_415,no,no,0,243.4,114,41.38,121.2,110,10.30,162.6,104,7.32,12.2,5,3.29,0,no
2,OH,84,area_code_408,yes,no,0,299.4,71,50.90,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,no
3,OK,75,area_code_415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,no
4,MA,121,area_code_510,no,yes,24,218.2,88,37.09,348.5,108,29.62,212.6,118,9.57,7.5,7,2.03,3,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4245,MT,83,area_code_415,no,no,0,188.3,70,32.01,243.8,88,20.72,213.7,79,9.62,10.3,6,2.78,0,no
4246,WV,73,area_code_408,no,no,0,177.9,89,30.24,131.2,82,11.15,186.2,89,8.38,11.5,6,3.11,3,no
4247,NC,75,area_code_408,no,no,0,170.7,101,29.02,193.1,126,16.41,129.1,104,5.81,6.9,7,1.86,1,no
4248,HI,50,area_code_408,no,yes,40,235.7,127,40.07,223.0,126,18.96,297.5,116,13.39,9.9,5,2.67,2,no


I want a list of all the columns and their data types so I'll easily copy and paste the columns I need and also understand the data better

In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4250 entries, 0 to 4249
Data columns (total 20 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   state                          4250 non-null   object 
 1   account_length                 4250 non-null   int64  
 2   area_code                      4250 non-null   object 
 3   international_plan             4250 non-null   object 
 4   voice_mail_plan                4250 non-null   object 
 5   number_vmail_messages          4250 non-null   int64  
 6   total_day_minutes              4250 non-null   float64
 7   total_day_calls                4250 non-null   int64  
 8   total_day_charge               4250 non-null   float64
 9   total_eve_minutes              4250 non-null   float64
 10  total_eve_calls                4250 non-null   int64  
 11  total_eve_charge               4250 non-null   float64
 12  total_night_minutes            4250 non-null   f

In [4]:
train_data.columns

Index(['state', 'account_length', 'area_code', 'international_plan',
       'voice_mail_plan', 'number_vmail_messages', 'total_day_minutes',
       'total_day_calls', 'total_day_charge', 'total_eve_minutes',
       'total_eve_calls', 'total_eve_charge', 'total_night_minutes',
       'total_night_calls', 'total_night_charge', 'total_intl_minutes',
       'total_intl_calls', 'total_intl_charge',
       'number_customer_service_calls', 'churn'],
      dtype='object')

### We can create new columns to find the total calls, total minutes, and total charge irrespective of the time of the day

In [5]:
train_data['total_calls'] = train_data['total_day_calls'] + train_data['total_eve_calls'] + train_data['total_night_calls']
train_data['total_minutes'] = train_data['total_day_minutes'] + train_data['total_eve_minutes'] + train_data['total_night_minutes']
train_data['total_charges'] = train_data['total_day_charge'] + train_data['total_eve_charge'] + train_data['total_night_charge']

We'll select our 'feature' data  (X) and 'target' data (y) that'll be used to train our model

- Now, note that at first, I picked out the features based on the heatmap above
- At the end of my analysis, I would have performed feature engineering to filter out items that reduce the quality of my model

Let's go!!!

### WAIT, HOLD ON
Our model only works with numerical or categorical data i.e. boolean

And my intuition tells me that a customer having an international plan or not  OR having a voicemail plan or not may affect his/her likelihood of churning

Both columns contain invalid data types, so we'll have to
- convert the data in the respective columns to boolean data using the pandas function 'map()'

- To ensure we don't get the NaN error, we further convert the dtype to boolean using astype() function

In [6]:
train_data['international_plan'] = train_data['international_plan'].map({'yes':True, 'no':False}).astype(bool)
train_data['voice_mail_plan'] = train_data['voice_mail_plan'].map({'yes':True, 'no':False}).astype(bool)

Let's confirm our data is as we want it

In [7]:
train_data['international_plan'].head()

0    False
1    False
2     True
3     True
4    False
Name: international_plan, dtype: bool

In [8]:
train_data['voice_mail_plan'].head()

0     True
1    False
2    False
3    False
4     True
Name: voice_mail_plan, dtype: bool

### Oh, great! Everything is shipshape and perfect
Okay, now we're ready to select our features and target

In [9]:
features = ['international_plan', 'voice_mail_plan', 'account_length', 'total_intl_minutes', 'total_calls', 'total_minutes', 'total_charges', 'total_intl_calls', 'total_intl_charge', 'total_day_minutes','total_day_calls', 'total_eve_minutes', 'total_eve_calls', 'total_night_minutes', 'total_night_calls']
X = train_data[features]
y = train_data.churn

### UGH, WAIT A MINUTE... AGAIN

Our churn data is also in an invalid format.

Well, easy, we'd just convert it to boolean as we did for the other categorical data


In [10]:
train_data['churn'] = train_data['churn'].map({'yes':True, 'no':False}).astype(bool)
y = train_data.churn
y.head()

0    False
1    False
2    False
3    False
4    False
Name: churn, dtype: bool

### YAY! We're good to go!
Now, We'd define our model now as a random forrest model

In [11]:
churn_model = RandomForestClassifier()

Next, we'd fit (train) our model using the training data

In [12]:
churn_model.fit(X, y)

###Awesome! It's time to know if our model really works

Let's get testing!

In [13]:
test_data = pd.read_csv('churntest.csv')
test_data.head()

Unnamed: 0,id,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls
0,1,KS,128,area_code_415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1
1,2,AL,118,area_code_510,yes,no,0,223.4,98,37.98,220.6,101,18.75,203.9,118,9.18,6.3,6,1.7,0
2,3,IA,62,area_code_415,no,no,0,120.7,70,20.52,307.2,76,26.11,203.0,99,9.14,13.1,6,3.54,4
3,4,VT,93,area_code_510,no,no,0,190.7,114,32.42,218.2,111,18.55,129.6,121,5.83,8.1,3,2.19,3
4,5,NE,174,area_code_415,no,no,0,124.3,76,21.13,277.1,112,23.55,250.7,115,11.28,15.5,5,4.19,3


### As seen above, our test data doesn't contain the churn column

> Instead, the churn column has been saved in another file called 'churn.csv' (we'll use it later)

### For now, let's preprocess our testing data as we did for the training data


In [14]:
test_data['international_plan'] = test_data['international_plan'].map({'yes':True, 'no':False}).astype(bool)
test_data['voice_mail_plan'] = test_data['voice_mail_plan'].map({'yes':True, 'no':False}).astype(bool)

Let's check if that worked

In [15]:
test_data['international_plan'].value_counts()

False    673
True      77
Name: international_plan, dtype: int64

### Great! It worked. Let's move on

First, we'll create the total calls, total minutes, and total charges columns for our testing data

Then, we'll select our testing features (the same columns as training features)

In [16]:
test_data['total_calls'] = test_data['total_day_calls'] + test_data['total_eve_calls'] + test_data['total_night_calls']
test_data['total_minutes'] = test_data['total_day_minutes'] + test_data['total_eve_minutes'] + test_data['total_night_minutes']
test_data['total_charges'] = test_data['total_day_charge'] + test_data['total_eve_charge'] + test_data['total_night_charge']

In [17]:
features = ['international_plan', 'voice_mail_plan', 'account_length', 'total_intl_minutes', 'total_calls', 'total_minutes', 'total_charges', 'total_intl_calls', 'total_intl_charge', 'total_day_minutes','total_day_calls', 'total_eve_minutes', 'total_eve_calls', 'total_night_minutes', 'total_night_calls']
test_X = test_data[features]
test_X.head()

Unnamed: 0,international_plan,voice_mail_plan,account_length,total_intl_minutes,total_calls,total_minutes,total_charges,total_intl_calls,total_intl_charge,total_day_minutes,total_day_calls,total_eve_minutes,total_eve_calls,total_night_minutes,total_night_calls
0,False,True,128,10.0,300,707.2,72.86,3,2.7,265.1,110,197.4,99,244.7,91
1,True,False,118,6.3,317,647.9,65.91,6,1.7,223.4,98,220.6,101,203.9,118
2,False,False,62,13.1,245,630.9,55.77,6,3.54,120.7,70,307.2,76,203.0,99
3,False,False,93,8.1,346,538.5,56.8,3,2.19,190.7,114,218.2,111,129.6,121
4,False,False,174,15.5,303,652.1,55.96,5,4.19,124.3,76,277.1,112,250.7,115


### Ooouuu, everything looks READY TO GO
What are we waiting for? Let's get predicting!

In [18]:
pred = churn_model.predict(test_X)
pred

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False,  True, False, False, False,  True, False,
       False, False, False, False, False, False, False,  True, False,
        True, False,  True, False, False, False, False, False,  True,
       False, False, False, False, False,  True, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False,  True, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

###YEAH, That's what I'm talking about!

Now, let's see how good this model is, shall we?

### Model Evaluation Time!

Hey, do you remember the churn.csv file I mentioned earlier?

Yeah, we're using that now!

Churn.csv contains the actual churn column for the test data.

So we're going to compare it with our predicted churn values to see how accurate our model was

In [19]:
test_y = pd.read_csv('churn.csv', index_col = 'id')
test_y.head()

Unnamed: 0_level_0,churn
id,Unnamed: 1_level_1
1,yes
2,no
3,no
4,yes
5,yes


####Our data is in an invalid state.
- We only need the churn column so we converted the id column to the index as done above

Secondly,
- We need the churn column to be in boolean form


So just as we did for the training data, we'd transform our testing data to boolean using map function

In [20]:
test_y['churn'] = test_y['churn'].map({'yes':True, 'no':False}).astype(bool)
test_y = test_y.churn
test_y.head()

id
1     True
2    False
3    False
4     True
5     True
Name: churn, dtype: bool

###Looks good!

Now, let's cut right into it (evaluate our model)

### We'll be using a few evaluation functions to determine if our model is a good one or not

I'll briefly explain what each of the evaluation metric we'll be using does before we get to the code:

- Accuracy: tells us how often our Rmodel is correct
- Precision: tells us how many of the positive predictions are actually correct
- Recall: tells us how many of the actual true values were correctly predicted.
- F1 Score: balances precision and recall
- Confusion Matrix: breakdown of the model's predictions into four categories


[[true negative, false positive],

[false negative, and true positive]]

In [21]:
eval_metrics = {
    'Accuracy': accuracy_score(test_y, pred),
    'Precision': precision_score(test_y, pred),
    'Recall': recall_score(test_y, pred),
    'F1-score': f1_score(test_y, pred),
    'Confusion Matrix': confusion_matrix(test_y, pred)
}

Let's run this dictionary through a for loop to see our validation figures

In [22]:
for metric_name, metric_value in eval_metrics.items():
    print(metric_name, ":", metric_value)

Accuracy : 0.5933333333333334
Precision : 0.36764705882352944
Recall : 0.08710801393728224
F1-score : 0.14084507042253522
Confusion Matrix : [[420  43]
 [262  25]]


#FEATURE ENGINEERING DOCUMENTATION

###Model 1
- All usable columns
> Result:
    - Accuracy : 0.5813333333333334
    - Precision : 0.35789473684210527
    - Recall : 0.11846689895470383
    - F1-score : 0.17801047120418848
    - Confusion Matrix :
    
        [[402  61]
    
        [253  34]]
-
###Model 2
- All usable columns except ['account_length']
> Result: Slightly better
    - Accuracy : 0.584
    - Precision : 0.3655913978494624
    - Recall : 0.11846689895470383
    - F1-score : 0.17894736842105263
    - Confusion Matrix :
            [[404  59]
            [253  34]]
###Model 3
- All usable columns except ['total_day_charge', 'total_eve_charge', 'total_night_charge','total_intl_charge','account_length']
> Result: Slightly better
    - Accuracy : 0.5866666666666667
    - Precision : 0.3707865168539326
    - Recall : 0.11498257839721254
    - F1-score : 0.17553191489361702
    - Confusion Matrix :
            [[407  56]
            [254  33]]

###Model 4
- Colmuns: ['international_plan', 'voice_mail_plan', 'total_intl_minutes', 'total_calls', 'total_minutes', 'total_charges', 'total_intl_calls', 'total_intl_charge' ]
> Result: Slightly better
    - Accuracy : 0.588
    - Precision : 0.34285714285714286
    - Recall : 0.08362369337979095
    - F1-score : 0.13445378151260506
    - Confusion Matrix :
            [[417  46]
            [263  24]]

###Model 5
- Columns: ['international_plan', 'voice_mail_plan', 'total_intl_minutes', 'total_calls', 'total_minutes', 'total_charges', 'total_intl_calls', 'total_intl_charge', 'total_day_minutes','total_day_calls', 'total_eve_minutes', 'total_eve_calls', 'total_night_minutes', 'total_night_calls']
> Result: Slightly better
    - Accuracy : 0.592
    - Precision : 0.36231884057971014
    - Recall: 0.08710801393728224
    - F1-score : 0.1404494382022472
    - Confusion Matrix :
    
            [[419  44]
            [262  25]]

###Model 6

Columns: ['international_plan', 'voice_mail_plan', 'account_length', 'total_intl_minutes', 'total_calls', 'total_minutes', 'total_charges', 'total_intl_calls', 'total_intl_charge', 'total_day_minutes','total_day_calls', 'total_eve_minutes', 'total_eve_calls', 'total_night_minutes', 'total_night_calls']
> Result: Tiny bit better
    - Accuracy : 0.5933333333333334
    - Precision : 0.36764705882352944
    - Recall : 0.08710801393728224
    - F1-score : 0.14084507042253522
    - Confusion Matrix :
    
            [[420  43]
            [262  25]]




        




# We'll go with model 6! Thanks for sticking around!

# VISUALS USING PLOTLY

## Identifying some key factors responsible for churning using visuals

In [23]:
import plotly.express as px

# Creating a 3D scatter plot using plotly express
fig = px.scatter_3d(train_data, x='total_calls', y='total_minutes', z='total_charges', color='churn')

fig.update_layout(
    scene=dict(
        xaxis=dict(title='Total Calls'),
        yaxis=dict(title='Total Minutes'),
        zaxis=dict(title='Total Charges'),
    ),
    width=800,
    height=600,
    autosize=True,
    title='Churning rate based on total values'
)

fig.show()



- The 3D scatter plot showed some clustering of not churned customers, although churned customers were still present.
- There were more not churned customers than churned ones, and higher levels of calls, minutes, and charges were associated with a higher likelihood of churn.

# THE END