<a href="https://colab.research.google.com/github/mjdabendoh/maxime/blob/main/TelecomChurn_CaseStudy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About Dataset
Problem Statement
In the telecom industry, customers are able to choose from multiple service providers and actively switch from one operator to another. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate. Given the fact that it costs 5-10 times more to acquire a new customer than to retain an existing one, customer retention has now become even more important than customer acquisition.

For many incumbent operators, retaining high profitable customers is the number one business
goal. To reduce customer churn, telecom companies need to predict which customers are at high risk of churn. In this project, you will analyze customer-level data of a leading telecom firm, build predictive models to identify customers at high risk of churn.

In this competition, your goal is to build a machine learning model that is able to predict churning customers based on the features provided for their usage.

Dataset Description
This page appears alongside the data files. It describes what files have been provided and the format of each. There is no single format for this page that is appropriate for all competitions, but you should strive to describe as much as you can here. A little time spent describing the data here can save a lot of time answering questions later.

Files

train.csv - the training set
test.csv - the test set
sample_submission.csv - a sample submission file in the correct format
metaData.csv - supplemental information about the data
Columns

# Explore the Dataset set First

In [None]:
import os
import pandas as pd
import numpy as np


# Using matplolib to visialise the model
import matplotlib.pyplot as plt

# using seaborn for data visualising
import seaborn as sns

from datetime import datetime


from sklearn import set_config
from sklearn.base import BaseEstimator
from sklearn.metrics import r2_score
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.impute import SimpleImputer


from sklearn.linear_model import Ridge
from joblib import dump, load

%matplotlib inline

print("All modules are imported  !!!")

All modules are imported  !!!


# Load Data

As we see we four Dataset in zipfile that are :
    _Dictionary file
    _Sample file (a sample with the target variable "churn_probability")
    _Test file  (wihtout the target variable, we are keep this file for prediction)
    _Train file (we are using this file four our purpose)
So to keep our data consistent all modification apply on Train Data must to be apply to Test Data

In [None]:
# explore data_dictionary

df_dic = pd.read_csv("/Users/abendohmaximejeandidier/IntelligenceArtificielle/Hands_On/100-hands-on Project/Telecom Churn Case Study/archive (8)/data_dictionary.csv")
df_dic.head()

Unnamed: 0,Acronyms,Description
0,CIRCLE_ID,Telecom circle area to which the customer belo...
1,LOC,Local calls within same telecom circle
2,STD,STD calls outside the calling circle
3,IC,Incoming calls
4,OG,Outgoing calls


In [None]:
# Explore Sample_csv

df_sample =pd.read_csv("/Users/abendohmaximejeandidier/IntelligenceArtificielle/Hands_On/100-hands-on Project/Telecom Churn Case Study/archive (8)/sample (2).csv")
df_sample.head()

Unnamed: 0,id,churn_probability
0,69999,0
1,70000,0
2,70001,0
3,70002,0
4,70003,0


In [None]:
# Explore Test_csv

df_test =pd.read_csv("/Users/abendohmaximejeandidier/IntelligenceArtificielle/Hands_On/100-hands-on Project/Telecom Churn Case Study/archive (8)/test.csv")
df_test.head()

Unnamed: 0,id,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,arpu_6,arpu_7,...,sachet_3g_6,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g
0,69999,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,91.882,65.33,...,0,0,0,,,,1692,0.0,0.0,0.0
1,70000,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,414.168,515.568,...,0,0,0,,,,2533,0.0,0.0,0.0
2,70001,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,329.844,434.884,...,0,0,0,,,,277,525.61,758.41,241.84
3,70002,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,43.55,171.39,...,0,0,0,,,,1244,0.0,0.0,0.0
4,70003,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,306.854,406.289,...,0,0,0,,,,462,0.0,0.0,0.0


In [None]:
df_train =pd.read_csv("/Users/abendohmaximejeandidier/IntelligenceArtificielle/Hands_On/100-hands-on Project/Telecom Churn Case Study/archive (8)/train.csv")
df_train.head()

Unnamed: 0,id,circle_id,loc_og_t2o_mou,std_og_t2o_mou,loc_ic_t2o_mou,last_date_of_month_6,last_date_of_month_7,last_date_of_month_8,arpu_6,arpu_7,...,sachet_3g_7,sachet_3g_8,fb_user_6,fb_user_7,fb_user_8,aon,aug_vbc_3g,jul_vbc_3g,jun_vbc_3g,churn_probability
0,0,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,31.277,87.009,...,0,0,,,,1958,0.0,0.0,0.0,0
1,1,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,0.0,122.787,...,0,0,,1.0,,710,0.0,0.0,0.0,0
2,2,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,60.806,103.176,...,0,0,,,,882,0.0,0.0,0.0,0
3,3,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,156.362,205.26,...,0,0,,,,982,0.0,0.0,0.0,0
4,4,109,0.0,0.0,0.0,6/30/2014,7/31/2014,8/31/2014,240.708,128.191,...,1,0,1.0,1.0,1.0,647,0.0,0.0,0.0,0


# Loading and explore the Train data

In [None]:
df_train.info()    # We have 172 columns , that's huge jus to predict churn_probability,
                   #we will reduce this size with dictionary labels

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69999 entries, 0 to 69998
Columns: 172 entries, id to churn_probability
dtypes: float64(135), int64(28), object(9)
memory usage: 91.9+ MB


# Using dictionary to have an understanble label
We will use the dictionary to change the label with dictionary label and delete the other labels
(note include in dictionary).

In [None]:
# Dictionary labels :
"""
CIRCLE_ID	Telecom circle area to which the customer belo...
1	LOC	Local calls within same telecom circle
2	STD	STD calls outside the calling circle
3	IC	Incoming calls
4	OG	Outgoing calls
'churn_probability'
"""
# Using the dictionary labels, we filter on (i think) the most important labels in a new DataFrame
# The purpose here is to keep the necessary fields

# Select specific labels and filter using regex patterns
selected_labels1 = df_train.filter(items=['id', 'churn_probability'])        # keep those fields as we see in a sample
selected_labels2 = df_train.filter(regex='^loc_.*|^std_.*|^ic_.*|^og_.*')    # selected fields as dictionnary indication

selected_labels = pd.concat([selected_labels1,selected_labels2], axis=1)

print(selected_labels, 5)

# Print the selected labels (note that the columns size is reduce to 68 fields), run the two script below :
# column_count = selected_labels.columns.size
# print(column_count)

In [None]:
# For test file
# Select specific labels and filter using regex patterns
selectedtest_labels1 = df_test.filter(items=['id'])        # keep those fields as we see in a sample
selectedtest_labels2 = df_test.filter(regex='^loc_.*|^std_.*|^ic_.*|^og_.*')    # selected fields as dictionnary indication

selectedtest_labels = pd.concat([selectedtest_labels1,selectedtest_labels2], axis=1)

print(selected_labels, 5)

          id  churn_probability  loc_og_t2o_mou  std_og_t2o_mou  \
0          0                  0             0.0             0.0   
1          1                  0             0.0             0.0   
2          2                  0             0.0             0.0   
3          3                  0             0.0             0.0   
4          4                  0             0.0             0.0   
...      ...                ...             ...             ...   
69994  69994                  0             0.0             0.0   
69995  69995                  0             0.0             0.0   
69996  69996                  0             0.0             0.0   
69997  69997                  0             0.0             0.0   
69998  69998                  0             0.0             0.0   

       loc_ic_t2o_mou  loc_og_t2t_mou_6  loc_og_t2t_mou_7  loc_og_t2t_mou_8  \
0                 0.0              2.23              0.00              0.28   
1                 0.0              0.

 # Explore & Understanding Data

In [None]:
# Adjust the display options to show all columns
pd.set_option('display.max_columns', None)

# selected_labels.describe()  #  Can't see the global set
selected_labels.describe().transpose() # we can see tha a lot data are null, let's go further


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,69999.0,34999.000000,20207.115084,0.0,17499.50,34999.00,52498.500,69998.00
churn_probability,69999.0,0.101887,0.302502,0.0,0.00,0.00,0.000,1.00
loc_og_t2o_mou,69297.0,0.000000,0.000000,0.0,0.00,0.00,0.000,0.00
std_og_t2o_mou,69297.0,0.000000,0.000000,0.0,0.00,0.00,0.000,0.00
loc_ic_t2o_mou,69297.0,0.000000,0.000000,0.0,0.00,0.00,0.000,0.00
...,...,...,...,...,...,...,...,...
std_ic_mou_7,67312.0,33.760809,114.142230,0.0,0.00,5.98,28.160,6745.76
std_ic_mou_8,66296.0,33.077030,108.469864,0.0,0.03,5.83,27.615,5658.74
ic_others_6,67231.0,0.854063,12.149144,0.0,0.00,0.00,0.000,1362.94
ic_others_7,67312.0,1.019680,13.225373,0.0,0.00,0.00,0.000,1495.94


# Missing Value

In [None]:
#df_test.isnull().sum()
selectedtest_labels.isnull().sum()

id                     0
loc_og_t2o_mou       316
std_og_t2o_mou       316
loc_ic_t2o_mou       316
loc_og_t2t_mou_6    1169
                    ... 
std_ic_mou_7        1172
std_ic_mou_8        1675
ic_others_6         1169
ic_others_7         1172
ic_others_8         1675
Length: 67, dtype: int64

In [None]:
selected_labels.isnull().sum()

id                      0
churn_probability       0
loc_og_t2o_mou        702
std_og_t2o_mou        702
loc_ic_t2o_mou        702
                     ... 
std_ic_mou_7         2687
std_ic_mou_8         3703
ic_others_6          2768
ic_others_7          2687
ic_others_8          3703
Length: 68, dtype: int64

In [None]:
# Drop fields where all statistics are equal to 0
selected_labels_filtered = selected_labels.loc[:, ~(selected_labels.eq(0).all())]

# Print the filtered DataFrame
print(selected_labels_filtered)

          id  churn_probability  loc_og_t2o_mou  std_og_t2o_mou  \
0          0                  0             0.0             0.0   
1          1                  0             0.0             0.0   
2          2                  0             0.0             0.0   
3          3                  0             0.0             0.0   
4          4                  0             0.0             0.0   
...      ...                ...             ...             ...   
69994  69994                  0             0.0             0.0   
69995  69995                  0             0.0             0.0   
69996  69996                  0             0.0             0.0   
69997  69997                  0             0.0             0.0   
69998  69998                  0             0.0             0.0   

       loc_ic_t2o_mou  loc_og_t2t_mou_6  loc_og_t2t_mou_7  loc_og_t2t_mou_8  \
0                 0.0              2.23              0.00              0.28   
1                 0.0              0.

In [None]:
#For test file
# Drop fields where all statistics are equal to 0
selectedtest_labels_filtered = selected_labels.loc[:, ~(selected_labels.eq(0).all())]

# Print the filtered DataFrame
print(selectedtest_labels_filtered)

          id  churn_probability  loc_og_t2o_mou  std_og_t2o_mou  \
0          0                  0             0.0             0.0   
1          1                  0             0.0             0.0   
2          2                  0             0.0             0.0   
3          3                  0             0.0             0.0   
4          4                  0             0.0             0.0   
...      ...                ...             ...             ...   
69994  69994                  0             0.0             0.0   
69995  69995                  0             0.0             0.0   
69996  69996                  0             0.0             0.0   
69997  69997                  0             0.0             0.0   
69998  69998                  0             0.0             0.0   

       loc_ic_t2o_mou  loc_og_t2t_mou_6  loc_og_t2t_mou_7  loc_og_t2t_mou_8  \
0                 0.0              2.23              0.00              0.28   
1                 0.0              0.

In [None]:
# Specify the fields with all have 0 to exclude (ref. to describe function)
fields_to_exclude = ['loc_og_t2o_mou', 'std_og_t2o_mou', 'loc_ic_t2o_mou','std_ic_t2o_mou_6','std_ic_t2o_mou_7',
                     'std_ic_t2o_mou_8']

# Select fields except the ones to exclude
selected_labels_filtered = selected_labels.drop(fields_to_exclude, axis=1)

# Print the filtered DataFrame
print(selected_labels_filtered)

          id  churn_probability  loc_og_t2t_mou_6  loc_og_t2t_mou_7  \
0          0                  0              2.23              0.00   
1          1                  0              0.00              0.00   
2          2                  0              0.53             12.98   
3          3                  0              6.99              3.94   
4          4                  0             10.16              4.83   
...      ...                ...               ...               ...   
69994  69994                  0              0.00              2.44   
69995  69995                  0              7.18             30.11   
69996  69996                  0             77.13             44.28   
69997  69997                  0             10.88              7.64   
69998  69998                  0              0.00              0.00   

       loc_og_t2t_mou_8  loc_og_t2m_mou_6  loc_og_t2m_mou_7  loc_og_t2m_mou_8  \
0                  0.28              5.29             16.04       

In [None]:
# For test file
# Specify the fields with all have 0 to exclude (ref. to describe function)
fields_to_exclude = ['loc_og_t2o_mou', 'std_og_t2o_mou', 'loc_ic_t2o_mou','std_ic_t2o_mou_6','std_ic_t2o_mou_7',
                     'std_ic_t2o_mou_8']

# Select fields except the ones to exclude
selectedtest_labels_filtered = selectedtest_labels.drop(fields_to_exclude, axis=1)

# Print the filtered DataFrame
print(selectedtest_labels_filtered)

          id  loc_og_t2t_mou_6  loc_og_t2t_mou_7  loc_og_t2t_mou_8  \
0      69999             24.88             20.23             21.06   
1      70000             75.51             41.21             19.84   
2      70001              0.00              0.00              0.00   
3      70002              5.31              0.00              0.00   
4      70003              0.45              0.78             14.56   
...      ...               ...               ...               ...   
29995  99994            214.99            233.96            277.24   
29996  99995              5.08             17.33             13.16   
29997  99996             11.08              6.66              7.58   
29998  99997              1.06              4.81              0.00   
29999  99998              0.00              5.09             17.41   

       loc_og_t2m_mou_6  loc_og_t2m_mou_7  loc_og_t2m_mou_8  loc_og_t2f_mou_6  \
0                 18.13             10.89              8.36              0.00 

In [None]:
# Control
selected_labels_filtered.describe().transpose() # it's look loke pretty good butwe need to go deeper

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,69999.0,34999.000000,20207.115084,0.0,17499.50,34999.00,52498.500,69998.00
churn_probability,69999.0,0.101887,0.302502,0.0,0.00,0.00,0.000,1.00
loc_og_t2t_mou_6,67231.0,46.904854,150.971758,0.0,1.66,11.91,40.740,6431.33
loc_og_t2t_mou_7,67312.0,46.166503,154.739002,0.0,1.65,11.58,39.760,7400.66
loc_og_t2t_mou_8,66296.0,45.686109,153.716880,0.0,1.61,11.74,39.895,10752.56
...,...,...,...,...,...,...,...,...
std_ic_mou_7,67312.0,33.760809,114.142230,0.0,0.00,5.98,28.160,6745.76
std_ic_mou_8,66296.0,33.077030,108.469864,0.0,0.03,5.83,27.615,5658.74
ic_others_6,67231.0,0.854063,12.149144,0.0,0.00,0.00,0.000,1362.94
ic_others_7,67312.0,1.019680,13.225373,0.0,0.00,0.00,0.000,1495.94


# Summary statistics

In [None]:
# Check statistics if it's consistent
pd.set_option('display.max_columns', None)
selected_labels_filtered.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,69999.0,34999.000000,20207.115084,0.0,17499.50,34999.00,52498.500,69998.00
churn_probability,69999.0,0.101887,0.302502,0.0,0.00,0.00,0.000,1.00
localoutgoing_t2t_mou_6,67231.0,46.904854,150.971758,0.0,1.66,11.91,40.740,6431.33
localoutgoing_t2t_mou_7,67312.0,46.166503,154.739002,0.0,1.65,11.58,39.760,7400.66
localoutgoing_t2t_mou_8,66296.0,45.686109,153.716880,0.0,1.61,11.74,39.895,10752.56
...,...,...,...,...,...,...,...,...
standardincoming_mou_7,67312.0,33.760809,114.142230,0.0,0.00,5.98,28.160,6745.76
standardincoming_mou_8,66296.0,33.077030,108.469864,0.0,0.03,5.83,27.615,5658.74
incoming_others_6,67231.0,0.854063,12.149144,0.0,0.00,0.00,0.000,1362.94
incoming_others_7,67312.0,1.019680,13.225373,0.0,0.00,0.00,0.000,1495.94


In [None]:
selected_labels_filtered.isnull().sum()

id                      0
churn_probability       0
loc_og_t2t_mou_6     2768
loc_og_t2t_mou_7     2687
loc_og_t2t_mou_8     3703
                     ... 
std_ic_mou_7         2687
std_ic_mou_8         3703
ic_others_6          2768
ic_others_7          2687
ic_others_8          3703
Length: 62, dtype: int64

In [None]:
# selected_labels_filtered['churn_probability'].value_counts()    # Check the actual probability(0/1) of target value
# df_train['churn_probability'].value_counts(normalize = True)    # Check the actual probability of the target value
# selected_labels_filtered.columns.tolist                         # To use to re-labelling in understanding way
selected_labels_filtered.info()                                   # check the dtype of our data


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69999 entries, 0 to 69998
Data columns (total 62 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          69999 non-null  int64  
 1   churn_probability           69999 non-null  int64  
 2   localoutgoing_t2t_mou_6     67231 non-null  float64
 3   localoutgoing_t2t_mou_7     67312 non-null  float64
 4   localoutgoing_t2t_mou_8     66296 non-null  float64
 5   localoutgoing_t2m_mou_6     67231 non-null  float64
 6   localoutgoing_t2m_mou_7     67312 non-null  float64
 7   localoutgoing_t2m_mou_8     66296 non-null  float64
 8   localoutgoing_t2f_mou_6     67231 non-null  float64
 9   localoutgoing_t2f_mou_7     67312 non-null  float64
 10  localoutgoing_t2f_mou_8     66296 non-null  float64
 11  localoutgoing_t2c_mou_6     67231 non-null  float64
 12  localoutgoing_t2c_mou_7     67312 non-null  float64
 13  localoutgoing_t2c_mou_8     662

# Rename the dataset with a comprehensive label

In [None]:
# Get the list of columns in the DataFrame
columns = selected_labels_filtered.columns.tolist()

# Loop through each column and check if it starts with the specified prefixes
for column in columns:
    if column.startswith(("loc_og_", "std_og_", "og_", "loc_ic_", "std_ic_", "ic_")):
        # Generate the new label by replacing the prefixes with the corresponding replacements
        new_label = column.replace("loc_og_", "localoutgoing_").replace("std_og_", "standardoutgoing_").replace("og_", "outgoing_").replace("loc_ic_", "localincoming_").replace("std_ic_", "standardincoming_").replace("ic_", "incoming_")

        # Rename the column using the new label
        selected_labels_filtered = selected_labels_filtered.rename(columns={column: new_label})

# Print the updated DataFrame
print(selected_labels_filtered)

          id  churn_probability  localoutgoing_t2t_mou_6  \
0          0                  0                     2.23   
1          1                  0                     0.00   
2          2                  0                     0.53   
3          3                  0                     6.99   
4          4                  0                    10.16   
...      ...                ...                      ...   
69994  69994                  0                     0.00   
69995  69995                  0                     7.18   
69996  69996                  0                    77.13   
69997  69997                  0                    10.88   
69998  69998                  0                     0.00   

       localoutgoing_t2t_mou_7  localoutgoing_t2t_mou_8  \
0                         0.00                     0.28   
1                         0.00                     0.00   
2                        12.98                     0.00   
3                         3.94             

In [None]:
# For test file
# Get the list of columns in the DataFrame
columns = selectedtest_labels_filtered.columns.tolist()

# Loop through each column and check if it starts with the specified prefixes
for column in columns:
    if column.startswith(("loc_og_", "std_og_", "og_", "loc_ic_", "std_ic_", "ic_")):
        # Generate the new label by replacing the prefixes with the corresponding replacements
        new_label = column.replace("loc_og_", "localoutgoing_").replace("std_og_", "standardoutgoing_").replace("og_", "outgoing_").replace("loc_ic_", "localincoming_").replace("std_ic_", "standardincoming_").replace("ic_", "incoming_")

        # Rename the column using the new label
        selectedtest_labels_filtered = selectedtest_labels_filtered.rename(columns={column: new_label})

# Print the updated DataFrame
print(selectedtest_labels_filtered)

          id  localoutgoing_t2t_mou_6  localoutgoing_t2t_mou_7  \
0      69999                    24.88                    20.23   
1      70000                    75.51                    41.21   
2      70001                     0.00                     0.00   
3      70002                     5.31                     0.00   
4      70003                     0.45                     0.78   
...      ...                      ...                      ...   
29995  99994                   214.99                   233.96   
29996  99995                     5.08                    17.33   
29997  99996                    11.08                     6.66   
29998  99997                     1.06                     4.81   
29999  99998                     0.00                     5.09   

       localoutgoing_t2t_mou_8  localoutgoing_t2m_mou_6  \
0                        21.06                    18.13   
1                        19.84                   473.61   
2                         0.00

# let's summarize our data

In [None]:
# The function shape() consist in pandas to give the number of rows/columns.
# the number of rows is given by .shape[0], thenumber of column is given by .shape[1]
# The shape() consist of an array having the arguments rows & columns

print("Rows :", selected_labels_filtered.shape[0])
print(f"Columns : {selected_labels_filtered.shape[1]}")
print(f"\nFeatures :\n {selected_labels_filtered.columns.to_list()}")
print("\nUniques Value:\n", selected_labels_filtered.nunique())
print('\nMissing Values:\n', selected_labels_filtered.isnull().sum().values.sum())

# We still have some missing values but , as understand the Dataset you use your outgoing in certain area and not in other
# and "vis versa", that's why the missing values, at thsi stage, are relevant information.

Rows : 69999
Columns : 62

Features :
 ['id', 'churn_probability', 'localoutgoing_t2t_mou_6', 'localoutgoing_t2t_mou_7', 'localoutgoing_t2t_mou_8', 'localoutgoing_t2m_mou_6', 'localoutgoing_t2m_mou_7', 'localoutgoing_t2m_mou_8', 'localoutgoing_t2f_mou_6', 'localoutgoing_t2f_mou_7', 'localoutgoing_t2f_mou_8', 'localoutgoing_t2c_mou_6', 'localoutgoing_t2c_mou_7', 'localoutgoing_t2c_mou_8', 'localoutgoing_mou_6', 'localoutgoing_mou_7', 'localoutgoing_mou_8', 'standardoutgoing_t2t_mou_6', 'standardoutgoing_t2t_mou_7', 'standardoutgoing_t2t_mou_8', 'standardoutgoing_t2m_mou_6', 'standardoutgoing_t2m_mou_7', 'standardoutgoing_t2m_mou_8', 'standardoutgoing_t2f_mou_6', 'standardoutgoing_t2f_mou_7', 'standardoutgoing_t2f_mou_8', 'standardoutgoing_t2c_mou_6', 'standardoutgoing_t2c_mou_7', 'standardoutgoing_t2c_mou_8', 'standardoutgoing_mou_6', 'standardoutgoing_mou_7', 'standardoutgoing_mou_8', 'outgoing_others_6', 'outgoing_others_7', 'outgoing_others_8', 'localincoming_t2t_mou_6', 'localincomi

In [None]:
# For test file
# checkOut the modifications and some statistic:
print("Rows :", selectedtest_labels_filtered.shape[0])
print(f"Columns : {selectedtest_labels_filtered.shape[1]}")
print(f"\nFeatures :\n {selectedtest_labels_filtered.columns.to_list()}")
print("\nUniques Value:\n", selectedtest_labels_filtered.nunique())
print('\nMissing Values:\n', selectedtest_labels_filtered.isnull().sum().values.sum())


Rows : 30000
Columns : 61

Features :
 ['id', 'localoutgoing_t2t_mou_6', 'localoutgoing_t2t_mou_7', 'localoutgoing_t2t_mou_8', 'localoutgoing_t2m_mou_6', 'localoutgoing_t2m_mou_7', 'localoutgoing_t2m_mou_8', 'localoutgoing_t2f_mou_6', 'localoutgoing_t2f_mou_7', 'localoutgoing_t2f_mou_8', 'localoutgoing_t2c_mou_6', 'localoutgoing_t2c_mou_7', 'localoutgoing_t2c_mou_8', 'localoutgoing_mou_6', 'localoutgoing_mou_7', 'localoutgoing_mou_8', 'standardoutgoing_t2t_mou_6', 'standardoutgoing_t2t_mou_7', 'standardoutgoing_t2t_mou_8', 'standardoutgoing_t2m_mou_6', 'standardoutgoing_t2m_mou_7', 'standardoutgoing_t2m_mou_8', 'standardoutgoing_t2f_mou_6', 'standardoutgoing_t2f_mou_7', 'standardoutgoing_t2f_mou_8', 'standardoutgoing_t2c_mou_6', 'standardoutgoing_t2c_mou_7', 'standardoutgoing_t2c_mou_8', 'standardoutgoing_mou_6', 'standardoutgoing_mou_7', 'standardoutgoing_mou_8', 'outgoing_others_6', 'outgoing_others_7', 'outgoing_others_8', 'localincoming_t2t_mou_6', 'localincoming_t2t_mou_7', 'local

In [None]:
# Loop through each field and print the value counts
pd.set_option('display.max_columns', None)
for field in selected_labels_filtered:
    print(f"Value counts for {field}:")
    print(selected_labels_filtered[field].value_counts())
    print()

Value counts for id:
0        1
46664    1
46670    1
46669    1
46668    1
        ..
23338    1
23339    1
23340    1
23341    1
69998    1
Name: id, Length: 69999, dtype: int64

Value counts for churn_probability:
0    62867
1     7132
Name: churn_probability, dtype: int64

Value counts for localoutgoing_t2t_mou_6:
0.00      11140
0.33        100
0.31         78
0.43         77
0.48         77
          ...  
121.68        1
161.08        1
298.43        1
280.74        1
140.89        1
Name: localoutgoing_t2t_mou_6, Length: 11491, dtype: int64

Value counts for localoutgoing_t2t_mou_7:
0.00       11016
0.48          81
1.01          79
0.43          78
0.41          76
           ...  
264.14         1
263.23         1
121.91         1
1331.94        1
365.33         1
Name: localoutgoing_t2t_mou_7, Length: 11359, dtype: int64

Value counts for localoutgoing_t2t_mou_8:
0.00      10937
0.38         97
0.48         89
0.33         85
0.43         84
          ...  
130.06        1
6

# At this stage,
> I assume that our data look like correct for the train csv file
> The test data must to be work as we done for train , we will do it later (as we use the Dataset as prediction)
> So we known go further

In [None]:
# let see the data again with he last 10 rows :
selected_labels_filtered.tail(10)

Unnamed: 0,id,churn_probability,localoutgoing_t2t_mou_6,localoutgoing_t2t_mou_7,localoutgoing_t2t_mou_8,localoutgoing_t2m_mou_6,localoutgoing_t2m_mou_7,localoutgoing_t2m_mou_8,localoutgoing_t2f_mou_6,localoutgoing_t2f_mou_7,localoutgoing_t2f_mou_8,localoutgoing_t2c_mou_6,localoutgoing_t2c_mou_7,localoutgoing_t2c_mou_8,localoutgoing_mou_6,localoutgoing_mou_7,localoutgoing_mou_8,standardoutgoing_t2t_mou_6,standardoutgoing_t2t_mou_7,standardoutgoing_t2t_mou_8,standardoutgoing_t2m_mou_6,standardoutgoing_t2m_mou_7,standardoutgoing_t2m_mou_8,standardoutgoing_t2f_mou_6,standardoutgoing_t2f_mou_7,standardoutgoing_t2f_mou_8,standardoutgoing_t2c_mou_6,standardoutgoing_t2c_mou_7,standardoutgoing_t2c_mou_8,standardoutgoing_mou_6,standardoutgoing_mou_7,standardoutgoing_mou_8,outgoing_others_6,outgoing_others_7,outgoing_others_8,localincoming_t2t_mou_6,localincoming_t2t_mou_7,localincoming_t2t_mou_8,localincoming_t2m_mou_6,localincoming_t2m_mou_7,localincoming_t2m_mou_8,localincoming_t2f_mou_6,localincoming_t2f_mou_7,localincoming_t2f_mou_8,localincoming_mou_6,localincoming_mou_7,localincoming_mou_8,standardincoming_t2t_mou_6,standardincoming_t2t_mou_7,standardincoming_t2t_mou_8,standardincoming_t2m_mou_6,standardincoming_t2m_mou_7,standardincoming_t2m_mou_8,standardincoming_t2f_mou_6,standardincoming_t2f_mou_7,standardincoming_t2f_mou_8,standardincoming_mou_6,standardincoming_mou_7,standardincoming_mou_8,incoming_others_6,incoming_others_7,incoming_others_8
69989,69989,0,0.0,0.0,5.83,39.28,32.03,30.76,0.0,0.0,0.15,1.41,0.13,0.01,39.28,32.03,36.74,1.05,0.0,0.0,4.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.09,0.0,0.0,0.5,0.0,0.0,2.88,0.0,10.38,89.46,34.63,52.23,3.13,0.0,1.36,95.48,34.63,63.98,1.91,0.0,0.0,5.13,0.0,0.0,0.0,0.0,0.0,7.04,0.0,0.0,0.0,0.0,0.0
69990,69990,0,3.01,8.86,16.41,9.53,24.58,35.14,0.0,0.0,0.0,0.0,0.0,12.68,12.54,33.44,51.56,3.66,7.91,0.0,224.54,784.44,727.39,0.0,0.0,0.0,0.0,0.0,0.0,228.21,792.36,727.39,0.0,0.0,0.0,5.69,9.69,16.18,11.39,42.36,62.03,0.0,0.0,11.94,17.09,52.06,90.16,0.0,0.45,1.0,9.21,38.98,53.86,0.0,0.0,0.0,9.21,39.43,54.86,0.2,0.0,0.0
69991,69991,0,0.0,1.65,2.75,0.0,70.14,27.24,0.0,0.38,0.0,0.0,0.0,1.93,0.0,72.18,29.99,0.0,0.0,0.0,0.0,36.81,7.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,36.81,7.96,0.0,0.0,0.0,0.0,0.0,0.0,1.33,14.83,18.46,0.0,2.31,0.0,1.33,17.14,18.46,0.0,0.0,0.38,1.21,0.0,0.0,0.0,0.0,0.0,1.21,0.0,0.38,0.0,0.0,0.0
69992,69992,0,29.59,7.94,17.91,30.43,13.04,27.29,0.0,0.0,0.0,1.46,1.45,0.31,60.03,20.99,45.21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,42.14,40.53,69.71,51.31,57.19,52.88,3.78,0.25,5.79,97.24,97.98,128.39,0.0,0.1,0.0,0.53,4.13,2.31,0.0,0.86,0.0,0.53,5.09,2.31,0.38,0.0,0.0
69993,69993,0,26.48,53.69,0.0,13.16,47.68,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39.64,101.38,0.0,18.59,40.61,0.0,98.34,188.93,0.0,0.0,0.0,0.0,0.0,0.0,0.0,116.94,229.54,0.0,0.0,0.0,0.0,2.43,27.51,0.0,16.99,26.88,2.38,0.81,0.0,0.0,20.24,54.39,2.38,0.08,5.53,0.0,17.04,102.73,5.16,0.0,0.0,0.0,17.13,108.26,5.16,0.0,0.0,0.0
69994,69994,0,0.0,2.44,7.19,0.0,60.64,89.66,0.0,0.0,0.0,0.0,2.43,0.86,0.0,63.09,96.86,0.0,4.91,3.73,0.0,414.61,290.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,419.53,293.88,0.0,0.0,0.0,0.0,26.59,33.84,0.0,172.33,223.91,0.0,1.06,0.0,0.0,199.99,257.76,0.0,0.0,0.0,0.0,21.99,11.79,0.0,0.0,0.0,0.0,21.99,11.79,0.0,0.0,0.0
69995,69995,0,7.18,30.11,9.06,37.53,73.84,47.34,2.01,0.0,0.0,0.0,4.01,0.0,46.73,103.96,56.41,109.36,166.34,223.56,9.98,18.41,0.53,0.0,0.0,0.0,0.0,0.0,0.0,119.34,184.76,224.09,0.0,0.0,0.0,30.48,28.48,23.09,21.78,35.18,28.79,2.38,0.21,0.0,54.64,63.88,51.89,16.63,39.23,66.28,8.96,9.31,17.24,0.0,0.0,0.0,25.59,48.54,83.53,0.0,0.0,0.08
69996,69996,0,77.13,44.28,78.44,143.19,82.58,138.26,142.58,141.26,125.58,0.0,4.1,0.0,362.91,268.13,342.29,0.0,24.16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24.16,0.0,0.0,0.0,0.0,46.41,30.29,86.53,143.94,147.01,177.73,339.11,236.16,147.74,529.48,413.48,412.01,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0,2.48,2.5,0.0,2.48,5.14,3.09,0.0
69997,69997,0,10.88,7.64,6.71,4.44,6.66,8.84,7.99,1.45,2.86,0.0,0.0,0.0,23.33,15.76,18.43,2.15,0.0,0.0,14.3,8.56,0.85,0.0,0.0,0.0,0.0,0.0,0.0,16.45,8.56,0.85,0.0,0.0,0.0,11.36,3.64,1.04,0.66,1.68,3.94,0.34,4.28,2.81,12.38,9.61,7.81,3.7,4.61,1.3,2.74,2.01,7.36,0.0,0.0,1.28,6.44,6.63,9.94,0.0,0.0,0.0
69998,69998,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.21,4.31,0.96,2.68,38.71,31.69,0.43,5.78,0.0,5.33,48.81,32.66,0.0,0.0,0.0,0.0,16.28,0.0,0.0,0.0,0.0,0.0,16.28,0.0,2.8,0.0,0.36


### TRAINING AND TESTING ###
As we think earlier, we'll use the test file for predicting so it's mean
we will use the training file to train & testing

In [None]:
# Define the number of nodes in hidden layer
selected_labels_filtered_x.shape[1]         # I apply the convention of hidding laye: Number of features/2

61

In [None]:
selected_labels_filtered_x = selected_labels_filtered.loc[:, selected_labels_filtered.columns!='churn_probability']


target_y = selected_labels_filtered['churn_probability']

datatest_x = selectedtest_labels_filtered



In [None]:
# Replacing #Nan values:
selected_labels_filtered_x = SimpleImputer(missing_values =np.nan, strategy ='mean').fit_transform(selected_labels_filtered_x)
datatest_x = SimpleImputer(missing_values =np.nan, strategy ='mean').fit_transform(datatest_x)

In [None]:
# split the data for training and testing with train file

train_x, test_x, train_y, test_y = train_test_split(selected_labels_filtered_x, target_y, test_size =0.2)

# Building the pipeline :

As a classification problem , we will use CNN from Scikit-Learn

# In the entire dataset we have a lot of entries with missing values, it's create some computation problem
> to fix it we will use from sklearn "StandardScaler"

In [None]:
# Create a pipeline with preprocessing and MLP classifier
pipeline = make_pipeline(
    StandardScaler(),  # Preprocessing step
    MLPClassifier(hidden_layer_sizes=(64, 32), activation='logistic', solver='adam')  # MLP classifier
)

#
pipeline.fit(train_x, train_y)



In [None]:
pipeline.score(train_x, train_y)

0.9383560420721798

In [None]:
# Create a pipeline with preprocessing and MLP classifier
pipeline = make_pipeline(
    StandardScaler(),  # Preprocessing step
    MLPClassifier(hidden_layer_sizes=(64, 32), activation='logistic', solver='adam')  # MLP classifier
)


pipeline.fit(test_x, test_y)



In [None]:
pipeline.score(test_x, test_y)

0.946

In [None]:
print("Let's summarize:")
print(f"The model explain:{round(pipeline.score(train_x, train_y)*100,2)}% on the data train")
print(f"The model explain:{round(pipeline.score(test_x, test_y)*100,2)}% on the data test")

Let's summarize:
The model explain:91.98% on the data train
The model explain:94.6% on the data test


# After the follow the different steps for cleaning the data in training file
# we will use the same approach for test file and make some prediction

In [None]:
# Predict on the test data
# print(f"The input data is:{datatest_x[0:12]}")
pipeline.predict(datatest_x[0:12])

array([0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0])

In [None]:
*************************PRODUCTION ***************

# Use the model for production:

In [None]:
# > Save and Load the model
# Create a pipeline with preprocessing and MLP classifier

import pandas as pd
import joblib
import time

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from joblib import dump, load

datatest_x = SimpleImputer(missing_values =np.nan, strategy ='mean').fit_transform(datatest_x)
target_y = selected_labels_filtered['churn_probability']

model_pl = make_pipeline(
    StandardScaler(),  # Preprocessing step
    MLPClassifier(hidden_layer_sizes=(64, 32), activation='logistic', solver='adam'))

model_pl.fit(train_x, train_y)

set_config(print_changed_only=False)

dump(model_pl, 'ridge_model.joblib')
saved_model = joblib.load('ridge_model.joblib')
saved_model.predict(datatest_x[40:50])



array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

In [None]:
start_time = time.time()
dump(Ridge(), r'\Users\abendohmaximejeandidier\IntelligenceArtificielle\Hands_On\100-hands-on Project\Ridge_model.joblib')
print("Fit model in {} seconds". format(time. time()-start_time))


Fit model in 0.0021729469299316406 seconds


In [None]:
model = joblib.load(r'\Users\abendohmaximejeandidier\IntelligenceArtificielle\Hands_On\100-hands-on Project\Ridge_model.joblib')
print("model exists.")
Y_pred = model

model exists.


In [None]:
# to be review
model.predict(datatest_x[50:70])

# Using Flask

In [None]:
# Only use post for secure purpose as we use HTTP
# Or only for internal use in company

import flask
from flask import Flask, request
from flask.json import jsonify
from joblib import load

model_app = Flask(__name__)

ml_model =load('ridge_model.joblib')

@model_app.route('/Users/abendohmaximejeandidier/IntelligenceArtificielle/Hands_On/100-hands-on Project/Telecom Churn Case Study/Ridge_model.joblib', methods=['POST'])

def predict_churnprobability():
    input_features =request.datatest_x
    input_df = pd.read_json(input_features)
    result =ml.model.predict(input_df)
    return jsonify(result.tolist())

if __name__ =='__main__':
    model_app.run(debug=False, port=49152)


 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:49152
Press CTRL+C to quit


In [None]:
curl http://127.0.0.1:5000/hello

In [None]:
curl http://127.0.0.1:49152/'/Users/abendohmaximejeandidier/IntelligenceArtificielle/Hands_On/100-hands-on Project/Telecom Churn Case Study/Ridge_model.joblib' -d @sample_test_json_features.json --header "content-Types : application/json"