## Business Understanding 

There is need for company's to understand their consumers behaviours, tastes and preferences in order to have a competitive edge over competitors. Churn analysis determines likelihood of customers ceasing to consume products or services from a company. Proactive and data-driven businesses are able to leverage on analytics to analyse customer behaviours, strategize customer retention activities, and effectively prioritize their resources to attain business success.  This project's goal is to  develop a machine learning model for a telecommunications company that predicts likelihood of customers churning.(i.e becoming inactive and not making any transactions for 90 days). This solution will help the telecom commpany to better serve their customers by understanding which customers are at risk of leaving.


1. user_id, User identification
2. REGION,the location of each client. Geographic region of the user
3. TENURE,duration in the network. Tenure category (e.g., prepaid, postpaid).
4. MONTANT,top-up amount. Monetary value.
5. FREQUENCE_RECH, number of times the customer refilled. Numeric frequency 
6. REVENUE,monthly income of each client. Monetary value.
7. ARPU_SEGMENT,income over 90 days / 3.  Monetary value.
8. FREQUENCE,number of times the client has made an income.  Numeric frequency (how often the user uses the service).
9. DATA_VOLUME,number of connections. Numeric data volume.
10. ON_NET,inter expresso call. Numeric usage on the network.
11. ORANGE,call to orange. Numeric usage on the Orange network.
12. TIGO,call to Tigo. Numeric usage on the Tigo network.
13. ZONE1,call to zones1. Numeric usage in specific zones.
14. ZONE2,call to zones2. Numeric usage in specific zones.
15. MRG,a client who is going. Numeric usage in specific zones.Categorical (merged or not merged).
16. REGULARITY,number of times the client is active for 90 days.  Numeric regularity value.
17. TOP_PACK,the most active packs.  Categorical package name.
18. FREQ_TOP_PACK,number of times the client has activated the top pack packages.Numeric frequency.
19. CHURN,variable to predict - Target.  Binary (churned or not churned).


#### Goal 
#### Hypothesis
#### Analytic questions

## Data understanding 

In [1]:
# import the nesscery libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split 
from imblearn.over_sampling import RandomOverSampler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
import joblib


In [2]:
#load the train dataset
df_train = pd.read_csv('data/Train (2).csv')

In [3]:
#load the test dataset
df_test =  pd.read_csv('data/Test.csv')

In [4]:
#check the head of the train dataset 
df_train.head(5)

Unnamed: 0,user_id,REGION,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,ORANGE,TIGO,ZONE1,ZONE2,MRG,REGULARITY,TOP_PACK,FREQ_TOP_PACK,CHURN
0,7ee9e11e342e27c70455960acc80d3f91c1286d1,DAKAR,K > 24 month,20000.0,47.0,21602.0,7201.0,52.0,8835.0,3391.0,396.0,185.0,,,NO,62,On net 200F=Unlimited _call24H,30.0,0
1,50443f42bdc92b10388fc56e520e4421a5fa655c,,K > 24 month,,,,,,,,,,,,NO,3,,,0
2,da90b5c1a9b204c186079f89969aa01cb03c91b2,,K > 24 month,,,,,,,,,,,,NO,1,,,0
3,364ec1b424cdc64c25441a444a16930289a0051e,SAINT-LOUIS,K > 24 month,7900.0,19.0,7896.0,2632.0,25.0,9385.0,27.0,46.0,20.0,,2.0,NO,61,"Data:490F=1GB,7d",7.0,0
4,d5a5247005bc6d41d3d99f4ef312ebb5f640f2cb,DAKAR,K > 24 month,12350.0,21.0,12351.0,4117.0,29.0,9360.0,66.0,102.0,34.0,,,NO,56,All-net 500F=2000F;5d,11.0,0


In [5]:
#check the head of the test dataset 
df_test.head(5)

Unnamed: 0,user_id,REGION,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,ORANGE,TIGO,ZONE1,ZONE2,MRG,REGULARITY,TOP_PACK,FREQ_TOP_PACK
0,51fe4c3347db1f8571d18ac03f716c41acee30a4,MATAM,I 18-21 month,2500.0,5.0,2500.0,833.0,5.0,0.0,64.0,70.0,,,,NO,35,All-net 500F=2000F;5d,5.0
1,5ad5d67c175bce107cc97b98c4e37dcc38aa7f3e,,K > 24 month,,,,,,,,,,,,NO,2,,
2,5a4db591c953a8d8f373877fad37aaf4268899a1,,K > 24 month,,,,,,0.0,,,,,,NO,22,,
3,8bf9b4d8880aeba1c9a0da48be78f12e629be37c,,K > 24 month,,,,,,,,,,,,NO,6,,
4,c7cdf2af01e9fa95bf498b68c122aa4b9a8d10df,SAINT-LOUIS,K > 24 month,5100.0,7.0,5637.0,1879.0,15.0,7783.0,30.0,24.0,0.0,0.0,,NO,60,"Data:1000F=2GB,30d",4.0


In [6]:
#check the info of the train dataset
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1077024 entries, 0 to 1077023
Data columns (total 19 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   user_id         1077024 non-null  object 
 1   REGION          652687 non-null   object 
 2   TENURE          1077024 non-null  object 
 3   MONTANT         699139 non-null   float64
 4   FREQUENCE_RECH  699139 non-null   float64
 5   REVENUE         714669 non-null   float64
 6   ARPU_SEGMENT    714669 non-null   float64
 7   FREQUENCE       714669 non-null   float64
 8   DATA_VOLUME     547261 non-null   float64
 9   ON_NET          683850 non-null   float64
 10  ORANGE          629880 non-null   float64
 11  TIGO            432250 non-null   float64
 12  ZONE1           84898 non-null    float64
 13  ZONE2           68794 non-null    float64
 14  MRG             1077024 non-null  object 
 15  REGULARITY      1077024 non-null  int64  
 16  TOP_PACK        626129 non-null   ob

In [7]:
#check the info of the test dataset
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 190063 entries, 0 to 190062
Data columns (total 18 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   user_id         190063 non-null  object 
 1   REGION          115330 non-null  object 
 2   TENURE          190063 non-null  object 
 3   MONTANT         123695 non-null  float64
 4   FREQUENCE_RECH  123695 non-null  float64
 5   REVENUE         126422 non-null  float64
 6   ARPU_SEGMENT    126422 non-null  float64
 7   FREQUENCE       126422 non-null  float64
 8   DATA_VOLUME     96716 non-null   float64
 9   ON_NET          120771 non-null  float64
 10  ORANGE          111417 non-null  float64
 11  TIGO            76555 non-null   float64
 12  ZONE1           14850 non-null   float64
 13  ZONE2           12011 non-null   float64
 14  MRG             190063 non-null  object 
 15  REGULARITY      190063 non-null  int64  
 16  TOP_PACK        110773 non-null  object 
 17  FREQ_TOP_P

In [8]:
# checking for null values on the train dataset
df_train.isnull().sum()

user_id                 0
REGION             424337
TENURE                  0
MONTANT            377885
FREQUENCE_RECH     377885
REVENUE            362355
ARPU_SEGMENT       362355
FREQUENCE          362355
DATA_VOLUME        529763
ON_NET             393174
ORANGE             447144
TIGO               644774
ZONE1              992126
ZONE2             1008230
MRG                     0
REGULARITY              0
TOP_PACK           450895
FREQ_TOP_PACK      450895
CHURN                   0
dtype: int64

In [9]:
#checking for the null value on the test dataset
df_test.isnull().sum()

user_id                0
REGION             74733
TENURE                 0
MONTANT            66368
FREQUENCE_RECH     66368
REVENUE            63641
ARPU_SEGMENT       63641
FREQUENCE          63641
DATA_VOLUME        93347
ON_NET             69292
ORANGE             78646
TIGO              113508
ZONE1             175213
ZONE2             178052
MRG                    0
REGULARITY             0
TOP_PACK           79290
FREQ_TOP_PACK      79290
dtype: int64

In [10]:
#checking the shape of the train dataset 
df_train.shape

(1077024, 19)

In [11]:
#checking the shape of the test data
df_test.shape

(190063, 18)

In [12]:
#checking for dublicate on the train 
df_train.duplicated().sum()


0

In [13]:
# checking for dublicate on the test 
df_test.duplicated().sum()

0

In [14]:
#  cheking the stasitical on the train 
df_train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MONTANT,699139.0,5529.210895,7104.737952,20.0,1000.0,3000.0,7350.0,470000.0
FREQUENCE_RECH,699139.0,11.523756,13.261938,1.0,2.0,7.0,16.0,131.0
REVENUE,714669.0,5506.050798,7175.62501,1.0,1000.0,3000.0,7360.0,532177.0
ARPU_SEGMENT,714669.0,1835.355961,2391.870902,0.0,333.0,1000.0,2453.0,177392.0
FREQUENCE,714669.0,13.974439,14.687059,1.0,3.0,9.0,20.0,91.0
DATA_VOLUME,547261.0,3368.801722,12898.928039,0.0,0.0,258.0,2905.0,1702309.0
ON_NET,683850.0,277.065798,874.315378,0.0,5.0,27.0,156.0,50809.0
ORANGE,629880.0,95.160804,203.020261,0.0,7.0,29.0,99.0,12040.0
TIGO,432250.0,23.105018,64.035464,0.0,2.0,6.0,20.0,4174.0
ZONE1,84898.0,8.167483,39.245883,0.0,0.0,1.0,3.0,2507.0


In [15]:
#  What is the percentage of missing values in train columns
percent_missing_values =( (df_train.isna().sum()/len(df_train)) * 100)

percent_missing_values.round(2)

user_id            0.00
REGION            39.40
TENURE             0.00
MONTANT           35.09
FREQUENCE_RECH    35.09
REVENUE           33.64
ARPU_SEGMENT      33.64
FREQUENCE         33.64
DATA_VOLUME       49.19
ON_NET            36.51
ORANGE            41.52
TIGO              59.87
ZONE1             92.12
ZONE2             93.61
MRG                0.00
REGULARITY         0.00
TOP_PACK          41.86
FREQ_TOP_PACK     41.86
CHURN              0.00
dtype: float64

In [16]:
#  What is the percentage of missing values in train columns
percent_missing_values =( (df_test.isna().sum()/len(df_test)) * 100)

percent_missing_values.round(2)

user_id            0.00
REGION            39.32
TENURE             0.00
MONTANT           34.92
FREQUENCE_RECH    34.92
REVENUE           33.48
ARPU_SEGMENT      33.48
FREQUENCE         33.48
DATA_VOLUME       49.11
ON_NET            36.46
ORANGE            41.38
TIGO              59.72
ZONE1             92.19
ZONE2             93.68
MRG                0.00
REGULARITY         0.00
TOP_PACK          41.72
FREQ_TOP_PACK     41.72
dtype: float64

#### cleaning the dataset by filling in the missing values

In [17]:
# Fill missing values for each numeric column with its mean of the train dataset
numeric_columns = df_train.select_dtypes(include='number').columns
for column in numeric_columns:
    df_train[column].fillna(df_train[column].mean(), inplace=True)


In [18]:
# for mode imputation
df_train['REGION'].fillna(df_train['REGION'].mode()[0], inplace=True)


In [19]:
# Example for filling with a constant (e.g., 'Unknown')
df_train['TOP_PACK'].fillna('Unknown', inplace=True)


In [20]:
df_train.isnull().sum()

user_id           0
REGION            0
TENURE            0
MONTANT           0
FREQUENCE_RECH    0
REVENUE           0
ARPU_SEGMENT      0
FREQUENCE         0
DATA_VOLUME       0
ON_NET            0
ORANGE            0
TIGO              0
ZONE1             0
ZONE2             0
MRG               0
REGULARITY        0
TOP_PACK          0
FREQ_TOP_PACK     0
CHURN             0
dtype: int64

In [21]:
# Fill missing values for each numeric column with its mean of the test dataset
numeric_columns = df_test.select_dtypes(include='number').columns
for column in numeric_columns:
    df_test[column].fillna(df_test[column].mean(), inplace=True)


In [22]:
# for mode imputation
df_test['REGION'].fillna(df_test['REGION'].mode()[0], inplace=True)

In [23]:
# Example for filling with a constant (e.g., 'Unknown')
df_test['TOP_PACK'].fillna('Unknown', inplace=True)

In [24]:
df_test.isnull().sum()

user_id           0
REGION            0
TENURE            0
MONTANT           0
FREQUENCE_RECH    0
REVENUE           0
ARPU_SEGMENT      0
FREQUENCE         0
DATA_VOLUME       0
ON_NET            0
ORANGE            0
TIGO              0
ZONE1             0
ZONE2             0
MRG               0
REGULARITY        0
TOP_PACK          0
FREQ_TOP_PACK     0
dtype: int64

#### EDA

#### univariate

#### Bivariate analysis 

#### Multivariate

## Data Preparation 

#### Data preprocessing

In [25]:
#We will drop REGION, TOP_PACK, and MRG. They have less contribution in prediction of the target.
#We will also replace the missing values for the numerical columns with their means (averages)

df_train.drop(columns=['REGION', 'MRG', 'TOP_PACK', 'ZONE1', 'ZONE2','user_id'], inplace=True) #drop these columns

In [26]:
df_train.columns

Index(['TENURE', 'MONTANT', 'FREQUENCE_RECH', 'REVENUE', 'ARPU_SEGMENT',
       'FREQUENCE', 'DATA_VOLUME', 'ON_NET', 'ORANGE', 'TIGO', 'REGULARITY',
       'FREQ_TOP_PACK', 'CHURN'],
      dtype='object')

In [27]:
df_test.drop(columns=['REGION', 'MRG', 'TOP_PACK', 'ZONE1', 'ZONE2','user_id'], inplace=True)

In [28]:
df_test.columns

Index(['TENURE', 'MONTANT', 'FREQUENCE_RECH', 'REVENUE', 'ARPU_SEGMENT',
       'FREQUENCE', 'DATA_VOLUME', 'ON_NET', 'ORANGE', 'TIGO', 'REGULARITY',
       'FREQ_TOP_PACK'],
      dtype='object')

In [29]:
# lets separate the dependent and independent/target variable
X = df_train.drop(['CHURN'], axis=1)
y = df_train['CHURN']

X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2, random_state=42)
(X_train.shape, y_train.shape),(X_eval.shape, y_eval.shape),(df_test.shape)

(((861619, 12), (861619,)), ((215405, 12), (215405,)), (190063, 12))

In [30]:
X

Unnamed: 0,TENURE,MONTANT,FREQUENCE_RECH,REVENUE,ARPU_SEGMENT,FREQUENCE,DATA_VOLUME,ON_NET,ORANGE,TIGO,REGULARITY,FREQ_TOP_PACK
0,K > 24 month,20000.000000,47.000000,21602.000000,7201.000000,52.000000,8835.000000,3391.000000,396.000000,185.000000,62,30.000000
1,K > 24 month,5529.210895,11.523756,5506.050798,1835.355961,13.974439,3368.801722,277.065798,95.160804,23.105018,3,9.262446
2,K > 24 month,5529.210895,11.523756,5506.050798,1835.355961,13.974439,3368.801722,277.065798,95.160804,23.105018,1,9.262446
3,K > 24 month,7900.000000,19.000000,7896.000000,2632.000000,25.000000,9385.000000,27.000000,46.000000,20.000000,61,7.000000
4,K > 24 month,12350.000000,21.000000,12351.000000,4117.000000,29.000000,9360.000000,66.000000,102.000000,34.000000,56,11.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
1077019,K > 24 month,5529.210895,11.523756,5506.050798,1835.355961,13.974439,3368.801722,277.065798,95.160804,23.105018,16,9.262446
1077020,K > 24 month,2500.000000,5.000000,2500.000000,833.000000,5.000000,0.000000,15.000000,77.000000,23.105018,34,2.000000
1077021,K > 24 month,5529.210895,11.523756,5506.050798,1835.355961,13.974439,3368.801722,277.065798,95.160804,23.105018,3,9.262446
1077022,K > 24 month,600.000000,1.000000,600.000000,200.000000,1.000000,591.000000,11.000000,37.000000,5.000000,16,1.000000


In [31]:
y

0          0
1          0
2          0
3          0
4          0
          ..
1077019    0
1077020    0
1077021    1
1077022    0
1077023    0
Name: CHURN, Length: 1077024, dtype: int64

#### Balancing of the data 

In [32]:
# checking if the data is balance 
class_counts = df_train['CHURN'].value_counts()
print(class_counts)



CHURN
0    875031
1    201993
Name: count, dtype: int64


### since the data is not balance we have to now balance the data 

In [33]:
ros = RandomOverSampler(random_state=0)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
print("Class Distribution Before Oversampling:")
print(y_train.value_counts())

print("Class Distribution After Oversampling:")
print(y_train_resampled.value_counts())

Class Distribution Before Oversampling:
CHURN
0    700072
1    161547
Name: count, dtype: int64
Class Distribution After Oversampling:
CHURN
0    700072
1    700072
Name: count, dtype: int64


## Modeling 

In [34]:
#df has both numerical and categorical columns
#categorical columns
categorical_cols = ['TENURE']

numerical_cols = ['MONTANT', 'FREQUENCE_RECH', 'REVENUE', 'ARPU_SEGMENT',
       'FREQUENCE', 'DATA_VOLUME', 'ON_NET', 'ORANGE', 'TIGO', 'REGULARITY', 'FREQ_TOP_PACK']

In [35]:
# Define numerical and categorical transformers
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [36]:
# Combine transformers using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Create the full pipeline
Xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),

    ('xgb_classifier', XGBClassifier())  # the_model to be use 
])
decision_tree_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('decision_tree', DecisionTreeClassifier())
])


#### Since our dataset is huge we decide to resample our data to 20%

In [37]:
# Now you can fit and transform your data using the pipeline

Xgb_pipeline.fit(X_train_resampled, y_train_resampled)


In [38]:
decision_tree_pipeline.fit(X_train_resampled, y_train_resampled)


In [39]:
# Make predictions
xgb_pred = Xgb_pipeline.predict(X_eval)
# Print classification reports with model names
print("\nClassification Report for XGBClassifier:")
print(classification_report(y_eval, xgb_pred))




Classification Report for XGBClassifier:
              precision    recall  f1-score   support

           0       0.96      0.79      0.87    174959
           1       0.49      0.86      0.62     40446

    accuracy                           0.80    215405
   macro avg       0.72      0.83      0.75    215405
weighted avg       0.87      0.80      0.82    215405



In [40]:
#make prediction
decision_tree_pred = decision_tree_pipeline.predict(X_eval)
#print classification report with the model names
print("\nClassification Report for DecisionTreeClassifier:")
print(classification_report(y_eval, decision_tree_pred))


Classification Report for DecisionTreeClassifier:
              precision    recall  f1-score   support

           0       0.95      0.80      0.87    174959
           1       0.49      0.80      0.61     40446

    accuracy                           0.80    215405
   macro avg       0.72      0.80      0.74    215405
weighted avg       0.86      0.80      0.82    215405



### saving of the model and pipleline 

In [41]:

# Save the best Xgb model
joblib.dump(Xgb_pipeline, 'xgb.joblib')

['xgb.joblib']

In [42]:
# Save the best decision tree model
joblib.dump( decision_tree_pipeline, 'dt.joblib')

['dt.joblib']

## Evaluation 

In [43]:
#  load the model 
loaded_model = joblib.load('model/xgb.joblib')

In [44]:
# fit the model on the train data
model = loaded_model.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test data
test_predictions = model.predict(df_test)

test_predictions

array([0, 1, 0, ..., 1, 0, 1])