# Parameters tunning on Telecom users dataset to predict churn of a user.

## *Using all features*

In this notebook we perform a parameters tunning on GBM classifier on Telecom users dataset (obtained from kaggle: https://www.kaggle.com/radmirzosimov/telecom-users-dataset) to select the best parameters to predict the churn of a user, using all features of the dataset.

- OneHot encoding + Standardization
-------------------------------------------------------------------------------------------------------------------------------------------------------------

## 0. Import libraries

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd 
import seaborn as sns

import matplotlib.pyplot as plt

from collections import Counter

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

from sklearn.ensemble import GradientBoostingClassifier

print("Libraries imported!!")

Libraries imported!!


----------------------------------------------------------------------------------------
## 1. Load and read the dataset

Here, we read the dataset and we find the shape of it as well as the colum names.

In [2]:
df = pd.read_csv('telecom_users_train.csv')
#conversions
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce') 
df['SeniorCitizen'] = df['SeniorCitizen'].astype('object')

# drop first column
df = df.drop(['Unnamed: 0','customerID','TotalCharges'], axis=1)
df

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,Churn
0,Female,1,Yes,No,28,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Mailed check,25.70,No
1,Female,0,No,No,6,No,No phone service,DSL,No,No,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,47.95,Yes
2,Male,0,No,No,55,Yes,Yes,Fiber optic,No,No,No,No,Yes,Yes,Month-to-month,Yes,Electronic check,96.80,Yes
3,Female,0,Yes,Yes,54,Yes,Yes,DSL,No,Yes,Yes,No,No,No,Two year,Yes,Bank transfer (automatic),59.80,No
4,Female,0,No,No,29,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Credit card (automatic),19.35,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4783,Male,0,Yes,No,72,Yes,Yes,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.40,No
4784,Male,0,Yes,Yes,66,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,Yes,Electronic check,20.35,No
4785,Male,0,No,No,5,Yes,No,DSL,No,No,Yes,Yes,Yes,No,Month-to-month,Yes,Credit card (automatic),63.95,No
4786,Female,0,Yes,Yes,43,Yes,No,DSL,Yes,No,Yes,No,Yes,Yes,Two year,No,Bank transfer (automatic),75.20,No


----------------------------------------------------------------------------------------
## 2. Preprocess the data

One-Hot encode categorical attributes and Standardize the data

In [3]:
#label encoder categorical attributes
data_train = df.copy()
cat_features = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling','PaymentMethod']
num_features = ['tenure','MonthlyCharges']

for f in cat_features :
    data_train[f] = LabelEncoder().fit_transform(data_train[f])
data_train

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,Churn
0,0,1,1,0,28,1,2,2,1,1,1,1,1,1,0,0,3,25.70,No
1,0,0,0,0,6,0,1,0,0,0,2,0,2,2,0,1,2,47.95,Yes
2,1,0,0,0,55,1,2,1,0,0,0,0,2,2,0,1,2,96.80,Yes
3,0,0,1,1,54,1,2,0,0,2,2,0,0,0,2,1,0,59.80,No
4,0,0,0,0,29,1,0,2,1,1,1,1,1,1,0,0,1,19.35,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4783,1,0,1,0,72,1,2,2,1,1,1,1,1,1,2,0,3,25.40,No
4784,1,0,1,1,66,1,0,2,1,1,1,1,1,1,1,1,2,20.35,No
4785,1,0,0,0,5,1,0,0,0,0,2,2,2,0,0,1,1,63.95,No
4786,0,0,1,1,43,1,0,0,2,0,2,0,2,2,2,0,0,75.20,No


In [4]:
#one-hot encode categorical attributes and join with numerical attributes
encoded = OneHotEncoder().fit_transform(data_train[cat_features]).toarray()
X_train = np.column_stack((encoded, data_train[num_features].values))
X_train.shape

(4788, 45)

In [5]:
#take the target attribute
target = 'Churn'
y_train = data_train[target].values
y_train.shape

(4788,)

----------------------------------------------------------------------------------------------------------------------------------------------------

## 3. Perform a GridSearch to find the best parameters of the model

In [6]:
#create a dictionary with the parameters
parameters = dict()
parameters['loss'] = ('deviance','exponential')
parameters['learning_rate'] = [0.1, 0.01, 0.001, 0.0001]
parameters['n_estimators'] = [100, 200, 300, 400, 500]

#initialize the grid search
model = GradientBoostingClassifier(verbose=True, random_state=64)
grid_search = GridSearchCV(model, parameters, cv=10, scoring='accuracy')

#train the grid search
grid_search.fit(X_train, y_train)

#show the results
print('The best model has been ', grid_search.best_estimator_, ' with an accuracy of', grid_search.best_score_)

      Iter       Train Loss   Remaining Time 
         1           1.1107            0.79s
         2           1.0694            0.73s
         3           1.0353            0.74s
         4           1.0069            0.79s
         5           0.9835            0.78s
         6           0.9633            0.75s
         7           0.9459            0.74s
         8           0.9304            0.72s
         9           0.9170            0.71s
        10           0.9050            0.70s
        20           0.8369            0.59s
        30           0.8094            0.50s
        40           0.7927            0.43s
        50           0.7815            0.35s
        60           0.7734            0.28s
        70           0.7658            0.21s
        80           0.7601            0.14s
        90           0.7542            0.07s
       100           0.7486            0.00s
      Iter       Train Loss   Remaining Time 
         1           1.1105            0.79s
        

In [7]:
#show the results
print('The best model has been ', grid_search.best_estimator_, ' with an accuracy of', grid_search.best_score_)

The best model has been  GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.01, loss='exponential', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=500,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=64, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=True,
                           warm_start=False)  with an accuracy of 0.8032559988120299


It seems that the best model performance has been obtained with the next parameters configuration:

- loss = 'exponential'
- learning_rate = 0.01
- n_estimators = [100, 200, 300, 400, 500]