# Building a model

In this notebook I will test various machine learning models to find one that gives the best results in predicting the churn.

### Load libraries

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
import pandas as pd
import numpy as np

### Load data

In [3]:
dataset = pd.read_csv('../data/transformed.csv', index_col='customerID')

Description of the features:
* Gender: The customer’s gender: Male, Female
* Senior Citizen: Indicates if the customer is 65 or older: Yes, No
* Partner: Indicates if the customer is a partner: Yes, No
* Dependents: Indicates if the customer lives with any dependents: Yes, No. Dependents could be children, parents, grandparents, etc.
* Tenure: How long they’ve been a customer (in months)
* Phone Service: Indicates if the customer subscribes to home phone service with the company: Yes, No
* Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes, No
* Internet Service: Indicates if the customer subscribes to Internet service with the company: No, DSL, Fiber Optic, Cable.
* Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes, No
* Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes, No
* Device Protection Plan: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes, No
* Tech Support: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: Yes, No
* Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes, No. The company does not charge an additional fee for this service
* Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes, No. The company does not charge an additional fee for this service
* Contract: Indicates the customer’s current contract type: Month-to-Month, One Year, Two Year
* Paperless Billing: Indicates if the customer has chosen paperless billing: Yes, No
* Payment Method: Indicates how the customer pays their bill: Bank Withdrawal, Credit Card, Mailed Check
* Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company
* Total Charges: Indicates the customer’s total charges
* Churn: Indicates if the customer have churned: Yes, No

In [4]:
dataset.head()

Unnamed: 0_level_0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
7590-VHVEG,Female,No,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
5575-GNVDE,Male,No,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
3668-QPYBK,Male,No,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
7795-CFOCW,Male,No,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
9237-HQITU,Female,No,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [5]:
dataset.shape

(7043, 20)

In [6]:
data = dataset.sample(frac=0.95, random_state=786)
data_validation = dataset.drop(data.index)

data.reset_index(inplace=True, drop=True)
data_validation.reset_index(inplace=True, drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Final Validation ' + str(data_validation.shape))

Data for Modeling: (6691, 20)
Unseen Data For Final Validation (352, 20)


### Building a model

To quickly iterate through various models i will use the [PyCaret](https://pycaret.org) library. It is an open-source, low-code machine learning library in Python that automates machine learning workflows.

The kind of a model that suits our case is a binary classification model.

In [7]:
from pycaret.classification import *

Initialize the training environment and create the transformation pipeline.

In [14]:
setup(
    data, 
    target='Churn',
    silent=True,
    transformation=True,
    bin_numeric_features=['tenure'],
    normalize=True,
    normalize_method='minmax',
);

Unnamed: 0,Description,Value
0,session_id,8817
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"No: 0, Yes: 1"
4,Original Data,"(6691, 20)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,16
8,Ordinal Features,False
9,High Cardinality Features,False


Compare various models.

In [15]:
compare_models(include=['lr', 'knn', 'dt', 'svm', 'ridge'], fold=5);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.8065,0.8492,0.5153,0.6803,0.5862,0.463,0.4708,0.026
ridge,Ridge Classifier,0.8023,0.0,0.4678,0.6885,0.557,0.4356,0.4492,0.008
svm,SVM - Linear Kernel,0.7803,0.0,0.4026,0.7223,0.4584,0.3468,0.392,0.014
knn,K Neighbors Classifier,0.7751,0.7939,0.5458,0.5828,0.5633,0.4122,0.4129,0.056
dt,Decision Tree Classifier,0.7406,0.6738,0.5313,0.5116,0.5209,0.3432,0.3436,0.01


All the models give similar results.