# <font color='red'>-----------------------------------------------------------------------------------------------------</font>
# <font color='green'>Telco dataset - customer churn prediction</font>
## <font color='green'>Dataset is related to Telco customers to predict churn</font>
# <font color='red'>-----------------------------------------------------------------------------------------------------</font>

## <ins><div class="alert alert-block alert-info">*Objective*</div></ins>

**Predicting customer churn so that we can work hard on attempting to retain high level customers so that we dont lose them. Thus we do churn prediction to improve retention rate**<br>
We will use a different dataset of Telco which has target variable as 'churn' to build our model and make churn predictions

The libraries are already preloaded and hence we will directly begin with loading dataset and EDA

In [None]:
import pandas as pd
import numpy as np
from datetime import timedelta,datetime,date
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from numpy import percentile

### <ins><div class="alert alert-block alert-warning">*Step 1: Loading the dataset, reading the file and doing required modifications*</div></ins>

In [None]:
tel_data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
tel_data.info()

In [None]:
tel_data.isnull().sum()

Fortunately there are no missing values. However we have maximum categorical values (18 out of 21) so there will be many columns generated after one hot encoding

In [None]:
# since target variable is in alphabetical format, we will convert it to numerical format
num = {"No":0,"Yes":1}
tel_data = tel_data.replace({"Churn":num})

# also total charges seem to be object. coverting to integer
# tel_data['TotalCharges'] = pd.to_numeric(tel_data['TotalCharges'])
tel_data.head(2)

### <ins><div class="alert alert-block alert-warning">*Step 2: Visualising the relationship between variables and churn rate and performing EDA*</div></ins>

### Visualising the relationship between categorical variables and churn rate

In [None]:
# plotting bar plot for gender and churn
plt.figure(figsize=(6,5))
bar_plt_g=tel_data.groupby(['gender'])['Churn'].mean().reset_index()
print(bar_plt_g,'\n')
x=bar_plt_g['gender']
y=bar_plt_g['Churn']
sns.barplot(x,y)
plt.show()

Churn rate for females is slightly more than for males

In [None]:
# plotting bar plot for SeniorCitizen and churn
plt.figure(figsize=(6,5))
bar_plt_sc=tel_data.groupby(['SeniorCitizen'])['Churn'].mean().reset_index()
print(bar_plt_sc,'\n')
x=bar_plt_sc['SeniorCitizen']
y=bar_plt_sc['Churn']
sns.barplot(x,y)
plt.show()

Senior citizens churn is significantly higher than non sr citizen. They are more likely to churn than non sr.citizens

In [None]:
# plotting bar plot for PhoneService and churn
plt.figure(figsize=(6,5))
bar_plt_phn=tel_data.groupby(['PhoneService'])['Churn'].mean().reset_index()
print(bar_plt_phn,'\n')
x=bar_plt_phn['PhoneService']
y=bar_plt_phn['Churn']
sns.barplot(x,y)
plt.show()

Customers who have a phone service are more likely to churn compared to ones who already have, although the differencce is not that significant

In [None]:
# plotting bar plot for MultipleLines and churn
plt.figure(figsize=(6,5))
bar_plt_ml=tel_data.groupby(['MultipleLines'])['Churn'].mean().reset_index()
print(bar_plt_ml,'\n')
x=bar_plt_ml['MultipleLines']
y=bar_plt_ml['Churn']
sns.barplot(x,y)
plt.show()

Customers with multiple lines are more likely to churn compared to the ones with no multiple lines

In [None]:
# plotting bar plot for InternetService and churn
plt.figure(figsize=(6,5))
bar_plt_int=tel_data.groupby(['InternetService'])['Churn'].mean().reset_index()
print(bar_plt_int,'\n')
x=bar_plt_int['InternetService']
y=bar_plt_int['Churn']
sns.barplot(x,y)
plt.show()

Customers having fibre optic lines are significantly more likely to churn compared to the ones who dont have fibre optics or DSL connection

In [None]:
# plotting bar plot for OnlineSecurity and churn
plt.figure(figsize=(6,5))
bar_plt_osec=tel_data.groupby(['OnlineSecurity'])['Churn'].mean().reset_index()
print(bar_plt_int,'\n')
x=bar_plt_osec['OnlineSecurity']
y=bar_plt_osec['Churn']
sns.barplot(x,y)
plt.show()

As expected cutomers with no online security are significantly more likely to churn compared to the ones who have online security

In [None]:
# plotting bar plot for OnlineBackup and churn
plt.figure(figsize=(6,5))
bar_plt_obck=tel_data.groupby(['OnlineBackup'])['Churn'].mean().reset_index()
print(bar_plt_obck,'\n')
x=bar_plt_obck['OnlineBackup']
y=bar_plt_obck['Churn']
sns.barplot(x,y)
plt.show()

Same situation as expected with online backup as in online security

In [None]:
# plotting bar plot for DeviceProtection and churn
plt.figure(figsize=(6,5))
bar_plt_devp=tel_data.groupby(['DeviceProtection'])['Churn'].mean().reset_index()
print(bar_plt_devp,'\n')
x=bar_plt_devp['DeviceProtection']
y=bar_plt_devp['Churn']
sns.barplot(x,y)
plt.show()

Device protection too follows the same patter as in online backup/security

In [None]:
# plotting bar plot for TechSupport and churn
plt.figure(figsize=(6,5))
bar_plt_tech=tel_data.groupby(['TechSupport'])['Churn'].mean().reset_index()
print(bar_plt_tech,'\n')
x=bar_plt_tech['TechSupport']
y=bar_plt_tech['Churn']
sns.barplot(x,y)
plt.show()

Tech support too, obviously follows the same trend as in online. churn will be higher if people dont have proper to No tech support

In [None]:
# plotting bar plot for StreamingTV & StreamingMovies  and churn
plt.figure(figsize=(6,5))
bar_plt_strt=tel_data.groupby(['StreamingTV'])['Churn'].mean().reset_index()
print(bar_plt_strt,'\n')
x=bar_plt_strt['StreamingTV']
y=bar_plt_strt['Churn']
sns.barplot(x,y)
plt.show()

plt.figure(figsize=(6,5))
bar_plt_strm=tel_data.groupby(['StreamingMovies'])['Churn'].mean().reset_index()
print(bar_plt_strm,'\n')
x=bar_plt_strm['StreamingMovies']
y=bar_plt_strm['Churn']
sns.barplot(x,y)
plt.show()

Customers with no streaming services are more likely to churn compared to customers with streaming service. although there isn't much difference between the ones that have and the ones that dont have

In [None]:
# lets plot together for Contract, PaperlessBilling, PaymentMethod

f,axes = plt.subplots(3,1,figsize=(10,10))
plt.subplots_adjust(hspace=0.5)

bar_plt_cont=tel_data.groupby(['Contract'])['Churn'].mean().reset_index()
bar_plt_pprbll=tel_data.groupby(['PaperlessBilling'])['Churn'].mean().reset_index()
bar_plt_pymnt=tel_data.groupby(['PaymentMethod'])['Churn'].mean().reset_index()

print(bar_plt_cont,'\n')
print(bar_plt_pprbll,'\n')
print(bar_plt_pymnt,'\n')

sns.barplot(y="Churn", x= "Contract", data=bar_plt_cont,  orient='v' , ax=axes[0])
sns.barplot(y="Churn", x= "PaperlessBilling", data=bar_plt_pprbll, orient='v', ax=axes[1])
sns.barplot(y="Churn", x= "PaymentMethod", data=bar_plt_pymnt, orient='v', ax=axes[2])

plt.show()

- Contract : Customers under some form of contract are much less likely to churn compared to people who renew monthly.<br>
- Paperless billing : Customers who subscribed to paperless billing are more likely to churn. <br>
- Payment method : Customers who pay through electronnic checks are more likely to churn compared to other modes of payments.

### Visualsing the relation between Tenure, Monthly charges, Total charges and churn rate

In [None]:
# plotting scatterplot for tenure, MonthlyCharges, TotalCharges and churn

f,axes = plt.subplots(3,1,figsize=(10,10))
plt.subplots_adjust(hspace=0.5)

scat_plt_ten=tel_data.groupby(['tenure'])['Churn'].mean().reset_index()
scat_plt_mchrg=tel_data.groupby(['MonthlyCharges'])['Churn'].mean().reset_index()
scat_plt_totchrg=tel_data.groupby(['TotalCharges'])['Churn'].mean().reset_index()

sns.scatterplot(y="Churn", x= "tenure", data=scat_plt_ten, ax=axes[0])
sns.scatterplot(y="Churn", x= "MonthlyCharges", data=scat_plt_mchrg, ax=axes[1])
sns.scatterplot(y="Churn", x= "TotalCharges", data=scat_plt_totchrg, ax=axes[2])

plt.show()

Churn and tenure have a linear trend. More the tenure less is the churn. Monthly charges and total charges have no trending at all. They are not related

Total charges column is an object. We need to convert it to float. Also there are blank entries without any alphanumeric data in the column which are not detected by isnull(). We have to fill those values. 

In [None]:
# using pd.to_numeric to convert the TotalCharges column to numeric will help us see the null values
tel_data.TotalCharges = pd.to_numeric(tel_data.TotalCharges, errors="coerce")
tel_data.isnull().sum()

As suspected, 11 rows are of TotalCharges column now reflects null values. We can either impute values or delete them since only 11 rows out of 7043 rows is less than 1%

In [None]:
# deleting the rows with null values

tel_data = tel_data.dropna(axis=0)

In [None]:
tel_data.info()

The rows with null values have been deleted and TotalCharges column is also converted to numeric

In [None]:
tel_data.head(2)

In [None]:
# encoding all categorical variables using one hot encoding

tel_data = pd.get_dummies(tel_data,drop_first=True,columns=['gender','Partner','Dependents',
                                            'PhoneService','MultipleLines','InternetService',
                                           'OnlineSecurity','OnlineBackup','DeviceProtection',
                                           'TechSupport','StreamingTV','StreamingMovies',
                                           'Contract','PaperlessBilling','PaymentMethod'])

### <ins><div class="alert alert-block alert-warning">*Step 3: Performing steps for model building*</div></ins>

In [None]:
# import sklearn libraries
from sklearn.model_selection import KFold,train_test_split,cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
import xgboost as xgb

In [None]:
# performing feature selection using chi2 test
from sklearn.feature_selection import chi2

# splitting the dataset (removing 'customerID' since it doesnt serve any purpose)
X = tel_data.drop(['customerID','Churn'],axis=1)
y = tel_data['Churn']

chi_scores = chi2(X,y)
print('chi_values:',chi_scores[0],'\n')
print('p_values:',chi_scores[1])

In [None]:
p_values = pd.Series(chi_scores[1],index = X.columns)
p_values.sort_values(ascending = False , inplace = True)

In [None]:
plt.figure(figsize=(12,8))
p_values.plot.bar()
plt.show()

The plot above shows that through chi square method p-values for 'phoneservice', 'geneder male', 'multiple lines_no phone service are the highest, indicating that we can discard them for our model prediction. Let's drop these variables and build our model based on the remaining

In [None]:
tel_data.drop(['PhoneService_Yes','gender_Male','MultipleLines_No phone service','MultipleLines_Yes'],axis=1,inplace=True)
tel_data.head(2)

In [None]:
# splitting the dataset (removing 'customerID' since it doesnt serve any purpose)
X = tel_data.drop(['customerID','Churn'],axis=1)
y = tel_data['Churn']

In [None]:
# splitting into train and test data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

**We will implement XGBOOST modelling for prediction.**

In [None]:
# importing libraries
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score
import xgboost as xgb

In [None]:
model_xgb_1 = xgb.XGBClassifier()

In [None]:
# hyperparameter tuning for best params for xgboost
params = {"n_estimators":range(50, 400, 50),    
          "max_depth": [3,4,5,6,8,10,12,14],
          "learning_rate":[0.05,0.1,0.15,0.2,0.25,0.3],
          "gamma":[0.1,0.2,0.3,0.4],
          "min_child_weight":[1,3,5,7], 
          "colsample_bytree": [0.3,0.4,0.5,0.7],
          "random_state" : [1, 42, 58, 69, 72]}

rand_srch = RandomizedSearchCV(model_xgb_1,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)
rand_srch.fit(X_train,y_train)

In [None]:
rand_srch.best_estimator_

In [None]:
rand_srch.best_params_

In [None]:
# fitting xgboost model with the best params
model_xgb = xgb.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, gamma=0.3, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=4,
              min_child_weight=5, missing=None, monotone_constraints='()',
              n_estimators=50, n_jobs=0, num_parallel_tree=1, random_state=58,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [None]:
trn_xgbmod = model_xgb.fit(X_train,y_train)

# checking accuracy of training data
print('Accuracy of XGB classifier on training set: {:.2f}'
       .format(trn_xgbmod.score(X_train, y_train)))

In [None]:
y_pred = trn_xgbmod.predict(X_test)
print('Accuracy of XGB classifier on test set: {:.2f}'
       .format(trn_xgbmod.score(X_test[X_train.columns], y_test)))

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
# checking churn probability of the customers
tel_data['churn_proba'] = trn_xgbmod.predict_proba(tel_data[X_train.columns])[:,1]
tel_data[['customerID','churn_proba']].head(10)

**We know the likely customers who would churn and we can build our strategy around them to increase retention**