## Telco Customer Churn Prediction
### Part 2: Vectorization & Model Building

Lets vectorize all the features to build the model<br>

<b>Categorical features:</b><br>
1. Area code
2. Internation Plan
3. Voice Mail Plan

<b>Numerical features:</b><br>
1. number_vmail_messages<br>
2. total_day_minutes<br>
3. total_day_calls<br>
4. total_day_charge<br>
5. total_eve_minutes<br>
6. total_eve_calls<br>
7. total_eve_charge<br>
8. total_night_minutes<br>
9. total_night_calls<br>
10. total_night_charge<br>
11. total_intl_minutes<br>
12. total_intl_calls<br>
13. total_intl_charge<br>
14. number_customer_service_calls<br>

In [45]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from scipy.sparse import hstack
from datetime import datetime

### 4. Vectorization

#### 4.1 Split data

<h4>Lets split the data into train & test. Later build model and evaluate it by performing hyperparameter tuning.</h4>

In [74]:
train = pd.read_csv("processed_train.csv")

In [75]:
print(train.columns.values)

['area_code' 'international_plan' 'voice_mail_plan'
 'number_vmail_messages' 'total_day_minutes' 'total_day_calls'
 'total_day_charge' 'total_eve_minutes' 'total_eve_calls'
 'total_eve_charge' 'total_night_minutes' 'total_night_calls'
 'total_night_charge' 'total_intl_minutes' 'total_intl_calls'
 'total_intl_charge' 'number_customer_service_calls' 'churn']


In [76]:
print(train.shape)

(4250, 18)


In [77]:
X = train.drop(['churn'], axis=1)
y = train['churn'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

print("X_train shape", X_train.shape)
print("X_test shape", X_test.shape)
print("y_train shape", y_train.shape)
print("y_train shape", y_test.shape)


X_train shape (2975, 17)
X_test shape (1275, 17)
y_train shape (2975,)
y_train shape (1275,)


#### 4.2 Vectorize data
We will also maintain a list of objects and later we will store them in pickle file. This will be useful when we deploy our model when a new query point comes in.

In [87]:
feature_names = []

In [80]:
categorical_vectorizers = dict()

##### Categorical features

In [81]:
# area_code
areacode_vectorizer = CountVectorizer()
areacode_vectorizer.fit(X_train['area_code'].values)
X_train_areacode_vectorized = areacode_vectorizer.transform(X_train['area_code'].values)
X_test_areacode_vectorized = areacode_vectorizer.transform(X_test['area_code'].values)
print("\nAfter one hot encoding 'area_code'")
print("X_train shape: ", X_train_areacode_vectorized.shape)
print("X_test shape: ", X_test_areacode_vectorized.shape)
categorical_vectorizers['areacode_vectorizer'] = areacode_vectorizer


After one hot encoding 'area_code'
X_train shape:  (2975, 3)
X_test shape:  (1275, 3)


In [82]:
print('Categorical vectorizers: {}'.format(categorical_vectorizers))

Categorical vectorizers: {'areacode_vectorizer': CountVectorizer()}


In [None]:
feature_names.extend(areacode_vectorizer.get_feature_names_out())

##### Numerical features

In [90]:
numeric_vectorizers = {}
scaler = MinMaxScaler()
def vectorizeNumericalFeature(X_train, X_test, feature):
    scaler.fit(X_train[feature].values.reshape(-1,1))
    X_train_vectorized_feature = scaler.transform(X_train[feature].values.reshape(-1,1))
    X_test_vectorized_feature = scaler.transform(X_test[feature].values.reshape(-1,1))
    print("\nAfter normalizing ",feature)
    print("X_train shape: ", X_train_vectorized_feature.shape)
    print("X_test shape: ", X_test_vectorized_feature.shape)
    my_vect = feature+"_vectorizer"
    numeric_vectorizers[my_vect] = scaler
    feature_names.append(feature)
    return X_train_vectorized_feature, X_test_vectorized_feature

In [91]:
# number_vmail_messages
X_train_numvmailmsg_norm, X_test_numvmailmsg_norm = vectorizeNumericalFeature(X_train, X_test, 'number_vmail_messages')

# total_day_minutes
X_train_totdaymins_norm, X_test_totdaymins_norm = vectorizeNumericalFeature(X_train, X_test, 'total_day_minutes')

# total_day_calls
X_train_totdaycalls_norm, X_test_totdaycalls_norm = vectorizeNumericalFeature(X_train, X_test, 'total_day_calls')

# total_day_charge
X_train_totdaycharge_norm, X_test_totdaycharge_norm = vectorizeNumericalFeature(X_train, X_test, 'total_day_charge')

# total_eve_minutes
X_train_totevemins_norm, X_test_totevemins_norm = vectorizeNumericalFeature(X_train, X_test, 'total_eve_minutes')

# total_eve_calls
X_train_totevecalls_norm, X_test_totevecalls_norm = vectorizeNumericalFeature(X_train, X_test, 'total_eve_calls')

# total_eve_charge
X_train_totevecharge_norm, X_test_totevecharge_norm = vectorizeNumericalFeature(X_train, X_test, 'total_eve_charge')

# total_night_minutes
X_train_totnightmins_norm, X_test_totnightmins_norm = vectorizeNumericalFeature(X_train, X_test, 'total_night_minutes')

# total_night_calls
X_train_totnightcalls_norm, X_test_totnightcalls_norm = vectorizeNumericalFeature(X_train, X_test, 'total_night_calls')

# total_night_charge
X_train_totnightcharge_norm, X_test_totnightcharge_norm = vectorizeNumericalFeature(X_train, X_test, 'total_night_charge')

# total_intl_minutes
X_train_totintlmins_norm, X_test_totintlmins_norm = vectorizeNumericalFeature(X_train, X_test, 'total_intl_minutes')

# total_intl_calls
X_train_totintlcalls_norm, X_test_totintlcalls_norm = vectorizeNumericalFeature(X_train, X_test, 'total_intl_calls')

# total_intl_charge
X_train_totintlcharge_norm, X_test_totintlcharge_norm = vectorizeNumericalFeature(X_train, X_test, 'total_intl_charge')

# number_customer_service_calls
X_train_custservcalls_norm, X_test_custservcalls_norm = vectorizeNumericalFeature(X_train, X_test, 'number_customer_service_calls')


After normalizing  number_vmail_messages
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_day_minutes
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_day_calls
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_day_charge
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_eve_minutes
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_eve_calls
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_eve_charge
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_night_minutes
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_night_calls
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_night_charge
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_intl_minutes
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing 

In [23]:
print('Numerical vectorizers: {}'.format(numeric_vectorizers))

Numerical vectorizers: {'number_vmail_messages_vectorizer': MinMaxScaler(), 'total_day_minutes_vectorizer': MinMaxScaler(), 'total_day_calls_vectorizer': MinMaxScaler(), 'total_day_charge_vectorizer': MinMaxScaler(), 'total_eve_minutes_vectorizer': MinMaxScaler(), 'total_eve_calls_vectorizer': MinMaxScaler(), 'total_eve_charge_vectorizer': MinMaxScaler(), 'total_night_minutes_vectorizer': MinMaxScaler(), 'total_night_calls_vectorizer': MinMaxScaler(), 'total_night_charge_vectorizer': MinMaxScaler(), 'total_intl_minutes_vectorizer': MinMaxScaler(), 'total_intl_calls_vectorizer': MinMaxScaler(), 'total_intl_charge_vectorizer': MinMaxScaler(), 'number_customer_service_calls_vectorizer': MinMaxScaler()}


#### Reshape the features which were already in 0/1 format

In [40]:
X_train_intplan_ohe = X_train['international_plan'].values.reshape(-1,1)
X_test_intplan_ohe = X_test['international_plan'].values.reshape(-1,1)
X_train_vmailplan_ohe = X_train['voice_mail_plan'].values.reshape(-1,1)
X_test_vmailplan_ohe = X_test['voice_mail_plan'].values.reshape(-1,1)
print('Done!')

Done!


<h4>Lets stack the vectorized features using hstack and create 2 sets (train & test)

In [93]:
X_train_stacked = hstack((X_train_areacode_vectorized, X_train_intplan_ohe, X_train_vmailplan_ohe, X_train_numvmailmsg_norm, X_train_totdaymins_norm, X_train_totdaycalls_norm, X_train_totdaycharge_norm, X_train_totevemins_norm, X_train_totevecalls_norm, X_train_totevecharge_norm, X_train_totnightmins_norm, X_train_totnightcalls_norm, X_train_totnightcharge_norm, X_train_totintlmins_norm, X_train_totintlcalls_norm, X_train_totintlcharge_norm, X_train_custservcalls_norm)).tocsr()
X_test_stacked = hstack((X_test_areacode_vectorized, X_test_intplan_ohe, X_test_vmailplan_ohe, X_test_numvmailmsg_norm, X_test_totdaymins_norm, X_test_totdaycalls_norm, X_test_totdaycharge_norm, X_test_totevemins_norm, X_test_totevecalls_norm, X_test_totevecharge_norm, X_test_totnightmins_norm, X_test_totnightcalls_norm, X_test_totnightcharge_norm, X_test_totintlmins_norm, X_test_totintlcalls_norm, X_test_totintlcharge_norm, X_test_custservcalls_norm)).tocsr()

print("Stacked data set: ")
print("X_train shape: ", X_train_stacked.shape)
print("X_test shape: ", X_test_stacked.shape)

Stacked data set: 
X_train shape:  (2975, 19)
X_test shape:  (1275, 19)


Import some common ML functions

In [43]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

### 5. Modelling

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

In [94]:
results = []
for classifier in [LogisticRegression(), KNeighborsClassifier(), RandomForestClassifier(), DecisionTreeClassifier(), GradientBoostingClassifier(), SVC()]:
    start = datetime.now()
    clf = classifier
    clf_str = str(clf).split(' ')[0].split('.')[-1]
    print('{} started...'.format(clf_str))
    clf.fit(X_train_stacked, y_train)
    y_pred = clf.predict(X_test_stacked)
    print('{} completed... Time taken: {}'.format(clf_str, datetime.now()-start))
    
    temp = list()
    temp.append(clf_str)
    temp.append('Default')
    temp.append(accuracy_score(y_test, y_pred))
    temp.append(confusion_matrix(y_test, y_pred))
    temp.append(datetime.now()-start)
    results.append(temp)
    

LogisticRegression() started...
LogisticRegression() completed... Time taken: 0:00:00.027923
KNeighborsClassifier() started...
KNeighborsClassifier() completed... Time taken: 0:00:00.210118
RandomForestClassifier() started...
RandomForestClassifier() completed... Time taken: 0:00:01.151270
DecisionTreeClassifier() started...
DecisionTreeClassifier() completed... Time taken: 0:00:00.052666
GradientBoostingClassifier() started...
GradientBoostingClassifier() completed... Time taken: 0:00:00.815453
SVC() started...
SVC() completed... Time taken: 0:00:00.178173


In [95]:
pd.DataFrame(results, columns=['Algorithm', 'Hyperparameters', 'Accuracy', 'Confusion Matrix', 'Time taken']).sort_values('Accuracy')   

Unnamed: 0,Algorithm,Hyperparameters,Accuracy,Confusion Matrix,Time taken
0,LogisticRegression(),Default,0.862745,"[[1084, 12], [163, 16]]",0 days 00:00:00.028921
5,SVC(),Default,0.866667,"[[1096, 0], [170, 9]]",0 days 00:00:00.179173
3,DecisionTreeClassifier(),Default,0.876863,"[[1001, 95], [62, 117]]",0 days 00:00:00.053667
1,KNeighborsClassifier(),Default,0.883922,"[[1086, 10], [138, 41]]",0 days 00:00:00.211135
4,GradientBoostingClassifier(),Default,0.931765,"[[1081, 15], [72, 107]]",0 days 00:00:00.816420
2,RandomForestClassifier(),Default,0.937255,"[[1087, 9], [71, 108]]",0 days 00:00:01.152272


#### GradientBoosting has given us 95.68% accuracy. Lets tune this to see if we get better performance.

In [56]:
from sklearn.model_selection import GridSearchCV

In [96]:
start = datetime.now()
print('GridSearchCV started...')
parameters = {'max_depth':(1, 3, 10, 30), 'min_samples_split':(5, 10, 100, 500), 'n_estimators':(50,100,200)}
gbdt = GradientBoostingClassifier()
clf = GridSearchCV(gbdt, parameters, return_train_score=True, scoring='accuracy', cv=5)
clf.fit(X_train_stacked, y_train)
print('GridSearchCV completed... Time taken: {}'.format(datetime.now() - start))

GridSearchCV started...
GridSearchCV completed... Time taken: 0:12:43.422739


In [97]:
cv_result = pd.DataFrame.from_dict(clf.cv_results_)

In [98]:
cv_result.sort_values(by='mean_test_score', ascending=False)[:3]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
43,5.775754,0.653045,0.011317,0.000578,30,100,100,"{'max_depth': 30, 'min_samples_split': 100, 'n...",0.909244,0.922689,...,0.924034,0.008329,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
42,2.212015,0.355322,0.004808,0.000764,30,100,50,"{'max_depth': 30, 'min_samples_split': 100, 'n...",0.912605,0.922689,...,0.923361,0.006252,2,1.0,0.99958,0.99958,0.99958,1.0,0.999748,0.000206
47,5.694178,0.283163,0.015105,0.001222,30,500,200,"{'max_depth': 30, 'min_samples_split': 500, 'n...",0.917647,0.921008,...,0.922689,0.004986,3,0.99958,0.99916,0.99958,0.998739,0.998739,0.99916,0.000376


We will take the bets parameters that we got through cross validation and train our model.

In [101]:
start = datetime.now()
print("Training Gradient Boosting with parameters: 'max_depth': 30, 'min_samples_split': 100, 'n_estimators':100\n")
clf = GradientBoostingClassifier(max_depth=30, min_samples_split=100, n_estimators=100)
clf.fit(X_train_stacked, y_train)
print("Training completed... Time taken: {}".format(datetime.now()-start))
y_pred = clf.predict(X_test_stacked)
print('Accuracy: {}'.format(accuracy_score(y_test, y_pred)))
print('Confusion matrix: \n{}'.format(confusion_matrix(y_test, y_pred)))

Training Gradient Boosting with parameters: 'max_depth': 30, 'min_samples_split': 100, 'n_estimators':100

Training completed... Time taken: 0:00:05.427328
Accuracy: 0.9317647058823529
Confusion matrix: 
[[1078   18]
 [  69  110]]


Lets check feature importance

In [117]:
feature_importance = pd.DataFrame(data=zip(feature_names, list(clf.feature_importances_)), columns=['feature', 'importance']) 

In [122]:
feature_importance.sort_values(by='importance', ascending=False, inplace=True)
feature_importance

Unnamed: 0,feature,importance
6,total_day_minutes,0.204241
8,total_day_charge,0.149033
9,total_eve_minutes,0.114907
18,number_customer_service_calls,0.113268
11,total_eve_charge,0.093504
5,number_vmail_messages,0.079299
12,total_night_minutes,0.039573
13,total_night_calls,0.032813
17,total_intl_charge,0.03166
7,total_day_calls,0.030921


Now, our plan is to deploy this model to production and expose it through API. We will provide a web interface where user can input their telco usage and our model will predict whether the customer is likely to churn or not.
<br>
The process will be as follows:
1. User will enter all the required features using web interface
2. We will take the input as query point
3. The query point will go through preprocessing steps
4. After preprocessing, we will vectorize the query point
5. After vectorization, we will provide this as input to our GBDT model
6. Based on the result, we will update the on the web interface

<b>Preprocessing steps to do:</b>
1. For 'international_plan': if 'yes' then 1 else 0
2. For 'voice_mail_plan': : if 'yes' then 1 else 0
3. Drop 'state' & 'account_length'

#### Vectorization
We will run the query points through below vectorizers. We will be dumping all these vectorizers to pickle file now along with the model.

In [125]:
categorical_vectorizers

{'areacode_vectorizer': CountVectorizer()}

In [126]:
numeric_vectorizers

{'number_vmail_messages_vectorizer': MinMaxScaler(),
 'total_day_minutes_vectorizer': MinMaxScaler(),
 'total_day_calls_vectorizer': MinMaxScaler(),
 'total_day_charge_vectorizer': MinMaxScaler(),
 'total_eve_minutes_vectorizer': MinMaxScaler(),
 'total_eve_calls_vectorizer': MinMaxScaler(),
 'total_eve_charge_vectorizer': MinMaxScaler(),
 'total_night_minutes_vectorizer': MinMaxScaler(),
 'total_night_calls_vectorizer': MinMaxScaler(),
 'total_night_charge_vectorizer': MinMaxScaler(),
 'total_intl_minutes_vectorizer': MinMaxScaler(),
 'total_intl_calls_vectorizer': MinMaxScaler(),
 'total_intl_charge_vectorizer': MinMaxScaler(),
 'number_customer_service_calls_vectorizer': MinMaxScaler()}

In [134]:
import pickle

file = open('gbdt.pkl','wb')
for vec in categorical_vectorizers:
    pickle.dump(categorical_vectorizers[vec], file)
    print('Dumped: {}'.format(vec))

for vec in numeric_vectorizers:
    pickle.dump(numeric_vectorizers[vec], file)
    print('Dumped: {}'.format(vec))
    
pickle.dump(clf, file)
print('Dumped: {}'.format(clf))

file.close()

Dumped: areacode_vectorizer
Dumped: number_vmail_messages_vectorizer
Dumped: total_day_minutes_vectorizer
Dumped: total_day_calls_vectorizer
Dumped: total_day_charge_vectorizer
Dumped: total_eve_minutes_vectorizer
Dumped: total_eve_calls_vectorizer
Dumped: total_eve_charge_vectorizer
Dumped: total_night_minutes_vectorizer
Dumped: total_night_calls_vectorizer
Dumped: total_night_charge_vectorizer
Dumped: total_intl_minutes_vectorizer
Dumped: total_intl_calls_vectorizer
Dumped: total_intl_charge_vectorizer
Dumped: number_customer_service_calls_vectorizer
Dumped: GradientBoostingClassifier(max_depth=30, min_samples_split=100)
