Lets vectorize all the features to build the model<br>

<b>Categorical features:</b><br>
1. State
2. Area code
3. Internation Plan
4. Voice Mail Plan

<b>Numerical features:</b><br>
1. number_vmail_messages<br>
2. total_day_minutes<br>
3. total_day_calls<br>
4. total_day_charge<br>
5. total_eve_minutes<br>
6. total_eve_calls<br>
7. total_eve_charge<br>
8. total_night_minutes<br>
9. total_night_calls<br>
10. total_night_charge<br>
11. total_intl_minutes<br>
12. total_intl_calls<br>
13. total_intl_charge<br>
14. number_customer_service_calls<br>

In [32]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from scipy.sparse import hstack

<h4>Lets split the data into train & test. Later build model and evaluate it by performing hyperparameter tuning.</h4>

In [3]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [6]:
print(train.columns.values)

['state' 'account_length' 'area_code' 'international_plan'
 'voice_mail_plan' 'number_vmail_messages' 'total_day_minutes'
 'total_day_calls' 'total_day_charge' 'total_eve_minutes'
 'total_eve_calls' 'total_eve_charge' 'total_night_minutes'
 'total_night_calls' 'total_night_charge' 'total_intl_minutes'
 'total_intl_calls' 'total_intl_charge' 'number_customer_service_calls'
 'churn']


In [5]:
print(train.shape)

(4250, 20)


In [7]:
X = train.drop(['churn'], axis=1)
y = train['churn'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

print("X_train shape", X_train.shape)
print("X_test shape", X_test.shape)
print("y_train shape", y_train.shape)
print("y_train shape", y_test.shape)


X_train shape (2975, 19)
X_test shape (1275, 19)
y_train shape (2975,)
y_train shape (1275,)


<h4>Lets vectorize the both datasets (train & test) based on train data </h4>

In [24]:
vectorizer = CountVectorizer()
def vectorizeCategoricalFeature(X_train, X_test, feature):
    vectorizer.fit(X_train[feature].values)
    X_train_vectorized_feature = vectorizer.transform(X_train[feature].values)
    X_test_vectorized_feature = vectorizer.transform(X_test[feature].values)
    print("\nAfter one hot encoding ",feature)
    print("X_train shape: ", X_train_vectorized_feature.shape)
    print("X_test shape: ", X_test_vectorized_feature.shape)
    return X_train_vectorized_feature, X_test_vectorized_feature

In [27]:
# state
X_train_state_ohe, X_test_state_ohe = vectorizeCategoricalFeature(X_train, X_test, 'state')

# area_code
X_train_areacode_ohe, X_test_areacode_ohe = vectorizeCategoricalFeature(X_train, X_test, 'area_code')

# international_plan
X_train_intplan_ohe, X_test_intplan_ohe = vectorizeCategoricalFeature(X_train, X_test, 'international_plan')

# voice_mail_plan
X_train_vmailplan_ohe, X_test_vmailplan_ohe = vectorizeCategoricalFeature(X_train, X_test, 'voice_mail_plan')



After one hot encoding  state
X_train shape:  (2975, 51)
X_test shape:  (1275, 51)

After one hot encoding  area_code
X_train shape:  (2975, 3)
X_test shape:  (1275, 3)

After one hot encoding  international_plan
X_train shape:  (2975, 2)
X_test shape:  (1275, 2)

After one hot encoding  voice_mail_plan
X_train shape:  (2975, 2)
X_test shape:  (1275, 2)


In [30]:
scaler = MinMaxScaler()
def vectorizeNumericalFeature(X_train, X_test, feature):
    scaler.fit(X_train[feature].values.reshape(-1,1))
    X_train_vectorized_feature = scaler.transform(X_train[feature].values.reshape(-1,1))
    X_test_vectorized_feature = scaler.transform(X_test[feature].values.reshape(-1,1))
    print("\nAfter normalizing ",feature)
    print("X_train shape: ", X_train_vectorized_feature.shape)
    print("X_test shape: ", X_test_vectorized_feature.shape)
    return X_train_vectorized_feature, X_test_vectorized_feature

In [31]:
# number_vmail_messages
X_train_numvmailmsg_norm, X_test_numvmailmsg_norm = vectorizeNumericalFeature(X_train, X_test, 'number_vmail_messages')

# total_day_minutes
X_train_totdaymins_norm, X_test_totdaymins_norm = vectorizeNumericalFeature(X_train, X_test, 'total_day_minutes')

# total_day_calls
X_train_totdaycalls_norm, X_test_totdaycalls_norm = vectorizeNumericalFeature(X_train, X_test, 'total_day_calls')

# total_day_charge
X_train_totdaycharge_norm, X_test_totdaycharge_norm = vectorizeNumericalFeature(X_train, X_test, 'total_day_charge')

# total_eve_minutes
X_train_totevemins_norm, X_test_totevemins_norm = vectorizeNumericalFeature(X_train, X_test, 'total_eve_minutes')

# total_eve_calls
X_train_totevecalls_norm, X_test_totevecalls_norm = vectorizeNumericalFeature(X_train, X_test, 'total_eve_calls')

# total_eve_charge
X_train_totevecharge_norm, X_test_totevecharge_norm = vectorizeNumericalFeature(X_train, X_test, 'total_eve_charge')

# total_night_minutes
X_train_totnightmins_norm, X_test_totnightmins_norm = vectorizeNumericalFeature(X_train, X_test, 'total_night_minutes')

# total_night_calls
X_train_totnightcalls_norm, X_test_totnightcalls_norm = vectorizeNumericalFeature(X_train, X_test, 'total_night_calls')

# total_night_charge
X_train_totnightcharge_norm, X_test_totnightcharge_norm = vectorizeNumericalFeature(X_train, X_test, 'total_night_charge')

# total_intl_minutes
X_train_totintlmins_norm, X_test_totintlmins_norm = vectorizeNumericalFeature(X_train, X_test, 'total_intl_minutes')

# total_intl_calls
X_train_totintlcalls_norm, X_test_totintlcalls_norm = vectorizeNumericalFeature(X_train, X_test, 'total_intl_calls')

# total_intl_charge
X_train_totintlcharge_norm, X_test_totintlcharge_norm = vectorizeNumericalFeature(X_train, X_test, 'total_intl_charge')

# number_customer_service_calls
X_train_custservcalls_norm, X_test_custservcalls_norm = vectorizeNumericalFeature(X_train, X_test, 'number_customer_service_calls')


After normalizing  number_vmail_messages
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_day_minutes
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_day_calls
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_day_charge
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_eve_minutes
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_eve_calls
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_eve_charge
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_night_minutes
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_night_calls
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_night_charge
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing  total_intl_minutes
X_train shape:  (2975, 1)
X_test shape:  (1275, 1)

After normalizing 

<h4>Lets stack the vectorized features using hstack and create 2 sets (train & test)

In [34]:
X_train_stacked = hstack((X_train_state_ohe, X_train_areacode_ohe, X_train_intplan_ohe, X_train_vmailplan_ohe, X_train_numvmailmsg_norm, X_train_totdaymins_norm, X_train_totdaycalls_norm, X_train_totdaycharge_norm, X_train_totevemins_norm, X_train_totevecalls_norm, X_train_totevecharge_norm, X_train_totnightmins_norm, X_train_totnightcalls_norm, X_train_totnightcharge_norm, X_train_totintlmins_norm, X_train_totintlcalls_norm, X_train_totintlcharge_norm, X_train_custservcalls_norm)).tocsr()
X_test_stacked = hstack((X_test_state_ohe, X_test_areacode_ohe, X_test_intplan_ohe, X_test_vmailplan_ohe, X_test_numvmailmsg_norm, X_test_totdaymins_norm, X_test_totdaycalls_norm, X_test_totdaycharge_norm, X_test_totevemins_norm, X_test_totevecalls_norm, X_test_totevecharge_norm, X_test_totnightmins_norm, X_test_totnightcalls_norm, X_test_totnightcharge_norm, X_test_totintlmins_norm, X_test_totintlcalls_norm, X_test_totintlcharge_norm, X_test_custservcalls_norm)).tocsr()

print("Stacked data set: ")
print("X_train shape: ", X_train_stacked.shape)
print("X_test shape: ", X_test_stacked.shape)

Stacked data set: 
X_train shape:  (2975, 72)
X_test shape:  (1275, 72)
