# Predicting Customer Churn in python: Data Preprocessing

## The Dataset

In this project, we will learn how to build a churn model from beginning to end. The data we will be using comes from a Cellular Usage dataset that consists of records of actual Cell Phone customers, and features that include: 
* voice mail
* international calling
* cost for the service
* customer usage
* customer churn


In [1]:
import pandas as pd

In [2]:
telco_df=pd.read_csv('./Churn.csv')
telco_df

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Churn,Intl_Plan,Vmail_Plan,...,Day_Charge,Eve_Calls,Eve_Charge,Night_Calls,Night_Charge,Intl_Calls,Intl_Charge,State,Area_Code,Phone
0,128,25,265.1,197.4,244.7,10.0,1,no,no,yes,...,45.07,99,16.78,91,11.01,3,2.70,KS,415,382-4657
1,107,26,161.6,195.5,254.4,13.7,1,no,no,yes,...,27.47,103,16.62,103,11.45,3,3.70,OH,415,371-7191
2,137,0,243.4,121.2,162.6,12.2,0,no,no,no,...,41.38,110,10.30,104,7.32,5,3.29,NJ,415,358-1921
3,84,0,299.4,61.9,196.9,6.6,2,no,yes,no,...,50.90,88,5.26,89,8.86,7,1.78,OH,408,375-9999
4,75,0,166.7,148.3,186.9,10.1,3,no,yes,no,...,28.34,122,12.61,121,8.41,3,2.73,OK,415,330-6626
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,192,36,156.2,215.5,279.1,9.9,2,no,no,yes,...,26.55,126,18.32,83,12.56,6,2.67,AZ,415,414-4276
3329,68,0,231.1,153.4,191.3,9.6,3,no,no,no,...,39.29,55,13.04,123,8.61,4,2.59,WV,415,370-3271
3330,28,0,180.8,288.8,191.9,14.1,2,no,no,no,...,30.74,58,24.55,91,8.64,6,3.81,RI,510,328-8230
3331,184,0,213.8,159.6,139.2,5.0,2,no,yes,no,...,36.35,84,13.57,137,6.26,10,1.35,CT,510,364-6381


## Identifying features to convert

It is preferable to have features like 'Churn' encoded as 0 and 1 instead of no and yes, so that we can then feed it into machine learning algorithms that only accept numeric values.

Besides 'Churn', other features that are of type object can be converted into 0s and 1s. In the following, we will explore the different data types of telco in the IPython Shell and identify the ones that are of type object.

In [3]:
print(telco_df.dtypes)

Account_Length      int64
Vmail_Message       int64
Day_Mins          float64
Eve_Mins          float64
Night_Mins        float64
Intl_Mins         float64
CustServ_Calls      int64
Churn              object
Intl_Plan          object
Vmail_Plan         object
Day_Calls           int64
Day_Charge        float64
Eve_Calls           int64
Eve_Charge        float64
Night_Calls         int64
Night_Charge      float64
Intl_Calls          int64
Intl_Charge       float64
State              object
Area_Code           int64
Phone              object
dtype: object


## Encoding binary features

Recasting data types is an important part of data preprocessing. In this exercise you will assign the values 1 to 'yes' and 0 to 'no' to the 'Vmail_Plan' and 'Churn' features, respectively.

In [4]:
from sklearn.preprocessing import LabelEncoder


In [5]:
le=LabelEncoder()
telco_df['Intl_Plan']=le.fit_transform(telco_df['Intl_Plan'])
telco_df['Intl_Plan'].head()

0    0
1    0
2    0
3    1
4    1
Name: Intl_Plan, dtype: int64

In [6]:
telco_df['Vmail_Plan']=le.fit_transform(telco_df['Vmail_Plan'])
telco_df['Vmail_Plan'].head()

0    1
1    1
2    0
3    0
4    0
Name: Vmail_Plan, dtype: int64

In [7]:
telco_df['Churn']=le.fit_transform(telco_df['Churn'])
telco_df['Churn'].head()

0    0
1    0
2    0
3    0
4    0
Name: Churn, dtype: int64

## One hot encoding

In [8]:
# Perform one hot encoding on 'State'
telco_state = pd.get_dummies(telco_df['State'])

# Print the head of telco_state
telco_state.head()

Unnamed: 0,AK,AL,AR,AZ,CA,CO,CT,DC,DE,FL,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
telco_df.drop(columns=['State'],inplace=True)
telco_df=pd.concat([telco_df,telco_state],axis=1)
telco_df.head()

Unnamed: 0,Account_Length,Vmail_Message,Day_Mins,Eve_Mins,Night_Mins,Intl_Mins,CustServ_Calls,Churn,Intl_Plan,Vmail_Plan,...,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY
0,128,25,265.1,197.4,244.7,10.0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,107,26,161.6,195.5,254.4,13.7,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,137,0,243.4,121.2,162.6,12.2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,84,0,299.4,61.9,196.9,6.6,2,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,75,0,166.7,148.3,186.9,10.1,3,0,1,0,...,0,0,0,0,0,0,0,0,0,0


## Feature Scaling

Let's investigate the different scales of the 'Intl_Calls' and 'Night_Mins'.

In [10]:
telco_df['Intl_Calls'].describe()

count    3333.000000
mean        4.479448
std         2.461214
min         0.000000
25%         3.000000
50%         4.000000
75%         6.000000
max        20.000000
Name: Intl_Calls, dtype: float64

In [11]:
telco_df['Night_Mins'].describe()

count    3333.000000
mean      200.872037
std        50.573847
min        23.200000
25%       167.000000
50%       201.200000
75%       235.300000
max       395.000000
Name: Night_Mins, dtype: float64

Here we will re-scale them using StandardScaler.

In [12]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Scale telco using StandardScaler
telco_scaled = StandardScaler().fit_transform(telco_df[["Intl_Calls", "Night_Mins"]])

# Add column names back for readability
telco_scaled_df = pd.DataFrame(telco_scaled, columns=["Intl_Calls", "Night_Mins"])

# Print summary statistics
print(telco_scaled_df.describe())

         Intl_Calls    Night_Mins
count  3.333000e+03  3.333000e+03
mean  -1.264615e-16  6.602046e-17
std    1.000150e+00  1.000150e+00
min   -1.820289e+00 -3.513648e+00
25%   -6.011951e-01 -6.698545e-01
50%   -1.948306e-01  6.485803e-03
75%    6.178983e-01  6.808485e-01
max    6.307001e+00  3.839081e+00


In [13]:
#telco_df[["Intl_Calls", "Night_Mins"]]=telco_scaled_df
#telco_df.head()

## Dropping unnecessary features

In [14]:
# Drop the unnecessary features
telco_df = telco_df.drop(telco_df[['Area_Code','Phone']], axis=1)

# Verify dropped features
print(telco_df.columns)

Index(['Account_Length', 'Vmail_Message', 'Day_Mins', 'Eve_Mins', 'Night_Mins',
       'Intl_Mins', 'CustServ_Calls', 'Churn', 'Intl_Plan', 'Vmail_Plan',
       'Day_Calls', 'Day_Charge', 'Eve_Calls', 'Eve_Charge', 'Night_Calls',
       'Night_Charge', 'Intl_Calls', 'Intl_Charge', 'AK', 'AL', 'AR', 'AZ',
       'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IA', 'ID', 'IL', 'IN',
       'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC',
       'ND', 'NE', 'NH', 'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI',
       'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY'],
      dtype='object')


## Engineering a new column

Leveraging domain knowledge to engineer new features is an essential part of modeling. 

This quote from Andrew Ng summarizes the importance of feature engineering:

_Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering._

Here, we will create a new feature that contains information about the average length of night calls made by customers.

In [15]:
# Create the new feature
telco_df['Avg_Night_Calls'] = telco_df['Night_Mins']/telco_df['Night_Calls']

# Print the first five rows of 'Avg_Night_Calls'
print(telco_df['Avg_Night_Calls'].head())

0    2.689011
1    2.469903
2    1.563462
3    2.212360
4    1.544628
Name: Avg_Night_Calls, dtype: float64


In [16]:
telco_df.to_csv('./telco_preprocessed.csv',index=False)