# Data Processing

Real world data is never clean. In order to make it usable by machine learning models, we have to apply many (different) techniques to clean, process and transform it.

This notebook include code for cleaning, processing and transforming data using pandas and scikit-learn.
- Removing missing values
- Remove unwanted features
- Label binarization
- Converting Categorical variables to Numeric
- MinMax Scaler (Scaling data)
---

### Import pandas and scikit-learn package

In [1]:
import numpy as np    # linear algebra
import pandas as pd   # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from sklearn.preprocessing import MinMaxScaler

In [2]:
print(os.listdir("../data"))

['.DS_Store', 'Telco-Customer-Churn.csv', '.ipynb_checkpoints']


### Read data files

In [3]:
telecom_cust = pd.read_csv('../data/Telco-Customer-Churn.csv')

In [4]:
telecom_cust.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Numerical data type

In [5]:
# Converting Total Charges to a numerical data type.
telecom_cust.TotalCharges = pd.to_numeric(telecom_cust.TotalCharges, errors='coerce')

### Missing Value

Checking for missing value..

In [6]:
telecom_cust.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

We can see that there are 11 missing values for Total Charges. Since, we just have 11 missing values in our dataset and that too only for 1 feature (Total Charges). We can drop these 11 samples/ observations.

In [7]:
# Removing missing values 
telecom_cust.dropna(inplace = True)

### Remove unwanted features

Customer IDs are of no use in predicting churn rate. It has no relation with customer leaving or staying with company/ service. So, we will get rid of this feature.

In [8]:
# Remove customer IDs from the data set
telecom_cust = telecom_cust.iloc[:,1:]

### Binary labels

Machine Learning moldes like data/ lables in numerical fomat. We will convert predictor variable values (YES/ NO) to binary/ numerical values (1/0).

In [9]:
# Convertin the predictor variable in a binary numeric variable
telecom_cust['Churn'].replace(to_replace='Yes', value=1, inplace=True)
telecom_cust['Churn'].replace(to_replace='No',  value=0, inplace=True)

### Convert categorical variables to numeric 

Many machine learning (and all most all deep learning models) like data in numbers (vectores). We will use Pandas **get_dummies** methods to convert categorical variable into numerical variables.

In [10]:
# Let's convert all the categorical variables into dummy variables
df_dummies = pd.get_dummies(telecom_cust)

In [11]:
df_dummies.head()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn,gender_Female,gender_Male,Partner_No,Partner_Yes,Dependents_No,...,StreamingMovies_Yes,Contract_Month-to-month,Contract_One year,Contract_Two year,PaperlessBilling_No,PaperlessBilling_Yes,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0,1,29.85,29.85,0,1,0,0,1,1,...,0,1,0,0,0,1,0,0,1,0
1,0,34,56.95,1889.5,0,0,1,1,0,1,...,0,0,1,0,1,0,0,0,0,1
2,0,2,53.85,108.15,1,0,1,1,0,1,...,0,1,0,0,0,1,0,0,0,1
3,0,45,42.3,1840.75,0,0,1,1,0,1,...,0,0,1,0,1,0,1,0,0,0
4,0,2,70.7,151.65,1,1,0,1,0,1,...,0,1,0,0,0,1,0,0,1,0


In [12]:
# saving processed dataframe in a csv file
df_dummies.to_csv('../data/processed_data.csv',index=False)

### Scaled Data

Scaling all the variables to a range of 0 to 1:

Many machine learning models require data to be feed in a range (typically 0 to 1; like logistic regression). We will use MinMaxScaler from scikit-learn, it transforms features by scaling each feature to a given range.

In [13]:
not_Scaled_data = df_dummies

In [14]:
# store all column values in feature
features = not_Scaled_data.columns.values

In [15]:
# intialize MinMaxScaler 
scaler = MinMaxScaler(feature_range = (0,1))
# fit the data
scaler.fit(not_Scaled_data)

  return self.partial_fit(X, y)


MinMaxScaler(copy=True, feature_range=(0, 1))

In [16]:
# tansform and store scaled data in dataframe format
scaled_data = pd.DataFrame(scaler.transform(not_Scaled_data))
scaled_data.columns = features

In [17]:
# saving processed dataframe in a csv file
scaled_data.to_csv('../data/scaled_data.csv',index=False)