# 3. Preprocessing | Predicting Telco customer churn
## by Leo Evancie, Springboard Data Science Career Track

_This is part of a capstone project to predict customer churn with supervised machine learning. More information can be found in the [repository](https://github.com/levancie/customer-churn)._

First, I will import libraries and load the cleaned data, making noted adjustments.

In [1]:
import pandas as pd
import numpy as np

# Load cleaned data, dropping the two less informative features as identified in EDA
customers = pd.read_csv('../data/cleaned.csv', index_col=0).drop(columns=['MultipleLines','PhoneService'])
# Replicate churn re-encoding step from EDA
customers['Churn'] = pd.Series(np.where(customers['Churn'].values == 'Yes', 1, 0))
customers.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,No,Yes,No,1,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,0
1,5575-GNVDE,Male,No,No,No,34,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,0
2,3668-QPYBK,Male,No,No,No,2,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,1
3,7795-CFOCW,Male,No,No,No,45,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,Female,No,No,No,2,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,1


In [2]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   object 
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   InternetService   7043 non-null   object 
 7   OnlineSecurity    7043 non-null   object 
 8   OnlineBackup      7043 non-null   object 
 9   DeviceProtection  7043 non-null   object 
 10  TechSupport       7043 non-null   object 
 11  StreamingTV       7043 non-null   object 
 12  StreamingMovies   7043 non-null   object 
 13  Contract          7043 non-null   object 
 14  PaperlessBilling  7043 non-null   object 
 15  PaymentMethod     7043 non-null   object 
 16  MonthlyCharges    7043 non-null   float64


We'll need dummy columns (1/0) for our categorical columns, since most ML algorithms can't accept strings.

In [3]:
# Dummies for categorical variables
dummy_cols = ['gender','SeniorCitizen','Partner','Dependents','InternetService','OnlineSecurity',
              'OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies',
              'Contract','PaperlessBilling','PaymentMethod']
dummies = pd.get_dummies(customers[dummy_cols])
customers = pd.concat([customers,dummies], axis=1).drop(dummy_cols, axis=1)

As for our numeric columns, we will need to bring them into a consistent scale so as to avoid inappropriately exaggerating the influence of one feature over another. But before I do that, I will split the data into training and testing portions. This is done to avoid 'leaking' information from the training data into the testing data. The testing data will be rescaled based on the state of the training data. This approximates the use of an ML model in production, since the testing data is the stand-in for 'new' data, which you wouldn't possibly be able to incorporate when training your model and fitting your scaler.

In [4]:
# Perform train/test split before scaling int columns
from sklearn.model_selection import train_test_split
y = customers.Churn
X = customers.drop(['Churn','customerID'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=17)

In [5]:
from sklearn.preprocessing import StandardScaler

cols_to_scale = ['tenure','MonthlyCharges','TotalCharges']

for col in cols_to_scale:
    # Instantiate and fit-transform training data
    ss = StandardScaler()
    scaled_column = ss.fit_transform(np.array(X_train[col]).reshape(-1,1))
    X_train = X_train.drop(col, axis=1)
    X_train[col] = scaled_column
    
    # Transform testing data with same scaler
    scaled_column = ss.transform(np.array(X_test[col]).reshape(-1,1))
    X_test = X_test.drop(col, axis=1)
    X_test[col] = scaled_column

With all features now in a state that's suitable for modeling, we can export the data and continue to the final notebook.

In [6]:
X_train.to_csv('../data/X_train.csv')
y_train.to_csv('../data/y_train.csv')
X_test.to_csv('../data/X_test.csv')
y_test.to_csv('../data/y_test.csv')