First, we need to import important libraries. We do that below.

In [20]:
import tensorflow as tf
import pandas as pd
import numpy as np

We're going to load our data as a pandas data frame now and see how many training examples we have.

In [57]:
data = pd.read_csv('./data/data.csv')
print(data.shape)
m = data.shape[0]

(7043, 21)


We should look at what our features are now and preproccess them.

In [73]:
print(data.head())
columns = data.columns

   gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  Female              0     Yes         No       1           No   
1    Male              0      No         No      34          Yes   
2    Male              0      No         No       2          Yes   
3    Male              0      No         No      45           No   
4  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity OnlineBackup  \
0  No phone service             DSL             No          Yes   
1                No             DSL            Yes           No   
2                No             DSL            Yes          Yes   
3  No phone service             DSL            Yes           No   
4                No     Fiber optic             No           No   

  DeviceProtection TechSupport StreamingTV StreamingMovies        Contract  \
0               No          No          No              No  Month-to-month   
1              Yes          No  

Let's see which columns could possibly be categorical.

In [75]:
print(columns)
print(data['gender'])

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')
0       Female
1         Male
2         Male
3         Male
4       Female
5       Female
6         Male
7       Female
8       Female
9         Male
10        Male
11        Male
12        Male
13        Male
14        Male
15      Female
16      Female
17        Male
18      Female
19      Female
20        Male
21        Male
22        Male
23      Female
24        Male
25      Female
26        Male
27        Male
28        Male
29      Female
         ...  
7013    Female
7014      Male
7015      Male
7016    Female
7017    Female
7018      Male
7019    Female
7020      Male
7021      Male
7022      Male
7023    Female
702

Let's look at the areas where there is categorical data and convert it into one hot encodings.

In [72]:
print(data['customerID'])
data = data.drop(labels = 'customerID', axis = 1)

KeyError: 'customerID'

Now we should create our train, dev and test sets. We need to shuffle the data, then choose how many examples we want in our train, dev, and test sets. We only have 7,000 examples, so we should put less examples in our dev and test sets.

In [26]:
# shuffle the data
permutation = np.random.permutation(m)
data = data.iloc[permutation]

# split the data into training, dev and test sets
training = data.iloc[: int(0.7 * m)]
dev = data.iloc[int(0.7 * m): int(0.85 * m)]
test = data.iloc[int(0.85 * m):]
print(training.shape)
print(dev.shape)
print(test.shape)

(4930, 21)
(1056, 21)
(1057, 21)


Now we need to split our data into X and Y. Let's create a function for that and use it on training, dev and test.

In [47]:
''' input: data frame, outputs: X, Y'''
def split_xy(data):
    n = data.shape[1]
    X = data.iloc[:, :n - 1]
    Y = data.iloc[:, n - 1: n]
    return X, Y

In [46]:
X_train, Y_train = split_xy(training)
