# Trying to determine Customer Satisfaction based on numeric variables

Modern day businesses now rely on data to help their business. They need data help them market, data to help them run their business more efficiently, and data to help improve their products. The last topic is where I am going to focus. How to improve a product or service using data? One way is to gather customer feedback to determine if they are satisfied with your product/service and which factors best determine whether a customer is satisfied. That way you can determine which areas of the product and service to improve upon. That is the heart of the data set I am working with. Given certain factors, can I determine if a customer is satisfied with a service? A competition on Kaggle provides a dataset to try and determine if a customer is satisfied with a product. The data is presented in a csv format where there are 371 numerical variables representing different customer factors (these actual field variable names are hidden due to legal reasons). The target is simply category: 0 for customer sastification and 1 for a customer being unsatisfied. One of the difficulties with this data is that the data provided is skewed. About 96% of the samples are satisfied customers and the rest are unsatisfied instead of being a 50/50 split.

Let me start by visualizing the initial data to see whats going on.

In [16]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
import mlutils as ml
import neuralnetworks as nn
import qdalda as ql
import itertools
import imp
imp.reload(nn)
%matplotlib inline

def trainNN(X, T, parameters):
    classes = np.unique(T)
    print classes
    if parameters == 0 or parameters == [0] or parameters is None or parameters == [None]:
        nnet = nn.NeuralNetworkClassifier(X.shape[1], None, classes.shape[0])
        nnet.train(X, T)
    elif type(parameters) is int:
        nnet = nn.NeuralNetworkClassifier(X.shape[1], parameters, classes.shape[0])
        nnet.train(X, T)
    elif type(parameters) is list and len(parameters) == 1:
        nnet = nn.NeuralNetworkClassifier(X.shape[1], parameters[0], classes.shape[0])
        nnet.train(X, T)
    else:
        nnet = nn.NeuralNetworkClassifier(X.shape[1], parameters[0], classes.shape[0])
        nnet.train(X, T, nIterations = parameters[1])
    return nnet
def evaluateNN(model, X, T):
    results = model.use(X)
    return np.sum(results.ravel()==T.ravel()) / float(len(T)) * 100

data = pandas.read_csv("train.csv")
data.head()


Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,1,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.17,0
1,3,2,34,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.03,0
2,4,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.77,0
3,8,2,37,0.0,195.0,195.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.97,0
4,10,2,39,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0


In [17]:
targets = pandas.DataFrame(data.TARGET.value_counts())
targets['Percentage'] = 100*targets['TARGET']/data.shape[0]
targets

Unnamed: 0,TARGET,Percentage
0,73012,96.043147
1,3008,3.956853


It's clear that the unsatisfied customer are the outliers in this case but which factors are the best predictor of these outliers is the question. There are many ways to do classification, one of the first is removing categorical variables.

I want to make sure the data is cleaned before I proceed with analyzing it. To do this I will find out which variables are duplicates and which rows are duplicates.


In [18]:
def removeDuplicates(data):
    data.drop_duplicates()
    remove = []
    columns = data.columns
    for i in range(len(columns) - 1):
        values = data[columns[i]].values
        for j in range(i+1,len(columns)):
            if np.array_equal(values, data[columns[j]].values):
                remove.append(columns[j])
    data.drop(remove, axis=1, inplace=True)

I also need to remove columns with the same constant values

In [19]:
def removeConstantColumns(data):
    remove = []
    for col in data.columns:
        if data[col].std() == 0:
            remove.append(col)
    data.drop(remove, axis=1, inplace=True)

In [20]:
removeDuplicates(data)
removeConstantColumns(data)
data.describe()

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
count,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,...,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0,76020.0
mean,75964.050723,-1523.199277,33.212865,86.208265,72.363067,119.529632,3.55913,6.472698,0.412946,0.567352,...,7.935824,1.365146,12.21558,8.784074,31.505324,1.858575,76.026165,56.614351,117235.8,0.039569
std,43781.947379,39033.462364,12.956486,1614.757313,339.315831,546.266294,93.155749,153.737066,30.604864,36.513513,...,455.887218,113.959637,783.207399,538.439211,2013.125393,147.786584,4040.337842,2852.579397,182664.6,0.194945
min,1.0,-999999.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5163.75,0.0
25%,38104.75,2.0,23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67870.61,0.0
50%,76043.0,2.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,106409.2,0.0
75%,113748.75,2.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,118756.3,0.0
max,151838.0,238.0,105.0,210000.0,12888.03,21024.81,8237.82,11073.57,6600.0,6600.0,...,50003.88,20385.72,138831.63,91778.73,438329.22,24650.01,681462.9,397884.3,22034740.0,1.0


According to this link () column var3 is supposed to be the nationality of the customer. -999999 represents the nationality or country of origin not being known. Thus it is a categorical variable. I will take -9999999 as being 0.

In [21]:
data.var3.replace(-999999, 0)
data.drop(["ID"], axis =1)

Unnamed: 0,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,imp_op_var40_ult1,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
0,2,23,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,39205.170000,0
1,2,34,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49278.030000,0
2,2,23,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67333.770000,0
3,2,37,0.0,195.00,195.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,64007.970000,0
4,2,39,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0
5,2,23,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,87975.750000,0
6,2,27,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,94956.660000,0
7,2,26,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,251638.950000,0
8,2,45,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,101962.020000,0
9,2,25,0.0,0.00,0.00,0.00,0.00,0.0,0.0,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,356463.060000,0


## Add in data exploration of known (or suspected) column values

OK now I will go ahead and give the linear logistic regression model a try to see how I do.

In [23]:
test_data = pandas.read_csv("test.csv")
removeDuplicates(test_data)
removeConstantColumns(test_data)
test_data.var3.replace(-999999, 0)
test_data.drop(["ID"], axis = 1)
test_data = test_data.as_matrix()
train_data = data.as_matrix()
Targets = train_data[:,-1]
Targets = Targets.reshape((-1,1))
#Targets.reshape(76020, 1)
Tclasses = np.array([0,1])
train_data.drop(["TARGET"])
print train_data.shape
print test_data.shape

(76020, 308)
[0 1]


In [28]:
lda = ql.LDA()
lda.train(train_data, Targets)
pclass, probabilities, discriminants = lda.use(train_data)
classes,counts = np.unique(probabilities,return_counts=True)
print('classes',classes)
print('counts',counts)
ml.confusionMatrix(Targets,pclass,Tclasses)

('classes', array([  3.55199210e-129,   1.75175014e-127,   2.53693449e-126, ...,
         1.98554544e-112,   1.83106780e-111,   2.00084311e-111]))
('counts', array([1, 1, 1, ..., 1, 1, 1]))
   
    0
    1

    
------------
 0 |
 69.2
 30.8
   (73012 / 73012)
 1 |
 27.3
 72.7
   (3008 / 3008)


array([[  6.92351942e-01,   3.07648058e-01,   7.30120000e+04,
          7.30120000e+04],
       [  2.72938830e-01,   7.27061170e-01,   3.00800000e+03,
          3.00800000e+03]])