# Machine Learning Engineer Nanodegree

# Model Evaluation & Validation

# Project: Classifying customers to help grow new accounts

CFS has secured several new major accounts -- companies which previously purchased only from their competitors, including those old-time financial service bastions in Manhattan. The bad news is that these new customers are only purchasing one or two of their products, instead of the wide array of products they sell to their more established customers. In fact, revenue from their newly acquired customers is only about one-tenth that of their older wholesale customers. To grow new accounts, they need to know which products are most appropriate to sell to which new customers.

# Getting Started

To begin working with the customers data, we'll first need to import the functionality we need, and load our data into a pandas DataFrame. Run the code cell below to load our data and display the first few entries (customers) for examination using the .head() function.


In [37]:
import pandas as pd
import numpy as np
import xlrd
workbook = xlrd.open_workbook('similaritymatrix.xls')
names = ["Type","LifeStyle","Vacation","eCredit","salary","property","label"]
train=pd.read_csv("training.txt",names=names)
test=pd.read_csv("testing.txt",names=names)
print train.head()

      Type     LifeStyle  Vacation  eCredit  salary  property label
0  student  spend>saving         6       40   13.62    3.2804    C1
1  student  spend>saving        11       21   15.32    2.0232    C1
2  student  spend>saving         7       64   16.55    3.1202    C1
3  student  spend>saving         3       47   15.71    3.4022    C1
4  student  spend>saving        15       10   16.96    2.2825    C1


As we have to predict the output label of the dataset, the 'label' attribute is removed and the operations are performed so as to deduce the output. Also the values in the above tabel are not normalized. To perform the operations, values must be in between 0 and 1. So a function is initialised so as to normalise the values of the attributes.

In [30]:
def normalFn(arr):
    minn = np.min(arr)
    maxx = np.max(arr)
    finalArr =[]
    for i in arr:
        value = float(i-minn)/float(maxx-minn)
        finalArr.append(value)
    return finalArr

nvac = normalFn(train['Vacation'])
nec = normalFn(train['eCredit'])
nsal = normalFn(train['salary'])
npro = normalFn(train['property'])
# print nec,nvac

The normalised values obtained from the function must be replaced with the values in the dataset.

In [31]:
train['Vacation'] = nvac
train['eCredit'] = nec
train['salary'] = nsal
train['property'] = npro
print train.head()
# print train.ix[1]['Type']

      Type     LifeStyle  Vacation   eCredit    salary  property label
0  student  spend>saving  0.079365  0.107558  0.219960  0.183167    C1
1  student  spend>saving  0.158730  0.052326  0.293102  0.112797    C1
2  student  spend>saving  0.095238  0.177326  0.346023  0.174200    C1
3  student  spend>saving  0.031746  0.127907  0.309882  0.189984    C1
4  student  spend>saving  0.222222  0.020349  0.363663  0.127311    C1


A similarity matrix has been given such that the values of the string can be referred from the matrix and can be converted to numeric values. Then the required operations are performed.

In [32]:
def sim(x,y):
    dict1={'student':1,'engineer':2,'librarian':3,'professor':4,'doctor':5}
    dict2={'spend<<saving':1,'spend<saving':2,'spend>saving':3,'spend>>saving':4}
    if(x in dict1.keys()):
        worksheet=workbook.sheet_by_index(0)
        return worksheet.cell(dict1[x],dict1[y]).value
    else:
        worksheet=workbook.sheet_by_index(2)
        return worksheet.cell(dict2[x],dict2[y]).value
print train.head()

      Type     LifeStyle  Vacation   eCredit    salary  property label
0  student  spend>saving  0.079365  0.107558  0.219960  0.183167    C1
1  student  spend>saving  0.158730  0.052326  0.293102  0.112797    C1
2  student  spend>saving  0.095238  0.177326  0.346023  0.174200    C1
3  student  spend>saving  0.031746  0.127907  0.309882  0.189984    C1
4  student  spend>saving  0.222222  0.020349  0.363663  0.127311    C1


As we are now analysing the data, the output attribute 'Label' is now being removed. The 'k' value given to us is 3. Distances are calculated accordingly for values in the testing data. The highest values in each class is noted and are stored in an array called distances. It is continued for all the values in the testing data and the maximum values are stored in the distances array.

In [35]:
features = test.drop('label', axis = 1)
# minvac = np.min(features_vector['Vacation'])
# maxvac = np.max(features_vector['Vacation'])
# mincre = np.min(features_vector['eCredit'])
# maxcre = np.max(features_vector['eCredit'])
# minsal = np.min(features_vector['salary'])
# maxsal = np.max(features_vector['salary'])
# minprop = np.min(features_vector['property'])
# maxprop = np.min(features_vector['property'])
nvac = normalFn(test['Vacation'])
nec = normalFn(test['eCredit'])
nsal = normalFn(test['salary'])
npro = normalFn(test['property'])
features['Vacation'] = nvac
features['eCredit'] = nec
features['salary'] = nsal
features['property'] = npro
# print features.head()

label=test['label']
predicted=[]
for i in features.index:
    features_vector=features.ix[i]
    distances=[]
    for j in train.index:
        train_vector=train.ix[j]

        type_value=1-sim(train_vector['Type'],features_vector['Type'])

        LifeStyle_value =1-sim(train_vector['LifeStyle'],features_vector['LifeStyle'])
        
        Vacation_value=np.power(train_vector['Vacation']-features_vector['Vacation'],2)
        
#         eCredit_normalized=(features_vector['eCredit']-min_eCredit)/float((max_eCredit-min_eCredit))
        eCredit_value=np.power(train_vector['eCredit']-features_vector['eCredit'],2)

#         salary_normalized=(features_vector['salary']-min_salary)/float((max_salary-min_salary))
        salary_value=np.power(train_vector['salary']-features_vector['salary'],2)

#         property_normalized=(features_vector['property']-min_property)/float((max_property-min_property))
        property_value=np.power(train_vector['property']-features_vector['property'],2)

        similarity=1/np.sqrt(type_value+LifeStyle_value+Vacation_value+eCredit_value+salary_value+property_value)
        distances.append((similarity,train_vector['label']))
        Top3=sorted(distances,key=lambda x: x[0])[-3:]
#     c1=0
#     c2=0
#     c3=0
#     c4=0
#     c5=0
#     predicted_label="None"
#     for dist,clas in Top3:
#         if(clas=='c1'):
#             c1=c1+dist
#         elif(clas=='c2'):
#             c2=c2+dist
#         elif(clas=='c3'):
#             c3=c3+dist
#         elif(clas=='c4'):
#             c4=c4+dist
#         else:
#             c5=c5+dist
#     if(c1>c2 and c1>c3 and c1>c4 and c1>c5):
#         predicted_label="c1"
#     elif(c2>c3 and c2>c1 and c2>c4 and c2>c5):
#         predicted_label="c2"
#     elif(c3>c1 and c3>c2 and c3>c5 and c3>c4):
#         predicted_label="c3"
#     elif(c4>c1 and c4>c2 and c4>c3 and c4>c5):
#         predicted_label="c4"
#     elif(c5>c1 and c5>c2 and c5>c3 and c5>c4):
#         predicted_label="c5"
    C1=0
    C2=0
    C3=0
    C4=0
    C5=0
    predicted_label="None"
    for dist,clas in Top3:
        if(clas=='C1'):
            C1=C1+dist
        elif(clas=='C2'):
            C2=C2+dist
        elif(clas=='C3'):
            C3=C3+dist
        elif(clas=='C4'):
            C4=C4+dist
        else:
            C5=C5+dist
    if(C1>C2 and C1>C3 and C1>C4 and C1>C5):
        predicted_label="C1"
    elif(C2>C1 and C2>C3 and C2>C4 and C2>C5):
        predicted_label="C2"
    elif(C3>C1 and C3>C2 and C3>C4 and C3>C5):
        predicted_label="C3"
    elif(C4>C1 and C4>C2 and C4>C3 and C4>C5):
        predicted_label="C4"
    elif(C5>C1 and C5>C2 and C5>C3 and C5>C4):
        predicted_label="C5"
    predicted.append(predicted_label)
print predicted

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated


['C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2', 'C3', 'C3', 'C3', 'C3', 'C5', 'C4', 'C1', 'C1', 'C4', 'C4', 'C5', 'C5', 'C5']


The classes are predicted accordingly and are stored in the distances array for all the 21 values.

In [36]:
def accuracy_score(truth, pred):
    """ Returns accuracy score for input truth and predictions. """
    
    # Ensure that the number of predictions matches number of outcomes
    if len(truth) == len(pred): 
        
        # Calculate and return the accuracy as a percent
        return "Predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100)
    
    else:
        return "Number of predictions does not match number of outcomes!"
print accuracy_score(label,predicted)

Predictions have an accuracy of 28.57%.


The accuracy for the data is 28.57%