# Machine Learning Engineer Nanodegree

## Model Evaluation & Validation

## Project: Classifying customers to help grow new accounts

   CFS has secured several new major accounts -- companies which previously purchased only from their competitors, including those old-time financial service bastions in Manhattan. The bad news is that these new customers are only purchasing one or two of their products, instead of the wide array of products they sell to their more established customers. In fact, revenue from their newly acquired customers is only about one-tenth that of their older wholesale customers.
   To grow new accounts, they need to know which products are most appropriate to sell to which new customers. 

## Getting Started

To begin working with the customers data, we'll first need to import the functionality we need, and load our data into a pandas DataFrame.
Run the code cell below to load our data and display the first few entries (customers) for examination using the .head() function.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import ShuffleSplit

names=['Type','LifeStyle','Vacation','eCredit','salary','property','label']
train_data=pd.read_csv("training.txt",header=None,names=names)
test_data=pd.read_csv("testing.txt",header=None,names=names)
print train_data.head()

      Type     LifeStyle  Vacation  eCredit  salary  property label
0  student  spend>saving         6       40   13.62    3.2804    C1
1  student  spend>saving        11       21   15.32    2.0232    C1
2  student  spend>saving         7       64   16.55    3.1202    C1
3  student  spend>saving         3       47   15.71    3.4022    C1
4  student  spend>saving        15       10   16.96    2.2825    C1




From a sample of the customers data, we can see the various features present for each customer:
    
    -> Type:{student,engineer,librarian,professor,doctor} - The type of customer.
    
    -> LifeStyle:{spend << saving,spend < saving,spend > saving,spend >> saving} - The comparision of money spent and money saved by the customer.
    
    -> Vacation: A real number depicting the number of vacations a customer has taken.
    
    -> eCredit: A real number depicting the eCredit of a customer.
    
    -> salary: A real number depicting the salary of a customer.
    
    -> property: A real number depicting the property worth of a customer.
    
    -> label:{C1,C2,C3,C4,C5} - The class that defines the customer.
Since we're interested in the outcome of label for each customer, we can remove the label feature from this dataset and store it as its own separate variable labels. We will use these labels later as our prediction targets.

Normalize all the numeric attributes(Vacation,eCredit,salary,property) to fall between 0 and 1 using the Min Max normalization.
The function below takes an array as argument and returns an array with normalized values.

In [2]:
min_Vacation=np.min(train_data['Vacation'])
max_Vacation=np.max(train_data['Vacation'])
min_eCredit=np.min(train_data['eCredit'])
max_eCredit=np.max(train_data['eCredit'])
min_salary=np.min(train_data['salary'])
max_salary=np.max(train_data['salary'])
min_property=np.min(train_data['property'])
max_property=np.max(train_data['property'])
def normalized(arr):
    array=[]
    max_value=np.max(arr)
    min_value=np.min(arr)
    for x in arr:
        x_normalized=(x-min_value)/float((max_value-min_value))
        array.append(x_normalized)
    return  array

In [3]:
normalized_Vacation=normalized(train_data['Vacation'])
normalized_eCredit=normalized(train_data['eCredit'])
normalized_salary=normalized(train_data['salary'])
normalized_property=normalized(train_data['property'])
print normalized_Vacation[0:5]

[0.079365079365079361, 0.15873015873015872, 0.095238095238095233, 0.031746031746031744, 0.22222222222222221]


Now we need to replace the training set data with the new normalized data. So we normalize each column seperately and replace the training data set values with the new values.

In [4]:
train_data['Vacation']=normalized_Vacation
train_data['eCredit']=normalized_eCredit
train_data['salary']=normalized_salary
train_data['property']=normalized_property
print train_data.head()

      Type     LifeStyle  Vacation   eCredit    salary  property label
0  student  spend>saving  0.079365  0.107558  0.219960  0.183167    C1
1  student  spend>saving  0.158730  0.052326  0.293102  0.112797    C1
2  student  spend>saving  0.095238  0.177326  0.346023  0.174200    C1
3  student  spend>saving  0.031746  0.127907  0.309882  0.189984    C1
4  student  spend>saving  0.222222  0.020349  0.363663  0.127311    C1


Finding distances between floating point numbers is possible but it is not possible for strings unless the strings are converted to numbers. We have been provided with a similarity matrix using which we will find out similarity between two strings. The following function sim takes in two strings and returns the similarity value between them.

In [5]:
import xlrd
workbook = xlrd.open_workbook('similaritymatrix.xls')
def sim(x,y):
    dict1={'student':1,'engineer':2,'librarian':3,'professor':4,'doctor':5}
    dict2={'spend<<saving':1,'spend<saving':2,'spend>saving':3,'spend>>saving':4}
    if(x in dict1.keys()):
        worksheet=workbook.sheet_by_index(0)
        return worksheet.cell(dict1[x],dict1[y]).value
    else:
        worksheet=workbook.sheet_by_index(2)
        return worksheet.cell(dict2[x],dict2[y]).value
print train_data.head()

      Type     LifeStyle  Vacation   eCredit    salary  property label
0  student  spend>saving  0.079365  0.107558  0.219960  0.183167    C1
1  student  spend>saving  0.158730  0.052326  0.293102  0.112797    C1
2  student  spend>saving  0.095238  0.177326  0.346023  0.174200    C1
3  student  spend>saving  0.031746  0.127907  0.309882  0.189984    C1
4  student  spend>saving  0.222222  0.020349  0.363663  0.127311    C1


Each row from the testing data set is compared against all rows from the training data and distances are calculated and put in the distances array. As k=3 we take the top3 values that have the greatest score compared to the test data row. In the end collective scores of each class is calculated and whichever turns out to be the maximum is the predicted class. This process is carried out for all the rows in testing data set and all the predicted labels are put into the predicted array.

In [6]:
labels=test_data['label']
features = test_data.drop('label', axis = 1)
predicted=[]
for i in features.index:
    features_vector=features.ix[i]
    distances=[]
    for j in train_data.index:
        train_data_vector=train_data.ix[j]

        type_value=1-sim(train_data_vector['Type'],features_vector['Type'])

        LifeStyle_value =1-sim(train_data_vector['LifeStyle'],features_vector['LifeStyle'])

        Vacation_normalized=(features_vector['Vacation']-min_Vacation)/float((max_Vacation-min_Vacation))
        Vacation_value=np.power(train_data_vector['Vacation']-Vacation_normalized,2)

        eCredit_normalized=(features_vector['eCredit']-min_eCredit)/float((max_eCredit-min_eCredit))
        eCredit_value=np.power(train_data_vector['eCredit']-eCredit_normalized,2)

        salary_normalized=(features_vector['salary']-min_salary)/float((max_salary-min_salary))
        salary_value=np.power(train_data_vector['salary']-salary_normalized,2)

        property_normalized=(features_vector['property']-min_property)/float((max_property-min_property))
        property_value=np.power(train_data_vector['property']-property_normalized,2)

        similarity=1/np.sqrt(type_value+LifeStyle_value+Vacation_value+eCredit_value+salary_value+property_value)
        distances.append((similarity,train_data_vector['label']))
    Top3=sorted(distances,key=lambda x: x[0])[-3:]
    C1=0
    C2=0
    C3=0
    C4=0
    C5=0
    predicted_label="None"
    for dist,clas in Top3:
        if(clas=='C1'):
            C1=C1+dist
        elif(clas=='C2'):
            C2=C2+dist
        elif(clas=='C3'):
            C3=C3+dist
        elif(clas=='C4'):
            C4=C4+dist
        else:
            C5=C5+dist
    if(C1>C2 and C1>C3 and C1>C4 and C1>C5):
        predicted_label="C1"
    elif(C2>C1 and C2>C3 and C2>C4 and C2>C5):
        predicted_label="C2"
    elif(C3>C1 and C3>C2 and C3>C4 and C3>C5):
        predicted_label="C3"
    elif(C4>C1 and C4>C2 and C4>C3 and C4>C5):
        predicted_label="C4"
    elif(C5>C1 and C5>C2 and C5>C3 and C5>C4):
        predicted_label="C5"
    predicted.append(predicted_label)
print predicted
        

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  """
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  


['C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2', 'C3', 'C3', 'C3', 'C3', 'C5', 'C4', 'C1', 'C4', 'C4', 'C4', 'C5', 'C5', 'C5']


In [7]:
def accuracy_score(truth, pred):
    """ Returns accuracy score for input truth and predictions. """
    
    # Ensure that the number of predictions matches number of outcomes
    if len(truth) == len(pred): 
        
        # Calculate and return the accuracy as a percent
        return "Predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100)
    
    else:
        return "Number of predictions does not match number of outcomes!"
print accuracy_score(labels,predicted)

Predictions have an accuracy of 23.81%.


The predictions are 23.81% accurate. 

In [8]:
from sklearn.neighbors import KNeighborsClassifier
def myfunc(x,y):
    sc=0
    for f,b in zip(x,y):
        sc=sc+(f-b)**2
    return 1/np.sqrt(sc)



y_train=train_data['label']
x_train=train_data.drop('label',axis=1)
neigh = KNeighborsClassifier(n_neighbors = 5, weights='uniform', algorithm='auto', metric=myfunc)
x_train['Type']=x_train['Type'].map({'student':1,'engineer':2,'librarian':3,'professor':4,'doctor':5})
x_train['LifeStyle']=x_train['LifeStyle'].map({'spend<<saving':1,'spend<saving':2,'spend>saving':3,'spend>>saving':4})
ind=0
Y_train=[]
for x in y_train:
    if(x=="C1"):
        Y_train.append(1)
    elif(x=="C2"):
        Y_train.append(2)
    elif(x=="C3"):
        Y_train.append(3)
    elif(x=="C4"):
        Y_train.append(4)
    elif(x=="C5"):
        Y_train.append(5)
    ind=ind+1
print x_train.head()
print Y_train
neigh.fit(x_train, Y_train)

   Type  LifeStyle  Vacation   eCredit    salary  property
0     1          3  0.079365  0.107558  0.219960  0.183167
1     1          3  0.158730  0.052326  0.293102  0.112797
2     1          3  0.095238  0.177326  0.346023  0.174200
3     1          3  0.031746  0.127907  0.309882  0.189984
4     1          3  0.222222  0.020349  0.363663  0.127311
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]


KNeighborsClassifier(algorithm='auto', leaf_size=30,
           metric=<function myfunc at 0x7facd03e46e0>, metric_params=None,
           n_jobs=1, n_neighbors=5, p=2, weights='uniform')

In [9]:
x_test=features
for i in x_test.index:
    test_vector=x_test.ix[i]
    test_vector['Vacation']=(test_vector['Vacation']-min_Vacation)/float((max_Vacation-min_Vacation))

    test_vector['eCredit']=(test_vector['eCredit']-min_eCredit)/float((max_eCredit-min_eCredit))

    test_vector['salary']=(features_vector['salary']-min_salary)/float((max_salary-min_salary))

    test_vector['property']=(features_vector['property']-min_property)/float((max_property-min_property))
    x_test.ix[i]=test_vector
print x_test.head()
x_test['Type']=x_test['Type'].map({'student':1,'engineer':2,'librarian':3,'professor':4,'doctor':5})
x_test['LifeStyle']=x_test['LifeStyle'].map({'spend<<saving':1,'spend<saving':2,'spend>saving':3,'spend>>saving':4})
y_pred = neigh.predict(x_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


        Type      LifeStyle  Vacation   eCredit    salary  property
0    student   spend<saving  0.174603  0.046512  0.571043  0.115417
1    student  spend>>saving  0.444444  0.020349  0.571043  0.115417
2    student  spend<<saving  0.428571  0.165698  0.571043  0.115417
3   engineer   spend>saving  0.222222  0.110465  0.571043  0.115417
4  librarian   spend<saving  0.015873  0.017442  0.571043  0.115417


In [10]:
print y_pred
index=0
lab=[]
for x in labels:
    if(x=="C1"):
        lab.append(1)
    elif(x=="C2"):
        lab.append(2)
    elif(x=="C3"):
        lab.append(3)
    elif(x=="C4"):
        lab.append(4)
    elif(x=="C5"):
        lab.append(5)
    index=index+1
print lab
accuracy_score(lab,y_pred)

[3 3 3 3 1 1 1 1 3 3 1 1 1 1 3 3 1 1 1 3 3]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


'Predictions have an accuracy of 52.38%.'

The predefined knn by scikit gives 52.38% accuracy when feeded with "non-string" inputs and using inverse euclidean formula to calculate distances.