### Prediction of the metal ion content from multi-parameter data
The Water_data.csv dataset(found in dataset folder) is used, which is a multi-parameter dataset consisting of 201 samples obtained from 67 mixtures of Cadmium, Lead, and tap water.  Three features (attributes) were measured for each sample (Mod1, Mod2, Mod3). Use K-Nearest Neighbor Regression to predict total metal concentration (c_total), concentration of Cadmium (Cd) and concentration of Lead (Pb), for each sample using number of neighbors k = 1, 2, 3, 4 and 5.

In [150]:
#import necessary libraries
import pandas as pd
import numpy as np
import operator


In [151]:
# load dataset and normalize data with z-score
from scipy.stats import zscore
df = pd.read_csv('Water_data.csv')

# apply zscore to every every columns
df[df.columns] = df[df.columns].apply(zscore) 
df.tail()
    

Unnamed: 0,c_total,Cd,Pb,Mod1,Mod2,Mod3
196,2.918376,3.941121,0.654578,-0.84399,-0.534095,-1.420835
197,2.918376,3.941121,0.654578,-0.845365,-0.555321,-1.422472
198,2.918376,5.036636,-0.440936,-0.841559,-0.289483,0.567578
199,2.918376,5.036636,-0.440936,-0.841333,-0.29466,0.841408
200,2.918376,5.036636,-0.440936,-0.834399,-0.31416,0.777797


In [152]:
#convert pandas dataframe into numpy array
data = df.values
data

array([[-0.56000976, -0.44093638, -0.44093638, -0.97172094, -0.66981829,
        -0.3350763 ],
       [-0.56000976, -0.44093638, -0.44093638, -0.96309341, -0.66999086,
        -0.16082967],
       [-0.56000976, -0.44093638, -0.44093638, -0.96282669, -0.66973201,
         0.09487507],
       ...,
       [ 2.9183762 ,  5.03663569, -0.44093638, -0.8415591 , -0.28948263,
         0.56757761],
       [ 2.9183762 ,  5.03663569, -0.44093638, -0.84133341, -0.29465961,
         0.84140779],
       [ 2.9183762 ,  5.03663569, -0.44093638, -0.83439856, -0.31415958,
         0.77779679]])

==== Knn Regression  implementation =====

In [153]:
#calculates eculidan distance between two data instances
def eculidan_dist(data1, data2):
    
    distance = 0
    length = len(data2)
        
    # loop through all  input variables (except target variables)
    for x in range(3, length):
        distance += np.power((data1[x]- data2[x]),2)
    return np.sqrt(distance)
    

In [154]:
# finds k nearest  neighbors of  test instance 

def get_neighbours(trainingSet, testInstance, k):
        distance = []
        
        # find distance  between  a test instance to all training instances
        for x in range(len(trainingSet)):
            dist = eculidan_dist(trainingSet[x], testInstance) # call eculidan_dist function
            distance.append((trainingSet[x], dist)) 
        
        # sort the distance in ascending order
        distance.sort(key=operator.itemgetter(1))
        
    
        #store the  k neighbours
        neighbours = []
        for x in  range(k):
            # takes only neighbours from distance and append 
            neighbours.append(distance[x][0]) 
        return neighbours
    
    

In [155]:
# finds average of target variables of k neighbours
def get_predic_val(neighbors):
        total_neig = len(neighbors)
        predict_array = np.array([0.0,0.0,0.0]) # initilize predict values for c_total, Cd and Pb 
        
        #iterates over K neighbours
        for x in range(total_neig):
            predict_array += neighbors[x][0:3] # take c_total, Cd and Pb
            
        return  predict_array/total_neig # take average 
    
    

 ==== cross validation =====

In [156]:
''' this function divides the data into test and train set based on the given 'leave' number. For example, if 'leave' 
    is given as 1, then test set will 1 data instance and rest will be train dataset and keep doing until all instance
    become testset. Thus, there is n-tests and n-train set in n-samples.  
'''

# cross validation for leave-one-out method
def cross_validation( dataset, leav_out, k):
    
    '''dataset = data used for cv, 
       leav_out = no. of instance for test set, 
       k = parameter for knn'''
    
    predictions = [] # store prediction
    
    i = 0 
    while i <= len(dataset): #iterates as the no. of samples
        
        # separate test and training dataset
        test = dataset[i: i+leav_out] # test set
        train = np.concatenate([dataset[0:i], dataset[i+leav_out:]]) # train set
        
        # each element in test data
        for x in range(len(test)):
            neighbor = get_neighbours(train, test[x], k) # call get_neighbours function
            predic_val = get_predic_val(neighbor) # get predicted value
            predictions.append(predic_val)
        
        i = i + leav_out # increament i values 
        
    return np.array(predictions)
        

    
   

In [157]:
#prediction from knn scratch
prediction_from_scratch = cross_validation(data,1,3)
prediction_from_scratch

array([[-0.5405308 , -0.43582398, -0.41537437],
       [-0.53357402, -0.43889142, -0.40135179],
       [-0.54887892, -0.43450936, -0.42983516],
       [-0.54609621, -0.42925089, -0.43071157],
       [-0.54748757, -0.44093638, -0.42121712],
       [-0.54609621, -0.43655432, -0.42340815],
       [-0.55351677, -0.43786894, -0.43377901],
       [-0.55212541, -0.43553184, -0.43392508],
       [-0.54400918, -0.43392508, -0.42275084],
       [-0.53682052, -0.44093638, -0.40441923],
       [-0.55351677, -0.43889142, -0.43275653],
       [-0.53913944, -0.43838018, -0.41062714],
       [-0.53913944, -0.42231263, -0.42669469],
       [-0.53218267, -0.43838018, -0.399672  ],
       [-0.53913944, -0.42085194, -0.42815537],
       [-0.53913944, -0.42596435, -0.42304297],
       [-0.53102321, -0.438015  , -0.39821131],
       [-0.54261783, -0.425234  , -0.42925089],
       [-0.49739881, -0.44093638, -0.34234008],
       [-0.52429833, -0.43071157, -0.39492477],
       [-0.5405308 , -0.43071157, -0.420

===== Evaluation of Model =========

In [158]:
# implementation  of C_index
def c_index (true_labels, predictions):
    n = 0
    h_sum =0
    
    for i in range(len(true_labels)):
        t = true_labels[i]
        p = predictions[i]
        
        
        for j in range(i+1, len(true_labels)):
            nt = true_labels[j]
            np = predictions[j]
            
            if( t!=nt):
                n += 1
                
                if(p < np and t <  nt) or (p > np and t > nt):
                    h_sum += 1
                
                elif(p == np):
                    h_sum += 0.5
                
    c_index = h_sum/n
    return c_index
    
    
    


In [159]:
# testing c-index implementation
true_labels = [-1, 1, 1, -1, 1]
predictions = [0.60, 0.80, 0.75, 0.75, 0.70]
c_index(true_labels, predictions)

0.75

                  ==== Compare Knn  from scratch  with scikit learn =====
Now here, we want to compare knn from scratch with sklearn, whether it works as correctly as sklearn knn regression or not. Thus, the c-index from knn scratch is calcualted here. Here, leave-one-out is used for both (knn from scratch and 
knn from sklearn). This is just testing purpose between  knn from scratch and knn from sklearn.

In [160]:
'''==== find the c-index of usign knn from scratch ====''' 

#find c-index of 'c_total =0', 'Cd =1', 'Pb =2'
for x in range(len(prediction_from_scratch.T)): # tranpose the result array and loop thorugh it
    true_labels = (df.values).T[x] # true values 
    predictions = prediction_from_scratch.T[x] # predicted values
    c_val = c_index(true_labels , predictions ) # call c_index function
    print('C-index  of',x,  'is {0:.2f}'.format(c_val))

C-index  of 0 is 0.90
C-index  of 1 is 0.88
C-index  of 2 is 0.85


 === finding c-index from sklearn knn ====

In [161]:
# import necessary libraries
from  sklearn.neighbors import KNeighborsRegressor
from sklearn.multioutput import MultiOutputRegressor

#define function
def cross_validation_sklearn( dataset, leav_out, k):
    sklearn_prediction = [] # store prediction
    
    i = 0 
    while i < len(dataset): #loop until it reachs all samples
        
        # separate test and training dataset
        test = dataset[i: i+leav_out] # test set
        y_test, x_test = test[:,0:3], test[:,3:6] # separate input features(x_test) and target (y_test)
        
        train = np.concatenate([dataset[0:i], dataset[i+leav_out:]]) # train set
        y_train, x_train = train[:,0:3], train[:,3:6] # separate input features(x_train) and target (y_train)
       
    

        # intialize knn regression
        knn= KNeighborsRegressor(n_neighbors=k, metric='euclidean' )
        multi_regr =MultiOutputRegressor(knn) #intialize multioutput regression (since we have multi target ouputs)
        multi_regr.fit(x_train, y_train) # train with the dataset
        respon_val = multi_regr.predict(x_test) # get prediction
        sklearn_prediction.append(respon_val) # append prediction
        
        
        
        
        i = i + leav_out # increament i value
    result = [instance for item in sklearn_prediction for instance in item ] # change 3d into 2d list
    return np.array(result)
        






In [162]:
# prediction from sklearn knn regression
prediction_from_sklearn = cross_validation_sklearn( data, 1, 3)
prediction_from_sklearn.T

array([[-0.5405308 , -0.53357402, -0.54887892, -0.54609621, -0.54748757,
        -0.54609621, -0.55351677, -0.55212541, -0.54400918, -0.53682052,
        -0.55351677, -0.53913944, -0.53913944, -0.53218267, -0.53913944,
        -0.53913944, -0.53102321, -0.54261783, -0.49739881, -0.52429833,
        -0.5405308 , -0.54887892, -0.54887892, -0.54748757,  2.9183762 ,
        -0.54748757, -0.54261783, -0.53241456, -0.54261783, -0.54261783,
        -0.55351677, -0.54887892, -0.54748757, -0.53913944, -0.54748757,
        -0.5405308 , -0.54748757, -0.5405308 , -0.53566105, -0.54887892,
        -0.53705241, -0.54261783, -0.54400918, -0.54400918, -0.54609621,
        -0.54261783, -0.53913944, -0.54400918, -0.54400918, -0.54540054,
        -0.53913944, -0.5159502 , -0.52406643, -0.52870428, -0.51131235,
        -0.53218267, -0.53566105, -0.52893618, -0.52893618, -0.5405308 ,
        -0.49044204,  1.77050884, -0.53241456, -0.52429833,  1.77050884,
         1.77050884, -0.52870428, -0.5405308 , -0.5

In [163]:
''' === c-index from sklearn prediction ==== '''

# find c-index of 'c_total =0', 'Cd =1', 'Pb =2'
for x in range(len(prediction_from_sklearn.T)): # tranpose the result array and loop thorugh it
    true_label = (df.values).T[x] # true values 
    prediction = prediction_from_sklearn.T[x] # predicted values
    c_val = c_index(true_label , prediction) # call c_index function
    print('C-index  of',x,  'is {0:.2f}'.format(c_val))


C-index  of 0 is 0.90
C-index  of 1 is 0.88
C-index  of 2 is 0.85


Note: c-indexs from both i.e knn from scratch and sklearn are same, which indicates that our knn implementation is working correctly.

Now find the c-index from two CV, 1)leave-one-out and 2) leave-three-out using 'k' for knn from 1 to 5

In [164]:
'''======== c-index at leave-one-out ======'''
    
#cross validation (leave one out)
data =np.array(df)

# k from 1 to 5
for k in range(1, 6):
    leave_one_out_result = cross_validation(data,  1, k)
   
    
    print('\nWhen k is ', k)
    # find c-index of 'c_total =0', 'Cd =1', 'Pb =2'
    for x in range(len(leave_one_out_result.T)): # tranpose the result array and loop thorugh it
        true_labels = (df.values).T[x] # true values 
        predictions = leave_one_out_result.T[x] # predicted values
        c_val = c_index(true_labels , predictions ) # call c_index function
        print('C-index  of',x,  'is {0:.2f}'.format(c_val))

   
   


When k is  1
C-index  of 0 is 0.90
C-index  of 1 is 0.90
C-index  of 2 is 0.87

When k is  2
C-index  of 0 is 0.91
C-index  of 1 is 0.90
C-index  of 2 is 0.87

When k is  3
C-index  of 0 is 0.90
C-index  of 1 is 0.88
C-index  of 2 is 0.85

When k is  4
C-index  of 0 is 0.89
C-index  of 1 is 0.85
C-index  of 2 is 0.85

When k is  5
C-index  of 0 is 0.88
C-index  of 1 is 0.83
C-index  of 2 is 0.83


=======  C-index at leave-3-out =======

In [165]:
# cross validation (leave-3-out)
data = np.array(df)

# when k = 1,5
for k in range(1,6):
    leave_3_out_result = cross_validation(data,  3, k) # call cross validation function
    
    
    print('\nWhen k is ', k)
    # find c-index of 'c_total = 0', 'Cd =1', 'Pb =2'
    for x in range(len(leave_3_out_result.T)):
        true_labels = (df.values).T[x] # true values
        predictions = leave_3_out_result.T[x] # predicted values
        c_val = c_index(true_labels , predictions )
        print('C-index of ', x,'is {0:.2f}'.format(c_val))



When k is  1
C-index of  0 is 0.82
C-index of  1 is 0.74
C-index of  2 is 0.74

When k is  2
C-index of  0 is 0.82
C-index of  1 is 0.75
C-index of  2 is 0.75

When k is  3
C-index of  0 is 0.82
C-index of  1 is 0.74
C-index of  2 is 0.75

When k is  4
C-index of  0 is 0.82
C-index of  1 is 0.72
C-index of  2 is 0.76

When k is  5
C-index of  0 is 0.82
C-index of  1 is 0.72
C-index of  2 is 0.76


#### Report:
After obeserving the results from both the cross validations, it is found that the result from the leave-1-out CV gives
more percentage of c-index than the result from leave-3-out CV. This is because  some replicas of the data instance
(test sample) are in training set when performing leave-1-out CV. Thus, the information is leaked in the learning set and regressior algorithm predict very well.  Eventhough the predicted result (c-index) in leave-3-out CV is less, it is the realistic result because the leave-3-out CV in this dataset maintains the Independent and identically distributed (IID) by taking 3- replicas as testset and rest as training set.

