## Pain assessment from biosignal data 

Knn classifer from scratch is used to classify the multi-class label pain (1-3). The leave-subject-out crossvalidaton method is appplied to evaludate the model with the use of c-index as accuracy metric.

In [48]:
#import necessary libraries
import numpy as np
import pandas as pd
import operator 
from numba import autojit #for fast compling


 === load dataset ====

In [37]:
# define funciton to load dataset
def load_data(data):
    df = pd.read_csv(data)
    return df


In [38]:
#return dataframe
df = load_data('paindata.csv')
print('Dimension of data:', df.shape)
df.head()

Dimension of data: (10300, 8)


Unnamed: 0,subject,test,label,label_time,feat1,feat2,feat3,feat4
0,1,1,1,1,1.2,5.962879,5.939946,423.315903
1,1,1,1,2,1.2,5.951589,5.950951,425.57834
2,1,1,1,3,1.162144,5.98767,5.950802,428.006194
3,1,1,1,4,1.157024,5.963077,6.012273,430.369485
4,1,1,1,5,1.170144,5.970446,5.974484,432.507971


### Normalize the features  with z-score
Since each subject has different physical measure ranges with other, the data normalization/standarization should be done based on the subject.

In [39]:
featcols = ['feat1', 'feat2', 'feat3', 'feat4']
zscore = lambda x: (x - x.mean()) / x.std()
newdf = df.copy() # make duplicate (keep the original dataframe as it is)

#apply z-score sparately on each subject 
# 'x' refers to subject(=1-31) 
for x in range(1,32):
    newdf.loc[newdf['subject'] == x , featcols] = newdf.loc[newdf['subject'] == x , featcols].transform(zscore)


print('Data-frame with normalized features:')
newdf.tail(10)


Data-frame with normalized features:


Unnamed: 0,subject,test,label,label_time,feat1,feat2,feat3,feat4
10290,31,4,3,8,-0.904752,-0.608699,-0.546377,0.642281
10291,31,4,3,9,-0.873096,-0.096238,-1.17252,0.665783
10292,31,4,3,10,-0.923797,-0.46053,-1.027002,0.692833
10293,31,4,3,11,-0.923797,0.552092,-0.05723,0.721897
10294,31,4,3,12,-0.918135,-0.423589,-0.257973,0.751439
10295,31,4,3,13,-0.901921,-0.513861,-1.605357,0.779928
10296,31,4,3,14,-0.91273,-0.412859,-0.91124,0.805828
10297,31,4,3,15,-1.054411,-1.304609,-0.706943,0.827784
10298,31,4,3,16,-1.150023,-0.361098,0.457559,0.847852
10299,31,4,3,17,-1.170226,-0.637737,0.103565,0.868996


In [40]:
# remove test and label from dataframe 
data = newdf.drop(columns=['test', 'label_time'])
print('Data fram after some columns dropped:')
print(data.head())

data_matrix = data.values #convet to numpy array
data_matrix


Data fram after some columns dropped:
   subject  label     feat1     feat2     feat3     feat4
0        1      1  1.524262  0.051625 -0.549833 -1.246646
1        1      1  1.524262 -0.443984  0.002227 -1.229926
2        1      1  0.984724  1.139869 -0.005266 -1.211983
3        1      1  0.911751  0.060341  3.078269 -1.194518
4        1      1  1.098743  0.383784  1.182690 -1.178714


array([[  1.00000000e+00,   1.00000000e+00,   1.52426212e+00,
          5.16253124e-02,  -5.49833102e-01,  -1.24664567e+00],
       [  1.00000000e+00,   1.00000000e+00,   1.52426212e+00,
         -4.43984423e-01,   2.22669759e-03,  -1.22992566e+00],
       [  1.00000000e+00,   1.00000000e+00,   9.84723653e-01,
          1.13986901e+00,  -5.26649673e-03,  -1.21198318e+00],
       ..., 
       [  3.10000000e+01,   3.00000000e+00,  -1.05441094e+00,
         -1.30460878e+00,  -7.06942842e-01,   8.27783740e-01],
       [  3.10000000e+01,   3.00000000e+00,  -1.15002289e+00,
         -3.61097886e-01,   4.57559378e-01,   8.47852130e-01],
       [  3.10000000e+01,   3.00000000e+00,  -1.17022623e+00,
         -6.37736899e-01,   1.03565268e-01,   8.68996175e-01]])

#### ===   Knn classifer Implementation from scratch  ===
The three functions are used for knn classifer: eculidan_dist(), get_neighbours() and get_response(). The eculidan_dist() computes the ecuclidan distance between two data instances, get_neighbours() function sorts the distance and finds the k-neighbours (in our case 37 neighbours) and  get_response() function finds label/class 
for the testinstace belongs to.

In [41]:
#calculates eculidan distance between two data instances
@autojit
def eculidan_dist(data1, data2):
    
    distance = 0
    length = len(data2)
        
    # loop through all  input variables (except target variables)
    for x in range(3, length):
        distance += np.power((data1[x]- data2[x]),2)
    return np.sqrt(distance)

In [42]:
# finds k nearest  neighbors of  test instance 
@autojit
def get_neighbours(trainingSet, testInstance, k):
        ''' trainingSet =  train data
            testInstance = single test data
            k = k parameter for KNN
        '''
        
        distance = []
        
        # find distance  between  a test instance to all training instances
        for x in range(len(trainingSet)):
            dist = eculidan_dist(trainingSet[x], testInstance) # call eculidan_dist function
            distance.append((trainingSet[x], dist)) 
        
        # sort the distance in ascending order
        distance.sort(key=operator.itemgetter(1))
        
    
        #store the  k neighbours
        neighbours = []
        for x in  range(k):
            # takes only neighbours from distance and append 
            neighbours.append(distance[x][0]) 
        return neighbours

In [43]:
# finds class that test instance belongs to. This is done by finding the  the most frequent class in the k neigbours
@autojit
def get_response(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][1] # take second attribute(target variable) 
        if response in classVotes:
            classVotes[response] +=1
            
        else:
            classVotes[response] = 1
    
    # sorts classes with higest values
    sortedVotes = sorted(classVotes.items(),key=operator.itemgetter(1), reverse=True) #sorting based on max count
    return  sortedVotes[0][0] # take first neighbour

=== C-index  implementation ====

In [44]:
#implementation  of C_index
@autojit
def c_index (true_labels, predictions):
    n = 0
    h_sum =0
    
    for i in range(len(true_labels)):
        t = true_labels[i]
        p = predictions[i]
        
        
        for j in range(i+1, len(true_labels)):
            nt = true_labels[j]
            np = predictions[j]
            
            if( t!=nt):
                n += 1
                
                if(p < np and t <  nt) or (p > np and t > nt):
                    h_sum += 1
                
                elif(p == np):
                    h_sum += 0.5
                
    c_index = h_sum/n
    return c_index
    
    

In [45]:
# testing c-index implementation
true_labels = [-1, 1, 1, -1, 1]
predictions = [0.60, 0.80, 0.75, 0.75, 0.70]
c_index(true_labels, predictions)

0.75

#### ====  Cross validation (leave-subject-out)   ======

Leave-subject-out CV is applied to maintain the independency of data and to avoid from being bias result. There are 31 subjects in the dataset, so data is divided in 31 folds/subjects (one subject as test and remaining data as train set) and calculate the performance (using c-index) in each subjects and  get the average c-index from all these folds(subjects). The k-parameter for KNN is used as 37 which is mentioned in lecture slide.

In [46]:
#cross validation (leave_subject_out)
@autojit

def cross_validation(data):
    '''  
    This function separetes the data into testset (each subject) and remainig into trainset. Prints the subject with
    higest and lowest c-index, and the average c-index of all subject. And, returns all c-index with subject in 
    pandas dataframe form.
    '''
    
   
    predic_subject = {} # store c-index for each subject
    
    print('Program running ....... \n \n')
    
    
    '''==split data into test set(subject wise) and train set(remaining data)=='''
    #loop 31 times since 31 subjects in dataset
    for subject in range(1,32):
        
        #check incase subject no. put larger than in the dataset(1-31)
        if subject >=32:
            print('Please, put subject no. <= 31')
            break
            
        test = [] # store test instance
        train = [] # store train instance
        
        true_values = [] # store true labels of testdata
        predicted_values = [] # store predicted labels testdata
        
        #iterates over total data samples(seperates test and train set)
        for x in range(len(data)):
            
            #check subject with identity(1-31)
            if data[x][0] == subject:
                test.append(data[x]) # append for test
            else:
                train.append(data[x]) # append for train
         
        
        '''=== knn classifer start ==='''
        test = np.array(test) #convert into numpy before applying
        train =np.array(train)#convert into numpy before applying
        
        #each instance in test dataset (since get_neighbours funciton take test instance, rather whole testset)
        for instance  in test:
            k_neighbours = get_neighbours(train, instance, 37) # call get_neighbours (gives k(37)-neighbours)
            predict_class = get_response(k_neighbours) # call get_response(gives preidcted class)
            true_values.append(predict_class)
            predicted_values.append(instance[1])
           
           
        
        
        # calculates c_index for each testset(subject)
        c_val = c_index(true_values, predicted_values) # call c-index function
        #print(c_val)
        predic_subject[subject] = c_val # assign c-index to each subject
    

    
    
    ''' == print average c_index,  lowest and higest c_index with subject_id  =='''
   
    average_cindex = sum(predic_subject.values())/len(predic_subject) # average c-index from all testset(subjects)
    print('Average c-index is:', round(average_cindex, 2)) #average c-index
    
    sorted_cindex = sorted(predic_subject.items(),key=operator.itemgetter(1)) # sort with dict values(ascending order)
    #print(sorted_cindex)
    print('Subject id {0} has higest c_index: {1:.2f}'.format(sorted_cindex[-1][0], sorted_cindex[-1][1])) #higest c-index
    print('Subject id {0} has lowest c_index: {1:.2f} \n'.format(sorted_cindex[0][0],sorted_cindex[0][1])) #lowest c-index
    
    
    # each subject c-index in dataframe
    c_index_of_subject = pd.DataFrame(list(predic_subject.items()), columns=['Subject_Id', 'C-index'])
    
    print('Subject with c-index in dataframe:') 
    return c_index_of_subject
            
          
    
    
    

In [47]:
#run cross validaton function
cross_validation(data_matrix)


Program running ....... 
 

Average c-index is: 0.67
Subject id 28 has higest c_index: 0.84
Subject id 3 has lowest c_index: 0.48 

Subject with c-index in dataframe:


Unnamed: 0,Subject_Id,C-index
0,1,0.823539
1,2,0.670163
2,3,0.482408
3,4,0.68277
4,5,0.639182
5,6,0.51887
6,7,0.661254
7,8,0.760342
8,9,0.559236
9,10,0.630636


### Observation:
After finding the results from Cross-validation, it was found that subject id 28 has the higest c-index of 0.84
while the lowest one is 0.48 from subject id of 3. The  average c_index obtained from this subject-one-out CV is 0.67 which pretry low. From the outputted dataframe, it was observerd less no. of subjects(=4) which able to make c-index more than .80. This shows that model can't predict  very well most of the subjects.
    