# TKO2096 
## EXERCISE III | Pain assessment from biosignal data
YUE MA 520790

## Libraries list

In [1]:
import numpy as np
import pandas as pd
import pdb
from scipy.stats import zscore
import operator
import random

main used libraries
- `sklearn.neighbors.KNeighborsClassifier`
- `numpy`
- `pandas`
- `scipy`

## Functions & Classes
- `normalization_subjects()` to normalize data on subject level (`zscore()` from `scipy` is used)
- `C_index()` calculate c-index of two given sequences
- `Lso_CV()`Class: critical parts to implements leave subject out cv
    - `split_data()` method: split data set on subject level
    - `KNN_predict()` method: use `sklearn`'s KNN classifier to calculate predictions and performance for each subject

## Load_data

In [2]:
pain_data=pd.read_csv("/Users/mayue/Desktop/TKO-2096/Exe/paindata.csv")

## Normalization 

In [3]:
def normalization_subjects(dataset,column_name):
    """column name is the column which represent the subject name"""
    names=dataset[column_name].unique()
    normalized_data=pd.DataFrame(dataset)
#     pdb.set_trace()
    for i in names:
        if_subject=(normalized_data[column_name]==i)
#         pdb.set_trace()
        normalized_data.loc[if_subject,'feat1':]=zscore(dataset[if_subject].iloc[:,4:])
        #do not normalize the targets data
    return normalized_data
        
        
    

In [4]:
normalized_pain_data=normalization_subjects(pain_data,'subject')

I have used `zscore()` function to normalize data according to their subjects. 

And it is necessary to make sure that only feature values have been changed.

In [5]:
"""split the data into features and targets"""
X=normalized_pain_data.iloc[:,4:]
y=normalized_pain_data.iloc[:,2]
additional_variables=normalized_pain_data.iloc[:,[0,1,3]]


## Define the performance function

In [6]:
def C_index(pred,true_labels):
    """pred and true_labels are sequences which have same length"""
    n=0
    h_sum=0.0
    for i in range(0,len(true_labels)):
        t=true_labels[i]
        p=pred[i]
        for j in range(i+1,len(true_labels)):
            nt=true_labels[j]
            np=pred[j]
            if t!=nt:
                n=n+1
#                 pdb.set_trace()
                if ((p<np)&(t<nt))|((p>np)&(t>nt)):
                    h_sum+=1.0
                else:
                    if p==np:
                        h_sum+=0.5
    return float(h_sum/n)

This part is directly from exerciese II and it works well.

In [7]:
def classfication_accuracy(pred,true_labels):
    """pred and true_labels are sequences which have same length"""
    n_correct=0.0
    m=float(len(true_labels))#amount of instance
    for i in range(0,len(true_labels)):
#         pdb.set_trace()
        n_correct+=int(pred[i]==true_labels[i])
    return (float(n_correct/m))

for convenience I have also built a function to calculate traditional accuracy between predictions and true labels, which will be used in the CV.

## Implement leave-subject-out cross-validation with KNN (k=37)
Because the data amount is a bit large for this case, so I choose use the off-the-shelf sklearn KNN to finish this part.

In [8]:
from sklearn.neighbors import KNeighborsClassifier

In [14]:
class Lso_CV():
    
    def load_data(self,attributes,targets,variables):
        """load the train data before validating"""
        """variables should also be received"""
        self.attributes=attributes
        self.targets=targets
        self.variables=variables
        
    def split_data(self,subject_name):
        """subject name is the column name of subject attribute"""
        """ return the list of bool Expr """
        names=self.variables[subject_name].unique()
        self.n_subjects=len(names)  #how many subjects there are
        subject_index={}
        
        for i in names:
            subject_index[i]=self.variables[subject_name]==i#save member indexes of each subject
        
        self.subjects_index=subject_index
        self.subjects_names=names
            
        
        
    def KNN_predict(self,parameter_k):
        """validate for each subject using sklearn"""
        m=self.attributes.shape[0]
        all_results=[]
        c_indexes={}
        accuracies={}
        for i in self.subjects_names:
            subject_validate_index=self.subjects_index[i]#get the bool Expr of needed subject
            #==============#
            #delete subject instances from the training set
            X_train=self.attributes[-subject_validate_index]
            y_train=self.targets[-subject_validate_index]
            #==============#
            #add instances from one subject into validate set
            X_validate=self.attributes[subject_validate_index]
            y_validate=self.targets[subject_validate_index]
            #==============#
            knn=KNeighborsClassifier(n_neighbors=parameter_k)
            knn.fit(X_train,y_train)
            pred=knn.predict(X_validate)
            #==============#
            #save all the predicted results
            all_results+=list(pred)
            #==============#
            #calculate c-index for each subject and save performance into a dict
            c_index=C_index(pred,y_validate.values)
            c_indexes[i]=c_index
            
        self.all_results=all_results
        self.c_indexes=c_indexes#performance for each subject

    


I think it will be better to package splitting data into folds as a single method. When running KNN I can directly get the indexes of the fold I need.

The results attribute records all the predictions over the data set.

## Results and discussion

In [15]:
a=Lso_CV()
a.load_data(X,y,additional_variables)
a.split_data("subject")

In [11]:
a.KNN_predict(37)

In [13]:
a.c_indexes

{1: 0.7641179502282432,
 2: 0.5724070992789795,
 3: 0.5190438268892794,
 4: 0.6868397887323944,
 5: 0.6005850585058505,
 6: 0.4787750529067596,
 7: 0.6641137295081967,
 8: 0.6700568707842878,
 9: 0.5348868534482759,
 10: 0.5807945900253593,
 11: 0.6584419373662711,
 12: 0.6382644057392772,
 13: 0.6561140485911251,
 14: 0.5606199514346598,
 15: 0.7877790081007052,
 16: 0.6848321803740712,
 17: 0.6970981507823613,
 18: 0.7106073943661971,
 19: 0.6498157002649771,
 20: 0.5121860207258305,
 21: 0.6614544600571451,
 22: 0.5180185836263185,
 23: 0.6636151452282157,
 24: 0.5044012282497441,
 25: 0.6062697009302683,
 26: 0.5686545898636272,
 27: 0.686991419872682,
 28: 0.8156826256132634,
 29: 0.6325650394219939,
 30: 0.646957671957672,
 31: 0.5571999031945789}

Performance for each subject:
```
 1: 0.9023891915770873,
 2: 0.6682750970604548,
 3: 0.5801296133567663,
 4: 0.7555017605633803,
 5: 0.7563456345634564,
 6: 0.5020540271380555,
 7: 0.6878073770491804,
 8: 0.7649649517259621,
 9: 0.6223060344827587,
 10: 0.572916314454776,
 11: 0.7903131686442326,
 12: 0.7448783856364612,
 13: 0.7362793655043169,
 14: 0.6505277764012092,
 15: 0.8331487848942246,
 16: 0.776658980271586,
 17: 0.7843243243243243,
 18: 0.7711267605633803,
 19: 0.6704661525024795,
 20: 0.5432854279500192,
 21: 0.7319334090049187,
 22: 0.549761426418885,
 23: 0.7818724066390041,
 24: 0.5139201637666325,
 25: 0.5915372491735219,
 26: 0.6320781599837166,
 27: 0.8282313866592859,
 28: 0.8993402131619015,
 29: 0.6551056289662371,
 30: 0.7121031746031746,
 31: 0.597241045498548
 ```

In [14]:
results={}
results['mean']=np.mean(a.c_indexes.values())
results['max']=np.max(a.c_indexes.values())
results['min']=np.min(a.c_indexes.values())
results['range of c-index']=(np.max(a.c_indexes.values())-np.min(a.c_indexes.values()))
display(results)

{'max': 0.8156826256132634,
 'mean': 0.6286835479376972,
 'min': 0.4787750529067596,
 'range of c-index': 0.33690757270650384}



`.c_indexes` attributes records the performance for each fold. 

|mean c-index|maximum c-index|minmum c-index|range|
|:---|---|---|---:|
|0.6286835479376972|0.8156826256132634|0.4787750529067596|0.33690757270650384|


I have found the performance is not so good as the average c-index is just 0.62 (close to 0.5), which means it doesn't capture much useful information(but basically can predict something). The maximum c-index is 0.82. At least, this is a good score. The range of c-index is about 0.33, which may suggest that the the performance of the model differ on different subjects.
