# TKO2096 
## EXERCISE II Prediction of the metal ion content from multi-parameter data
YUE MA 520790

## Libraries

In [2]:
import numpy as np
import pandas as pd
import pdb
import operator
import random
from sklearn.neighbors import KNeighborsRegressor
from scipy.stats import zscore

## Functions and Classes list

Classes:
- `KNN_Regression`: 
    - `.fit()` load train data into the model
    - `.predict()` predict for a single label
    - `KNN_Regression_Multi` child class. predict for multi labels
        - `.predict_ion()` based on the `.select_label()` and `.predict_multi()` methods, the main idea of it is that it add `pb` and `Cd` together to predict `c_total`.
- `LXO_CV`
    - `.load_data()`:load train data.
    - `.predict_own_KNN()`:predict with my own KNN algorithms (k should be set) and save the results into the object
    - `.predict_sklearn_KNN()`:predict with sklearn (k should be set) and save the results into the object
    - `.evaluate()`: evaluate the performance of results with given scorer function

Function:
- `C_index()`: calculate c-index for two sequences of data

the part of trying different k values was not been created as function or class.

## Load and preprocess data

In [3]:
water_data=pd.read_csv("/Users/mayue/Desktop/TKO-2096/Exe/Water_data.csv")

In [4]:
"""split the data into attributes and targets"""
y=water_data.iloc[:,0:3]
X=water_data.iloc[:,3:6]

In [5]:
"""Normalization"""
y=zscore(y)
X=zscore(X)

## Define the own KNN regression 

To make the KNN predictor works on predicting multi values, I have created a subclass of `KNN_Regression_Multi`. It can produce multi predicted values at a meanwhile.

Another tricky problem I have met is that the case is special which include a label `c_total` as the sum of other labels. So I have to build a new special function to get the better predictions

In [6]:
class KNN_Regression():
    """my own class to implement KNN regress from exercise I"""
    
    def __init__(self,num_neighbors=3):
        self.num_neighbors=num_neighbors
        
    def fit(self,attributes,targets):
        """load the data into the model"""
        """attributes and targets are arrays"""
        self.attributes=attributes
        self.all_targets=targets #store the copy of all different kinds of labels
        self.targets=targets#labels we want to predict for now
        
        
    def predict(self,inX):
        """predict an input instance's label """
        """inX is an 1-d array which represents an instance"""
        size=self.attributes.shape[0]
        
        diffMat=np.tile(inX, (size,1))-self.attributes
        sqDiffMat=diffMat**2
        sqDistances=sqDiffMat.sum(axis=1)
        distances=sqDistances**0.5#calculate the euclidean disantce 
        
        sortedDistanceIndices=distances.argsort()
        valueCount=[]
        
        for i in range(self.num_neighbors):
            voteIvalue=self.targets[sortedDistanceIndices[i]]
            valueCount.append(voteIvalue)
        predictedValue=np.mean(valueCount) #calculate the average values of neighbors
        
        return predictedValue

In [7]:
class KNN_Regression_Multi(KNN_Regression):
    """the child class of KNN_Regression, which is used to produce 3 predicted values meanwhile"""

    def select_label(self,indice):
        """fit the model using selected labels"""
        self.targets=self.all_targets[:,indice]
        
    def predict_multi(self,inX,label_index):
        """predict for multi label values,return a list of predicted values"""
        """label_indice: a list of index of labels to be predicted"""
        n=self.all_targets.shape[1]
        pred=[]
        for i in label_index:
            self.select_label(i)#change the label to be predicted
            pred.append(self.predict(inX))
        return pred
    
    def predict_ion(self,inX):
        """this function is to solve the specific requirement of the case, where one lable c_total is the sum of other 2 labels"""
        pred=[]
        temp=self.predict_multi(inX,[1,2])
        pred.append(temp[0])
        pred.append(temp[1])
#         pdb.set_trace()
        pred.append(sum(pred)) #c-total is the sum of cd and pb
        pred=[pred[2],pred[0],pred[1]] #arrange the order of the values
        return pred

## Define the calculation of c-index

This function has refered to the pseudocode in the slide. It calculates the c-index for two sequences of data.

In [8]:
def C_index(pred,true_labels):
    """pred and true_labels are sequences which have same length"""
    n=0
    h_sum=0.0
    for i in range(0,len(true_labels)):
        t=true_labels[i]
        p=pred[i]
        for j in range(i+1,len(true_labels)):
            nt=true_labels[j]
            np=pred[j]
            if t!=nt:
                n=n+1
#                 pdb.set_trace()
                if ((p<np)&(t<nt))|((p>np)&(t>nt)):
                    h_sum+=1.0
                else:
                    if p==np:
                        h_sum+=0.5
    return float(h_sum/n)

## Define the own LXO cross validation class

To make the modifying easier, I build the CV process as a class. 

-  firstly the number of leave-out should be decided. If the parameter=1, then it is a LOO CV. If it is 3, then there goes the L3O CV.
- load data into CV object
- then use 2 different methods(own KNN and `sklearn`) to predict and save the results into the class's attribute. 
- Finally `.evaluate()` method help to calculate the performance of the model

In [9]:
class LXO_CV():
    
    def __init__(self,leave_out_num=1):
        self.size_validate=leave_out_num #number of leave-out instances in each validation
    
    def load_data(self,attributes,targets):
        """load the train data before validating"""
        self.attributes=attributes
        self.targets=targets
        
    def predict_own_KNN(self,predictor_class,num_neighbor):
        """predictor_class is the predictor CLASS used to validate"""
        """this method is used for validating using own KNN"""
        m=self.attributes.shape[0]
        all_results={}
        for i in range(0,m,self.size_validate):
            #split train set and validate set
            validate_index=range(i,i+self.size_validate)

            X_validate=self.attributes[validate_index]
            y_validate=self.targets[validate_index]

            X_train=np.delete(self.attributes,validate_index,0)
            y_train=np.delete(self.targets,validate_index,0)#remove the validating instance

            predictor=predictor_class(num_neighbor)
            predictor.fit(X_train,y_train)
            
            for j in range(0,self.size_validate):#predict for each single validate instance, because the KNN regressor here can't accept multi instances
                pred=predictor.predict_ion(X_validate[j])
                all_results[i+j]=pred #key of the result dict is the number of instance
        self.all_results=pd.DataFrame.from_dict(all_results,orient='index',columns=['c_total','cd','pb'])
        
    def predict_sklearn_KNN(self,num_neighbor):
        """directly use sklearn.neighbors.KNeighborsRegressor to predict for 3 targets values"""
        m=self.attributes.shape[0]
        all_results={}
        for i in range(0,m,self.size_validate):
            #split train set and validate set
            validate_index=range(i,i+self.size_validate)

            X_validate=self.attributes[validate_index]
            y_validate=self.targets[validate_index]

            X_train=np.delete(self.attributes,validate_index,0)
            y_train=np.delete(self.targets,validate_index,0)#remove the validating instance

            predictor=KNeighborsRegressor(num_neighbor)
            predictor.fit(X_train,y_train)
            pred=predictor.predict(X_validate)
            all_results[i]=pred #key of the result dict is the number of instance

        all_results=np.array(all_results.values())
        all_results_=all_results.reshape((all_results.shape[0]*all_results.shape[1]),all_results.shape[2])#transfer the dimension of the dict, which will make printout easier
        self.all_results=pd.DataFrame(all_results_,columns=['c_total','cd','pb'])
        
            
    def evaluate(self,scorer):
        """socrer is the function used to evaluating the performance"""
        scores={}
        for i in range(0,self.all_results.shape[1]):
            #Traversing targets
            score=scorer(self.all_results.iloc[:,i],self.targets[:,i])
            target_name=self.all_results.columns[i]
            scores[target_name]=score
        self.all_scores=scores
            

## Results 

### Predict  with my own KNN

Launch a leave-one-out cross validation using my own KNN, and print out the results of different k values:

In [22]:
"""Calculate c-index for different k values using LOO CV"""
loo=LXO_CV(1)
loo.load_data(X,y)

namelist=[]
for i in range(1,6): #the range of k
    name='loo_score_'+str(i)
    namelist.append(name)
    loo.predict_own_KNN(KNN_Regression_Multi,i)
    loo.evaluate(C_index)
    scores=loo.all_scores
    locals()[name]=scores



In [23]:
pd.DataFrame([loo_score_1,loo_score_2,loo_score_3,loo_score_4,loo_score_5],index=['k=1','k=2','k=3','k=4','k=5'])

Unnamed: 0,c_total,cd,pb
k=1,0.900076,0.900581,0.866368
k=2,0.906158,0.902385,0.872384
k=3,0.904258,0.877982,0.849629
k=4,0.891794,0.851904,0.845548
k=5,0.879277,0.825277,0.830482


Launch a leave-3-out cross validation using my own KNN, and print out the results of different k values:

In [24]:
"""Calculate c-index for different k values using L3O CV"""
l3o=LXO_CV(3)
l3o.load_data(X,y)

namelist=[]
for i in range(1,6): #the range of k
    name='l3o_score_'+str(i)
    namelist.append(name)
    l3o.predict_own_KNN(KNN_Regression_Multi,i)
    l3o.evaluate(C_index)
    scores=l3o.all_scores
    locals()[name]=scores



In [25]:
pd.DataFrame([l3o_score_1,l3o_score_2,l3o_score_3,l3o_score_4,l3o_score_5],index=['k=1','k=2','k=3','k=4','k=5'])

Unnamed: 0,c_total,cd,pb
k=1,0.816037,0.741028,0.736268
k=2,0.817231,0.745789,0.751909
k=3,0.819757,0.735405,0.752642
k=4,0.822553,0.722353,0.759338
k=5,0.815114,0.720731,0.755702


### Predict with `sklearn.neighbors.KNeighborsRegressor`

Launch a leave-one-out cross validation using sklean, and print out the results of different k values:

In [26]:
loo=LXO_CV(1)
loo.load_data(X,y)

namelist=[]
for i in range(1,6): #the range of k
    name='loo_score_'+str(i)
    namelist.append(name)
    loo.predict_sklearn_KNN(i)
    loo.evaluate(C_index)
    scores=loo.all_scores
    locals()[name]=scores
    
pd.DataFrame([loo_score_1,loo_score_2,loo_score_3,loo_score_4,loo_score_5],index=['k=1','k=2','k=3','k=4','k=5'])

Unnamed: 0,c_total,cd,pb
k=1,0.898664,0.900581,0.866368
k=2,0.905615,0.902385,0.872384
k=3,0.903796,0.877982,0.849629
k=4,0.891659,0.851904,0.845548
k=5,0.880553,0.825277,0.830482


Launch a leave-3-out cross validation using sklearn, and print out the results of different k values:

In [27]:
"""Calculate c-index for different k values using L3O CV"""
l3o=LXO_CV(3)
l3o.load_data(X,y)

namelist=[]
for i in range(1,6): #the range of k
    name='l3o_score_'+str(i)
    namelist.append(name)
    l3o.predict_sklearn_KNN(i)
    l3o.evaluate(C_index)
    scores=l3o.all_scores
    locals()[name]=scores

pd.DataFrame([l3o_score_1,l3o_score_2,l3o_score_3,l3o_score_4,l3o_score_5],index=['k=1','k=2','k=3','k=4','k=5'])

Unnamed: 0,c_total,cd,pb
k=1,0.560579,0.591573,0.521683
k=2,0.568861,0.585112,0.533689
k=3,0.568806,0.575303,0.530106
k=4,0.56066,0.565338,0.533924
k=5,0.546595,0.560865,0.52783


##  which evaluation approach generalize better? Why?  
- L3O will generalize better than LOO. Because L3O leave all 3 copies of a sample out as validate set, so that the prediction will not be influenced by the copy of the sample
-  Basically sklearn get worse performance during L3O CV. This may because it predicts for `c_total` independently rather than gets the sum of `cd` and `pb`. And when L3O it can't depend on the copies to predict, so the performance is decreased.