# Exercise 2 |  Application of Data Analysis

#### Prediction of the metal ion content from multi-parameter data <br>
- Use K-Nearest Neighbor Regression with euclidean distance to predict total metal concentration (c_total), concentration of Cadmium (Cd) and concentration of Lead (Pb), for each sample using number of neighbors k = 3.<br> <br>

    - You may use Nearest Neighbor Regression from https://scikit-learn.org/stable/modules/neighbors.html
    - The data should be standarized using z-score.
    - Implement your own Leave-One-Out cross-validation and calculate the C-index for each output (c_total, Cd, Pb). 
    - Implement your own Leave-Replicas-Out cross-validation and calculate the C-index for each output (c_total, Cd, Pb).


## Import libraries

In [1]:
#In this cell import all libraries you need. For example: 
import numpy as np
import pandas as pd
from scipy.stats import zscore
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_predict, LeaveOneOut

## Read and visualize the dataset

In [2]:
#In this cell read the file water_data.csv
#Print the dataset dimesions (i.e. number of rows and columns)
df = pd.read_csv("Water_data.csv")
df.shape

(225, 6)

#### To show understanding of the data, answer the following questions:
- How many different mixtures of Cadmium (Cd) and Lead (Pb) were measured? <br>
- How many total concentrations (c_total) were measured? <br>
- For each c_total, how many times the measurement was repeated? To answer this question <br>
  create a table or make a bar plot of c_total / number of repetitions 

In [3]:
#In this cell write the code to answer the previous questions and print the answers.
# Calculating the amount of different mixtures of Cd and Pb 
df_mixtures = df.groupby(['Cd','Pb']).size().reset_index().shape
print ('Amount Different unique mixtures:' , df_mixtures[0])

# total concentrations
df_c_total = df['c_total'].value_counts()
print ('Amount of unique total concentrations:', df_c_total.shape[0])

print (df_c_total)

Amount Different unique mixtures: 67
Amount of unique total concentrations: 12
200     24
100     24
70      24
50      24
500     18
1000    18
2000    18
5000    18
35      18
20      18
14      18
0        3
Name: c_total, dtype: int64


## Standardization of the dataset

In [15]:
#Standardize the dataset features by removing the mean and scaling to unit variance. 
#In other words, use z-score to scale the dataset features (Mod1, Mod2, Mod3)
df_f = df.drop(df.columns[[0,1,2]], axis=1)
df_z = df_f.apply(zscore)
df_t = df.drop(df.columns[[3,4,5]], axis=1)
df = pd.concat([df_t, df_z], axis=1)
print(df)
#Print the 5 first samples (i.e. rows) of the scaled dataset
print (df_z[0:5])

     c_total      Cd      Pb      Mod1      Mod2      Mod3
0       2000   800.0  1200.0  0.166505 -0.508756 -1.499041
1         35    14.0    21.0 -0.892616 -0.701641  0.685861
2         35    14.0    21.0 -0.852896 -0.701806  1.065760
3         35    35.0     0.0 -0.040629 -0.643767  0.584863
4        100    20.0    80.0 -0.520568 -0.276268 -0.058789
5       1000  1000.0     0.0  0.464578  0.213261  0.549090
6         14     5.6     8.4 -0.994542 -0.715696 -0.030300
7         50    40.0    10.0  0.657312 -0.143324  0.694356
8        500   100.0   400.0  1.122030 -0.446665 -1.424027
9        100   100.0     0.0  1.651645  2.122186  0.792002
10        14    11.2     2.8 -0.935218 -0.713381  0.255056
11      1000   400.0   600.0  0.640710 -0.452701 -1.393331
12        50     0.0    50.0 -0.933817 -0.698003 -0.089966
13        14    11.2     2.8 -0.929043 -0.712637  0.937766
14      1000   600.0   400.0  0.597287 -0.550673 -1.407286
15        70     0.0    70.0 -0.916894 -0.695688  0.3104

## C-index code 

In [5]:
def cindex(true_labels, predictions):
    """Returns C-index between true labels and predicted labels"""
    n = 0
    h_sum = 0

    for i in range (1, len(true_labels)):
        t = true_labels[i]
        p = predictions[i]
        j = i + 1
        for j in range(j, len(true_labels)):
            nt = true_labels[j]
            np = predictions[j]
            if t != nt:
                n = n + 1
                if (p < np and t < nt) or (p > np and t > nt):
                    h_sum = h_sum + 1
                elif p == np:
                    h_sum = h_sum + 0.5
    cindx = h_sum/n    
    return cindx

In [6]:
#test cindex function with following values
true_labels = [-1, 1, 1, -1, 1]
predictions = [0.60, 0.80, 0.75, 0.75, 0.70]
cindx = cindex(true_labels, predictions)
print(cindx)

0.5


## Functions

In [7]:
#Include here all the functions that you need to run in the data analysis part.

# Input values
x = df.drop(columns=['c_total','Cd','Pb'])

#Target values
c_total = df['c_total'].values
Cd = df['Cd'].values
Pb = df['Pb'].values
#c_total

## Results for Leave-One-Out cross-validation

In [8]:
#In this cell run your code for leave-One-Out cross-validation and print the corresponding results.


# Leave-one-out Cross validation c_total
print('Leave-one-out Cross validation c_total')
neigh = KNeighborsRegressor(n_neighbors=3)

cv_predictions = cross_val_predict(neigh, x, c_total, cv=LeaveOneOut())
print('C-index =', cindex(c_total,cv_predictions))

print('Leave-one-out Cross validation Cd')
cv_predictions = cross_val_predict(neigh, x, Cd, cv=LeaveOneOut())
print('C-index =', cindex(Cd,cv_predictions))

print('Leave-one-out Cross validation Pb')
cv_predictions = cross_val_predict(neigh, x, Pb, cv=LeaveOneOut())
print('C-index =', cindex(Pb,cv_predictions))

Leave-one-out Cross validation c_total
C-index = 0.9135875520490905
Leave-one-out Cross validation Cd
C-index = 0.8991276129467296
Leave-one-out Cross validation Pb
C-index = 0.8734246575342466


In [9]:
df_a = df.values
pred_c_total= []
pred_Cd= []
pred_Pb= []
true_labels = df_a[:,[0,1,2]]
true_c_total = []


for x in range(len(df_a)):
        #Select row for testing
        testset = df_a[x, [3,4,5]]
        
        #Delete row that is used for testing to create training set ( Leave-one-out cross-validation)
        trainset = np.delete(df_a, (x), axis=0)
        
        knn = KNeighborsRegressor(n_neighbors=3)
        knn.fit(trainset[:,[3,4,5]], trainset[:,[0,1,2]])
        #Prediction on testset row
        prediction = knn.predict([testset])
        
        # Add predicitons to separate list by output
        pred_c_total.append(prediction[0][0])
        pred_Cd.append(prediction[0][1])
        pred_Pb.append(prediction[0][2])
        

#Calculate each C-index
C_total_cindex = cindex(true_labels[:,0], pred_c_total)
Cd_cindex = cindex(true_labels[:,1], pred_Cd)
Pb_cindex = cindex(true_labels[:,2], pred_Pb)

print('C_total C-index:', C_total_cindex)
print('Cd C-index:', Cd_cindex)
print('Pb C-index:', Pb_cindex)

C_total C-index: 0.9135875520490905
Cd C-index: 0.8991276129467296
Pb C-index: 0.8734246575342466


## Results for Leave-Replicas-Out cross-validation

In [10]:
df_replicas =df.drop_duplicates(['Cd','Pb'],keep= 'first')
mixture_cd = df_replicas['Cd'].values
mixture_pb = df_replicas['Pb'].values
c_total = df['c_total'].values


In [11]:
#In this cell run your script for leave-Replicas-Out cross-validation and print the corresponding results.


true_labels = []
true_c_total = []
c_total_cindex_array = []
true_labels_c = []
Cd_cindex = []
Pb_cindex = []

for m in range(len(df_replicas)):
    pred_c_total= []
    pred_Cd= []
    pred_Pb= []
    
    # get replicas row indices
    test_index = np.where((df_a[:,1] == mixture_cd[m]) & (df_a[:,2] == mixture_pb[m]))

    # create test set using the replicas indices
    testset = df_a[test_index]
     #get c_total true value for each mixture
    true_index = test_index[0][0]
    true_labels_c.append(c_total[true_index])
    
    # create training set by deleting the test test from the data (Leave-Replicas-Out cross-validation)
    trainset = np.delete(df_a, (test_index), axis=0)
    trainset_f = trainset[:,[3,4,5]]
    trainset_t = trainset[:,[0,1,2]]
    
    # KNN prediction for each testset row
    for x in range(len(testset)):
        
        testset_row = testset[x, [3,4,5]]
        
        knn = KNeighborsRegressor(n_neighbors=3)
        knn.fit(trainset_f, trainset_t)
        
        prediction = knn.predict([testset_row])
        
        pred_c_total.append(prediction[0][0])
        pred_Cd.append(prediction[0][1])
        pred_Pb.append(prediction[0][2])
        
 
    # Compute averages of replica set
    C_total_average = np.average(pred_c_total)
    Cd_average = np.average(pred_Cd)
    Pb_average = np.average(pred_Pb)
    
    # Add averages to list
    c_total_cindex_array.append(C_total_average)
    Cd_cindex.append(Cd_average)
    Pb_cindex.append(Pb_average)

#Calculate each C-index 
C_total_cindex = cindex(true_labels_c, c_total_cindex_array)
Cd_cindex = cindex(mixture_cd, Cd_cindex)
Pb_cindex = cindex(mixture_pb, Pb_cindex)

print('C_total C-index:', C_total_cindex)
print('Cd C-index:', Cd_cindex)
print('Pb C-index:', Pb_cindex)

C_total C-index: 0.8251889168765743
Cd C-index: 0.7472073822243808
Pb C-index: 0.7665208940719145


## Interpretation of results
#### Answer the following questions based on the results obtained
- Which cross-validation approach had more optimistic results?
- Which cross-validation generalize better on unseen data? Why?

#In this cell write your answers to the questions about Interpretation of Results.

Leave-One-Out cross-validation gives more optimistic results.
Leave-Replicas-Out cross-validation method it better for generalizing on unseen data, because there is no data leakage from the replicas.