# Exercise 2 | TKO_2096 Application of Data Analysis 2021

#### Prediction of the metal ion content from multi-parameter data <br>
- Use K-Nearest Neighbor Regression with euclidean distance to predict total metal concentration (c_total), concentration of Cadmium (Cd) and concentration of Lead (Pb), for each sample using number of neighbors k = 3.<br> <br>

    - You may use Nearest Neighbor Regression from https://scikit-learn.org/stable/modules/neighbors.html
    - The data should be standarized using z-score.
    - Implement your own Leave-One-Out cross-validation and calculate the C-index for each output (c_total, Cd, Pb). 
    - Implement your own Leave-Replicas-Out cross-validation and calculate the C-index for each output (c_total, Cd, Pb).
    - Return your solution as a Jupyter Notebook file (include your full name in the file name).
    - Submit to moodle your solution on ** Wednesday 10 of February** at the latest.

## Import libraries

In [1]:
#In this cell import all libraries you need. For example: 
import numpy as np
import pandas as pd

from scipy.stats import zscore

from sklearn.neighbors import KNeighborsRegressor

## Read and visualize the dataset

In [2]:
#In this cell read the file Water_data.csv
#Print the dataset dimesions (i.e. number of rows and columns)
#Print the first 5 rows of the dataset

data = pd.read_csv('water_data.csv')
print("Dimensions:",data.shape)
print(data.head(10))

Dimensions: (225, 6)
   c_total      Cd      Pb    Mod1   Mod2    Mod3
0     2000   800.0  1200.0  126430   2604    6996
1       35    14.0    21.0   20597    271  138677
2       35    14.0    21.0   24566    269  161573
3       35    35.0     0.0  105732    971  132590
4      100    20.0    80.0   57774   5416   93798
5     1000  1000.0     0.0  156215  11337  130434
6       14     5.6     8.4   10412    101   95515
7       50    40.0    10.0  175474   7024  139189
8      500   100.0   400.0  221911   3355   11517
9      100   100.0     0.0  274833  34426  145074


#### To show understanding of the data, answer the following questions:
- How many different mixtures of Cadmium (Cd) and Lead (Pb) were measured? <br>
- How many total concentrations (c_total) were measured? <br>
- How many mixtures have less than 4 replicas? <br>
- How many mixtures have 4 or more replicas? Print out c_total, Cd and Pb for those concentrations.<br>

In [3]:
#In this cell write the code to answer the previous questions and print the answers. 

# Different mixtures of cadmium and lead
mix = data[["Cd","Pb"]] # Remove unneeded columns
mix = mix.drop_duplicates() # Remove duplicate rows
print("Different mixtures:",len(mix),"\n")

# Total concentrations
tot = len(data["c_total"].unique()) # Number of unique values for 'c_total'
print("Total consentrations:",tot,"\n")


# First count unique rows based on values of 'Cd' and ' Pb'
replica_counts = data.groupby(["Cd","Pb"]).size().reset_index(name="Count") 

# Mixtures with less than 4 replicas
lessthan4 = len([count for count in replica_counts["Count"] if count < 4])
print("Mixtures with less than 4 replicas:",lessthan4,"\n")

# Mixtures with 4 or more replicas, print concs
fourormore = len([count for count in replica_counts["Count"] if count > 3])
print("Mixtures with 4 or more replicas:",fourormore)
print("Concentrations of mixtures with 4 or more replicas:")

# Print the wanted concentrations
print(replica_counts[replica_counts["Count"]>3][["Cd","Pb"]])

Different mixtures: 67 

Total consentrations: 12 

Mixtures with less than 4 replicas: 43 

Mixtures with 4 or more replicas: 24
Concentrations of mixtures with 4 or more replicas:
       Cd     Pb
4     0.0   50.0
5     0.0   70.0
6     0.0  100.0
7     0.0  200.0
18   10.0   40.0
23   14.0   56.0
26   20.0   30.0
27   20.0   80.0
30   28.0   42.0
31   30.0   20.0
33   40.0   10.0
34   40.0   60.0
35   40.0  160.0
36   42.0   28.0
37   50.0    0.0
38   56.0   14.0
39   60.0   40.0
40   70.0    0.0
41   80.0   20.0
42   80.0  120.0
43  100.0    0.0
45  120.0   80.0
46  160.0   40.0
47  200.0    0.0


## Standardization of the dataset

In [4]:
#Standardize the dataset features by removing the mean and scaling to unit variance. 
#In other words, use z-score to scale the dataset features (Mod1, Mod2, Mod3) 
#Print the 5 first samples (i.e. rows) of the scaled dataset

# Select subset to standardize
to_scale = data[["Mod1","Mod2","Mod3"]] 

# Use scipy zscore on the subset
std_data = to_scale.apply(zscore) 

# Return standardized data back to original set
data = pd.concat([data[["c_total","Cd","Pb"]],std_data],axis=1)
print(data[220:])

     c_total      Cd      Pb      Mod1      Mod2      Mod3
220     2000     0.0  2000.0 -0.645171 -0.495941 -1.530484
221     5000  4000.0  1000.0 -0.874613 -0.677499 -1.491442
222       50    30.0    20.0 -0.603170 -0.537114  1.873760
223       50     0.0    50.0 -0.926602 -0.699822  0.351225
224     2000   800.0  1200.0  0.174902 -0.521240 -1.492006


## C-index code 

In [5]:
def cindex(true_labels, pred_labels):
    """Returns general C-index between true labels and predicted labels"""  
    
    N = 0
    T = 0 # total number of unequal outputs
    
    ## Create the data set to compare within
    data = []
    for t_label, p_label in zip(true_labels, pred_labels): 
        data.append((t_label, p_label))  


    for i in range(len(data)):
        for j in range(i+1, len(data)):
            
            # If y_i < y_j and ^y_i < ^y_j
            if (data[i][0] < data[j][0]) and (data[i][1] < data[j][1]):
                N = N + 1
                
            # Or if y_i > y_j and ^y_i > ^y_j
            elif (data[i][0] > data[j][0]) and (data[i][1] > data[j][1]):
                N = N + 1
                
            # Case for ^y_i == ^y_j
            elif (data[i][1] == data[j][1]) and (data[i][0] != data[j][0]):
                N = N + 0.5
                
            # Counts all pairs
            if data[i][0] != data[j][0]:
                T = T + 1
                    
    print("N =",N,"\nPairs =", T)
    
    cindx = N/T
    return cindx

In [6]:
#test cindex function with following values

## values given in this exercise
true_labels = [-1, 1, 1, -1, 1]
predictions = [0.60, 0.80, 0.75, 0.75, 0.70]

cindx = cindex(true_labels, predictions)
print(cindx)

N = 4.5 
Pairs = 6
0.75


## Functions

In [7]:
#Include here all the functions that you need to run in the data analysis part.
def myOwnLOO(X):
    
    indices = np.arange(len(X)) # number of splits
    
    for test_index in indices:
        
        test_index = indices[test_index]
        train_index = np.delete(indices, test_index) # all but 'test_index'

        yield train_index, test_index
        

def LOOCV(X, y, loo):
    
    pred = [] # feature predictions

    for train_index, test_index in loo:
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        knr = KNeighborsRegressor(n_neighbors=3)
        knr.fit(X_train, y_train)
        
        pred.append(knr.predict(X_test.reshape(-1,3)))
        
    return pred

In [8]:
def myOwnLRO(rep_groups):
    
    indices = np.arange(len(rep_groups)) # number of splits
    
    for i, test_index in enumerate(rep_groups):
        
        a = rep_groups.copy()
        test_index = rep_groups[i]
        del(a[i])
        train_index = a # all but 'test_index'
        
        yield train_index, test_index
        

def LROCV(X, y, loo):
    
    pred_y_test = [] # feature predictions
    
    y_true = [] # true y values
    
    for train_index, test_index in loo:
        
        # Flatten the train_index list
        train_index = [val for sublist in train_index for val in sublist]
        train_index.sort()
        
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        knr = KNeighborsRegressor(n_neighbors=3)
        knr.fit(X_train, y_train)
        
        pred_y_test.append(knr.predict(X_test))
        
        # True values for y (only way I could keep true values and predictions in the same order)
        y_true.append(y[test_index])
        
    
    return pred_y_test,y_true


## Results for Leave-One-Out cross-validation

In [9]:
#In this cell run your code for leave-One-Out cross-validation and print the corresponding results.

X = data[["Mod1","Mod2","Mod3"]].to_numpy()

# Labels of y in data
labels = ["c_total","Cd","Pb"]

for label in labels:
    loo = myOwnLOO(X) # LOO generator
    
    y = data[label].to_numpy()
    
    predictions = LOOCV(X,y,loo)
    
    print(label)
    c = cindex(y, predictions)
    print("C-index =",c,"\n")



c_total
N = 21046.5 
Pairs = 23022
C-index = 0.9141907740422205 

Cd
N = 21542.5 
Pairs = 23947
C-index = 0.8995907629348144 

Pb
N = 20940.5 
Pairs = 23947
C-index = 0.8744519146448407 



## Results for Leave-Replicas-Out cross-validation

In [10]:
#In this cell run your script for leave-Replicas-Out cross-validation and print the corresponding results.


## Find indexes of all the replicates
uniqs = data.drop_duplicates(subset=["c_total","Cd","Pb"])

# For each unique set of 'c_total', 'Cd' and 'Pb' find indexes of their replicates
replica_groups = []
for index, row in uniqs.iterrows():
    
    Xy = data.loc[(data["c_total"] == row["c_total"]) &
                 (data["Cd"] == row["Cd"]) &
                 (data["Pb"] == row["Pb"])]
    
    # Append lists of replicate indexes
    replica_groups.append(Xy.index.values.tolist())

    
X = data[["Mod1","Mod2","Mod3"]].to_numpy()

# Labels of y in data
labels = ["c_total", "Cd", "Pb"]


for label in labels:
    loo = myOwnLRO(replica_groups) # Leave replicas out generator

    y = data[label].to_numpy()

    pred_y_test, y_true = LROCV(X, y, loo)
    
    #flatten lists
    pred_y_test = [val for sublist in pred_y_test for val in sublist]
    y_true = [val for sublist in y_true for val in sublist]

    print(label)
    
    c = cindex(y_true, pred_y_test)# these are not in the same order, pred_y gets sorted somewhere

    print("C-index =",c,"\n")

c_total
N = 18847.5 
Pairs = 23022
C-index = 0.8186734427938493 

Cd
N = 18234.5 
Pairs = 23947
C-index = 0.7614523739925669 

Pb
N = 18414.0 
Pairs = 23947
C-index = 0.7689480937069362 



## Interpretation of results
#### Answer the following questions based on the results obtained
- Which cross-validation approach had more optimistic results?
- Which cross-validation generalize better on unseen data? Why?

In [11]:
#In this cell write your answers to the questions about Interpretation of Results.

# Leave-one-out had better c-index scores, but the replicates are distorting that.
# Leave-replicates-out is more realistic since it considers the replicates

