# Cross-Validation

## Evaluation mit Cross-Validation
Um verschiede Verfahren und Parameter möglichst ohne die Gefahr des overfitting evaluieren zu können, steht man immer vor dem Problem: Mit welchen Daten trainiere ich meine Verfahren und mit welchen teste ich? Offensichtlich hängt das Ergebnis der Evaluation stark von der konkreten Auswahl des Test- bzw. Trainingsdatensatzes ab. 

Eine in der Literatur etablierte Methode der systematischen Evaluation ist Cross-Validation (Link: [Cross-Validation](https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29)). Die grundlegende Idee des k-Fold  Cross-Validation (Link: [k-fold cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)\#k-fold_cross-validation)) ist wie folgt: Die Gesamtmenge an Klassen-annotierten Datensätzen $T$ wird zufällig in $k$ gleich große Teilmengen (Folds) $T_1 \dots T_k$ aufgeteilt. Es werden $k$ Testiteration $i_1 \dots i_k$ durchgeführt. In jeder Iteration wird jeweils eine andere Teilmenge $T_i$ als Testdatensatz und die restlichen Daten $T \setminus T_i$ als Trainingsdatensatz verwendet. Als Gesamt Ergebniss der Cross-Validation wird der Mittelwert der Genauigkeiten der einzelnen Iteration herangezogen. 

Weitere Verfahren sind bspw. Holdout (Link: [Holdout](https://en.wikipedia.org/wiki/Cross-validation_(statistics)\#Holdout_method)), Nested cross-validation (Link: [Nested cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)\#Nested_cross-validation)) etc.

<figure>
<img src="./Figures/k-fold-cross-validation.png" alt="drawing" style="width:600px;">
    <figcaption>k-fold Cross Validation, Quelle: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/K-fold_cross_validation_EN.svg/500px-K-fold_cross_validation_EN.svg.png
        </figcaption>
</figure>

Erweitern Sie Ihre Implementierung des KNN-Algorithmus aus dem vorherigen Teil um das <b>k-fold Cross-Validation</b> Verfahren. Wählen Sie hierbei einen geeigneten Wert für die Anzahl der k-folds, bzw. experimentieren Sie mit verschiedenen Werte.

In [1]:
import numpy as np
import csv as csv
import matplotlib.pyplot as plt
import pandas as pd
import itertools
%matplotlib inline

In [2]:
DATA_FILE = './Data/original_titanic.csv'

In [3]:
df = pd.read_csv(DATA_FILE, header=0)

def prepareData(df):
    df.loc[(((df.Sex == "male")   &(df.Survived == 0)) & (df.Pclass==1)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "male")   &(df.Survived == 0)) & (df.Pclass==1), "Age"].mean()
    df.loc[(((df.Sex == "male")   &(df.Survived == 0)) & (df.Pclass==2)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "male")   &(df.Survived == 0)) & (df.Pclass==2), "Age"].mean()
    df.loc[(((df.Sex == "male")   &(df.Survived == 0)) & (df.Pclass==3)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "male")   &(df.Survived == 0)) & (df.Pclass==3), "Age"].mean()
    
    df.loc[(((df.Sex == "male")   &(df.Survived == 1)) & (df.Pclass==1)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "male")   &(df.Survived == 1)) & (df.Pclass==1), "Age"].mean()
    df.loc[(((df.Sex == "male")   &(df.Survived == 1)) & (df.Pclass==2)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "male")   &(df.Survived == 1)) & (df.Pclass==2), "Age"].mean()
    df.loc[(((df.Sex == "male")   &(df.Survived == 1)) & (df.Pclass==3)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "male")   &(df.Survived == 1)) & (df.Pclass==3), "Age"].mean()
    
    
    df.loc[(((df.Sex == "female") &(df.Survived == 0)) & (df.Pclass==1)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "female") &(df.Survived == 0)) & (df.Pclass==1), "Age"].mean()
    df.loc[(((df.Sex == "female") &(df.Survived == 0)) & (df.Pclass==2)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "female") &(df.Survived == 0)) & (df.Pclass==2), "Age"].mean()
    df.loc[(((df.Sex == "female") &(df.Survived == 0)) & (df.Pclass==3)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "female") &(df.Survived == 0)) & (df.Pclass==3), "Age"].mean()
    
    df.loc[(((df.Sex == "female") &(df.Survived == 1)) & (df.Pclass==1)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "female") &(df.Survived == 1)) & (df.Pclass==1), "Age"].mean()
    df.loc[(((df.Sex == "female") &(df.Survived == 1)) & (df.Pclass==2)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "female") &(df.Survived == 1)) & (df.Pclass==2), "Age"].mean()
    df.loc[(((df.Sex == "female") &(df.Survived == 1)) & (df.Pclass==3)) & df.Age.isnull(),"Age"] =  df.loc[((df.Sex == "female") &(df.Survived == 1)) & (df.Pclass==3), "Age"].mean()
    return df

df = prepareData(df)



In [4]:
def normalize_colummn(feature):
    mean = feature.mean()
    std = feature.std()
    feature_array = feature.to_numpy()
    
    for index in range(len(feature_array)):
        feature_array[index] = (float(feature_array[index]) - mean) / std
        #print(float(feature_array[index]) - mean, feature_array[index])
    return pd.DataFrame(feature_array)

    
def normalize(df):
    new_dataFrame = df.copy()
    new_dataFrame.assign(Age=normalize_colummn(new_dataFrame.Age))
    new_dataFrame.assign(Fare=normalize_colummn(new_dataFrame.Fare))
    return  new_dataFrame # TODO implement

df_norm = normalize(df) # TODO : implement



In [5]:
def extractFeatureVector(row):
    # TODO : implement
    return np.array([row.Fare, row.Age])

In [6]:
class KNN(object):
    
    def __init__(self, k, data, parts):
        print("k:",k,"parts:",parts)
        self.k = k
        self.data = data
        self.parts = parts
        self.count = 0
        self.acc = []
        
        for i in range(self.parts):
            self.fit()
            self.acc.append(self.getAcc())
            print(((100/self.parts)*(i+1)), "% :",self.acc[-1])
            
        print("result:",np.mean(self.acc))
        
    def distance(self, vector1,vector2):
        #Manhattan
        vector1 = np.array(vector1)
        vector2 = np.array(vector2)
        sub_array = []
        for i in range(len(vector1)):
            sub_array.append(np.absolute(vector1[i] - vector2[i]))

        return np.sum(sub_array)
    
    
    def fit(self):
        self.trainData = []
        self.trainLabel = []
        
        self.cross_validation()
        
        for index, row in self.train.iterrows():
            self.trainData.append(extractFeatureVector(row))
            self.trainLabel.append(row.Survived)
        
        # TODO: implement

    def predict(self, x):
        x = np.array(x)
        dis_to_x = []
        for i in range(len(self.trainData)):
            dis_to_x.append({"d" : self.distance(x,self.trainData[i]), "s":self.trainLabel[i] })
        sorted_dis = sorted(dis_to_x, key = lambda i: i['d'])
        k_sorted =  sorted_dis[:self.k]
        
        k_results = []

        for i in k_sorted:
            k_results.append(i["s"])
        return max(k_results,key=k_results.count)
    
    def getAcc(self):
        tp = 0
        tn = 0
        fp = 0
        fn = 0

        for index, row in self.test.iterrows():
            predicted = self.predict(extractFeatureVector(row))
            actual = row["Survived"]

            if predicted == 1:
                if actual == 1:
                    tp += 1
                else:  
                    fp += 1
            else:
                if actual == 1:
                    fn += 1
                else: 
                    tn += 1


        return (tp + tn) / (tp + tn + fp +fn)
    
    def cross_validation(self):
        split = np.array_split(self.data, self.parts)
        k_split = split.copy()
        
        self.test = k_split[self.count]
        
        #np.delete(k_split, self.count)
        del k_split[self.count] 
        self.train =  pd.concat(k_split)
        
        self.count += 1
        
        if self.count > self.parts-1:
            self.count = 0
           
       


In [7]:
knn = KNN(3, df_norm.sample(frac=1) , 3)
knn = KNN(10, df_norm.sample(frac=1) , 3)
knn = KNN(20, df_norm.sample(frac=1) , 3)
knn = KNN(3, df_norm.sample(frac=1) , 10)
knn = KNN(10, df_norm.sample(frac=1) , 10)
knn = KNN(20, df_norm.sample(frac=1) , 10)
knn = KNN(3, df_norm.sample(frac=1) , 20)
knn = KNN(10, df_norm.sample(frac=1) , 20)
knn = KNN(20, df_norm.sample(frac=1) , 20)

k: 3 parts: 3
33.333333333333336 % : 0.6956521739130435
66.66666666666667 % : 0.6330275229357798
100.0 % : 0.6628440366972477
result: 0.663841244515357
k: 10 parts: 3
33.333333333333336 % : 0.6544622425629291
66.66666666666667 % : 0.7224770642201835
100.0 % : 0.7178899082568807
result: 0.6982764050133311
k: 20 parts: 3
33.333333333333336 % : 0.6842105263157895
66.66666666666667 % : 0.731651376146789
100.0 % : 0.7178899082568807
result: 0.7112506035731531
k: 3 parts: 10
10.0 % : 0.7175572519083969
20.0 % : 0.6412213740458015
30.0 % : 0.6335877862595419
40.0 % : 0.6412213740458015
50.0 % : 0.6183206106870229
60.0 % : 0.6335877862595419
70.0 % : 0.6793893129770993
80.0 % : 0.6717557251908397
90.0 % : 0.6793893129770993
100.0 % : 0.7076923076923077
result: 0.6623722842043451
k: 10 parts: 10
10.0 % : 0.6870229007633588
20.0 % : 0.7633587786259542
30.0 % : 0.6717557251908397
40.0 % : 0.6946564885496184
50.0 % : 0.6946564885496184
60.0 % : 0.7022900763358778
70.0 % : 0.7175572519083969
80.0 %