# PVA-03
In dieser Aufgabe werden verschiedene Klassifikatoren auf dem Datensatz aus der letzten Aufgabe trainiert und verglichen.
## Datensatz zusammenführen
Zunächst werden die Datensätze aus der letzten Aufgabe geladen und in eine einzelne `csv`-Datei zusammengeführt. Da einige der Zeilen mit einem Komma enden, muss dieses dort entfernt werden. Einige Datensätze mussten nach manueller Überprüfung entfernt werden, da diese nicht interpretierbare Daten enthielten (leere Bilder, Bilder mit zuwenig information, welche auch für Menschen nicht erkennbar sind).


In [131]:
import os
import csv

directory = 'Datensatz/'
csv_data = 'combined_data.csv'
csv_files = [f for f in os.listdir(directory)]

combined_data = []

for file in csv_files:
    filepath = os.path.join(directory, file)
    with open(filepath, 'r') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
            combined_data.append(row)

with open(csv_data, 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in combined_data:
        writer.writerow(row)
        
# remove trailing commas in the lines ending with a comma
with open(csv_data, 'r') as infile, open('combined_data_clean.csv', 'w') as outfile:
    data = infile.read()
    data = data.replace(',\n', '\n')
    outfile.write(data)
    

    

Als nächstes wird das CSV File in ein Pandas Dataframe geladen, was Operationen mit den Daten erleichtert:

In [132]:
import pandas as pd

clean_data = 'combined_data_clean.csv'

dataframe = pd.read_csv(clean_data, header=None, delimiter=',')
print(dataframe.head())

                            0    1    2    3    4    5    6    7    8    9    \
0  #-41-cornelia.isenschmid.png    0    0    0    0    0    0    0    0    0   
1  #-42-cornelia.isenschmid.png    0    0    0    0    0    0    0    0    0   
2  #-43-cornelia.isenschmid.png    0    0    0    0    0    0    0    0    0   
3  #-44-cornelia.isenschmid.png    0    0    0    0    0    0    0    0    0   
4  #-45-cornelia.isenschmid.png    0    0    0    0    0    0    0    1    0   

   ...  91   92   93   94   95   96   97   98   99   100  
0  ...    0    0    0    0    0    0    0    0    0    0  
1  ...    0    1    0    1    0    0    0    0    0    0  
2  ...    0    1    0    1    0    0    0    0    0    0  
3  ...    0    0    0    0    0    0    0    0    0    0  
4  ...    0    0    0    0    0    0    0    0    0    0  

[5 rows x 101 columns]


## Abschliessende Bereinigung, Label extrahieren
Nun wird die erste Spalte der Daten bereinigt; aus der ersten Spalte wird nur das erste Zeichen extrahiert, dieses ist das Label für die Daten. Das Label wird in eine neue Spalte geschrieben und die erste Spalte wird gelöscht.

In [133]:
# remove rows with missing values (only 0 in a row)
dataframe = dataframe[(dataframe.T != 0).any()]

# in the first column: keep only the first character, which is the label
dataframe[0] = dataframe[0].str[0]
data_labels = dataframe[0].values
# remove the first column
dataframe = dataframe.drop(0, axis=1)
    



## Aufteilen in Trainings- und Testdaten
Anschliessend werden die Daten in Trainings- und Testdaten aufgeteilt. Die Trainingsdaten werden für das Training des Klassifikators verwendet, die Testdaten werden für die Evaluation des Klassifikators verwendet. Die Daten werden zufällig aufgeteilt, wobei 80% der Daten für das Training und 20% der Daten für die Evaluation verwendet werden.

In [134]:
# separate the data into training and test sets

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(dataframe, data_labels, test_size=0.2, random_state=99)

In [135]:
# Describe the training set
x_train.describe()



Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
count,480.0,480.0,480.0,480.0,480.0,480.0,480.0,480.0,480.0,480.0,...,480.0,480.0,480.0,480.0,480.0,480.0,480.0,480.0,480.0,480.0
mean,0.008333,0.014583,0.010417,0.025,0.033333,0.027083,0.01875,0.029167,0.0125,0.0125,...,0.010417,0.025,0.022917,0.05625,0.0375,0.027083,0.039583,0.016667,0.010417,0.245833
std,0.091001,0.120003,0.101635,0.156288,0.179693,0.162496,0.135782,0.168449,0.111218,0.111218,...,0.101635,0.156288,0.149794,0.230644,0.190182,0.162496,0.195182,0.128153,0.101635,0.431029
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [136]:
# Describe the test set
x_test.describe()


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,91,92,93,94,95,96,97,98,99,100
count,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,...,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0,120.0
mean,0.025,0.041667,0.033333,0.025,0.033333,0.025,0.0,0.025,0.0,0.016667,...,0.008333,0.016667,0.025,0.016667,0.058333,0.025,0.025,0.025,0.016667,0.266667
std,0.15678,0.200664,0.180258,0.15678,0.180258,0.15678,0.0,0.15678,0.0,0.128556,...,0.091287,0.128556,0.15678,0.128556,0.235355,0.15678,0.15678,0.15678,0.128556,0.444071
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Klassifikatoren trainieren und vergleichen
Die folgenden Klassifikatoren werden verwendet:
- SVM
- Decision Tree

### SVM Modell trainieren
Da die Labels in Form von Symbolen vorliegen

In [138]:
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import pandas as pd

svm_model = SVC(kernel='rbf')
svm_model.fit(x_train, y_train)

# predict
y_pred = svm_model.predict(x_test)

print(classification_report(y_test, y_pred))

# confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

              precision    recall  f1-score   support

           #       0.95      0.95      0.95        20
           +       0.95      0.91      0.93        22
           -       0.87      1.00      0.93        20
           o       1.00      0.96      0.98        28
           x       1.00      0.97      0.98        30

    accuracy                           0.96       120
   macro avg       0.95      0.96      0.96       120
weighted avg       0.96      0.96      0.96       120

[[19  0  1  0  0]
 [ 1 20  1  0  0]
 [ 0  0 20  0  0]
 [ 0  1  0 27  0]
 [ 0  0  1  0 29]]


### Decision Tree Modell trainieren
