# Classification: Klassische Methoden
## Support Vector Machine (linear kernel)
In diesem Notebook versuchen wir die Klassifizierung in "Failure" / "No Failure" mit der klassischen machine learning Methode SVM durchzuführen.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from sklearn import metrics
import Preprocessing as pp
from sklearn.metrics import classification_report
pd.options.mode.chained_assignment = None

path_data = '/Users/marvinwoller/Desktop/SmartDataAnalytics/Blatt2/data/'

rootdir_train = path_data + 'train/'
rootdir_test = path_data + 'test/'

train_labels_path = path_data + 'train_label.csv'
test_labels_path = path_data + 'test_label.csv'

feature_path = path_data + 'features/'
feature_path_test = path_data + 'features_test/'

resampled_path = path_data + 'resampled/'
resampled_path_test = path_data + 'resampled_test/'

train_labels = pd.read_csv(train_labels_path, index_col=0) #Don't use index numbers per row but CSV file name as index

In [12]:
def svm_classification(X_train,y_train,X_test,y_test,name):
    # Split train data to get a second test set without concept drift
    X_train, X_test_trainset, y_train, y_test_trainset = train_test_split(X_train, y_train, test_size=0.2, random_state=123)
    # Create the classifier
    clf = BaggingClassifier(base_estimator=SVC(kernel='linear'), n_estimators=25, random_state=0, n_jobs=-1)
    # Fit the classifier
    clf.fit(X_train, y_train)
    # Perform prediction on 20% train set data (no drift)
    y_pred_trainset = clf.predict(X_test_trainset)
    # Perform prediction on test data (with drift)
    y_pred = clf.predict(X_test)
    print("##################### " + name + " #####################")
    print("---------------- TRAIN ----------------")
    print("TRAIN Accuracy (" + name + "):",metrics.accuracy_score(y_test_trainset, y_pred_trainset))
    print(classification_report(y_test_trainset, y_pred_trainset))
    print("---------------- TEST ----------------")
    print("TEST Accuracy (" + name + "):",metrics.accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred))

## Input data: Features

In [13]:
# Use extracted Features for classification
features = ['mean', 'median', 'min', 'max', 'std', 'var']
features2 = ['std', 'var']

In [14]:
# Preprocessing: Remove strong drift + scaling
for feature in features:
    df = pd.read_csv(feature_path + feature + '.csv', index_col=0)
    df_test = pd.read_csv(feature_path_test + feature + '.csv', index_col=0)
    y_train, X_train = pp.preprocess(df, random_n=10000)
    y_test, X_test = pp.preprocess_test(df_test)
    svm_classification(X_train,y_train,X_test,y_test,feature)

##################### mean #####################
---------------- TRAIN ----------------
TRAIN Accuracy (mean): 0.6805
              precision    recall  f1-score   support

         0.0       0.66      0.77      0.71      1016
         1.0       0.71      0.59      0.65       984

    accuracy                           0.68      2000
   macro avg       0.69      0.68      0.68      2000
weighted avg       0.68      0.68      0.68      2000

---------------- TEST ----------------
TEST Accuracy (mean): 0.5152223260782481
              precision    recall  f1-score   support

         0.0       0.51      0.91      0.66      7582
         1.0       0.55      0.11      0.18      7396

    accuracy                           0.52     14978
   macro avg       0.53      0.51      0.42     14978
weighted avg       0.53      0.52      0.42     14978

##################### median #####################
---------------- TRAIN ----------------
TRAIN Accuracy (median): 0.684
              precision  



##################### std #####################
---------------- TRAIN ----------------
TRAIN Accuracy (std): 0.5645
              precision    recall  f1-score   support

           0       0.57      0.66      0.61      1028
           1       0.56      0.47      0.51       972

    accuracy                           0.56      2000
   macro avg       0.56      0.56      0.56      2000
weighted avg       0.56      0.56      0.56      2000

---------------- TEST ----------------
TEST Accuracy (std): 0.519962611830685
              precision    recall  f1-score   support

         0.0       0.52      0.71      0.60      7582
         1.0       0.52      0.33      0.40      7396

    accuracy                           0.52     14978
   macro avg       0.52      0.52      0.50     14978
weighted avg       0.52      0.52      0.50     14978





##################### var #####################
---------------- TRAIN ----------------
TRAIN Accuracy (var): 0.53
              precision    recall  f1-score   support

           0       0.52      0.88      0.66      1025
           1       0.56      0.17      0.26       975

    accuracy                           0.53      2000
   macro avg       0.54      0.52      0.46      2000
weighted avg       0.54      0.53      0.46      2000

---------------- TEST ----------------
TEST Accuracy (var): 0.5210308452396849
              precision    recall  f1-score   support

         0.0       0.52      0.91      0.66      7582
         1.0       0.57      0.12      0.20      7396

    accuracy                           0.52     14978
   macro avg       0.54      0.52      0.43     14978
weighted avg       0.54      0.52      0.43     14978



In [15]:
# Try with different preprocessing ("good_sensors" + remove strong drift + scaling)
for feature in features:
    df = pd.read_csv(feature_path + feature + '.csv', index_col=0)
    df_test = pd.read_csv(feature_path_test + feature + '.csv', index_col=0)
    y_train, X_train = pp.preprocess(df, random_n=10000, get_good=True)
    y_test, X_test = pp.preprocess_test(df_test, get_good=True)
    svm_classification(X_train,y_train,X_test,y_test,feature)

##################### mean #####################
---------------- TRAIN ----------------
TRAIN Accuracy (mean): 0.647
              precision    recall  f1-score   support

         0.0       0.62      0.83      0.71      1031
         1.0       0.71      0.45      0.55       969

    accuracy                           0.65      2000
   macro avg       0.67      0.64      0.63      2000
weighted avg       0.66      0.65      0.63      2000

---------------- TEST ----------------
TEST Accuracy (mean): 0.5116170383228735
              precision    recall  f1-score   support

         0.0       0.51      1.00      0.67      7582
         1.0       0.99      0.01      0.02      7396

    accuracy                           0.51     14978
   macro avg       0.75      0.51      0.35     14978
weighted avg       0.75      0.51      0.35     14978





##################### median #####################
---------------- TRAIN ----------------
TRAIN Accuracy (median): 0.6435
              precision    recall  f1-score   support

         0.0       0.61      0.83      0.71      1030
         1.0       0.71      0.44      0.55       970

    accuracy                           0.64      2000
   macro avg       0.66      0.64      0.63      2000
weighted avg       0.66      0.64      0.63      2000

---------------- TEST ----------------
TEST Accuracy (median): 0.5117505674989985
              precision    recall  f1-score   support

         0.0       0.51      1.00      0.67      7582
         1.0       0.92      0.01      0.02      7396

    accuracy                           0.51     14978
   macro avg       0.71      0.51      0.35     14978
weighted avg       0.71      0.51      0.35     14978

##################### min #####################
---------------- TRAIN ----------------
TRAIN Accuracy (min): 0.6255
              precision 

In [16]:
# Try with different preprocessing (remove correlation + remove strong drift)
for feature in features:
    df = pd.read_csv(feature_path + feature + '.csv', index_col=0)
    df_test = pd.read_csv(feature_path_test + feature + '.csv', index_col=0)
    y_train, X_train = pp.preprocess(df, random_n=10000,rem_corr=True)
    y_test, X_test = pp.preprocess_test(df_test, rem_corr=True)
    svm_classification(X_train,y_train,X_test,y_test,feature)

KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
##################### mean #####################
---------------- TRAIN ----------------
TRAIN Accuracy (mean): 0.6575
              precision    recall  f1-score   support

         0.0       0.64      0.72      0.68      1000
         1.0       0.68      0.59      0.63      1000

    accuracy                           0.66      2000
   macro avg       0.66      0.66      0.66      2000
weighted avg       0.66      0.66      0.66      2000

---------------- TEST ----------------
TEST Accuracy (mean): 0.5182267325410602
              precision    recall  f1-score   support

         0.0       0.52      0.55      0.54      7582
         1.0       0.51      0.48      0.50      7396

    accuracy                           0.52     14978
   macro avg       0.52      0.52      0.52   



##################### max #####################
---------------- TRAIN ----------------
TRAIN Accuracy (max): 0.669
              precision    recall  f1-score   support

           0       0.65      0.75      0.70      1023
           1       0.69      0.58      0.63       977

    accuracy                           0.67      2000
   macro avg       0.67      0.67      0.67      2000
weighted avg       0.67      0.67      0.67      2000

---------------- TEST ----------------
TEST Accuracy (max): 0.5055414608091868
              precision    recall  f1-score   support

         0.0       0.51      0.90      0.65      7582
         1.0       0.50      0.10      0.17      7396

    accuracy                           0.51     14978
   macro avg       0.50      0.50      0.41     14978
weighted avg       0.50      0.51      0.41     14978

KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyE

Wie zu erwarten war, ist die Performance einer SVM mit linearem kernel nicht besonders gut. Wir erhalten hier für die performance Metrik "Acurracy" einen maximalen Wert von 0,68 für Testdaten aus dem train set und 0,52 für Testdaten aus dem test set. Das zeigt, dass die SVM nicht gut mit dem drift der Testdaten umgehen kann.

## Input data: Gekürzte Zeitserien
Wir betrachten nun wie sich die SVM auf den resampelten Zeitseriendaten verhällt.

In [17]:
# Use preprocessed time series for classification
features2 = ['resampled_12H_mean', 'resampled_12H_median', 'resampled_6H_mean', 'resampled_6H_median', 'resampled_3H_mean', 'resampled_3H_median']
features = ['resampled_12H_mean']
for feature in features:
    df = pd.read_csv(resampled_path + feature + '.csv', index_col=0)
    df_test = pd.read_csv(resampled_path_test + feature + '.csv', index_col=0)
    y_train, X_train = pp.preprocess(df, random_n=10000)
    y_test, X_test = pp.preprocess_test(df_test)
    svm_classification(X_train,y_train,X_test,y_test,feature)

##################### resampled_12H_mean #####################
---------------- TRAIN ----------------
TRAIN Accuracy (resampled_12H_mean): 0.6625
              precision    recall  f1-score   support

           0       0.63      0.75      0.69       976
           1       0.71      0.58      0.64      1024

    accuracy                           0.66      2000
   macro avg       0.67      0.66      0.66      2000
weighted avg       0.67      0.66      0.66      2000

---------------- TEST ----------------
TEST Accuracy (resampled_12H_mean): 0.5089040549624583
              precision    recall  f1-score   support

         0.0       0.51      0.94      0.66     44420
         1.0       0.52      0.07      0.12     43349

    accuracy                           0.51     87769
   macro avg       0.52      0.50      0.39     87769
weighted avg       0.52      0.51      0.39     87769



Durch verwendung der Zeitserien Daten kann das Ergebnis nicht verbessert werden.

## Zusammenfassung
Die SVM hat für Daten ohne drift, einen akzeptablen accuracy Wert von 0,684 für das feature median. Wir verwenden diesen Wert von hier an als baseline für Analysen auf dem train set.

Auf den Testdaten hat die SVM sehr schlecht abgeschnitten. Wir ziehen daher das Fazit, dass diese Methode nicht für Daten die mit einem data oder concept drift behaftet sind geeignet ist.
