# Classification: Klassische Methoden
## Logistische Regression
In diesem Notebook versuchen wir die Klassifizierung in "Failure" / "No Failure" mit logistischer Regression durchzuführen.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import Preprocessing as pp
from sklearn.metrics import classification_report
pd.options.mode.chained_assignment = None

path_data = '/Users/marvinwoller/Desktop/SmartDataAnalytics/Blatt2/data/'

rootdir_train = path_data + 'train/'
rootdir_test = path_data + 'test/'

train_labels_path = path_data + 'train_label.csv'
test_labels_path = path_data + 'test_label.csv'

feature_path = path_data + 'features/'
feature_path_test = path_data + 'features_test/'

resampled_path = path_data + 'resampled/'
resampled_path_test = path_data + 'resampled_test/'

train_labels = pd.read_csv(train_labels_path, index_col=0) #Don't use index numbers per row but CSV file name as index

In [2]:
# Classification: Logitisches Modell

In [3]:
def logreg_classification(X_train,y_train,X_test,y_test,name):
    # Split train data to get a second test set without concept drift
    X_train, X_test_trainset, y_train, y_test_trainset = train_test_split(X_train, y_train, test_size=0.2, random_state=123)
    # Create the classifier
    clf = BaggingClassifier(base_estimator=LogisticRegression(), n_estimators=25, random_state=0, n_jobs=-1)
    # Fit the classifier
    clf.fit(X_train, y_train)
    # Perform prediction on 20% train set data (no drift)
    y_pred_trainset = clf.predict(X_test_trainset)
    # Perform prediction on test data (with drift)
    y_pred = clf.predict(X_test)
    print("##################### " + name + " #####################")
    print("---------------- TRAIN ----------------")
    print("TRAIN Accuracy (" + name + "):",metrics.accuracy_score(y_test_trainset, y_pred_trainset))
    print(classification_report(y_test_trainset, y_pred_trainset))
    print("---------------- TEST ----------------")
    print("TEST Accuracy (" + name + "):",metrics.accuracy_score(y_test, y_pred))
    print(classification_report(y_test, y_pred))

## Input data: Features

In [4]:
# Use extracted Features for classification
features = ['mean', 'median', 'min', 'max', 'std', 'var']
features2 = ['std', 'var']

In [5]:
# Preprocessing: Remove strong drift + scaling
for feature in features:
    df = pd.read_csv(feature_path + feature + '.csv', index_col=0)
    df_test = pd.read_csv(feature_path_test + feature + '.csv', index_col=0)
    y_train, X_train = pp.preprocess(df, random_n=10000)
    y_test, X_test = pp.preprocess_test(df_test)
    logreg_classification(X_train,y_train,X_test,y_test,feature)

##################### mean #####################
---------------- TRAIN ----------------
TRAIN Accuracy (mean): 0.6745
              precision    recall  f1-score   support

         0.0       0.67      0.73      0.70      1048
         1.0       0.67      0.61      0.64       952

    accuracy                           0.67      2000
   macro avg       0.67      0.67      0.67      2000
weighted avg       0.67      0.67      0.67      2000

---------------- TEST ----------------
TEST Accuracy (mean): 0.5102149819735612
              precision    recall  f1-score   support

         0.0       0.51      1.00      0.67      7582
         1.0       0.84      0.01      0.02      7396

    accuracy                           0.51     14978
   macro avg       0.67      0.50      0.35     14978
weighted avg       0.67      0.51      0.35     14978

##################### median #####################
---------------- TRAIN ----------------
TRAIN Accuracy (median): 0.6815
              precision 

In [6]:
# Try with different preprocessing ("good_sensors" + remove strong drift + scaling)
for feature in features:
    df = pd.read_csv(feature_path + feature + '.csv', index_col=0)
    df_test = pd.read_csv(feature_path_test + feature + '.csv', index_col=0)
    y_train, X_train = pp.preprocess(df, random_n=10000, get_good=True)
    y_test, X_test = pp.preprocess_test(df_test, get_good=True)
    logreg_classification(X_train,y_train,X_test,y_test,feature)

##################### mean #####################
---------------- TRAIN ----------------
TRAIN Accuracy (mean): 0.636
              precision    recall  f1-score   support

         0.0       0.64      0.70      0.67      1048
         1.0       0.63      0.57      0.60       952

    accuracy                           0.64      2000
   macro avg       0.64      0.63      0.63      2000
weighted avg       0.64      0.64      0.63      2000

---------------- TEST ----------------
TEST Accuracy (mean): 0.512217919615436
              precision    recall  f1-score   support

         0.0       0.51      1.00      0.67      7582
         1.0       1.00      0.01      0.02      7396

    accuracy                           0.51     14978
   macro avg       0.75      0.51      0.35     14978
weighted avg       0.75      0.51      0.35     14978

##################### median #####################
---------------- TRAIN ----------------
TRAIN Accuracy (median): 0.634
              precision    

In [7]:
# Try with different preprocessing (remove correlation + remove strong drift)
for feature in features:
    df = pd.read_csv(feature_path + feature + '.csv', index_col=0)
    df_test = pd.read_csv(feature_path_test + feature + '.csv', index_col=0)
    y_train, X_train = pp.preprocess(df, random_n=10000,rem_corr=True)
    y_test, X_test = pp.preprocess_test(df_test, rem_corr=True)
    logreg_classification(X_train,y_train,X_test,y_test,feature)

KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
KeyError skipping...
##################### mean #####################
---------------- TRAIN ----------------
TRAIN Accuracy (mean): 0.666
              precision    recall  f1-score   support

         0.0       0.68      0.70      0.69      1048
         1.0       0.65      0.63      0.64       952

    accuracy                           0.67      2000
   macro avg       0.67      0.66      0.66      2000
weighted avg       0.67      0.67      0.67      2000

---------------- TEST ----------------
TEST Accuracy (mean): 0.5150887969021232
              precision    recall  f1-score   support

         0.0       0.55      0.23      0.32      7582
         1.0       0.51      0.81      0.62      7396

    accuracy                           0.52     14978
   macro avg       0.53      0.52      0.47    

## Ergebnis
Mit logistischer Regression können keine besseren Werte erzielt werden als mit der SVM.