Support Vector Machines (SVM) beschreiben eine ML-Methode für supervised Learning.

In [1]:
import pandas as pd

dataset = pd.read_csv("lung_cancer_dataset.csv")

# Convert text to categorical data
dataset['gender'] = dataset['gender'].astype('category')
dataset['radon_exposure'] = dataset['radon_exposure'].astype('category')
dataset['alcohol_consumption'] = dataset['alcohol_consumption'].fillna('None').astype('category')

dataset['asbestos_exposure'] = dataset['asbestos_exposure'].map({'Yes': True, 'No': False})
dataset['secondhand_smoke_exposure'] = dataset['secondhand_smoke_exposure'].map({'Yes': True, 'No': False})
dataset['copd_diagnosis'] = dataset['copd_diagnosis'].map({'Yes': True, 'No': False})
dataset['family_history'] = dataset['family_history'].map({'Yes': True, 'No': False})
dataset['lung_cancer'] = dataset['lung_cancer'].map({'Yes': True, 'No': False})

# no duplicate rows
dataset.duplicated().sum()

# show data
dataset.head()

Unnamed: 0,patient_id,age,gender,pack_years,radon_exposure,asbestos_exposure,secondhand_smoke_exposure,copd_diagnosis,alcohol_consumption,family_history,lung_cancer
0,100000,69,Male,66.025244,High,False,False,True,Moderate,False,False
1,100001,32,Female,12.7808,High,False,True,True,Moderate,True,True
2,100002,89,Female,0.408278,Medium,True,True,True,,False,True
3,100003,78,Female,44.065232,Low,False,True,False,Moderate,False,True
4,100004,38,Female,44.43244,Medium,True,False,True,,True,True


In [2]:
from sklearn.model_selection import train_test_split

# get the data and target from the data frame 
data = dataset.loc[:, 'age':'family_history']
target = dataset['lung_cancer']

train_data, test_data, train_label, test_label = train_test_split(data, target, test_size=0.3, random_state=0)

In [8]:
# Kategorische Variablen in numerische Werte umwandeln (One-Hot-Encoding)
dataset_encoded = pd.get_dummies(dataset, drop_first=True)
data = dataset_encoded.loc[:, 'age':'family_history']
target = dataset_encoded['lung_cancer']
train_data, test_data, train_label, test_label = train_test_split(data, target, test_size=0.3, random_state=0)

Allg. gilt:
- k beschreibt die Anzahl der Folds einer Cross-Validation (wie oft Datne in Trainings- und Testsets aufgeteilt werden)
    - -> beeinflusst Robustheit des Modells & nicht das Modell selbst
- C beschreibt die Stärke der Regularisierung vom SVM-Modell
    - -> kleiner C-Wert (z.B. 0.0001) = starke Regularisierung // das Modell toleriert mehr Fehler im Training
    - -> großer C-Wert (z.B. 1) = schwächere Regularisierung // das Modell passt sich stärker an die Trainingsdaten an

Bewertung der Crossvalidation durch einzelne Splits

In [9]:
from sklearn import svm
clf = svm.SVC(kernel='linear', C=0.001).fit(train_data, train_label)
clf.score(test_data, test_label)

0.6879333333333333

In [10]:
clf = svm.SVC(kernel='linear', C=0.01).fit(train_data, train_label)
clf.score(test_data, test_label)

0.6879333333333333

In [11]:
clf = svm.SVC(kernel='linear', C=0.1).fit(train_data, train_label)
clf.score(test_data, test_label)

0.6879333333333333

In [12]:
clf = svm.SVC(kernel='linear', C=1).fit(train_data, train_label)
clf.score(test_data, test_label)

0.6879333333333333

Bewertung der Cross-Validation mit k-facher CrossValidation

In [13]:
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1, random_state=42)

k = 5
scores = cross_val_score(clf, data, target, cv=k)
scores

array([0.6872, 0.6873, 0.6873, 0.6873, 0.6873])

In [14]:
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.69 accuracy with a standard deviation of 0.00


In [15]:
scores = cross_val_score(clf, data, target, cv=5, scoring='f1_macro')
scores

Bewertung der Cross-Validation (mehrere Metriken)

In [16]:
from sklearn.model_selection import cross_validate

scoring = ['precision_macro', 'recall_macro']
clf = svm.SVC(kernel='linear', C=1, random_state=0)
scores = cross_validate(clf, data, target, scoring=scoring)
scores

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


{'fit_time': array([490.29230762, 482.51228833, 542.29608941, 503.99202061,
        488.35938072]),
 'score_time': array([2.12367105, 2.08393192, 2.06741571, 2.12638593, 2.05960989]),
 'test_precision_macro': array([0.3436 , 0.34365, 0.34365, 0.34365, 0.34365]),
 'test_recall_macro': array([0.5, 0.5, 0.5, 0.5, 0.5])}