# Zadanie domowe 4

Zadanie dotyczy algorytmu Support Vector Machine, który był omówiony na zajęciach. Zainteresowanych zachęcamy do zapoznania się z http://pyml.sourceforge.net/doc/howto.pdf.

Wykorzystaj dwa zbiory danych:

  - apartments z R-owego pakietu DALEX,
  - dowolny, wybrany przez siebie zbiór danych (najlepiej z co najmniej 8 zmiennymi numerycznymi).

1. Dopasuj SVM do obu zbiorów danych.
2. Sprawdź, czy zalinkowany artykuł słusznie zwraca uwagę na skalowanie danych (pamiętaj, że większość implementacji domyślnie skaluje).
3. Spróbuj zoptymalizować metodą random search najważniejsze hiperparametry tj. :
* cost,
* gamma,
* degree, 
najprościej optymalizować hiperparametry w SVM z jądrem gaussowskim, ale można też poszukać najlepszego jądra.

Pakiet Dalex jest również dostepny w Pythonie, więc skorzystam z niego.

In [None]:
pip install dalex

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import SVR, SVC
from sklearn.datasets import load_diabetes
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error as mse
from dalex.datasets import load_apartments, load_apartments_test

In [3]:
data_apartments = load_apartments()
data_apartments_test = load_apartments_test()
apartments_X_train = data_apartments.drop(["m2_price"], axis=1)
apartments_y_train = data_apartments["m2_price"]
apartments_X_test = data_apartments_test.drop(["m2_price"], axis=1)
apartments_y_test = data_apartments_test["m2_price"]
enc = OneHotEncoder(sparse=False)
apartments_X_train['district'] = enc.fit_transform(apartments_X_train)
apartments_X_test['district'] = enc.transform(apartments_X_test)

In [4]:
data_diabetes = load_diabetes()
diabetes_X = data_diabetes['data']
diabetes_y = data_diabetes['target']
diabetes_feature_names = data_diabetes['feature_names']
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(diabetes_X, diabetes_y, test_size=0.33)
diabetes_feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [5]:
scaler1 = StandardScaler()
diabetes_X_train_scaled = scaler1.fit_transform(diabetes_X_train)
diabetes_X_test_scaled = scaler1.fit_transform(diabetes_X_test)

scaler2 = StandardScaler()
apartments_X_train_scaled = scaler2.fit_transform(apartments_X_train)
apartments_X_test_scaled = scaler2.fit_transform(apartments_X_test)

In [6]:
svm_diabetes = SVR()
svm_diabetes.fit(diabetes_X_train, diabetes_y_train)
y_hat = svm_diabetes.predict(diabetes_X_test)

svm_diabetes3 = SVR()
svm_diabetes3.fit(diabetes_X_train_scaled, diabetes_y_train)
y_hat_scaled = svm_diabetes3.predict(diabetes_X_test_scaled)

print(f"RMSE: {mse(y_hat, diabetes_y_test, squared=False)}")
print(f"RMSE scaled: {mse(y_hat_scaled, diabetes_y_test, squared=False)}")
print(f"Test_std: {np.std(diabetes_y_test)}")

RMSE: 71.93419377353958
RMSE scaled: 71.81253404625602
Test_std: 78.15581964810877


In [7]:
svm_apartments = SVR()
svm_apartments.fit(apartments_X_train, apartments_y_train)
y_hat = svm_apartments.predict(apartments_X_test)

svm_apartments3 = SVR()
svm_apartments3.fit(apartments_X_train_scaled, apartments_y_train)
y_hat_scaled = svm_apartments3.predict(apartments_X_test_scaled)

print(f"RMSE: {mse(y_hat, apartments_y_test, squared=False)}")
print(f"RMSE scaled: {mse(y_hat_scaled, apartments_y_test, squared=False)}")
print(f"Test_std: {np.std(apartments_y_test)}")

RMSE: 908.9981030762165
RMSE scaled: 880.3441230797098
Test_std: 900.4468304993062


W zbiorze dotyczącym cukrzyc dane zostały wcześniej już przeskalowane (wdg. dokumentacji), nie widać więc poprawy wyniku po kolejnej operacji skalowania. Natomiast dane apartments nie były skalowane i widoczne jest polepszenie wyniku gdy korzystamy z przeskalowanych danych. Zgadza się z uwagami wspominanego artykułu.

In [8]:
C_vals = [0.1*i for i in range(1,21)] + [1*i for i in range(1,21)] + [10*i for i in range(1,21)]
#Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. 
#The penalty is a squared l2 penalty. default=1.0

#degrees = [i for i in range(1,15)]
degrees = [i for i in range(1,5)]
#Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels. degree int, default=3

gammas = ['scale', 'auto'] + [0.1*i for i in range(1,51)]
#Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. gamma{‘scale’, ‘auto’} or float, default=’scale’

#kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’

In [9]:
svm_diabetes_optim = SVR(kernel='poly')
params = dict(C=C_vals, degree=degrees, gamma=gammas)
randomsrc = RandomizedSearchCV(svm_diabetes_optim, params, n_iter=100, n_jobs=-1)
searched = randomsrc.fit(diabetes_X, diabetes_y)
print(searched.best_params_)
print(searched.best_score_)

{'gamma': 3.7, 'degree': 1, 'C': 100}
0.47714795404624566


In [10]:
svm_diabetes_optim = SVR(kernel='rbf')
params = dict(C=C_vals, gamma=gammas)
randomsrc = RandomizedSearchCV(svm_diabetes_optim, params, n_iter=100, n_jobs=-1)
searched = randomsrc.fit(diabetes_X, diabetes_y)
print(searched.best_params_)
print(searched.best_score_)

{'gamma': 4.2, 'C': 80}
0.4920086988491194


In [11]:
svm_diabetes_optim = SVR(kernel='sigmoid')
params = dict(C=C_vals, gamma=gammas)
randomsrc = RandomizedSearchCV(svm_diabetes_optim, params, n_iter=100, n_jobs=-1)
searched = randomsrc.fit(diabetes_X, diabetes_y)
print(searched.best_params_)
print(searched.best_score_)

{'gamma': 4.2, 'C': 100}
0.4773035712343643


In [12]:
svm_apartments_optim = SVR()
params = dict(C=C_vals, gamma=gammas)
randomsrc = RandomizedSearchCV(svm_apartments_optim, params, n_iter=100, n_jobs=-1)
searched2 = randomsrc.fit(apartments_X_train, apartments_y_train)
print(searched2.best_score_)
print(searched2.best_params_)

0.023858307448529636
{'gamma': 'auto', 'C': 170}


Inne kernele dla apartments nigdy się nie kończą, przeprowadzam więc search na domyślnym czyli rbf. Dodatkowo w tym wypadku niestety parametr "degree" nie ma znaczenia, ponieważ jest używany tylko w kernel='poly', więc nie ma sensu go optymalizować.