# Assignment 2: Support Vector Machines - SVR

Hi all, 

Please use google to find out SVM python code and then use it to further produce prediction results (regression and classification). 

With warm regards,

Stanley

## 支持向量回歸（SVR）

SVR是一種用於回歸任務的支持向量機。其目的是找到一個最佳的函數來預測連續型變量。SVR的目標是找到一個函數，使得預測值與實際值之間的誤差在某個容忍範圍內（通常稱為epsilon-insensitive loss）最小化。

### 主要特點：
- **回歸任務**：SVR適用於預測連續型變量。
- **epsilon-insensitive loss**：允許預測值與實際值之間存在一定的誤差（epsilon），在這個範圍內的誤差不會對模型的損失函數產生影響。
- **支持向量**：決定回歸函數形狀的數據點。
- **核函數**：同樣可以使用不同的核函數來處理線性和非線性回歸問題。

## Dataset: California_Housing_Dataset

包含了1990年加州各區域的房屋數據，共有20640個樣本，8個特徵。

In [None]:
from sklearn.datasets import fetch_california_housing

# 我用公司網路會被擋，注意一下
california = fetch_california_housing(as_frame=True)

In [None]:
#description of the dataset
print(california.DESCR)

## 參考資料

[scikit-learn-mooc](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html)

[kaggle-California_Housing_Dataset](https://www.kaggle.com/code/olanrewajurasheed/california-housing-dataset)

# Code

## Importing Libraries

In [None]:
from sklearn.svm import SVR

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

# 讓你的圖形直接嵌入到 Notebook 中，而不是另開視窗。
%matplotlib inline

## Viewing the Dataset

In [None]:
california_df = pd.DataFrame(california.data,
                             columns=california.feature_names)
california_df['MedHouseValue'] = pd.Series(california.target)
california_df.head()

In [None]:
california_df.describe()

## Splitting the Data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, random_state=11, test_size=0.2)

## Scaling

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Building the SVR Model

In [None]:
from sklearn.model_selection import GridSearchCV
from joblib import Parallel, delayed

model_names = ['SVR Linear', 'SVR RBF']

def models(X_train, y_train):

    # Define parameter grid for SVR with linear kernel
    param_grid_lin = {
        'C': [0.1, 1, 10, 100],
        'epsilon': [0.01, 0.1, 0.5, 1]
    }

    # Define parameter grid for SVR with RBF kernel
    param_grid_rbf = {
        'C': [0.1, 1, 10, 100],
        'epsilon': [0.01, 0.1, 0.5, 1],
        'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1]
    }

    # Grid search for SVR with linear kernel
    print("Starting grid search for SVR with linear kernel...")
    svr_lin = SVR(kernel='linear')
    grid_search_lin = GridSearchCV(svr_lin, param_grid_lin, cv=5, scoring='neg_mean_squared_error', verbose=3, n_jobs=-1)
    grid_search_lin.fit(X_train, y_train)
    best_svr_lin = grid_search_lin.best_estimator_
    print("Grid search for SVR with linear kernel complete.\n")

    # Grid search for SVR with RBF kernel
    print("Starting grid search for SVR with RBF kernel...")
    svr_rbf = SVR(kernel='rbf')
    grid_search_rbf = GridSearchCV(svr_rbf, param_grid_rbf, cv=5, scoring='neg_mean_squared_error', verbose=3, n_jobs=-1)
    grid_search_rbf.fit(X_train, y_train)
    best_svr_rbf = grid_search_rbf.best_estimator_
    print("Grid search for SVR with RBF kernel complete.\n")

    print('Best SVR Linear Training Accuracy:', best_svr_lin.score(X_train, y_train))
    print('Best SVR RBF Training Accuracy:', best_svr_rbf.score(X_train, y_train))

    return best_svr_lin, best_svr_rbf

## Training the Model

In [None]:
best_models = models(X_train, y_train)

## Evaluating

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

def evaluate_model(best_models, model_names, X_test, y_test):
    for i in range(len(model_names)):
        print(f'Model: {model_names[i]}')

        # Predict
        y_pred = best_models[i].predict(X_test)
        
        # Calculate metrics
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
        
        print(f'Mean Squared Error: {mse}')
        print(f'R^2 Score: {r2}')
        
        # Plot predictions vs actual values
        plt.figure(figsize=(10, 5))
        plt.scatter(y_test, y_pred, edgecolors=(0, 0, 0))
        plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=3)
        plt.xlabel('Actual')
        plt.ylabel('Predicted')
        plt.title(f'Actual vs Predicted for {model_names[i]}')
        plt.show()
        
        # Plot residuals
        residuals = y_test - y_pred
        plt.figure(figsize=(10, 5))
        plt.scatter(y_pred, residuals, edgecolors=(0, 0, 0))
        plt.hlines(y=0, xmin=y_pred.min(), xmax=y_pred.max(), colors='r', linestyles='--')
        plt.xlabel('Predicted')
        plt.ylabel('Residuals')
        plt.title(f'Residuals vs Predicted for {model_names[i]}')
        plt.show()


In [None]:
evaluate_model(best_models, model_names, X_test, y_test)