# Assignment 2: Support Vector Machines - SVC

Hi all, 

Please use google to find out SVM python code and then use it to further produce prediction results (regression and classification). 

With warm regards,

Stanley

## 支持向量分類（SVC）

SVC是一種用於分類任務的支持向量機。其目的是找到一個最佳的超平面，將不同類別的數據點分開。SVC的主要目標是最大化分類邊界兩側最近數據點之間的間隔，以提高模型的泛化能力。

### 主要特點：
- **分類任務**：SVC適用於二元或多元分類問題。
- **超平面**：在高維空間中找到一個最佳的超平面來分隔不同類別的數據。
- **支持向量**：決定最佳超平面位置的數據點。
- **核函數**：可以使用不同的核函數（如線性核、多項式核、RBF核）來處理線性和非線性可分的數據。

## Stanley Recommends

用鐵達尼號的那個資料集，並且要做前處理，one-hot encoding，填補缺失值等等。

[Taitanic Dataset](https://www.kaggle.com/c/titanic/data)

### 參考資料：
[Titanic - Machine Learning from Disaster 鐵達尼號生存預測 資料分析篇](https://hackmd.io/@Go3PyC86QhypSl7kh5nA2Q/Hk4nXFYkK)

# Code

## Import Libraries

In [None]:
from sklearn.svm import SVC

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

# 讓你的圖形直接嵌入到 Notebook 中，而不是另開視窗。
%matplotlib inline

## Import Dataset

### 欄位解釋

| Variable  | Definition                           | Key                               |
|-----------|--------------------------------------|-----------------------------------|
| survival  | Survival                             | 0 = No, 1 = Yes                   |
| pclass    | Ticket class                         | 1 = 1st, 2 = 2nd, 3 = 3rd         |
| sex       | Sex                                  |                                   |
| Age       | Age in years                         |                                   |
| sibsp     | # of siblings / spouses aboard the Titanic |                           |
| parch     | # of parents / children aboard the Titanic |                           |
| ticket    | Ticket number                        |                                   |
| fare      | Passenger fare                       |                                   |
| cabin     | Cabin number                         |                                   |
| embarked  | Port of Embarkation                  | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
train_df = pd.read_csv('/Users/hank/CodeSpace/1131-ML/Assignment_2/dataset/Taitanic/train.csv')
test_df = pd.read_csv('/Users/hank/CodeSpace/1131-ML/Assignment_2/dataset/Taitanic/test.csv') #無Survived欄位

train_df.head()

## Data Preprocessing

### Missing Values

1. `Cabin`缺失值過多，對訓練無幫助，直接移除
2. `PassengerId`、`Name`、`Ticket`(票號)，沒有分析價值，直接移除

In [None]:
# 檢視缺失值
print(train_df.isnull().sum())

# train_df 資料總筆數
print("資料總筆數：" + str(train_df.shape[0]))

In [None]:
# 移除欄位
train_df = train_df.drop(['Cabin', 'Ticket', 'Name', 'PassengerId'], axis=1)

# 移除有缺失值的資料
train_df = train_df.dropna(subset =  ['Embarked','Age'])

# 檢視
train_df.head()

### Encoding Categorical Variables

`Sex` 與 `Embarked` 是非數值型資料，需要進行編碼。

使用LabelEncoder。

In [None]:
# Print the unique values in the columns
print(train_df['Sex'].unique())
print(train_df['Embarked'].unique())

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()

# Encode the sex column
train_df.iloc[:, 2] = labelencoder.fit_transform(train_df.iloc[:, 2].values)

# Encode the embarked column
train_df.iloc[:, 7] = labelencoder.fit_transform(train_df.iloc[:, 7].values)

# Print the unique values in the columns
print(train_df['Sex'].unique())
print(train_df['Embarked'].unique())

In [None]:
train_df.head()

## Splitting the Dataset

In [None]:
# Split the data into independent 'X' and dependent 'y' variables
X = train_df.iloc[:, 1:8].values
y = train_df.iloc[:, 0].values

# Split the dataset into 80% training and 20% testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Feature Scaling

In [None]:
#Scale the data 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_trian = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Building the SVC Model

In [None]:
# import pickle

def models(X_train, y_train):
    # Use SVC (linear kernal)
    svc_lin = SVC(kernel='linear', random_state = 0)
    svc_lin.fit(X_train,y_train)

    # Use SVC (RBF kernal)
    svc_rbf = SVC(kernel='rbf', random_state = 0)
    svc_rbf.fit(X_train,y_train)

    print('SVC Linear Training Accuracy:', svc_lin.score(X_train, y_train))
    print('SVC RBF Training Accuracy:', svc_rbf.score(X_train, y_train))

    # Save the model
    # pickle.dump(svc_lin, open('/Users/hank/CodeSpace/1131-ML/Assignment_2/svc_lin_model.pkl', 'wb'))
    # pickle.dump(svc_rbf, open('/Users/hank/CodeSpace/1131-ML/Assignment_2/svc_rbf_model.pkl', 'wb'))

    return svc_lin,svc_rbf


## Training

In [None]:
model = models(X_train, y_train)

## Validation

In [None]:
# Show the confusion matrix and accuracy for all of the models of the test data
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

model_names = ['SVC Linear', 'SVC RBF']

for i in range(len(model)):
    # 生成混淆矩陣
    y_pred = model[i].predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    
    # 計算準確率
    TN, FP, FN, TP = cm.ravel()
    test_score = (TP + TN) / (TN + TP + FP + FN)
    
    # 繪製混淆矩陣
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(f'{model_names[i]} Confusion Matrix\nAccuracy = {test_score:.2f}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()

    # 顯示分類報告
    print(f'{model_names[i]} Classification Report')
    print(classification_report(y_test, y_pred))

## Testing

對測試集資料進行相同的前處理，並使用訓練好的模型進行預測。

In [None]:
# 用與訓練集相同的方式處理測試集

# 檢視缺失值
print(test_df.isnull().sum())

# 移除欄位
test_df = test_df.drop(['Cabin', 'Ticket', 'Name', 'PassengerId'], axis=1)

# 填補缺失值
test_df['Age'] = test_df['Age'].fillna(test_df['Age'].mean())
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].mean())

# 填補完成
print('-'*30)
print(test_df.isnull().sum())

In [None]:
# Encode

test_df.iloc[:, 1] = labelencoder.fit_transform(test_df.iloc[:, 1].values) # Sex
test_df.iloc[:, 6] = labelencoder.fit_transform(test_df.iloc[:, 6].values) # Embarked

In [None]:
# Split the data into independent 'X' and dependent 'y' variables
test_y_df = pd.read_csv('/Users/hank/CodeSpace/1131-ML/Assignment_2/dataset/Taitanic/gender_submission.csv')

X_test = test_df
y_test = test_y_df['Survived']

In [None]:
# Scale the data 
X_test = sc.transform(X_test)

In [None]:
# 檢視結果
for i in range(len(model)):
    # 生成混淆矩陣
    y_pred = model[i].predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    
    # 計算準確率
    test_score = accuracy_score(y_test, y_pred)
    
    # 繪製混淆矩陣
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
    plt.title(f'{model_names[i]} Confusion Matrix\nAccuracy = {test_score:.2f}')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()
    
    # 顯示分類報告
    print(f'{model_names[i]} Classification Report')
    print(classification_report(y_test, y_pred))