# Support Vector Machine Classification

Support-vector machines (SVMs) are supervised learning models used for classification and regression, known for its kernel trick to handle nonlinear input spaces. This template builds, trains, and tunes an SVM for a **classification** problem. If you would like to learn more about SVMs, take a look at DataCamp's [Linear Classifiers in Python](https://app.datacamp.com/learn/courses/linear-classifiers-in-python) course.

To swap in your dataset in this template, the following is required:
- There must be at least one feature column and a column with a categorical target variable you would like to predict.
- The features have been cleaned and preprocessed, including categorical encoding.
- There are no NaN/NA values. You can use [this template to impute missing values](https://app.datacamp.com/workspace/templates/recipe-python-impute-missing-data) if needed.

The placeholder dataset in this template is consists of hotel booking data with details, such as length of stay. Each row represents a booking and whether the booking was canceled (the target variable). You can find more information on this dataset's source and dictionary [here](https://app.datacamp.com/workspace/datasets/dataset-python-hotel-booking-demand).

### 1. Loading packages and data

In [1]:
# Load packages
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV

# Load the data and replace with your CSV file path
df = pd.read_csv("data/hotel_bookings_clean.csv")
df.head()

Unnamed: 0,is_canceled,lead_time,arrival_date_week_number,stays_in_weekend_nights,stays_in_week_nights,adults,is_repeated_guest,previous_cancellations,total_of_special_requests,avg_daily_rate,booked_by_company,booked_by_agent,customer_type_Contract,customer_type_Group,customer_type_Transient,customer_type_Transient-Party
0,0,68,14,2,3,2,0,0,1,130.9,0,1,0,0,0,1
1,0,152,14,1,4,1,0,0,0,42.0,1,0,0,0,0,1
2,0,11,49,0,3,1,0,0,0,36.0,1,0,0,0,0,1
3,1,6,27,0,1,2,0,0,0,139.0,0,1,0,0,1,0
4,1,335,38,0,1,2,0,1,0,85.0,0,1,0,0,1,0


In [2]:
# Check if there are any null values
print(df.isnull().sum())

is_canceled                      0
lead_time                        0
arrival_date_week_number         0
stays_in_weekend_nights          0
stays_in_week_nights             0
adults                           0
is_repeated_guest                0
previous_cancellations           0
total_of_special_requests        0
avg_daily_rate                   0
booked_by_company                0
booked_by_agent                  0
customer_type_Contract           0
customer_type_Group              0
customer_type_Transient          0
customer_type_Transient-Party    0
dtype: int64


In [4]:
# Check columns to make sure you have feature(s) and a target variable
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 16 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   is_canceled                    1500 non-null   int64  
 1   lead_time                      1500 non-null   int64  
 2   arrival_date_week_number       1500 non-null   int64  
 3   stays_in_weekend_nights        1500 non-null   int64  
 4   stays_in_week_nights           1500 non-null   int64  
 5   adults                         1500 non-null   int64  
 6   is_repeated_guest              1500 non-null   int64  
 7   previous_cancellations         1500 non-null   int64  
 8   total_of_special_requests      1500 non-null   int64  
 9   avg_daily_rate                 1500 non-null   float64
 10  booked_by_company              1500 non-null   int64  
 11  booked_by_agent                1500 non-null   int64  
 12  customer_type_Contract         1500 non-null   i

### 2. Splitting the data
To split the data, we'll use the [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. 

In [5]:
# Split the data into two DataFrames: X (features) and y (target variable)
X = df.iloc[:, 1:]  # Specify at least one column as feature(s)
y = df["is_canceled"]  # Specify one column as the target variable

# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=123
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1050, 15), (450, 15), (1050,), (450,))

### 3. Building a support vector machine classifier

The following code builds a scikit-learn support vector machine classifier (`svm.SVC`) using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's [Linear Classifiers in Python](https://app.datacamp.com/learn/courses/linear-classifiers-in-python) course or [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

In [6]:
# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
    "kernel": "linear",  # Kernel type: 'linear', 'poly', 'rbf', 'sigmoid', or 'precomputed'
    "C": 1,  # Regularization parameter, squared l2 penalty
    "gamma": 0.01,  # Kernel coefficient (a float, 'scale', or 'auto') for 'rbf', 'poly' and 'sigmoid'
    "degree": 3,  # Degree of ‘poly’ kernel function
    "random_state": 123,
}

# Create a svm.SVC with the parameters above
clf = svm.SVC(**params)

# Train the SVM classifer on the train set
clf = clf.fit(X_train, y_train)

# Predict the outcomes on the test set
y_pred = clf.predict(X_test)

To evaluate this classifier, we will use accuracy and implement it with sklearn's [metrics.accuracy_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function. Note accuracy may not be the best evaluation metric for your problem, especially if your dataset has class imbalance. 

In [7]:
# Evaluate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.7488888888888889


### 4. Hyperparameter tuning with random search

Hyperparameter tuning is considered best practice to improve the efficiency and effectiveness of your machine learning model. In this section, we'll use random search where a fixed number of hyperparameter settings are sampled from specified probability distributions. To learn more about other hyperparameter tuning options, such as grid search, check out DataCamp's [Hyperparameter Tuning in Python](https://app.datacamp.com/learn/courses/hyperparameter-tuning-in-python) course.

Note: SVMs can take noticeably longer to train on larger datasets compared to other models. If that's the case, you can adjust the parameter space and reduce the number of folds and candidates in `RandomizedSearchCV()`. Otherwise, you may want to consider another classification model, such as decision trees.

In [8]:
# Define a parameter grid with distributions of possible parameters to use
rs_param_grid = {
    "kernel": ["linear", "poly", "rbf", "sigmoid"],
    "C": [0.1, 1, 10],
    "gamma": [0.00001, 0.0001, 0.001, 0.01, 0.1],
}

# Create a svm.SVC object
clf = svm.SVC(random_state=123)

# Instantiate RandomizedSearchCV() with clf and the parameter grid
clf_rs = RandomizedSearchCV(
    estimator=clf,
    param_distributions=rs_param_grid,
    cv=3,  # Number of folds
    n_iter=5,  # Number of parameter candidate settings to sample
    verbose=2,  # The higher this is, the more messages are outputed
    random_state=123,
)

# Train the model on the training set
clf_rs.fit(X_train, y_train)

# Print the best parameters and highest accuracy
print("Best parameters found: ", clf_rs.best_params_)
print("Best accuracy found: ", clf_rs.best_score_)

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[CV] END ...................C=1, gamma=0.0001, kernel=linear; total time=   1.1s
[CV] END ...................C=1, gamma=0.0001, kernel=linear; total time=   1.1s
[CV] END ...................C=1, gamma=0.0001, kernel=linear; total time=   1.9s
[CV] END .....................C=1, gamma=0.1, kernel=sigmoid; total time=   0.0s
[CV] END .....................C=1, gamma=0.1, kernel=sigmoid; total time=   0.0s
[CV] END .....................C=1, gamma=0.1, kernel=sigmoid; total time=   0.0s
[CV] END ..................C=0.1, gamma=0.01, kernel=sigmoid; total time=   0.0s
[CV] END ..................C=0.1, gamma=0.01, kernel=sigmoid; total time=   0.0s
[CV] END ..................C=0.1, gamma=0.01, kernel=sigmoid; total time=   0.0s
[CV] END ...................C=1, gamma=0.001, kernel=sigmoid; total time=   0.0s
[CV] END ...................C=1, gamma=0.001, kernel=sigmoid; total time=   0.0s
[CV] END ...................C=1, gamma=0.001, ker