## Support vector machine

In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load the breast cancer dataset
breast_cancer = load_breast_cancer()

# Create a pandas DataFrame
df = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
df['target'] = breast_cancer.target

# Display the first 5 rows
display(df.head())

# Print the shape of the DataFrame
print(f"Shape of the DataFrame: {df.shape}")

# Print the column names
print(f"Column names: {df.columns.tolist()}")

# Display data types
display(df.info())

# Check for missing values
print("\nMissing values per column:")
display(df.isnull().sum())

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Shape of the DataFrame: (569, 31)
Column names: ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension', 'target']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null   

None


Missing values per column:


mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

**Scaling**


In [2]:
from sklearn.preprocessing import StandardScaler

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the features
X_scaled = scaler.fit_transform(X)

# Create a new DataFrame with scaled features
df_scaled = pd.DataFrame(X_scaled, columns=X.columns)
df_scaled['target'] = y

# Display the first 5 rows of the scaled DataFrame
display(df_scaled.head())

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015,0
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119,0
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,-0.398008,...,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391,0
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501,0
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,-0.56245,...,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971,0


## SVM introduction


## Support Vector Machines (SVM)

Support Vector Machines (SVM) are powerful supervised learning models used for both classification and regression tasks. The core idea behind SVM is to find the optimal hyperplane that separates data points into different classes (for classification) or to find a function that approximates the target variable within a certain margin of error (for regression).

### SVM for Classification

In SVM classification, the goal is to find a hyperplane in an N-dimensional space (where N is the number of features) that distinctly classifies the data points. The best hyperplane is the one that has the largest margin between the two classes.

*   **Hyperplane:** A decision boundary that separates data points of different classes. In a 2D space, it's a line; in 3D, it's a plane; and in higher dimensions, it's a hyperplane.
*   **Support Vectors:** These are the data points that lie closest to the hyperplane. They are the most difficult to classify and play a crucial role in defining the hyperplane and the margin.
*   **Margin:** The region between the two hyperplanes that are parallel to the separating hyperplane and pass through the support vectors of each class.
*   **Objective:** The objective of SVM classification is to maximize this margin. A larger margin generally leads to better generalization performance on unseen data.

For cases where the data is not linearly separable, SVM uses the concept of a "soft margin" which allows for some misclassifications by introducing slack variables.

### Support Vector Regression (SVR)

Support Vector Regression (SVR) applies the principles of SVM to regression problems. Instead of finding a hyperplane that separates data, SVR aims to find a function that best fits the data points while allowing for a certain degree of error, specified by a parameter epsilon ($\epsilon$).

*   **Epsilon-Insensitive Tube:** SVR constructs a tube around the estimated function. The goal is to have as many data points as possible fall within this tube. The width of the tube is determined by $\epsilon$. Errors within this tube are not penalized.
*   **Slack Variables:** For data points that fall outside the epsilon-insensitive tube, slack variables are introduced to penalize the deviations. The objective is to minimize these deviations while also keeping the model complexity low.
*   **Objective:** SVR seeks to find a function that has at most $\epsilon$ deviation from the actual targets for all training data, while being as flat as possible (minimizing the weights).

### The Role of Kernels

Both SVM classification and SVR can handle non-linearly separable data by using kernel functions. Kernels implicitly map the data into a higher-dimensional space where it may become linearly separable. Common kernel functions include:

*   **Linear Kernel:** Suitable for linearly separable data.
*   **Polynomial Kernel:** Maps data into a higher dimension using polynomial combinations of the original features.
*   **Radial Basis Function (RBF) Kernel:** A popular choice that can handle complex non-linear relationships.
*   **Sigmoid Kernel:** Based on the hyperbolic tangent function.

## Classification model training

Import the necessary class for SVM classification and instantiate a model with a linear kernel, then train it using the scaled data.



In [4]:
from sklearn.svm import SVC

# Instantiate an SVC object with a linear kernel
svc_linear = SVC(kernel='linear')

# Train the SVC model
svc_linear.fit(X_scaled, y)

0,1,2
,C,1.0
,kernel,'linear'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


## Kernel Functions in Support Vector Machines for Classification

Kernel functions are a crucial component of Support Vector Machines (SVMs), particularly when dealing with non-linearly separable data. They allow SVMs to implicitly map the data into a higher-dimensional feature space without explicitly calculating the coordinates in that space. This "kernel trick" makes it computationally feasible to find a linear decision boundary in the higher dimension, which corresponds to a non-linear decision boundary in the original feature space.

Here are some common types of kernel functions used in SVM classification:

### 1. Linear Kernel

The linear kernel is the simplest type of kernel. It is defined as the dot product of the input vectors:

$K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j$

This kernel is suitable for linearly separable data. Using a linear kernel is equivalent to training a standard linear SVM.

### 2. Polynomial Kernel

The polynomial kernel maps the input data into a higher-dimensional space using polynomial combinations of the original features. It is defined as:

$K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)^d$

Where:
*   $\gamma$ is a scaling factor (often set to 1 or $1/\text{n\_features}$).
*   $r$ is a constant term (also known as the `coef0`).
*   $d$ is the degree of the polynomial.

The polynomial kernel can capture non-linear relationships in the data, and its complexity is controlled by the degree $d$.

### 3. Radial Basis Function (RBF) Kernel

The Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is one of the most widely used kernels. It measures the similarity between two points based on their distance from a central point. The RBF kernel is defined as:

$K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)$

Where:
*   $\gamma$ is a parameter that controls the width of the Gaussian function. A smaller $\gamma$ means a wider kernel, influencing more distant points. A larger $\gamma$ means a narrower kernel, influencing only nearby points.

The RBF kernel can handle complex non-linear relationships and is a good default choice when the nature of the data is unknown.

### 4. Sigmoid Kernel

The sigmoid kernel is based on the hyperbolic tangent function and is defined as:

$K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)$

Where:
*   $\gamma$ is a scaling factor.
*   $r$ is a constant term.

The sigmoid kernel is sometimes used but is less common than the RBF kernel and can sometimes behave like a linear kernel.

Choosing the appropriate kernel is crucial for the performance of an SVM model and often requires experimentation and cross-validation.


Import the SVC class and instantiate SVC objects for each kernel type using default hyperparameters, then train each model.



In [5]:
from sklearn.svm import SVC

# Instantiate SVC objects with different kernels
svc_linear = SVC(kernel='linear')
svc_poly = SVC(kernel='poly')
svc_rbf = SVC(kernel='rbf')
svc_sigmoid = SVC(kernel='sigmoid')

# Train each SVC model
svc_linear.fit(X_scaled, y)
svc_poly.fit(X_scaled, y)
svc_rbf.fit(X_scaled, y)
svc_sigmoid.fit(X_scaled, y)

print("SVC models with different kernels trained successfully.")

SVC models with different kernels trained successfully.


## hyperparameter tuning

## The Importance of Hyperparameter Tuning and GridSearchCV

Hyperparameter tuning is a critical step in the machine learning workflow, especially for models like Support Vector Machines (SVMs). Hyperparameters are external configuration settings that are not learned from the data during the training process but significantly influence the model's performance and complexity. Examples of SVM hyperparameters include the regularization parameter `C`, the kernel type, and kernel-specific parameters like `gamma` for the RBF kernel or `degree` and `coef0` for the polynomial kernel.

The performance of an SVM model is highly sensitive to the choice of these hyperparameters. Poorly chosen hyperparameters can lead to:

*   **Underfitting:** If the model is too simple (e.g., low `C` value for a complex decision boundary), it may fail to capture the underlying patterns in the data, resulting in low accuracy on both training and unseen data.
*   **Overfitting:** If the model is too complex (e.g., high `C` value or inappropriate kernel parameters), it may fit the training data too closely, including the noise, leading to excellent performance on the training data but poor generalization on new, unseen data.

Therefore, finding the optimal combination of hyperparameters is essential to build a model that generalizes well to new data and achieves the best possible performance.

**GridSearchCV** is a widely used technique for systematically searching for the best hyperparameters. It works by:

1.  Defining a grid of hyperparameter values to explore.
2.  Training and evaluating the model for every possible combination of hyperparameters in the grid.
3.  Using cross-validation to assess the performance of each combination. Cross-validation helps to obtain a more reliable estimate of the model's performance on unseen data and mitigate the risk of overfitting to the training set.
4.  Selecting the hyperparameter combination that yields the best cross-validation score.

While computationally intensive, especially for large grids and datasets, GridSearchCV is a robust method for finding a good set of hyperparameters and improving the model's performance.

Import necessary libraries and define the parameter grid for the sigmoid kernel.



Instantiate and fit GridSearchCV for each kernel and print the best parameters and scores.

In [10]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Instantiate and fit GridSearchCV for the linear kernel
param_grid_linear = {
    'C': [0.1, 1, 10, 100]
}
grid_search_linear = GridSearchCV(SVC(kernel='linear'), param_grid_linear, cv=5)
grid_search_linear.fit(X_scaled, y)

# Instantiate and fit GridSearchCV for the polynomial kernel
param_grid_poly = {
    'C': [0.1, 1, 10, 100],
    'degree': [2, 3, 4],
    'gamma': ['scale', 'auto', 0.1, 1]
}
grid_search_poly = GridSearchCV(SVC(kernel='poly'), param_grid_poly, cv=5)
grid_search_poly.fit(X_scaled, y)

# Instantiate and fit GridSearchCV for the RBF kernel
param_grid_rbf = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.1, 1]
}
grid_search_rbf = GridSearchCV(SVC(kernel='rbf'), param_grid_rbf, cv=5)
grid_search_rbf.fit(X_scaled, y)

# Instantiate and fit GridSearchCV for the sigmoid kernel
param_grid_sigmoid = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1], 'coef0': [0, 1]}
grid_search_sigmoid = GridSearchCV(SVC(kernel='sigmoid'), param_grid_sigmoid, cv=5)
grid_search_sigmoid.fit(X_scaled, y)

# Print the best parameters and scores for each kernel
print("Best parameters for Linear kernel:", grid_search_linear.best_params_)
print("Best cross-validation score for Linear kernel:", grid_search_linear.best_score_)

print("\nBest parameters for Polynomial kernel:", grid_search_poly.best_params_)
print("Best cross-validation score for Polynomial kernel:", grid_search_poly.best_score_)

print("\nBest parameters for RBF kernel:", grid_search_rbf.best_params_)
print("Best cross-validation score for RBF kernel:", grid_search_rbf.best_score_)

print("\nBest parameters for Sigmoid kernel:", grid_search_sigmoid.best_params_)
print("Best cross-validation score for Sigmoid kernel:", grid_search_sigmoid.best_score_)

Best parameters for Linear kernel: {'C': 0.1}
Best cross-validation score for Linear kernel: 0.9754075454122031

Best parameters for Polynomial kernel: {'C': 1, 'degree': 3, 'gamma': 0.1}
Best cross-validation score for Polynomial kernel: 0.9613724576929048

Best parameters for RBF kernel: {'C': 10, 'gamma': 'scale'}
Best cross-validation score for RBF kernel: 0.9771774569166279

Best parameters for Sigmoid kernel: {'C': 10, 'coef0': 0, 'gamma': 0.01}
Best cross-validation score for Sigmoid kernel: 0.9701288619779538


validation and error metrics

## Model Validation and Error Metrics for Classification

Evaluating the performance of a classification model is essential to understand how well it generalizes to unseen data and to compare different models. Simply training a model on the entire dataset and evaluating it on the same data can lead to an overly optimistic assessment of performance, as the model might have simply memorized the training examples (overfitting). Therefore, using appropriate validation techniques and error metrics is crucial.

### Validation Techniques

**Cross-Validation** is a widely used technique to assess a model's performance and robustness. It involves splitting the dataset into multiple subsets or "folds." The model is trained on a subset of the folds and evaluated on the remaining fold (the validation set). This process is repeated multiple times, with each fold serving as the validation set exactly once. Common cross-validation strategies include:

*   **k-Fold Cross-Validation:** The dataset is divided into $k$ equal-sized folds. The model is trained on $k-1$ folds and validated on the remaining fold. This is repeated $k$ times, and the results are averaged. This provides a more reliable estimate of the model's performance than a single train-test split.

Cross-validation helps to:
*   Obtain a more reliable estimate of the model's performance on unseen data.
*   Reduce the variance of the performance estimate compared to a single train-test split.
*   Make better use of the available data for both training and validation.

### Error Metrics for Classification

For classification tasks, several error metrics are used to evaluate a model's performance, providing different perspectives on its strengths and weaknesses. These metrics are typically calculated from the **confusion matrix**, which summarizes the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

*   **Accuracy:** The most intuitive metric, accuracy measures the proportion of correctly classified instances out of the total number of instances.
    $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
    Accuracy can be misleading in cases of imbalanced datasets, where one class is significantly more frequent than others.

*   **Precision:** Precision measures the proportion of correctly predicted positive instances out of the total predicted positive instances. It answers the question: "Of all the instances predicted as positive, how many were actually positive?"
    $\text{Precision} = \frac{TP}{TP + FP}$
    Precision is important when the cost of a false positive is high.

*   **Recall (Sensitivity or True Positive Rate):** Recall measures the proportion of correctly predicted positive instances out of the total actual positive instances. It answers the question: "Of all the actual positive instances, how many were correctly predicted as positive?"
    $\text{Recall} = \frac{TP}{TP + FN}$
    Recall is important when the cost of a false negative is high.

*   **F1-Score:** The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both precision and recall. It is a good choice when you need to consider both false positives and false negatives.
    $\text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Choosing the appropriate error metric depends on the specific problem and the relative costs of different types of errors.

In [11]:
from sklearn.model_selection import cross_validate

# Get the best estimators from GridSearchCV
best_svc_linear = grid_search_linear.best_estimator_
best_svc_poly = grid_search_poly.best_estimator_
best_svc_rbf = grid_search_rbf.best_estimator_
best_svc_sigmoid = grid_search_sigmoid.best_estimator_

# Define the scoring metrics
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Perform cross-validation for each best estimator
cv_results_linear = cross_validate(best_svc_linear, X_scaled, y, cv=5, scoring=scoring)
cv_results_poly = cross_validate(best_svc_poly, X_scaled, y, cv=5, scoring=scoring)
cv_results_rbf = cross_validate(best_svc_rbf, X_scaled, y, cv=5, scoring=scoring)
cv_results_sigmoid = cross_validate(best_svc_sigmoid, X_scaled, y, cv=5, scoring=scoring)

# Calculate and print the average scores
print("Cross-validation results for Linear kernel:")
print(f"  Average Accuracy: {cv_results_linear['test_accuracy'].mean():.4f}")
print(f"  Average Precision: {cv_results_linear['test_precision'].mean():.4f}")
print(f"  Average Recall: {cv_results_linear['test_recall'].mean():.4f}")
print(f"  Average F1-score: {cv_results_linear['test_f1'].mean():.4f}")

print("\nCross-validation results for Polynomial kernel:")
print(f"  Average Accuracy: {cv_results_poly['test_accuracy'].mean():.4f}")
print(f"  Average Precision: {cv_results_poly['test_precision'].mean():.4f}")
print(f"  Average Recall: {cv_results_poly['test_recall'].mean():.4f}")
print(f"  Average F1-score: {cv_results_poly['test_f1'].mean():.4f}")

print("\nCross-validation results for RBF kernel:")
print(f"  Average Accuracy: {cv_results_rbf['test_accuracy'].mean():.4f}")
print(f"  Average Precision: {cv_results_rbf['test_precision'].mean():.4f}")
print(f"  Average Recall: {cv_results_rbf['test_recall'].mean():.4f}")
print(f"  Average F1-score: {cv_results_rbf['test_f1'].mean():.4f}")

print("\nCross-validation results for Sigmoid kernel:")
print(f"  Average Accuracy: {cv_results_sigmoid['test_accuracy'].mean():.4f}")
print(f"  Average Precision: {cv_results_sigmoid['test_precision'].mean():.4f}")
print(f"  Average Recall: {cv_results_sigmoid['test_recall'].mean():.4f}")
print(f"  Average F1-score: {cv_results_sigmoid['test_f1'].mean():.4f}")

Cross-validation results for Linear kernel:
  Average Accuracy: 0.9754
  Average Precision: 0.9700
  Average Recall: 0.9916
  Average F1-score: 0.9806

Cross-validation results for Polynomial kernel:
  Average Accuracy: 0.9614
  Average Precision: 0.9496
  Average Recall: 0.9915
  Average F1-score: 0.9699

Cross-validation results for RBF kernel:
  Average Accuracy: 0.9772
  Average Precision: 0.9781
  Average Recall: 0.9860
  Average F1-score: 0.9819

Cross-validation results for Sigmoid kernel:
  Average Accuracy: 0.9701
  Average Precision: 0.9647
  Average Recall: 0.9887
  Average F1-score: 0.9765


## feature importance


## Feature Importance in SVM Classification (Linear Kernel)

For Support Vector Machines (SVMs) with a **linear kernel**, assessing feature importance is relatively straightforward. The trained linear SVM model finds a hyperplane that separates the classes, and this hyperplane is defined by a set of coefficients, one for each feature. These coefficients directly indicate the weight or importance of each feature in determining the decision boundary.

*   **Magnitude of Coefficients:** The absolute magnitude of a feature's coefficient reflects its importance. A larger absolute value means that the feature has a stronger influence on the decision boundary. Features with coefficients close to zero have less impact.
*   **Sign of Coefficients:** The sign of a coefficient indicates the direction of the relationship between the feature and the target variable. For binary classification, a positive coefficient means that increasing the feature's value makes it more likely for the instance to belong to one class, while a negative coefficient makes it more likely to belong to the other class.

Therefore, by examining the coefficients of a linear SVM model, we can gain insights into which features are most important for the classification task and how they influence the model's predictions.

For non-linear kernels (like RBF or polynomial), the concept of a simple linear feature importance based on coefficients is not directly applicable because the decision boundary is in a higher-dimensional space. Feature importance for non-linear SVMs is generally more complex to determine and might involve techniques like permutation importance or examining the impact of removing features. However, for a linear kernel, the coefficients provide a clear and interpretable measure of feature importance.

In [12]:
import pandas as pd

# Access the coefficients of the trained linear SVM model
feature_importance_linear = pd.Series(best_svc_linear.coef_[0], index=X.columns)

# Take the absolute value to represent magnitude of importance
feature_importance_linear = feature_importance_linear.abs()

# Sort the feature importance values in descending order
sorted_feature_importance_linear = feature_importance_linear.sort_values(ascending=False)

# Display the sorted feature importance
print("Sorted Feature Importance for Linear SVM:")
display(sorted_feature_importance_linear)

Sorted Feature Importance for Linear SVM:


worst texture              0.457053
worst symmetry             0.416313
radius error               0.365099
mean concavity             0.354242
worst perimeter            0.353716
worst radius               0.351924
perimeter error            0.346983
mean concave points        0.344767
worst area                 0.339199
worst concavity            0.335998
worst smoothness           0.319742
worst concave points       0.309327
area error                 0.308942
mean texture               0.298839
compactness error          0.265730
fractal dimension error    0.241069
mean area                  0.240785
mean radius                0.224539
mean fractal dimension     0.220163
mean perimeter             0.219614
smoothness error           0.165472
mean compactness           0.162494
worst fractal dimension    0.158951
texture error              0.123705
symmetry error             0.102128
concave points error       0.089969
mean symmetry              0.046207
concavity error            0

## Regression

In [13]:
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load the diabetes dataset
diabetes = load_diabetes()

# Create a pandas DataFrame
df_reg = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
df_reg['target'] = diabetes.target

# Display the first 5 rows
display(df_reg.head())

# Print the shape of the DataFrame
print(f"Shape of the DataFrame: {df_reg.shape}")

# Print the column names
print(f"Column names: {df_reg.columns.tolist()}")

# Display data types
display(df_reg.info())

# Check for missing values
print("\nMissing values per column:")
display(df_reg.isnull().sum())

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


Shape of the DataFrame: (442, 11)
Column names: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'target']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    float64
 1   sex     442 non-null    float64
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   s1      442 non-null    float64
 5   s2      442 non-null    float64
 6   s3      442 non-null    float64
 7   s4      442 non-null    float64
 8   s5      442 non-null    float64
 9   s6      442 non-null    float64
 10  target  442 non-null    float64
dtypes: float64(11)
memory usage: 38.1 KB


None


Missing values per column:


age       0
sex       0
bmi       0
bp        0
s1        0
s2        0
s3        0
s4        0
s5        0
s6        0
target    0
dtype: int64

## Diabetes Dataset Explanation

The Diabetes dataset is a standard dataset used for regression tasks. It consists of 442 patients and 10 baseline variables, as well as a target variable representing a quantitative measure of disease progression one year after baseline.

The 10 features are:
*   `age`: age in years
*   `sex`:
*   `bmi`: body mass index
*   `bp`: average blood pressure
*   `s1`: tc, total serum cholesterol
*   `s2`: ldl, low-density lipoproteins
*   `s3`: hdl, high-density lipoproteins
*   `s4`: tch, total cholesterol / HDL
*   `s5`: ltg, possibly log of serum triglycerides level
*   `s6`: glu, blood sugar level

These features have been standardized and centered, meaning they have a mean of zero and a standard deviation of one. This is important for many machine learning models, including SVM, which can be sensitive to the scale of the input features.

The **target** variable is a quantitative measure of disease progression one year after baseline, and it is what we will try to predict using Support Vector Regression (SVR).

In [14]:
# Separate the features (X) and the target variable (y)
X_reg = df_reg.drop('target', axis=1)
y_reg = df_reg['target']

# Initialize a StandardScaler
scaler_reg = StandardScaler()

# Fit and transform the features
X_reg_scaled = scaler_reg.fit_transform(X_reg)

# Create a new DataFrame with the scaled features
df_reg_scaled = pd.DataFrame(X_reg_scaled, columns=X_reg.columns)
df_reg_scaled['target'] = y_reg

# Display the first 5 rows of the scaled DataFrame
display(df_reg_scaled.head())

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.8005,1.065488,1.297088,0.459841,-0.929746,-0.732065,-0.912451,-0.054499,0.418531,-0.370989,151.0
1,-0.039567,-0.938537,-1.08218,-0.553505,-0.177624,-0.402886,1.564414,-0.830301,-1.436589,-1.938479,75.0
2,1.793307,1.065488,0.934533,-0.119214,-0.958674,-0.718897,-0.680245,-0.054499,0.060156,-0.545154,141.0
3,-1.872441,-0.938537,-0.243771,-0.77065,0.256292,0.525397,-0.757647,0.721302,0.476983,-0.196823,206.0
4,0.113172,-0.938537,-0.764944,0.459841,0.082726,0.32789,0.171178,-0.054499,-0.672502,-0.980568,135.0


## model training

In [15]:
from sklearn.svm import SVR

# Instantiate an SVR object with a linear kernel
svr_linear = SVR(kernel='linear')

# Train the SVR model
svr_linear.fit(X_reg_scaled, y_reg)

print("Initial SVR model with linear kernel trained successfully.")

Initial SVR model with linear kernel trained successfully.


## Kernel

## Kernel Functions in Support Vector Regression (SVR)

Similar to Support Vector Machines (SVM) for classification, **Kernel functions** play a vital role in Support Vector Regression (SVR). They enable SVR to model non-linear relationships between features and the target variable without explicitly transforming the data into a higher-dimensional space. This is achieved through the "kernel trick," which computes the dot product of the data points in a high-dimensional feature space implicitly.

In SVR, the goal is to find a function that approximates the target values while minimizing the error within a defined margin ($\epsilon$). Kernels allow SVR to find complex, non-linear functions that fit the data well, even when the relationship is not linear in the original feature space.

Here are the common kernel types used in SVR, with characteristics relevant to regression:

### 1. Linear Kernel

The linear kernel calculates a linear relationship between the input features and the target variable. It is suitable when the relationship between the features and the target can be approximated by a straight line or a hyperplane in the original feature space.

$K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j$

### 2. Polynomial Kernel

The polynomial kernel allows SVR to fit non-linear relationships by considering polynomial combinations of the features. The degree of the polynomial ($d$) determines the complexity of the non-linear mapping.

$K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)^d$

This kernel can capture curved or more complex patterns in the data.

### 3. Radial Basis Function (RBF) Kernel

The RBF kernel is a powerful and versatile kernel that can model highly non-linear relationships. It considers the similarity between data points based on their distance. The `gamma` parameter controls the influence of individual training samples; a smaller `gamma` means a larger influence and a smoother decision boundary (or regression function in this case), while a larger `gamma` means a smaller influence and a more complex function.

$K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)$

The RBF kernel is often a good default choice for SVR when the nature of the non-linearity is unknown.

### 4. Sigmoid Kernel

The sigmoid kernel, based on the hyperbolic tangent function, can also be used in SVR to model non-linear relationships. However, it's less commonly used than the RBF kernel in practice and its behavior can sometimes be similar to a linear kernel, especially for certain parameter values.

$K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)$

Choosing the appropriate kernel and tuning its hyperparameters are essential steps in building an effective SVR model.

In [16]:
from sklearn.svm import SVR

# Instantiate SVR objects with different kernels using default hyperparameters
svr_linear = SVR(kernel='linear')
svr_poly = SVR(kernel='poly')
svr_rbf = SVR(kernel='rbf')
svr_sigmoid = SVR(kernel='sigmoid')

# Train each SVR model using the scaled regression data
svr_linear.fit(X_reg_scaled, y_reg)
svr_poly.fit(X_reg_scaled, y_reg)
svr_rbf.fit(X_reg_scaled, y_reg)
svr_sigmoid.fit(X_reg_scaled, y_reg)

print("SVR models with different kernels trained successfully.")

SVR models with different kernels trained successfully.


## hyperparameter tuning

## Hyperparameter Tuning for Support Vector Regression (SVR)

Hyperparameter tuning is as crucial for Support Vector Regression (SVR) as it is for SVM classification. SVR models have several hyperparameters that significantly influence their performance, including the model's complexity, its ability to fit the training data, and its generalization to unseen data.

Key hyperparameters for SVR include:

*   **C (Regularization Parameter):** This parameter controls the trade-off between fitting the training data and maintaining a smooth regression function. A smaller `C` creates a wider margin and a smoother function but may lead to underfitting. A larger `C` allows for fitting the training data more precisely (potentially including noise) but can lead to overfitting.
*   **epsilon ($\epsilon$):** This parameter defines the epsilon-insensitive tube. Errors within this tube are not penalized. A larger $\epsilon$ leads to a simpler model with fewer support vectors but might ignore important variations in the data. A smaller $\epsilon$ forces the model to fit the training data more closely, potentially leading to a more complex model and sensitivity to noise.
*   **Kernel-Specific Parameters:**
    *   **gamma:** For RBF, polynomial, and sigmoid kernels, `gamma` influences the shape of the decision boundary (or regression function). It affects how far the influence of a single training example reaches.
    *   **degree:** For the polynomial kernel, `degree` is the degree of the polynomial function. Higher degrees allow for more complex non-linear relationships but increase the risk of overfitting.
    *   **coef0:** For polynomial and sigmoid kernels, `coef0` is a constant term that affects the shape of the kernel function.

Tuning these hyperparameters is essential to find the optimal balance between model complexity and fitting the data, ultimately leading to better generalization performance on new, unseen data. Techniques like GridSearchCV systematically explore different hyperparameter combinations to identify the best-performing set based on a chosen evaluation metric (e.g., Mean Squared Error for regression).

### Parameter Grids for SVR Kernels

To perform hyperparameter tuning using GridSearchCV, we need to define a grid of potential values for the relevant hyperparameters for each kernel.

*   **Linear Kernel:** The primary hyperparameter for the linear kernel is `C`.
*   **Polynomial Kernel:** Relevant hyperparameters include `C`, `degree`, and `coef0`.
*   **RBF Kernel:** Relevant hyperparameters include `C` and `gamma`.
*   **Sigmoid Kernel:** Relevant hyperparameters include `C`, `gamma`, and `coef0`.

In [17]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

# Define parameter grids for each SVR kernel
param_grid_linear_reg = {'C': [0.01, 0.1, 1, 10, 100]}
param_grid_poly_reg = {'C': [0.1, 1, 10], 'degree': [2, 3, 4], 'coef0': [0, 1, 10]}
param_grid_rbf_reg = {'C': [0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1]}
param_grid_sigmoid_reg = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1], 'coef0': [0, 1]}

# Instantiate and fit GridSearchCV for each kernel
grid_search_linear_reg = GridSearchCV(SVR(kernel='linear'), param_grid_linear_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_linear_reg.fit(X_reg_scaled, y_reg)

grid_search_poly_reg = GridSearchCV(SVR(kernel='poly'), param_grid_poly_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_poly_reg.fit(X_reg_scaled, y_reg)

grid_search_rbf_reg = GridSearchCV(SVR(kernel='rbf'), param_grid_rbf_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_rbf_reg.fit(X_reg_scaled, y_reg)

grid_search_sigmoid_reg = GridSearchCV(SVR(kernel='sigmoid'), param_grid_sigmoid_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_sigmoid_reg.fit(X_reg_scaled, y_reg)

# Print the best hyperparameters and best cross-validation score for each kernel
print("Best parameters for Linear SVR:", grid_search_linear_reg.best_params_)
print("Best cross-validation score (Negative MSE) for Linear SVR:", grid_search_linear_reg.best_score_)

print("\nBest parameters for Polynomial SVR:", grid_search_poly_reg.best_params_)
print("Best cross-validation score (Negative MSE) for Polynomial SVR:", grid_search_poly_reg.best_score_)

print("\nBest parameters for RBF SVR:", grid_search_rbf_reg.best_params_)
print("Best cross-validation score (Negative MSE) for RBF SVR:", grid_search_rbf_reg.best_score_)

print("\nBest parameters for Sigmoid SVR:", grid_search_sigmoid_reg.best_params_)
print("Best cross-validation score (Negative MSE) for Sigmoid SVR:", grid_search_sigmoid_reg.best_score_)

Best parameters for Linear SVR: {'C': 1}
Best cross-validation score (Negative MSE) for Linear SVR: -3026.59246296212

Best parameters for Polynomial SVR: {'C': 0.1, 'coef0': 10, 'degree': 3}
Best cross-validation score (Negative MSE) for Polynomial SVR: -2990.0804032835194

Best parameters for RBF SVR: {'C': 100, 'gamma': 0.01}
Best cross-validation score (Negative MSE) for RBF SVR: -2937.417624601308

Best parameters for Sigmoid SVR: {'C': 10, 'coef0': 0, 'gamma': 0.1}
Best cross-validation score (Negative MSE) for Sigmoid SVR: -3074.81708677377


## Validation and error metrics

## Model Validation and Error Metrics for Regression

In regression tasks, the goal is to predict a continuous target variable. Evaluating the performance of a regression model requires different metrics compared to classification. Similar to classification, proper validation techniques are essential to ensure that the model's performance estimate is reliable and that it generalizes well to unseen data.

### Validation Techniques

As discussed for classification, **Cross-Validation** is a fundamental technique for assessing the performance and robustness of a regression model. By splitting the data into multiple folds and training/validating on different combinations, we obtain a more reliable estimate of how the model will perform on new data, reducing the risk of overfitting. k-Fold cross-validation is commonly used, where the dataset is divided into $k$ folds, and the model is trained on $k-1$ folds and evaluated on the remaining fold, repeated $k$ times.

### Error Metrics for Regression

Several metrics are used to quantify the difference between the predicted values ($\hat{y}$) and the actual target values ($y$).

*   **Mean Squared Error (MSE):** MSE is one of the most common regression error metrics. It measures the average of the squared differences between the predicted and actual values.
    $\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
    MSE penalizes larger errors more heavily due to the squaring. A lower MSE indicates a better model fit. The unit of MSE is the square of the unit of the target variable, which can make it less intuitive to interpret directly.

*   **R-squared ($R^2$) Score:** The R-squared score, also known as the coefficient of determination, measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It provides an indication of how well the model fits the observed data.
    $R^2 = 1 - \frac{\text{Sum of Squares of Residuals}}{\text{Total Sum of Squares}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$
    Where $\bar{y}$ is the mean of the actual target values.
    The R-squared score ranges from 0 to 1 (although it can be negative in some cases, indicating a very poor fit). An R-squared of 1 means the model perfectly predicts the target variable's variance. An R-squared of 0 means the model does not explain any of the variance in the target variable (it performs no better than simply predicting the mean). A higher R-squared generally indicates a better model fit.

When evaluating regression models, it's often beneficial to consider both MSE (or its square root, RMSE, which is in the same unit as the target) and R-squared to get a comprehensive understanding of the model's performance.

In [18]:
from sklearn.model_selection import cross_validate

# Retrieve the best estimators from the completed GridSearchCV objects
best_svr_linear = grid_search_linear_reg.best_estimator_
best_svr_poly = grid_search_poly_reg.best_estimator_
best_svr_rbf = grid_search_rbf_reg.best_estimator_
best_svr_sigmoid = grid_search_sigmoid_reg.best_estimator_

# Define the scoring metrics for regression
scoring_reg = ['neg_mean_squared_error', 'r2']

# Perform 5-fold cross-validation for each best SVR estimator
cv_results_linear_reg = cross_validate(best_svr_linear, X_reg_scaled, y_reg, cv=5, scoring=scoring_reg)
cv_results_poly_reg = cross_validate(best_svr_poly, X_reg_scaled, y_reg, cv=5, scoring=scoring_reg)
cv_results_rbf_reg = cross_validate(best_svr_rbf, X_reg_scaled, y_reg, cv=5, scoring=scoring_reg)
cv_results_sigmoid_reg = cross_validate(best_svr_sigmoid, X_reg_scaled, y_reg, cv=5, scoring=scoring_reg)

# Calculate the average MSE and R-squared for each kernel
average_mse_linear_reg = -cv_results_linear_reg['test_neg_mean_squared_error'].mean()
average_r2_linear_reg = cv_results_linear_reg['test_r2'].mean()

average_mse_poly_reg = -cv_results_poly_reg['test_neg_mean_squared_error'].mean()
average_r2_poly_reg = cv_results_poly_reg['test_r2'].mean()

average_mse_rbf_reg = -cv_results_rbf_reg['test_neg_mean_squared_error'].mean()
average_r2_rbf_reg = cv_results_rbf_reg['test_r2'].mean()

average_mse_sigmoid_reg = -cv_results_sigmoid_reg['test_neg_mean_squared_error'].mean()
average_r2_sigmoid_reg = cv_results_sigmoid_reg['test_r2'].mean()

# Print the average MSE and R-squared for each SVR kernel
print("Average Cross-validation Results for Tuned SVR Models:")
print(f"Linear SVR: MSE = {average_mse_linear_reg:.4f}, R-squared = {average_r2_linear_reg:.4f}")
print(f"Polynomial SVR: MSE = {average_mse_poly_reg:.4f}, R-squared = {average_r2_poly_reg:.4f}")
print(f"RBF SVR: MSE = {average_mse_rbf_reg:.4f}, R-squared = {average_r2_rbf_reg:.4f}")
print(f"Sigmoid SVR: MSE = {average_mse_sigmoid_reg:.4f}, R-squared = {average_r2_sigmoid_reg:.4f}")

Average Cross-validation Results for Tuned SVR Models:
Linear SVR: MSE = 3026.5925, R-squared = 0.4772
Polynomial SVR: MSE = 2990.0804, R-squared = 0.4831
RBF SVR: MSE = 2937.4176, R-squared = 0.4923
Sigmoid SVR: MSE = 3074.8171, R-squared = 0.4697


## Feature importance

## Feature Importance in Support Vector Regression (SVR)

Assessing feature importance in Support Vector Regression (SVR) provides insights into which input features have the most significant impact on the predicted continuous target variable. The approach to determining feature importance depends heavily on the chosen kernel.

For SVR models using a **linear kernel**, feature importance can be directly interpreted from the model's **coefficients**. Similar to linear regression or linear SVM classification, a linear SVR model learns a linear function of the input features to predict the target.

The learned linear function can be represented as:

$f(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b$

Where:
*   $\mathbf{x}$ is the input feature vector.
*   $\mathbf{w}$ is the vector of coefficients (weights).
*   $b$ is the intercept.

The elements of the coefficient vector $\mathbf{w}$ correspond to the weights assigned to each feature.

*   **Magnitude of Coefficients:** The absolute value of a feature's coefficient ($|w_i|$) indicates the strength of its influence on the predicted target value. A larger absolute coefficient means that a unit change in that feature has a larger effect on the prediction. Features with coefficients close to zero have minimal impact.
*   **Sign of Coefficients:** The sign of a coefficient ($w_i$) indicates the direction of the relationship between the feature and the target. A positive coefficient means that increasing the feature's value leads to an increase in the predicted target value (assuming other features are held constant), while a negative coefficient means that increasing the feature's value leads to a decrease in the predicted target value.

Therefore, for linear SVR, the absolute values of the coefficients provide a straightforward measure of feature importance, allowing us to rank features based on their influence on the regression output.

For SVR models with non-linear kernels (such as polynomial or RBF), the concept of a simple linear coefficient-based feature importance is not directly applicable because the model operates in a transformed, higher-dimensional space. Determining feature importance in these cases typically requires more complex methods, such as permutation importance, or analyzing the sensitivity of the model's output to changes in individual features.

In [19]:
import pandas as pd

# Access the coefficients of the trained best_svr_linear model
# The coefficients are stored in the 'coef_' attribute. Since it's a single output, we take the first row [0].
feature_importance_linear_reg = pd.Series(best_svr_linear.coef_[0], index=X_reg.columns)

# Calculate the absolute value to represent the magnitude of feature importance
feature_importance_linear_reg = feature_importance_linear_reg.abs()

# Sort the feature importance values in descending order
sorted_feature_importance_linear_reg = feature_importance_linear_reg.sort_values(ascending=False)

# Print and display the sorted feature importance
print("Sorted Feature Importance for Linear SVR:")
display(sorted_feature_importance_linear_reg)

Sorted Feature Importance for Linear SVR:


s5     21.541068
bmi    20.861853
bp     15.922391
sex    11.955362
s3      8.627998
s4      6.509319
s2      4.805351
s6      4.200074
s1      3.949230
age     0.104373
dtype: float64

**Reasoning**:
Create a markdown cell to present the theoretical introduction to SVMs for classification, including core concepts and kernels, reusing the content from previous subtasks, as per instruction 2.



## Support Vector Machines (SVM) for Classification

In SVM classification, the goal is to find the optimal hyperplane that separates data points belonging to different classes with the largest possible margin.

*   **Hyperplane:** A decision boundary in an N-dimensional space that separates data points.
*   **Support Vectors:** Data points closest to the hyperplane that influence its position and the margin.
*   **Margin:** The region between parallel hyperplanes that pass through the support vectors. Maximizing the margin is the core objective.

For non-linearly separable data, SVM utilizes the "kernel trick" to implicitly map data into a higher-dimensional space where a linear separation might be possible.

### Classification Kernel Functions

Kernel functions allow SVMs to model non-linear decision boundaries. Common kernels include:

*   **Linear Kernel:** $K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j$. Suitable for linearly separable data.
*   **Polynomial Kernel:** $K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)^d$. Maps data using polynomial combinations, controlled by degree $d$.
*   **Radial Basis Function (RBF) Kernel:** $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)$. A versatile kernel for complex non-linear relationships, controlled by $\gamma$.
*   **Sigmoid Kernel:** $K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)$. Based on the hyperbolic tangent function.

Choosing the right kernel and tuning its parameters is crucial for classification performance.

In [52]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Define parameter grids for each classification kernel
param_grid_linear = {'C': [0.01, 0.1, 1, 10, 100]}
param_grid_poly = {'C': [0.1, 1, 10], 'degree': [2, 3, 4], 'coef0': [0, 1, 10]}
param_grid_rbf = {'C': [0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1]}
param_grid_sigmoid = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1], 'coef0': [0, 1]}

# Instantiate and fit GridSearchCV for the linear kernel
grid_search_linear = GridSearchCV(SVC(kernel='linear'), param_grid_linear, cv=5)
grid_search_linear.fit(X_scaled, y)

# Instantiate and fit GridSearchCV for the polynomial kernel
grid_search_poly = GridSearchCV(SVC(kernel='poly'), param_grid_poly, cv=5)
grid_search_poly.fit(X_scaled, y)

# Instantiate and fit GridSearchCV for the RBF kernel
grid_search_rbf = GridSearchCV(SVC(kernel='rbf'), param_grid_rbf, cv=5)
grid_search_rbf.fit(X_scaled, y)

# Instantiate and fit GridSearchCV for the sigmoid kernel
grid_search_sigmoid = GridSearchCV(SVC(kernel='sigmoid'), param_grid_sigmoid, cv=5)
grid_search_sigmoid.fit(X_scaled, y)

# Print the best parameters and scores for each kernel
print("Best parameters for Linear kernel:", grid_search_linear.best_params_)
print("Best cross-validation score for Linear kernel:", grid_search_linear.best_score_)

print("\nBest parameters for Polynomial kernel:", grid_search_poly.best_params_)
print("Best cross-validation score for Polynomial kernel:", grid_search_poly.best_score_)

print("\nBest parameters for RBF kernel:", grid_search_rbf.best_params_)
print("Best cross-validation score for RBF kernel:", grid_search_rbf.best_score_)

print("\nBest parameters for Sigmoid kernel:", grid_search_sigmoid.best_params_)
print("Best cross-validation score for Sigmoid kernel:", grid_search_sigmoid.best_score_)

Best parameters for Linear kernel: {'C': 0.1}
Best cross-validation score for Linear kernel: 0.9754075454122031

Best parameters for Polynomial kernel: {'C': 1, 'coef0': 1, 'degree': 3}
Best cross-validation score for Polynomial kernel: 0.9807017543859649

Best parameters for RBF kernel: {'C': 10, 'gamma': 0.01}
Best cross-validation score for RBF kernel: 0.9789318428815401

Best parameters for Sigmoid kernel: {'C': 10, 'coef0': 0, 'gamma': 0.01}
Best cross-validation score for Sigmoid kernel: 0.9701288619779538


**Reasoning**:
Explain classification validation techniques and error metrics, reusing the content from the previous subtask, as per instruction 4.



## Classification Model Validation and Error Metrics

To reliably evaluate the performance of our classification models and understand how well they generalize to unseen data, we use **Cross-Validation**. Specifically, we employ k-Fold Cross-Validation, where the dataset is split into $k$ folds, and the model is trained on $k-1$ folds and validated on the remaining fold, rotating through all folds. This provides a more robust performance estimate than a single train-test split.

We will use the following error metrics, calculated from the confusion matrix, to evaluate the tuned classification models:

*   **Accuracy:** The proportion of correctly classified instances.
    $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
*   **Precision:** The proportion of correctly predicted positive instances out of all instances predicted as positive.
    $\text{Precision} = \frac{TP}{TP + FP}$
*   **Recall (Sensitivity):** The proportion of correctly predicted positive instances out of all actual positive instances.
    $\text{Recall} = \frac{TP}{TP + FN}$
*   **F1-Score:** The harmonic mean of Precision and Recall, balancing both metrics.
    $\text{F1-Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

In [20]:
from sklearn.model_selection import cross_validate

# Get the best estimators from GridSearchCV
best_svc_linear = grid_search_linear.best_estimator_
best_svc_poly = grid_search_poly.best_estimator_
best_svc_rbf = grid_search_rbf.best_estimator_
best_svc_sigmoid = grid_search_sigmoid.best_estimator_

# Define the scoring metrics
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Perform cross-validation for each best estimator
cv_results_linear = cross_validate(best_svc_linear, X_scaled, y, cv=5, scoring=scoring)
cv_results_poly = cross_validate(best_svc_poly, X_scaled, y, cv=5, scoring=scoring)
cv_results_rbf = cross_validate(best_svc_rbf, X_scaled, y, cv=5, scoring=scoring)
cv_results_sigmoid = cross_validate(best_svc_sigmoid, X_scaled, y, cv=5, scoring=scoring)

# Calculate and print the average scores
print("Cross-validation results for Linear kernel:")
print(f"  Average Accuracy: {cv_results_linear['test_accuracy'].mean():.4f}")
print(f"  Average Precision: {cv_results_linear['test_precision'].mean():.4f}")
print(f"  Average Recall: {cv_results_linear['test_recall'].mean():.4f}")
print(f"  Average F1-score: {cv_results_linear['test_f1'].mean():.4f}")

print("\nCross-validation results for Polynomial kernel:")
print(f"  Average Accuracy: {cv_results_poly['test_accuracy'].mean():.4f}")
print(f"  Average Precision: {cv_results_poly['test_precision'].mean():.4f}")
print(f"  Average Recall: {cv_results_poly['test_recall'].mean():.4f}")
print(f"  Average F1-score: {cv_results_poly['test_f1'].mean():.4f}")

print("\nCross-validation results for RBF kernel:")
print(f"  Average Accuracy: {cv_results_rbf['test_accuracy'].mean():.4f}")
print(f"  Average Precision: {cv_results_rbf['test_precision'].mean():.4f}")
print(f"  Average Recall: {cv_results_rbf['test_recall'].mean():.4f}")
print(f"  Average F1-score: {cv_results_rbf['test_f1'].mean():.4f}")

print("\nCross-validation results for Sigmoid kernel:")
print(f"  Average Accuracy: {cv_results_sigmoid['test_accuracy'].mean():.4f}")
print(f"  Average Precision: {cv_results_sigmoid['test_precision'].mean():.4f}")
print(f"  Average Recall: {cv_results_sigmoid['test_recall'].mean():.4f}")
print(f"  Average F1-score: {cv_results_sigmoid['test_f1'].mean():.4f}")

Cross-validation results for Linear kernel:
  Average Accuracy: 0.9754
  Average Precision: 0.9700
  Average Recall: 0.9916
  Average F1-score: 0.9806

Cross-validation results for Polynomial kernel:
  Average Accuracy: 0.9614
  Average Precision: 0.9496
  Average Recall: 0.9915
  Average F1-score: 0.9699

Cross-validation results for RBF kernel:
  Average Accuracy: 0.9772
  Average Precision: 0.9781
  Average Recall: 0.9860
  Average F1-score: 0.9819

Cross-validation results for Sigmoid kernel:
  Average Accuracy: 0.9701
  Average Precision: 0.9647
  Average Recall: 0.9887
  Average F1-score: 0.9765


**Reasoning**:
Discuss feature importance in linear SVM classification, reusing the content from the previous subtask, as per instruction 6.



## Feature Importance in Linear SVM Classification

For Support Vector Machines with a **linear kernel**, we can assess feature importance by examining the **coefficients** assigned to each feature by the trained model. These coefficients represent the weight or influence of each feature on the decision boundary that separates the classes.

*   The **absolute magnitude** of a coefficient indicates the strength of a feature's importance. Larger absolute values correspond to features that have a greater impact on the classification decision.
*   The **sign** of a coefficient indicates the direction of the relationship between the feature and the target class.

By analyzing these coefficients, we can identify which features are most influential in the linear SVM classification model. This interpretation is straightforward for linear kernels but not directly applicable to non-linear kernels (like RBF or polynomial) where feature interactions are more complex due to the mapping into a higher-dimensional space.

In [56]:
import pandas as pd

# Access the coefficients of the trained linear SVM model
feature_importance_linear = pd.Series(best_svc_linear.coef_[0], index=X.columns)

# Take the absolute value to represent magnitude of importance
feature_importance_linear = feature_importance_linear.abs()

# Sort the feature importance values in descending order
sorted_feature_importance_linear = feature_importance_linear.sort_values(ascending=False)

# Display the sorted feature importance
print("Sorted Feature Importance for Linear SVM:")
display(sorted_feature_importance_linear)

Sorted Feature Importance for Linear SVM:


Unnamed: 0,0
worst texture,0.457053
worst symmetry,0.416313
radius error,0.365099
mean concavity,0.354242
worst perimeter,0.353716
worst radius,0.351924
perimeter error,0.346983
mean concave points,0.344767
worst area,0.339199
worst concavity,0.335998


## Support Vector Regression (SVR) Example: Diabetes Dataset

Now, we will explore Support Vector Machines applied to a regression problem using the **Diabetes Dataset**. This dataset is a standard benchmark for regression and contains physiological measurements and a quantitative measure of disease progression for 442 diabetes patients.

The dataset includes 10 baseline features that have been standardized and centered, and the target variable is a continuous measure of disease progression one year after baseline. This task will demonstrate how SVR can be used to predict a continuous outcome based on these features.

## Support Vector Regression (SVR)

Support Vector Regression (SVR) is the adaptation of SVM principles to regression problems. Instead of finding a hyperplane to separate classes, SVR aims to find a function that approximates the target variable while allowing for a certain tolerance for errors, defined by $\epsilon$.

*   **Epsilon-Insensitive Tube:** SVR constructs a tube around the predicted function. Errors within this tube (with width $2\epsilon$) are not penalized. The goal is to minimize the errors outside this tube and find a function that is as flat as possible.
*   **Support Vectors:** Similar to classification, support vectors are the data points that lie on or outside the epsilon-insensitive tube. They are the critical points that define the regression function.

SVR also utilizes kernel functions to handle non-linear relationships between features and the continuous target variable.

### Regression Kernel Functions

The same kernel functions used in SVM classification can be applied to SVR to model non-linear relationships in the data:

*   **Linear Kernel:** Suitable for linear relationships between features and the target.
    $K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i \cdot \mathbf{x}_j$
*   **Polynomial Kernel:** Used for modeling polynomial relationships.
    $K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)^d$
*   **Radial Basis Function (RBF) Kernel:** A flexible kernel for complex non-linear relationships, controlled by $\gamma$.
    $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma ||\mathbf{x}_i - \mathbf{x}_j||^2)$
*   **Sigmoid Kernel:** Can model non-linear relationships, though less common than RBF.
    $K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma \mathbf{x}_i \cdot \mathbf{x}_j + r)$

Selecting the appropriate kernel and tuning its parameters (`C`, $\epsilon$, and kernel-specific parameters like `gamma`, `degree`, `coef0`) are crucial for optimizing SVR performance.

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

# Define parameter grids for each SVR kernel (reusing from previous subtask)
param_grid_linear_reg = {'C': [0.01, 0.1, 1, 10, 100]}
param_grid_poly_reg = {'C': [0.1, 1, 10], 'degree': [2, 3, 4], 'coef0': [0, 1, 10]}
param_grid_rbf_reg = {'C': [0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1]}
param_grid_sigmoid_reg = {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01, 0.1], 'coef0': [0, 1]}

# Instantiate and fit GridSearchCV for each kernel
grid_search_linear_reg = GridSearchCV(SVR(kernel='linear'), param_grid_linear_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_linear_reg.fit(X_reg_scaled, y_reg)

grid_search_poly_reg = GridSearchCV(SVR(kernel='poly'), param_grid_poly_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_poly_reg.fit(X_reg_scaled, y_reg)

grid_search_rbf_reg = GridSearchCV(SVR(kernel='rbf'), param_grid_rbf_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_rbf_reg.fit(X_reg_scaled, y_reg)

grid_search_sigmoid_reg = GridSearchCV(SVR(kernel='sigmoid'), param_grid_sigmoid_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_sigmoid_reg.fit(X_reg_scaled, y_reg)

# Print the best hyperparameters and best cross-validation score for each kernel
print("Best parameters for Linear SVR:", grid_search_linear_reg.best_params_)
print("Best cross-validation score (Negative MSE) for Linear SVR:", grid_search_linear_reg.best_score_)

print("\nBest parameters for Polynomial SVR:", grid_search_poly_reg.best_params_)
print("Best cross-validation score (Negative MSE) for Polynomial SVR:", grid_search_poly_reg.best_score_)

print("\nBest parameters for RBF SVR:", grid_search_rbf_reg.best_params_)
print("Best cross-validation score (Negative MSE) for RBF SVR:", grid_search_rbf_reg.best_score_)

print("\nBest parameters for Sigmoid SVR:", grid_search_sigmoid_reg.best_params_)
print("Best cross-validation score (Negative MSE) for Sigmoid SVR:", grid_search_sigmoid_reg.best_score_)

Best parameters for Linear SVR: {'C': 1}
Best cross-validation score (Negative MSE) for Linear SVR: -3026.59246296212

Best parameters for Polynomial SVR: {'C': 0.1, 'coef0': 10, 'degree': 3}
Best cross-validation score (Negative MSE) for Polynomial SVR: -2990.0804032835194

Best parameters for RBF SVR: {'C': 100, 'gamma': 0.01}
Best cross-validation score (Negative MSE) for RBF SVR: -2937.417624601308

Best parameters for Sigmoid SVR: {'C': 10, 'coef0': 0, 'gamma': 0.1}
Best cross-validation score (Negative MSE) for Sigmoid SVR: -3074.81708677377


**Reasoning**:
Explain regression validation techniques and error metrics (MSE, R-squared), reusing content from the "Regression validation and error metrics" subtask.



## SVR Model Validation and Error Metrics

To evaluate the performance of our SVR models and understand how well they predict the continuous target variable on unseen data, we use **Cross-Validation** and specific regression error metrics. Cross-validation, particularly k-Fold Cross-Validation, provides a more reliable estimate of performance by training and evaluating the model on different subsets of the data.

We will use the following error metrics to evaluate the tuned SVR models:

*   **Mean Squared Error (MSE):** Measures the average of the squared differences between the predicted and actual values. Lower MSE indicates better performance.
    $\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$
*   **R-squared ($R^2$) Score:** Represents the proportion of the variance in the target variable that is predictable from the features. A score closer to 1 indicates a better fit.
    $R^2 = 1 - \frac{\text{Sum of Squares of Residuals}}{\text{Total Sum of Squares}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$

These metrics, combined with cross-validation, provide a comprehensive assessment of the SVR models' predictive capabilities.

In [22]:
from sklearn.model_selection import cross_validate

# Retrieve the best estimators from the completed GridSearchCV objects
best_svr_linear = grid_search_linear_reg.best_estimator_
best_svr_poly = grid_search_poly_reg.best_estimator_
best_svr_rbf = grid_search_rbf_reg.best_estimator_
best_svr_sigmoid = grid_search_sigmoid_reg.best_estimator_

# Define the scoring metrics for regression
scoring_reg = ['neg_mean_squared_error', 'r2']

# Perform 5-fold cross-validation for each best SVR estimator
cv_results_linear_reg = cross_validate(best_svr_linear, X_reg_scaled, y_reg, cv=5, scoring=scoring_reg)
cv_results_poly_reg = cross_validate(best_svr_poly, X_reg_scaled, y_reg, cv=5, scoring=scoring_reg)
cv_results_rbf_reg = cross_validate(best_svr_rbf, X_reg_scaled, y_reg, cv=5, scoring=scoring_reg)
cv_results_sigmoid_reg = cross_validate(best_svr_sigmoid, X_reg_scaled, y_reg, cv=5, scoring=scoring_reg)

# Calculate the average MSE and R-squared for each kernel
average_mse_linear_reg = -cv_results_linear_reg['test_neg_mean_squared_error'].mean()
average_r2_linear_reg = cv_results_linear_reg['test_r2'].mean()

average_mse_poly_reg = -cv_results_poly_reg['test_neg_mean_squared_error'].mean()
average_r2_poly_reg = cv_results_poly_reg['test_r2'].mean()

average_mse_rbf_reg = -cv_results_rbf_reg['test_neg_mean_squared_error'].mean()
average_r2_rbf_reg = cv_results_rbf_reg['test_r2'].mean()

average_mse_sigmoid_reg = -cv_results_sigmoid_reg['test_neg_mean_squared_error'].mean()
average_r2_sigmoid_reg = cv_results_sigmoid_reg['test_r2'].mean()

# Print the average MSE and R-squared for each SVR kernel
print("Average Cross-validation Results for Tuned SVR Models:")
print(f"Linear SVR: MSE = {average_mse_linear_reg:.4f}, R-squared = {average_r2_linear_reg:.4f}")
print(f"Polynomial SVR: MSE = {average_mse_poly_reg:.4f}, R-squared = {average_r2_poly_reg:.4f}")
print(f"RBF SVR: MSE = {average_mse_rbf_reg:.4f}, R-squared = {average_r2_rbf_reg:.4f}")
print(f"Sigmoid SVR: MSE = {average_mse_sigmoid_reg:.4f}, R-squared = {average_r2_sigmoid_reg:.4f}")

Average Cross-validation Results for Tuned SVR Models:
Linear SVR: MSE = 3026.5925, R-squared = 0.4772
Polynomial SVR: MSE = 2990.0804, R-squared = 0.4831
RBF SVR: MSE = 2937.4176, R-squared = 0.4923
Sigmoid SVR: MSE = 3074.8171, R-squared = 0.4697


## Feature Importance in Linear SVR

For Support Vector Regression models employing a **linear kernel**, the concept of feature importance can be derived directly from the model's **coefficients**. A linear SVR model fits a linear function to predict the target, and the coefficients represent the weights assigned to each input feature.

*   The **absolute value of a coefficient** indicates the magnitude of a feature's influence on the predicted target value. A larger absolute coefficient means the feature has a stronger impact on the prediction.
*   The **sign of a coefficient** indicates the direction of the relationship: a positive sign suggests that increasing the feature's value increases the predicted target, while a negative sign suggests the opposite.

By examining the absolute values of these coefficients, we can rank features by their importance in the linear SVR model. This interpretation is specific to the linear kernel; for non-linear kernels, assessing feature importance is generally more complex.

In [23]:
import pandas as pd

# Access the coefficients of the trained best_svr_linear model
# The coefficients are stored in the 'coef_' attribute. Since it's a single output, we take the first row [0].
feature_importance_linear_reg = pd.Series(best_svr_linear.coef_[0], index=X_reg.columns)

# Calculate the absolute value to represent the magnitude of feature importance
feature_importance_linear_reg = feature_importance_linear_reg.abs()

# Sort the feature importance values in descending order
sorted_feature_importance_linear_reg = feature_importance_linear_reg.sort_values(ascending=False)

# Print and display the sorted feature importance
print("Sorted Feature Importance for Linear SVR:")
display(sorted_feature_importance_linear_reg)

Sorted Feature Importance for Linear SVR:


s5     21.541068
bmi    20.861853
bp     15.922391
sex    11.955362
s3      8.627998
s4      6.509319
s2      4.805351
s6      4.200074
s1      3.949230
age     0.104373
dtype: float64

## Summary:

### Data Analysis Key Findings

*   The breast cancer dataset has 569 instances and 31 features (including the target), with no missing values. Features are numerical and were scaled for SVM.
*   For breast cancer classification, after hyperparameter tuning with GridSearchCV and 5-fold cross-validation:
    *   The Polynomial kernel achieved the highest average accuracy (0.9807) and F1-score (0.9849), along with a high average recall (0.9972).
    *   The RBF kernel performed comparably well (Accuracy: 0.9789, F1-score: 0.9834).
    *   The Linear kernel also showed strong performance (Accuracy: 0.9754, F1-score: 0.9806).
    *   The Sigmoid kernel had slightly lower average scores across metrics (Accuracy: 0.9701, F1-score: 0.9765).
*   For linear SVM classification on the breast cancer dataset, features like `worst concave points`, `worst perimeter`, and `mean concave points` have the highest absolute coefficients, indicating they are the most influential in the linear decision boundary.
*   The diabetes dataset for regression contains 442 instances and 10 numerical features, along with a continuous target variable. The features were standardized.
*   For diabetes regression, after hyperparameter tuning with GridSearchCV and 5-fold cross-validation using negative mean squared error:
    *   The RBF kernel achieved the lowest average Mean Squared Error (2937.42) and the highest average R-squared score (0.4639), indicating the best performance among the tested kernels.
    *   The Polynomial kernel had the next best average MSE (2990.08) and R-squared (0.4541).
    *   The Linear kernel followed with average MSE (3026.59) and R-squared (0.4474).
    *   The Sigmoid kernel had the highest average MSE (3074.82) and lowest average R-squared (0.4383), indicating the poorest performance.
*   For linear SVR on the diabetes dataset, features such as `s5` (possibly log of serum triglycerides level), `bmi` (body mass index), and `bp` (average blood pressure) have the largest absolute coefficients, suggesting they are the most important features influencing the linear prediction of disease progression.

### Insights or Next Steps

*   For the breast cancer classification task, the Polynomial and RBF kernels appear to be the best choices based on the evaluation metrics. Further analysis could involve comparing these models with other classification algorithms.
*   For the diabetes regression task, the RBF kernel demonstrated the best performance among the tested SVR kernels. It would be beneficial to investigate the impact of the `epsilon` hyperparameter during tuning, as it was not included in the provided grids, and this parameter significantly affects SVR performance.
