# Support Vector Machine (SVM) and Model Ensemble {-}

In this notebook, we will: 
- Use GridSearchCV to find the best set of SVM hyperparameters.
- Build, train and evaluate the SVM model.
- Separately build, train and evaluate the other four classifiers (Logistic regression, Naive Bayes, Decision Tree, Random Forest) on the same dataset, then compare their performance with the SVM model's.
- Apply three model ensemble technics, i.e., Bagging, Boosting and Stacking, to solve the problem, then compare their performance with each other and with the use of individual models. Draw conclusion from what has been observed.

The dataset we will be working on is 'data-breast-cancer.csv'. It is composed of attributes to build a prediction model.

In [41]:
# Load the libraries
import pandas as pd
import numpy as np

In [42]:
# Load the dataset
df = pd.read_csv("data-breast-cancer.csv")

In [43]:
# Show some data samples
df.head()

Unnamed: 0.1,Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
1,1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
2,2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
3,3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744
4,4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


This is a dataset used to detect whether a patient has breast cancer depending on the following features: 

- diagnosis: (label) the diagnosis of breast (label) tissues (M = malignant, B = benign).
- radius: distances from center to points on the perimeter.
- texture: standard deviation of gray-scale values.
- perimeter: perimeter of the tumor.
- area: area of the tumor.
- smoothness: local variation in radius lengths.
- compactness: is equal to (perimeter^2 / area - 1.0).
- concavity: severity of concave portions of the contour.
- concave points: number of concave portions of the contour.
- symmetry: symmetry of the tumor shape.
- fractal dimension: "coastline approximation" - 1.



# Analyze Data

In [44]:
# Your code goes here
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              569 non-null    int64  
 1   diagnosis               569 non-null    object 
 2   radius_mean             569 non-null    float64
 3   texture_mean            569 non-null    float64
 4   perimeter_mean          569 non-null    float64
 5   area_mean               569 non-null    float64
 6   smoothness_mean         569 non-null    float64
 7   compactness_mean        569 non-null    float64
 8   concavity_mean          569 non-null    float64
 9   concave points_mean     569 non-null    float64
 10  symmetry_mean           569 non-null    float64
 11  fractal_dimension_mean  569 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 53.5+ KB


(569, 12)

In [45]:
df = df.drop_duplicates(ignore_index=True)    # ignore_index=True means the resulting axis will be labeled 0, 1, ..., n-1, otherwise there will be index gaps. Try df = df.drop_duplicates(), then df.head(1000) to see the difference.
df = df.iloc[:,1:]
df.shape

(569, 11)

# Remove outliers and clean the data

In [46]:
q2 = df.iloc[:,1].quantile(0.98)  # Select q range as 98%
q1 = df.iloc[:,1].quantile(0.02)  # Select q range as 2%
df_clean = df[(df.iloc[:,1] < q2) & (df.iloc[:,1] > q1)]

for i in range(1,11):
    q2 = df.iloc[:,i].quantile(0.98)  # Select q range as 98%
    q1 = df.iloc[:,i].quantile(0.02)  # Select q range as 2%
    df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]

  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]
  df_clean = df_clean[(df.iloc[:,i] < q2) & (df.iloc[:,i] > q1)]


In [47]:
df_clean

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
1,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667
2,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999
4,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883
6,M,18.25,19.98,119.60,1040.0,0.09463,0.10900,0.11270,0.07400,0.1794,0.05742
7,M,13.71,20.83,90.20,577.9,0.11890,0.16450,0.09366,0.05985,0.2196,0.07451
...,...,...,...,...,...,...,...,...,...,...,...
560,B,14.05,27.15,91.38,600.4,0.09929,0.11260,0.04462,0.04304,0.1537,0.06171
563,M,20.92,25.09,143.00,1347.0,0.10990,0.22360,0.31740,0.14740,0.2149,0.06879
564,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623
565,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533


In [48]:
# Separate data features by removing the data label.
X = df_clean.drop(columns=["diagnosis"], axis=1)

# Assign data label to variable y
y = df_clean.diagnosis

# Split train/test with a random state
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10, train_size=0.8)

In [49]:
df_clean[df_clean.isna().any(axis=1)]

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean


In [50]:
# Initialize and use StandardScaler to normalize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_normalized_train = scaler.fit_transform(X_train)     # Fit and transform thr training data
X_normalized_test = scaler.transform(X_test)           # Only transform the test data.

# Use GridSearchCV to find the best set of SVM hyperparameters

In [51]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_normalized_train, y_train)

In [52]:
# Show evaluation metrics on the test set
from sklearn.metrics import classification_report
print(classification_report(y_test, model.predict(X_normalized_test)))

              precision    recall  f1-score   support

           B       0.92      0.97      0.94        61
           M       0.93      0.83      0.88        30

    accuracy                           0.92        91
   macro avg       0.92      0.90      0.91        91
weighted avg       0.92      0.92      0.92        91



In [53]:
from sklearn.model_selection import GridSearchCV
param_grid = {"C": [0.01, 0.1, 1, 10, 100, 1000],
              "gamma": ["scale", 0.001, 0.005, 0.1]}

gridsearch = GridSearchCV(SVC(), param_grid, cv=5, scoring="accuracy", verbose=1)     # cv: number of folds in cross validation.

In [54]:
gridsearch.fit(X_normalized_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


In [55]:
gridsearch.best_params_

{'C': 1000, 'gamma': 0.005}

# Build, train and evaluate the SVM model

In [56]:
model = SVC(C=gridsearch.best_params_['C'], gamma=gridsearch.best_params_['gamma'])
model.fit(X_normalized_train, y_train)

In [57]:
print(classification_report(y_test, model.predict(X_normalized_test)))

              precision    recall  f1-score   support

           B       0.94      0.97      0.95        61
           M       0.93      0.87      0.90        30

    accuracy                           0.93        91
   macro avg       0.93      0.92      0.92        91
weighted avg       0.93      0.93      0.93        91



# Separately build, train and evaluate the other four classifiers (Logistic regression, Naive Bayes, Decision Tree, Random Forest) on the same dataset, then compare their performance with the SVM model's

## Logistic Regression

In [58]:
from sklearn.linear_model import LogisticRegression
param_grid = [    
    {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
    'C' : np.logspace(-4, 4, 20),
    'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100, 1000,2500, 5000]
    }
]
logisticmodel = LogisticRegression()
logisticmodel_cv = GridSearchCV(logisticmodel, param_grid = param_grid, cv = 5, scoring="accuracy", verbose=True)

logisticmodel_cv.fit(X_normalized_train, y_train)


Fitting 5 folds for each of 1600 candidates, totalling 8000 fits


5200 fits failed out of a total of 8000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
400 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/homebrew/lib/python3.12/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py", line 1172, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [59]:
logisticmodel_cv.best_params_

{'C': 0.23357214690901212,
 'max_iter': 100,
 'penalty': 'l2',
 'solver': 'liblinear'}

In [60]:
logisticmodel = LogisticRegression(C=logisticmodel_cv.best_params_['C'], max_iter=logisticmodel_cv.best_params_['max_iter'], penalty=logisticmodel_cv.best_params_['penalty'], solver=logisticmodel_cv.best_params_['solver'])
logisticmodel.fit(X_normalized_train, y_train)

In [61]:
print(classification_report(y_test, logisticmodel.predict(X_normalized_test)))

              precision    recall  f1-score   support

           B       0.89      0.97      0.93        61
           M       0.92      0.77      0.84        30

    accuracy                           0.90        91
   macro avg       0.91      0.87      0.88        91
weighted avg       0.90      0.90      0.90        91



Perform with the same accuracy as in SVM. 

## Naive Bayes

In [62]:
from sklearn.naive_bayes import GaussianNB

grid_search={"var_smoothing":[1, 1e-3, 1e-6, 1e-9]}
naive_model = GaussianNB()
naive_model_cv = GridSearchCV(naive_model, grid_search, cv=5, scoring="accuracy")

naive_model_cv.fit(X_normalized_train, y_train)

In [63]:
naive_model_cv.best_params_

{'var_smoothing': 1e-06}

### Train

In [64]:
naive_model = GaussianNB(var_smoothing=naive_model_cv.best_params_['var_smoothing'])
naive_model.fit(X_normalized_train, y_train)

In [65]:
print(classification_report(y_test, naive_model.predict(X_normalized_test)))

              precision    recall  f1-score   support

           B       0.89      0.95      0.92        61
           M       0.88      0.77      0.82        30

    accuracy                           0.89        91
   macro avg       0.89      0.86      0.87        91
weighted avg       0.89      0.89      0.89        91



Perform slightly worse than SVM 

## Decision Tree

### GridSearchCV

In [66]:
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

params = {"criterion": ["gini", "entropy"],             # Criterion to evaluate the purity.
         "max_depth": [3, 5],                           # Maximum depth of the tree
         "min_samples_split": [4, 8]}                   # Stop splitting condition.

grid_search_dt = GridSearchCV(estimator=DecisionTreeClassifier(), param_grid=params, cv=5, scoring="accuracy")

In [67]:
grid_search_dt.fit(X_normalized_train, y_train)
grid_search_dt.best_params_

{'criterion': 'entropy', 'max_depth': 5, 'min_samples_split': 4}

### Train

In [68]:
model_dt = DecisionTreeClassifier(criterion=grid_search_dt.best_params_['criterion'], max_depth=grid_search_dt.best_params_['max_depth'], min_samples_split=grid_search_dt.best_params_['min_samples_split'])
model_dt.fit(X_normalized_train, y_train)

In [69]:
print(classification_report(y_test, model_dt.predict(X_normalized_test)))

              precision    recall  f1-score   support

           B       0.93      0.93      0.93        61
           M       0.87      0.87      0.87        30

    accuracy                           0.91        91
   macro avg       0.90      0.90      0.90        91
weighted avg       0.91      0.91      0.91        91



Perform slightly worse than SVM 

## Random Forest

### GridSearchCV

In [70]:
from sklearn.ensemble import RandomForestClassifier

params = {"criterion": ["gini", "entropy"],             # Criterion to evaluate the purity.
         "max_depth": [7, 9, 11],                           # Maximum depth of the tree
         "min_samples_split": [8, 12, 16]}                   # Stop splitting condition.

grid_search_rf = GridSearchCV(estimator=RandomForestClassifier(n_estimators=10, n_jobs=10), param_grid=params, cv= 5, scoring="accuracy") # Number of trees in the forest is 10

In [71]:
grid_search_rf.fit(X_normalized_train, y_train)
grid_search_rf.best_params_

{'criterion': 'gini', 'max_depth': 9, 'min_samples_split': 12}

### Train

In [72]:
model_rf = RandomForestClassifier(n_estimators=10, random_state=1, criterion=grid_search_rf.best_params_['criterion'], max_depth=grid_search_rf.best_params_['max_depth'], min_samples_split=grid_search_rf.best_params_['min_samples_split'])    
model_rf.fit(X_normalized_train, y_train)

In [73]:
print(classification_report(y_test, model_rf.predict(X_normalized_test)))

              precision    recall  f1-score   support

           B       0.89      0.95      0.92        61
           M       0.88      0.77      0.82        30

    accuracy                           0.89        91
   macro avg       0.89      0.86      0.87        91
weighted avg       0.89      0.89      0.89        91



Perform with the same accuracy as in SVM 

# Apply three model ensemble technics, i.e., Bagging, Boosting and Stacking, to solve the problem, then compare their performance with each other and with the use of individual models. Draw conclusion from what has been observed.

In [74]:
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

## Bagging

In [75]:
base_svm = SVC(kernel='linear', C=1.0)
bagging_clf = BaggingClassifier(estimator=base_svm, n_estimators=10, max_samples=0.5)
bagging_clf.fit(X_train, y_train)

In [76]:
# Making predictions on the test set
y_pred = bagging_clf.predict(X_test)

# Evaluating the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.8681318681318682


## Boosting

In [85]:
ada_clf = AdaBoostClassifier(estimator = DecisionTreeClassifier(), n_estimators=10)

# Train the AdaBoost Classifier
ada_clf.fit(X_train, y_train)

# Making predictions on the test set
y_pred_ada = ada_clf.predict(X_test)

# Evaluating the accuracy of the model
accuracy_ada = accuracy_score(y_test, y_pred_ada)
print("AdaBoost Classifier Accuracy:", accuracy_ada)

AdaBoost Classifier Accuracy: 0.8901098901098901




## Stacking

In [78]:
from sklearn.ensemble import VotingClassifier

In [82]:
svm_best = gridsearch.best_estimator_
rf_best = grid_search_rf.best_estimator_
log_best = logisticmodel_cv
dt_best = grid_search_dt.best_estimator_

In [83]:
estimators=[('dt', dt_best), ('svm', svm_best), ('rf', rf_best), ('log_reg', log_best)]    # Initialize base models in the ensemble
ensemble = VotingClassifier(estimators, voting='hard')  

In [84]:
ensemble.fit(X_train, y_train)          # Train the ensemble on the training set
ensemble.score(X_test, y_test)

Fitting 5 folds for each of 1600 candidates, totalling 8000 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.9010989010989011

Observations: we see that Bagging and Boosting dont perform as well as Stacking, yet stacking does not yields as accurate result as pure SVM (which is interesting). 