<a href="https://colab.research.google.com/github/mrhamedani/Machine-learning-projects-ScikitLearn/blob/main/4_breast_cancer_GrideSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optimizing SVM with GridSearchCV on the Breast Cancer Dataset
This project demonstrates how to optimize a Support Vector Machine (SVM) model using GridSearchCV on the Breast Cancer Dataset from scikit-learn. The goal is to find the best hyperparameters for the SVM model using cross-validation to improve classification accuracy.

In [23]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_validate
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV ,cross_validate
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, Lasso

In [None]:
x,y = datasets.load_breast_cancer(return_X_y=True) #The dataset is ready for scikit learn
print(x.shape)
print(y.shape)

(569, 30)
(569,)


#Important point
In this code
train_test_split method to separate test data from training
k-fold method for separating validation data from training data
We have three parts of data: validation, training and testing

In [None]:
scaler = StandardScaler()
x = scaler.fit_transform(x) # standardize the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# 1- cross_validate & k-fold
Here we use the model using the cross_validate method
Also, instead of using train-test-split, we use k-fold, which is a better method and divides the data into k numbers (also k times).



In [None]:
result1 = cross_validate(SVC(),x_train,y_train,cv=5,scoring='accuracy') # SVC is the model  #cv is the number of folds #scoring is the metric
print(result1['test_score']) #the accuracy of each fold
print(np.mean(result1['test_score'])) #the average accuracy


[0.97802198 0.96703297 0.98901099 0.98901099 0.94505495]
0.9736263736263737


In [None]:
model = SVC()
model.fit(x_train, y_train)

In [None]:
test_accuracy = model.score(x_test, y_test)
print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.9736842105263158


## 2- GridSearchCV & hayperparameters
 kernel choice:  
- 'rbf' (default, for data with complex and unknown structure)  
- 'poly' (for data with polynomial relationships)  
- 'linear' (for linearly separable data)  
- 'sigmoid' (for neural network-like behavior)  


hayperparameters:
(Of course, there are other cases that we will examine here)
- 'C' (regularization parameter, controls trade-off between margin and misclassification-soft margin or hard margin)   
- 'degree' (only for 'poly' kernel, defines polynomial degree)
- 'gamma' (kernel coefficient, affects influence of a single data point)

GridSearchCV:
A method like cross_validate with the difference that it selects the best among the available options for hyperparameters.
  




In [None]:
hayperparameters = {'C':[30,60,90,100,120],'degree':[1,2,3,4],'gamma': [0.001, 0.01, 0.1, 1, 10],'kernel': ['rbf', 'poly']}
result2= GridSearchCV(SVC(),hayperparameters,cv=5,verbose=3)

result2.fit(x_train,y_train)

In [None]:
print(result2.best_params_)
print(result2.best_score_)

{'C': 30, 'degree': 1, 'gamma': 0.01, 'kernel': 'poly'}
0.9780219780219781


In [None]:
best_model = result2.best_estimator_
test_accuracy = best_model.score(x_test, y_test)
print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.9736842105263158


# 3- Ridge regression model & Lasso regression model

Regularization refers to techniques used to prevent overfitting in models.This technique is used in these three cases :

 1- model Lasso(L1): It sets some weights to zero, thus removing ineffective features.

 2- model Ridge (L2):It helps the weights get smaller but doesn't make them zero.

 3- hyperparameter C



In [25]:
# Ridge regression model
ridge = Ridge(alpha=1.0)  # Alpha (or Lambda) is the regularization strength parameter for regularization
ridge.fit(x_train, y_train)
ridge_train_score = ridge.score(x_train, y_train)
ridge_test_score = ridge.score(x_test, y_test)

In [26]:
# Lasso regression model
lasso = Lasso(alpha=0.1)
lasso.fit(x_train, y_train)
lasso_train_score = lasso.score(x_train, y_train)
lasso_test_score = lasso.score(x_test, y_test)

In [27]:
print("Ridge Regression - Train Score:", ridge_train_score)
print("Ridge Regression - Test Score:", ridge_test_score)
print("Lasso Regression - Train Score:", lasso_train_score)
print("Lasso Regression - Test Score:", lasso_test_score)

Ridge Regression - Train Score: 0.7764062577006963
Ridge Regression - Test Score: 0.7358522941887385
Lasso Regression - Train Score: 0.6431052569345049
Lasso Regression - Test Score: 0.6786247259787639
