The **Breast Cancer Wisconsin** (Diagnostic) dataset consists of 569 observations of breast cancer cell features, with 30 features and a binary outcome variable (malignant or benign). The data was collected from digitized images of breast tissue, and the features represent the characteristics of each cell.

This dataset is a popular benchmark for machine learning classification tasks, and it has been used in a variety of studies to evaluate the performance of different machine learning algorithms. The dataset is often used to classify breast cancer as either malignant or benign, but it can also be used to classify breast cancer into different stages or types.

Here are some of the key features of the dataset:

* **The dataset consists** of 569 observations of breast cancer cell features.
The features represent the characteristics of each cell, such as the radius, texture, and perimeter.
* **The outcome variable** is binary, indicating whether the cancer is malignant or benign.
The dataset is available for download from the UCI Machine Learning Repository.



## **Gettig Data**

I used the Kaggle API to download the Breast Cancer Wisconsin Diagnostic Dataset. The API provides a simple way to access data from Kaggle without having to download it manually.

In [1]:
!pip install -q kaggle

In [2]:
!mkdir ~/.kaggle/

In [3]:
!cp '/content/drive/MyDrive/Colab Notebooks/kaggle/kaggle.json' ~/.kaggle/

In [4]:
!chmod 600 ~/.kaggle/kaggle.json

In [5]:
!kaggle datasets download -d utkarshx27/breast-cancer-wisconsin-diagnostic-dataset

Downloading breast-cancer-wisconsin-diagnostic-dataset.zip to /content
  0% 0.00/47.7k [00:00<?, ?B/s]
100% 47.7k/47.7k [00:00<00:00, 49.9MB/s]


In [6]:
!unzip /content/breast-cancer-wisconsin-diagnostic-dataset.zip

Archive:  /content/breast-cancer-wisconsin-diagnostic-dataset.zip
  inflating: brca.csv                


In [7]:
# import data
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


In [8]:
# lode data
df = pd.read_csv('/content/brca.csv')
df.sample(5)

Unnamed: 0.1,Unnamed: 0,x.radius_mean,x.texture_mean,x.perimeter_mean,x.area_mean,x.smoothness_mean,x.compactness_mean,x.concavity_mean,x.concave_pts_mean,x.symmetry_mean,...,x.texture_worst,x.perimeter_worst,x.area_worst,x.smoothness_worst,x.compactness_worst,x.concavity_worst,x.concave_pts_worst,x.symmetry_worst,x.fractal_dim_worst,y
341,342,10.26,16.58,65.85,320.8,0.08877,0.08066,0.04358,0.02438,0.1669,...,22.04,71.08,357.4,0.1461,0.2246,0.1783,0.08333,0.2691,0.09479,B
194,195,12.56,19.07,81.92,485.8,0.0876,0.1038,0.103,0.04391,0.1533,...,22.43,89.02,547.4,0.1096,0.2002,0.2388,0.09265,0.2121,0.07188,B
168,169,12.18,14.08,77.25,461.4,0.07734,0.03212,0.01123,0.005051,0.1673,...,16.47,81.6,513.1,0.1001,0.05332,0.04116,0.01852,0.2293,0.06037,B
123,124,12.89,14.11,84.95,512.2,0.0876,0.1346,0.1374,0.0398,0.1596,...,17.7,105.0,639.1,0.1254,0.5849,0.7727,0.1561,0.2639,0.1178,B
95,96,12.91,16.33,82.53,516.4,0.07941,0.05366,0.03873,0.02377,0.1829,...,22.0,90.81,600.6,0.1097,0.1506,0.1764,0.08235,0.3024,0.06949,B


In [9]:
df.shape

(569, 32)

In [10]:
df.isnull().sum()

Unnamed: 0             0
x.radius_mean          0
x.texture_mean         0
x.perimeter_mean       0
x.area_mean            0
x.smoothness_mean      0
x.compactness_mean     0
x.concavity_mean       0
x.concave_pts_mean     0
x.symmetry_mean        0
x.fractal_dim_mean     0
x.radius_se            0
x.texture_se           0
x.perimeter_se         0
x.area_se              0
x.smoothness_se        0
x.compactness_se       0
x.concavity_se         0
x.concave_pts_se       0
x.symmetry_se          0
x.fractal_dim_se       0
x.radius_worst         0
x.texture_worst        0
x.perimeter_worst      0
x.area_worst           0
x.smoothness_worst     0
x.compactness_worst    0
x.concavity_worst      0
x.concave_pts_worst    0
x.symmetry_worst       0
x.fractal_dim_worst    0
y                      0
dtype: int64

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           569 non-null    int64  
 1   x.radius_mean        569 non-null    float64
 2   x.texture_mean       569 non-null    float64
 3   x.perimeter_mean     569 non-null    float64
 4   x.area_mean          569 non-null    float64
 5   x.smoothness_mean    569 non-null    float64
 6   x.compactness_mean   569 non-null    float64
 7   x.concavity_mean     569 non-null    float64
 8   x.concave_pts_mean   569 non-null    float64
 9   x.symmetry_mean      569 non-null    float64
 10  x.fractal_dim_mean   569 non-null    float64
 11  x.radius_se          569 non-null    float64
 12  x.texture_se         569 non-null    float64
 13  x.perimeter_se       569 non-null    float64
 14  x.area_se            569 non-null    float64
 15  x.smoothness_se      569 non-null    flo

In [12]:
df.columns

Index(['Unnamed: 0', 'x.radius_mean', 'x.texture_mean', 'x.perimeter_mean',
       'x.area_mean', 'x.smoothness_mean', 'x.compactness_mean',
       'x.concavity_mean', 'x.concave_pts_mean', 'x.symmetry_mean',
       'x.fractal_dim_mean', 'x.radius_se', 'x.texture_se', 'x.perimeter_se',
       'x.area_se', 'x.smoothness_se', 'x.compactness_se', 'x.concavity_se',
       'x.concave_pts_se', 'x.symmetry_se', 'x.fractal_dim_se',
       'x.radius_worst', 'x.texture_worst', 'x.perimeter_worst',
       'x.area_worst', 'x.smoothness_worst', 'x.compactness_worst',
       'x.concavity_worst', 'x.concave_pts_worst', 'x.symmetry_worst',
       'x.fractal_dim_worst', 'y'],
      dtype='object')

In [13]:
df = df.drop('Unnamed: 0', axis=1)
df.shape

(569, 31)

In [14]:
df['y'].unique()

array(['B', 'M'], dtype=object)

In [15]:
# Feature Engineering (One hot encoding)
df["y"] = pd.get_dummies(df['y'], drop_first=True)
df.sample(5)

Unnamed: 0,x.radius_mean,x.texture_mean,x.perimeter_mean,x.area_mean,x.smoothness_mean,x.compactness_mean,x.concavity_mean,x.concave_pts_mean,x.symmetry_mean,x.fractal_dim_mean,...,x.texture_worst,x.perimeter_worst,x.area_worst,x.smoothness_worst,x.compactness_worst,x.concavity_worst,x.concave_pts_worst,x.symmetry_worst,x.fractal_dim_worst,y
294,15.73,11.28,102.8,747.2,0.1043,0.1299,0.1191,0.06211,0.1784,0.06259,...,14.2,112.5,854.3,0.1541,0.2979,0.4004,0.1452,0.2557,0.08181,0
306,12.54,16.32,81.25,476.3,0.1158,0.1085,0.05928,0.03279,0.1943,0.06612,...,21.4,86.67,552.0,0.158,0.1751,0.1889,0.08411,0.3155,0.07538,0
209,13.46,28.21,85.89,562.1,0.07517,0.04726,0.01271,0.01117,0.1421,0.05763,...,35.63,97.11,680.6,0.1108,0.1457,0.07934,0.05781,0.2694,0.07061,0
0,13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.04781,0.1885,0.05766,...,19.26,99.7,711.2,0.144,0.1773,0.239,0.1288,0.2977,0.07259,0
362,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,1


In [16]:
df.describe()

Unnamed: 0,x.radius_mean,x.texture_mean,x.perimeter_mean,x.area_mean,x.smoothness_mean,x.compactness_mean,x.concavity_mean,x.concave_pts_mean,x.symmetry_mean,x.fractal_dim_mean,...,x.texture_worst,x.perimeter_worst,x.area_worst,x.smoothness_worst,x.compactness_worst,x.concavity_worst,x.concave_pts_worst,x.symmetry_worst,x.fractal_dim_worst,y
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.372583
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,0.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [17]:
# train_test_split
X = df.iloc[:, :-1].to_numpy()
y = df.iloc[:, -1].to_numpy()

print(X.shape)
print(y.shape)

(569, 30)
(569,)


In [19]:
# train phase
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


 We want to use different machine learning models for the Breast Cancer Classification dataset to see which model can have the best performance. For this purpose, we will use the following models:

* Support Vector Machine (SVM)

* LinearRegression

* Gaussian

## **Support Vector Machine (SVM)**:

 SVM is a supervised learning algorithm that can be used for both classification and regression tasks. In classification tasks, SVM finds a hyperplane that separates the two classes of data with the largest possible margin. In regression tasks, SVM finds a function that best fits the data points with the smallest possible error.



## **SVM** ( **kernel = rbf** )

In [20]:
#  data modeling
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Create the SVM model with RBF kernel
# c = [10 , 100, 1000 , 10000] the best c = 10000
clf = SVC(kernel='rbf', C=10000)

# Train the model
clf.fit(X_train, y_train)

# Evaluate the model
print(clf.score(X_test, y_test))


0.9649122807017544


## **SVM** ( **kernel = linear** )

In [21]:
# Create the SVM model with linear kernel
# c = [10 , 100, 1000 , 10000] the best c = 10000
clf = SVC(kernel='linear', C=10000)

# Train the model
clf.fit(X_train, y_train)

# Evaluate the model
print(clf.score(X_test, y_test))

0.9473684210526315


## **Linear Regression:**

Linear regression is a supervised learning algorithm that can be used to predict a continuous value from a set of features. The model learns a linear relationship between the features and the target value.

In [22]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create the LinearRegression
model = LinearRegression()

# creat the hyperparametr grid
param_grid = {'fit_intercept': [True, False]}

# Create the grid search
grid_search = GridSearchCV(model, param_grid, cv=5)

# Fit the grid search
grid_search.fit(X_train, y_train)


In [23]:
# make pridictions
predictions = grid_search.predict(X_test)

# calculate the mean_squared_error
mse = mean_squared_error(y_test, predictions)

# print best pararmetrs and Mean squar error
print(f'best pararmetrs:, {grid_search.best_params_}')
print(f'Mean squar error: , {mse}')

best pararmetrs:, {'fit_intercept': True}
Mean squar error: , 0.05773149020116077


In [24]:
r2_score(y_test, predictions)

0.7542487891731787

In [25]:
percent_accuracy = (1 - (mse / y_test.mean())) * 100
print(f"Model Accuracy: {percent_accuracy}%")

Model Accuracy: 84.69444213271552%


## **Gaussian Process Regression:**

 Gaussian process regression is a non-parametric supervised learning algorithm that can be used to predict a continuous value from a set of features. The model learns a Gaussian process, which is a probability distribution over functions.

In [26]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV


param_grid = {'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3]}

# Create the GaussianNB
gnb = GaussianNB()

# hyperparametr Tuning
clf = GridSearchCV(gnb, param_grid, cv=5)

# fitting the model
clf.fit(X_train, y_train)

In [27]:
print(clf.best_params_)

{'var_smoothing': 1e-09}


In [31]:
# Create the model
gnb = GaussianNB(var_smoothing=1e-09)

# Fit the model to the data
gnb.fit(X_train, y_train)

# Evaluate the model on the test data
score = gnb.score(X_test, y_test)
print(score)

0.9210526315789473
