## **Breast Cancer Detection**

This project focuses on detecting breast cancer by analyzing patient data to predict whether a tumor is malignant or benign. I utilized the Logistic Regression model for this task, achieving an impressive **F1 score of 98.6**. The model demonstrates high accuracy in distinguishing between malignant and benign tumors, contributing to early and reliable diagnosis.


# Download the Data set

In [2]:
!pip install ucimlrepo



### **import the required libraries**

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler


In [4]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
X = breast_cancer_wisconsin_diagnostic.data.features 
y = breast_cancer_wisconsin_diagnostic.data.targets 
  
# metadata 
print(breast_cancer_wisconsin_diagnostic.metadata) 
  
# variable information 
print(breast_cancer_wisconsin_diagnostic.variables) 

{'uci_id': 17, 'name': 'Breast Cancer Wisconsin (Diagnostic)', 'repository_url': 'https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic', 'data_url': 'https://archive.ics.uci.edu/static/public/17/data.csv', 'abstract': 'Diagnostic Wisconsin Breast Cancer Database.', 'area': 'Health and Medicine', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 569, 'num_features': 30, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Diagnosis'], 'index_col': ['ID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1993, 'last_updated': 'Fri Nov 03 2023', 'dataset_doi': '10.24432/C5DW2B', 'creators': ['William Wolberg', 'Olvi Mangasarian', 'Nick Street', 'W. Street'], 'intro_paper': {'title': 'Nuclear feature extraction for breast tumor diagnosis', 'authors': 'W. Street, W. Wolberg, O. Mangasarian', 'published_in': 'Electronic imaging', 'year': 1993, 'url': 'https://www.semanticscholar.org/paper/53

In [5]:
X = pd.DataFrame(X)
y = pd.DataFrame(y)
print ("X :\n")
print(X.head())
print("y : \n")
print(y.head())

the dataset's features is :

   radius1  texture1  perimeter1   area1  smoothness1  compactness1  \
0    17.99     10.38      122.80  1001.0      0.11840       0.27760   
1    20.57     17.77      132.90  1326.0      0.08474       0.07864   
2    19.69     21.25      130.00  1203.0      0.10960       0.15990   
3    11.42     20.38       77.58   386.1      0.14250       0.28390   
4    20.29     14.34      135.10  1297.0      0.10030       0.13280   

   concavity1  concave_points1  symmetry1  fractal_dimension1  ...  radius3  \
0      0.3001          0.14710     0.2419             0.07871  ...    25.38   
1      0.0869          0.07017     0.1812             0.05667  ...    24.99   
2      0.1974          0.12790     0.2069             0.05999  ...    23.57   
3      0.2414          0.10520     0.2597             0.09744  ...    14.91   
4      0.1980          0.10430     0.1809             0.05883  ...    22.54   

   texture3  perimeter3   area3  smoothness3  compactness3  concavity

# preprocessing step

#### Now I load the dataset and I need to preprocessing I
##### How i can make this? :

1- remove the Nan or impute them

2- remove the duplicates 

3- feature selction
# evaluation 
evaluate the model in each set of features

In [6]:
X.describe()

Unnamed: 0,radius1,texture1,perimeter1,area1,smoothness1,compactness1,concavity1,concave_points1,symmetry1,fractal_dimension1,...,radius3,texture3,perimeter3,area3,smoothness3,compactness3,concavity3,concave_points3,symmetry3,fractal_dimension3
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [7]:
X.isna().sum()
#ther is no nan 
X.drop_duplicates()
print(len(X.columns))

30


**I will use the sequence feature selection forward and backward to choose the importan features** 

In [8]:
y['Diagnosis'].unique()
replace_map = {'B': 1, 'M': 0}
y = y.replace(replace_map)
y.value_counts()

  y = y.replace(replace_map)


Diagnosis
1            357
0            212
Name: count, dtype: int64

In [9]:
def extract_features (X, y,direction):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 42)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Define the model
    model = LogisticRegression(fit_intercept= True, penalty= None , random_state = 42)

    # Wrap SFS around the model
    seq = SequentialFeatureSelector(model,direction=direction)

    # Fit RFE
    seq = seq.fit(X_train_scaled, y_train)
    feature_names = X.columns[seq.get_support()]

    return feature_names

In [10]:
features_F = extract_features(X, y, "forward")
features_b = extract_features(X, y, "backward")
print(features_F)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

Index(['radius1', 'texture1', 'perimeter1', 'area1', 'compactness1',
       'concavity1', 'concave_points1', 'symmetry1', 'texture2', 'radius3',
       'texture3', 'perimeter3', 'area3', 'smoothness3', 'symmetry3'],
      dtype='object')


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [11]:
features_b = features_b.sort_values()
features_F = features_F.sort_values()

print(f"the number of features is {len(features_F)} \n and its features are: \n {features_F} ")
print(f"the number of features is {len(features_b)} \n and its features are: \n {features_b}")
#check if they are the same features ? 
are_equal = features_b.sort_values().equals(features_F.sort_values())
print(are_equal)  # True if they contain the same elements, regardless of order

the number of features is 15 from the forwad selection is: 
 Index(['area1', 'area3', 'compactness1', 'concave_points1', 'concavity1',
       'perimeter1', 'perimeter3', 'radius1', 'radius3', 'smoothness3',
       'symmetry1', 'symmetry3', 'texture1', 'texture2', 'texture3'],
      dtype='object') 
the number of features is 15 from the backward elemenation is: 
 Index(['area3', 'compactness1', 'compactness3', 'concave_points1',
       'concave_points3', 'concavity2', 'concavity3', 'fractal_dimension1',
       'fractal_dimension3', 'radius2', 'smoothness1', 'symmetry2',
       'symmetry3', 'texture2', 'texture3'],
      dtype='object')
False


In [12]:
def calculate_metrics(model, X_test_scaled, Y_test):
    '''Get model evaluation metrics on the test set.'''

    # Get model predictions
    y_predict_r = model.predict(X_test_scaled)

    # Calculate evaluation metrics for assesing performance of the model.
    acc = accuracy_score(Y_test, y_predict_r)
    roc = roc_auc_score(Y_test, y_predict_r)
    prec = precision_score(Y_test, y_predict_r)
    rec = recall_score(Y_test, y_predict_r)
    f1 = f1_score(Y_test, y_predict_r)

    return acc, roc, prec, rec, f1
    
def add_results(X_train,X_test,  y_train, y_test, Name):
    model = LogisticRegression(fit_intercept= True, penalty= None , random_state = 42)
    # Call the fit model function to train the model on the normalized features and the diagnosis values
    model.fit(X_train, y_train)
    acc, roc, prec, rec, f1 = calculate_metrics(model, X_test, y_test)
    metric_df = pd.DataFrame({"acc": [acc], "roc": [roc], "prec": [prec], "rec": [rec], "f1": [f1]}, index = [Name])
    return metric_df

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,stratify=y, random_state = 123)

# All features of dataset are float values. You normalize all features of the train and test dataset here.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
metric_df = add_results(X_train_scaled, X_test_scaled, y_train, y_test, "All features")
print(metric_df.head())


                   acc       roc      prec       rec        f1
All features  0.947368  0.943452  0.958333  0.958333  0.958333


  y = column_or_1d(y, warn=True)


In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,stratify=y, random_state = 123)

# All features of dataset are float values. You normalize all features of the train and test dataset here.
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns= X_train.columns)[features_F]
X_test_scaled =  pd.DataFrame(scaler.transform(X_test), columns= X_test.columns)[features_F]

metric_f_df = add_results(X_train_scaled, X_test_scaled, y_train, y_test, "Seq F")

metric_df = pd.concat([metric_df, metric_f_df])
print(metric_df.head())

                   acc       roc      prec       rec        f1
All features  0.947368  0.943452  0.958333  0.958333  0.958333
Seq F         0.982456  0.981151  0.986111  0.986111  0.986111


  y = column_or_1d(y, warn=True)


In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,stratify=y, random_state = 123)

# All features of dataset are float values. You normalize all features of the train and test dataset here.
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns= X_train.columns)[features_b]
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns= X_train.columns)[features_b]

metric_b_df = add_results(X_train_scaled, X_test_scaled, y_train, y_test, "Seq b")
metric_df = pd.concat([metric_df, metric_b_df])
print(metric_df.head())

                   acc       roc      prec       rec        f1
All features  0.947368  0.943452  0.958333  0.958333  0.958333
Seq F         0.982456  0.981151  0.986111  0.986111  0.986111
Seq b         0.938596  0.936508  0.957746  0.944444  0.951049


  y = column_or_1d(y, warn=True)


#### **I can see the features from forward selection achieve high accuracy so i will try to make the features more complex add square** 

In [16]:
X_train_square = X_train_scaled ** 2
X_test_square = X_test_scaled ** 2

X_train_square = pd.concat([X_train_scaled, X_train_square], axis = 0)
X_test_square = pd.concat([X_test_scaled, X_test_square])

metric_bi_df = add_results(X_train_scaled, X_test_scaled, y_train, y_test, "binomial Seq f")
metric_df = pd.concat([metric_df, metric_bi_df])
print(metric_df.head())


                     acc       roc      prec       rec        f1
All features    0.947368  0.943452  0.958333  0.958333  0.958333
Seq F           0.982456  0.981151  0.986111  0.986111  0.986111
Seq b           0.938596  0.936508  0.957746  0.944444  0.951049
binomial Seq f  0.938596  0.936508  0.957746  0.944444  0.951049


  y = column_or_1d(y, warn=True)


I can see overfitting or varianve problem when i add squared features

## finnaly! I finished this project and achieved an F1 score of 98.6%