# Breast Cancer Prediction with Machine Learning using Keras

## Artificial Neural Network
Neural networks, also known as artificial neural networks (ANNs) or simulated neural networks (SNNs), are a subset of machine learning and are at the heart of deep learning algorithms. Their name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another.

Artificial neural networks (ANNs) are comprised of a node layers, containing an input layer, one or more hidden layers, and an output layer. Each node, or artificial neuron, connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.

### Import Libraries

In [34]:
import numpy as np 
import pandas as pd 
import plotly.express as px
import plotly.figure_factory as ff
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler, MaxAbsScaler

### Upload the dataset

In [2]:
cancer = pd.read_csv("cancer.csv")
cancer.head()

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,4,Positive,Positive,24,1,60,Alive
1,50,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,35,Positive,Positive,14,5,62,Alive
2,58,White,Divorced,T3,N3,IIIC,Moderately differentiated,2,Regional,63,Positive,Positive,14,7,75,Alive
3,58,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,18,Positive,Positive,2,1,84,Alive
4,47,White,Married,T2,N1,IIB,Poorly differentiated,3,Regional,41,Positive,Positive,3,1,50,Alive


## Exploratory Data Analysis

In [3]:
#Find out more about the dataset
cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Age                     4024 non-null   int64 
 1   Race                    4024 non-null   object
 2   Marital Status          4024 non-null   object
 3   T Stage                 4024 non-null   object
 4   N Stage                 4024 non-null   object
 5   6th Stage               4024 non-null   object
 6   differentiate           4024 non-null   object
 7   Grade                   4024 non-null   object
 8   A Stage                 4024 non-null   object
 9   Tumor Size              4024 non-null   int64 
 10  Estrogen Status         4024 non-null   object
 11  Progesterone Status     4024 non-null   object
 12  Regional Node Examined  4024 non-null   int64 
 13  Reginol Node Positive   4024 non-null   int64 
 14  Survival Months         4024 non-null   int64 
 15  Stat

In [5]:
#Check for null values
cancer.isnull().sum()

Age                       0
Race                      0
Marital Status            0
T Stage                   0
N Stage                   0
6th Stage                 0
differentiate             0
Grade                     0
A Stage                   0
Tumor Size                0
Estrogen Status           0
Progesterone Status       0
Regional Node Examined    0
Reginol Node Positive     0
Survival Months           0
Status                    0
dtype: int64

In [10]:
#Check for duplicate values
duplicates = cancer.duplicated()

# Display rows with duplicate values
duplicate_rows = cancer[duplicates]
print("Duplicate Rows except first occurrence:")
print(duplicate_rows)

Duplicate Rows except first occurrence:
     Age   Race Marital Status T Stage  N Stage 6th Stage  \
436   63  White        Married       T1      N1       IIA   

                 differentiate Grade   A Stage  Tumor Size Estrogen Status  \
436  Moderately differentiated     2  Regional          17        Positive   

    Progesterone Status  Regional Node Examined  Reginol Node Positive  \
436            Positive                       9                      1   

     Survival Months Status  
436               56  Alive  


In [14]:
#Drop duplicate values
cancer.drop_duplicates(inplace = True)

In [16]:
# Describe the dataset
cancer.describe()

Unnamed: 0,Age,Tumor Size,Regional Node Examined,Reginol Node Positive,Survival Months
count,4023.0,4023.0,4023.0,4023.0,4023.0
mean,53.969923,30.477007,14.358439,4.158837,71.301765
std,8.963118,21.121253,8.100241,5.109724,22.923009
min,30.0,1.0,1.0,1.0,1.0
25%,47.0,16.0,9.0,1.0,56.0
50%,54.0,25.0,14.0,2.0,73.0
75%,61.0,38.0,19.0,5.0,90.0
max,69.0,140.0,61.0,46.0,107.0


### Detecting Categorical Columns

In [19]:
def detect_categorical_columns(cancer):
    categorical_columns = cancer.select_dtypes(include = ["object"]).columns.tolist()
    return categorical_columns

In [20]:
categorical_cols = detect_categorical_columns(cancer)
print("Categorical Columns: ")
print(categorical_cols)

Categorical Columns: 
['Race', 'Marital Status', 'T Stage ', 'N Stage', '6th Stage', 'differentiate', 'Grade', 'A Stage', 'Estrogen Status', 'Progesterone Status', 'Status']


### Encode the categorical columns

In [21]:
def encode_categorical_columns(cancer, categorical_columns):
    le = LabelEncoder()
    cancer_encoded = cancer.copy()

    for col in categorical_columns:
        cancer_encoded[col] = le.fit_transform(cancer[col])

    return cancer_encoded

### Print the encoded data frame

In [24]:
cancer_encoded = encode_categorical_columns(cancer, categorical_cols)
cancer_encoded.head()

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,6th Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,68,2,1,0,0,0,1,3,1,4,1,1,24,1,60,0
1,50,2,1,1,1,2,0,2,1,35,1,1,14,5,62,0
2,58,2,0,2,2,4,0,2,1,63,1,1,14,7,75,0
3,58,2,1,0,0,0,1,3,1,18,1,1,2,1,84,0
4,47,2,1,1,0,1,1,3,1,41,1,1,3,1,50,0


### Split the dataset into training and testing

In [26]:
X = cancer_encoded.drop(columns=["Status"]) #Used to make predictions 
y = cancer_encoded.Status.values # To be predicted
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 17, shuffle = True, stratify = y)

### Normalize the data
#### It is recommended to normalize the data after splitting it into training and testing because of the following reasons:
- Preventing Information Leakage
- Maintaining Test Set Independence
- Ensuring Consistency in Deployment

In [27]:
scaler = StandardScaler()
scaler.fit(X_train) #Fit only to the training set

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


### Train the model

In [31]:
model = Sequential()
model.add(Dense(128, input_dim = X_train_scaled.shape[1], activation = "relu"))
model.add(Dense(64, activation = "relu"))

model.add(Dense(1, activation = "sigmoid"))

model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])

model.fit(X_train_scaled, y_train, epochs = 1000, batch_size = 32)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

<keras.src.callbacks.History at 0x27e8f058700>

### Print the results

In [32]:
loss, accuracy = model.evaluate(X_test_scaled, y_test)
print(f"Test Loss = {loss: .4f}")
print(f"Test Accuracy = {accuracy * 100: .4f}")

y_pred = model.predict(X_test_scaled)
y_pred_binary = (y_pred > 0.5).astype(int)

Test Loss =  2.1174
Test Accuracy =  85.5901


### Print Classification Report

In [33]:
print("classification_report \n", classification_report(y_test, y_pred_binary))

classification_report 
               precision    recall  f1-score   support

           0       0.91      0.93      0.92       682
           1       0.53      0.46      0.50       123

    accuracy                           0.86       805
   macro avg       0.72      0.70      0.71       805
weighted avg       0.85      0.86      0.85       805



### Print Confusion Matrix

In [43]:
# cm as confusion matrix
cm = confusion_matrix(y_test, y_pred_binary, labels=[0, 1])
class_names = ["Alive", "Dead"]

# Plot confusion matrix using Plotly Express
fig = ff.create_annotated_heatmap(
    z=cm,
    x=class_names,
    y=class_names,
    colorscale="Reds",
    showscale=True
)

# Update the layout
fig.update_layout(
    title="Confusion Matrix",
    xaxis=dict(title="Predictions"),
    yaxis=dict(title="True Values"),
    width = 1000, 
    height = 600
    )

# Show the plot
fig.show()