<a href="https://colab.research.google.com/github/kushagrathisside/Breast-Cancer-Project/blob/main/Automation_of_Breast_Cancer_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Breast Cancer Prediction**
Breast cancer is cancer that develops from breast tissue. Signs of breast cancer may include a lump in the breast, a change in breast shape, dimpling of the skin, milk rejection, fluid coming from the nipple, a newly inverted nipple, or a red or scaly patch of skin. In those with distant spread of the disease, there may be bone pain, swollen lymph nodes, shortness of breath, or yellow skin.


## **Benign**
'Benign' cancer is the cancer which doesn't spread and are non cancerous. In most cases, a doctor diagnosing a tumor as benign will most likely be left alone. Benign tumors are not generally aggressive around the surrounding tissue and in some cases, may continue to grow. If the tumor continues to grow and cause discomfort by pressing against surrounding organs and causing pain, the tumor would be removed. \\
## **Malignant**
'Malignant' cancer cells spread across the body making it very dangerous. Malignant tumors are aggressive and cancerous because damage the surrounding tissue and may be removed depending on the cancerous and aggressive on the severity or aggressiveness of the tumor.

Link to dataset: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer 

## Importing libraries and related functions

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


from google.colab import drive
from google.colab import files

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

## Uploading Files
The user can use *upload_files()* to upload the csv files. \\
**Parameters:** *None* \\
**Return Type:** *None*

In [49]:
def upload_files():
  uploaded = files.upload()

  for fn in uploaded.keys():
    print('User uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))

In [50]:
upload_files()

## Importing Data
Importing data uses the *import_data()* to get the input mode and the file name/link of the file. \\
**Parameters:**
1. name_type: "google-drive-link" or "uploaded file" for google drive file and uploaded file respectively.
2. file_name: < *name of the file* > (Rest links and mounting options are already added into the function.)

**Return Type:** *df*, pandas dataframe with the data present in the given file.

In [51]:
def import_data(name_type:str,file_name:str):

  file_path=''

  try:
    if name_type=="google-drive-link":
        drive.mount('/content/drive')
        file_path='/content/drive/'+file_name+'.csv'
    elif name_type=="uploaded_file":
        file_path='/content/'+file_name+'.csv'
    else:
        raise Exception('Incorrect parameters given!')
  except:
    raise Exception("Unsuccessful import!")
  
  print("CSV Imported!")
  df=pd.read_csv(file_path)
  print("Dataset loaded to dataframe!")

  return df

In [52]:
data= import_data("uploaded_file","bcd")

CSV Imported!
Dataset loaded to dataframe!


# Preprocessing Begins...

**Boolean Masking:** \\
*boolean_mask()* is used for boolean masking over the column *col_name* for values in *vals* for data stored in *data*

**Note:**
'B' in this usecase is considered as negative as per the medical conventions. Medicals practioners call a medical case **positive** only if the disease is found, which in this case is cancer. \\
So, Benign (non cancerous tumour) is negative and malignant (cancerous tumour) is positive.

In [53]:
def boolean_mask(data, col_name, vals):
  #col_name=diagnosis
  #vals=['B','M']
  neg_class_data = data[data[col_name] == vals[0]]
  pos_class_data = data[data[col_name] == vals[1]]

  return pos_class_data, neg_class_data

**Training Splitting:** \\
Training Testing Split is performed by the function *train_split()* which divides the given dataset into two parts for training and testing respectively. This split must create both partitions following the basic rules: \\
- Training partition size must be greater than testset in order to give more data to model for training. (Eg. 0.7, 0.8)
- Both partitions must not have common records to avoid training/testing on same values.
- **Both partitions must have equally distributed records of each class.** Eg. Number of 'B's and 'M's must be same to avoid training/testing on different classes.

In [54]:
def train_split(data, train_ratio : float):

  training_data_len = int(train_ratio*len(data))
  testing_data_len = int((1-train_ratio)*len(data))
  print("Length of Training Data: ", training_data_len)
  print("Length of Testing Data: ", testing_data_len)

  pos_class_data, neg_class_data = boolean_mask(data, "diagnosis",['B','M'])
  print(f"Shape of both classes after Boolean Masking: \n B:{neg_class_data.shape}\n M:{pos_class_data.shape}")

  train_neg_class_data = neg_class_data.iloc[0:training_data_len//2,:]
  train_pos_class_data = pos_class_data.iloc[0:training_data_len//2,:]
  test_neg_class_data = neg_class_data.iloc[training_data_len//2:,:]
  test_pos_class_data = pos_class_data.iloc[training_data_len//2:,:]

  training_data = pd.concat([train_neg_class_data,train_pos_class_data])
  testing_data = pd.concat([test_neg_class_data,test_pos_class_data])

  #Unanmed colmns removed
  training_data.drop([data.columns[32]],axis=1,inplace=True)
  
  testing_data.drop([data.columns[32]],axis=1,inplace=True)

  return training_data, testing_data

**Binary Encoding:** \\
Binary Encoding is process of converting string categorical values into integral values to make it ready for calculations in training.

Parameters:
*initial_vals*: Intitial Categories which have to be encoded. \\
*final_vals*: Final categories in which *initial_vals* will be encoded

In [55]:
def binary_encoding(training_data, testing_data, data, col_index:int, initial_vals, final_vals): 
  #initial_vals: ['B','M']
  #final_vals: [0,1]
  training_data[data.columns[col_index]].replace(to_replace=initial_vals, value=final_vals, inplace=True)
  testing_data[data.columns[col_index]].replace(to_replace=initial_vals, value=final_vals, inplace=True)

## **Correlation and Thresholding** 
Pearson correlation measures the linear relationship between variable continuous X and variable continuous Y and has a value between 1 and -1. In other words, the Pearson Correlation Coefficient measures the relationship between 2 variables via a line.

**Correlation** is an important concept used Machine Learning Modelling to select only those features for modelling, which are directly related to the target. Irrelevant or less correlated columns increase the dimentions and exposes the model to the curse of dimentionality.

The *corr_threshold()* takes the *data* (original dataset), *training_data*, *testing_data* and a list of threshold values in *threshold_vals*, and prints the classification report of each model its trains.\\
**In each iteration within the function**, following actions are performed:
- the features with correlation more than threshold (with the target) are filtered out.
- models are trained for each threshold values
- classification reports are given for each threshold


In [56]:
def corr_threshold(data, training_data, testing_data, threshold_vals):
  #correlation calculation
  corr_matrix=training_data.corr()

  print("\n\n\nodel testing for different thresholds starts...")
  for threshold in threshold_vals:
    print(f"Current threshold: {threshold}")
    D = dict(corr_matrix[data.columns[1]] > threshold)
    key_select=[]
    for i in D.keys():
      try:
        if D[i]==True:
          key_select+=[i]
      except:
        print(i+" had issues")
    
    print("Featured selected for modelling, final filtering in action!")
    filtered_training_data = training_data[key_select]
    answers = filtered_training_data[data.columns[1]]

    input_features = filtered_training_data.drop(["diagnosis"],axis=1)
    
    filtered_testing_data=testing_data[key_select]
    testing_input_features = filtered_testing_data.iloc[:,1:]
    testing_answers = filtered_testing_data[data.columns[1]]

    #model training
    print("Modelling process initiated!!")
    naive_bayes_algo = GaussianNB()
    naive_bayes_algo.fit(X=input_features, y=answers)
    
    print("Your model is ready, let's start the prediction!!")
    exam_answers = naive_bayes_algo.predict(testing_input_features)


    print("Classification Report: ")
    print(classification_report(testing_answers, exam_answers))

# Final sequential function calls...

In [57]:
train_ratio=float(input("Input Train Ratio:"))
initial_vals=['B','M']
final_vals=[0,1]
col_index=1
threshold_vals=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8]

#Splitting
training_data, testing_data = train_split(data, train_ratio)

#Encoding
binary_encoding(training_data, testing_data, data, col_index, initial_vals, final_vals)

#Correlation testing
corr_threshold(data, training_data, testing_data, threshold_vals)

Input Train Ratio:0.7
Length of Training Data:  398
Length of Testing Data:  170
Shape of both classes after Boolean Masking: 
 B:(357, 33)
 M:(212, 33)



odel testing for different thresholds starts...
Current threshold: 0.1
Featured selected for modelling, final filtering in action!
Modelling process initiated!!
Your model is ready, let's start the prediction!!
Classification Report: 
              precision    recall  f1-score   support

           0       0.99      0.96      0.97       158
           1       0.61      0.85      0.71        13

    accuracy                           0.95       171
   macro avg       0.80      0.90      0.84       171
weighted avg       0.96      0.95      0.95       171

Current threshold: 0.2
Featured selected for modelling, final filtering in action!
Modelling process initiated!!
Your model is ready, let's start the prediction!!
Classification Report: 
              precision    recall  f1-score   support

           0       0.99      0.96      0

## Future Scope:

- Generated reports can be compared with each other
- Compare metrics to each other
- Change parameters for each values.