## Feature Prototype (Pre-processing stage)
This prototype aims to implement a a baseline model that can accurately categorise mammogram images and classify them as either malignant or benign. In this stage, the irrelevant data is being removed to ensure effective training of models in future stages.

The following libraries are required for the data preparation process

In [35]:
import os
import numpy as np
import pandas as pd
import cv2
from sklearn.model_selection import train_test_split

### Data preparation
The dataset is obtained from the <a href="https://www.kaggle.com/datasets/awsaf49/cbis-ddsm-breast-cancer-image-dataset/data">CBIS-DDSM: Breast Cancer Image Dataset</a>. The file downloaded includes the mammogram images ("jpeg" folder) and CSVs containing information of the images ("csv" folder).

The CSVs will be used get the image path, as well as the pathology of the image (benign or malignant). This will be used to train the model.

The first row of each CSV dataset is displayed to understand the contents of the provided data.<br>This will be followed by checking which file path is an accurate link to the images provided.

In [36]:
csv_path = "datasets/csv"
csv_dirs = os.listdir(csv_path)

# understanding the data provided
for file in csv_dirs:
    df_name = file[:-4]
    # print(df_name)
    df = pd.read_csv(csv_path + "/" + file)
    display(df.head(1))

Unnamed: 0,patient_id,breast density,left or right breast,image view,abnormality id,abnormality type,calc type,calc distribution,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
0,P_00038,2,LEFT,CC,1,calcification,PUNCTATE-PLEOMORPHIC,CLUSTERED,4,BENIGN,2,Calc-Test_P_00038_LEFT_CC/1.3.6.1.4.1.9590.100...,Calc-Test_P_00038_LEFT_CC_1/1.3.6.1.4.1.9590.1...,Calc-Test_P_00038_LEFT_CC_1/1.3.6.1.4.1.9590.1...


Unnamed: 0,patient_id,breast density,left or right breast,image view,abnormality id,abnormality type,calc type,calc distribution,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
0,P_00005,3,RIGHT,CC,1,calcification,AMORPHOUS,CLUSTERED,3,MALIGNANT,3,Calc-Training_P_00005_RIGHT_CC/1.3.6.1.4.1.959...,Calc-Training_P_00005_RIGHT_CC_1/1.3.6.1.4.1.9...,Calc-Training_P_00005_RIGHT_CC_1/1.3.6.1.4.1.9...


Unnamed: 0,file_path,image_path,AccessionNumber,BitsAllocated,BitsStored,BodyPartExamined,Columns,ContentDate,ContentTime,ConversionType,...,SecondaryCaptureDeviceManufacturerModelName,SeriesDescription,SeriesInstanceUID,SeriesNumber,SmallestImagePixelValue,SpecificCharacterSet,StudyDate,StudyID,StudyInstanceUID,StudyTime
0,CBIS-DDSM/dicom/1.3.6.1.4.1.9590.100.1.2.12930...,CBIS-DDSM/jpeg/1.3.6.1.4.1.9590.100.1.2.129308...,,16,16,BREAST,351,20160426,131732.685,WSD,...,MATLAB,cropped images,1.3.6.1.4.1.9590.100.1.2.129308726812851964007...,1,23078,ISO_IR 100,20160720.0,DDSM,1.3.6.1.4.1.9590.100.1.2.271867287611061855725...,214951.0


Unnamed: 0,patient_id,breast_density,left or right breast,image view,abnormality id,abnormality type,mass shape,mass margins,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
0,P_00016,4,LEFT,CC,1,mass,IRREGULAR,SPICULATED,5,MALIGNANT,5,Mass-Test_P_00016_LEFT_CC/1.3.6.1.4.1.9590.100...,Mass-Test_P_00016_LEFT_CC_1/1.3.6.1.4.1.9590.1...,Mass-Test_P_00016_LEFT_CC_1/1.3.6.1.4.1.9590.1...


Unnamed: 0,patient_id,breast_density,left or right breast,image view,abnormality id,abnormality type,mass shape,mass margins,assessment,pathology,subtlety,image file path,cropped image file path,ROI mask file path
0,P_00001,3,LEFT,CC,1,mass,IRREGULAR-ARCHITECTURAL_DISTORTION,SPICULATED,4,MALIGNANT,4,Mass-Training_P_00001_LEFT_CC/1.3.6.1.4.1.9590...,Mass-Training_P_00001_LEFT_CC_1/1.3.6.1.4.1.95...,Mass-Training_P_00001_LEFT_CC_1/1.3.6.1.4.1.95...


Unnamed: 0,SeriesInstanceUID,StudyInstanceUID,Modality,SeriesDescription,BodyPartExamined,SeriesNumber,Collection,Visibility,ImageCount
0,1.3.6.1.4.1.9590.100.1.2.117041576511324414842...,1.3.6.1.4.1.9590.100.1.2.229361142710768138411...,MG,ROI mask images,BREAST,1,CBIS-DDSM,1,2


The data will now be individually stored into a pandas dataframe and relevant columns will be kept in each dataset.<br>To prevent duplication of data, the function ```drop_duplicate()``` will be used whilst removing irrelevant columns.

In [37]:
# reading CSVs as separate datasets
calc_case_description_test_set = pd.read_csv("datasets/csv/calc_case_description_test_set.csv")
calc_case_description_train_set = pd.read_csv("datasets/csv/calc_case_description_train_set.csv")
dicom_info = pd.read_csv("datasets/csv/dicom_info.csv")
mass_case_description_test_set = pd.read_csv("datasets/csv/mass_case_description_test_set.csv")
mass_case_description_train_set = pd.read_csv("datasets/csv/mass_case_description_train_set.csv")
meta = pd.read_csv("datasets/csv/meta.csv")

# choosing columns to keep
keep_set = ["patient_id", "pathology", "image file path"]
keep_dicom = ["image_path", "SeriesInstanceUID"]
keep_meta = ["SeriesInstanceUID", "StudyInstanceUID"]

# modifying datasets to keep relavent columns
calc_case_description_test_set = calc_case_description_test_set[keep_set].drop_duplicates()
calc_case_description_train_set = calc_case_description_train_set[keep_set].drop_duplicates()
dicom_info = dicom_info[keep_dicom].drop_duplicates()
mass_case_description_test_set = mass_case_description_test_set[keep_set].drop_duplicates()
mass_case_description_train_set = mass_case_description_train_set[keep_set].drop_duplicates()
meta = meta[keep_meta].drop_duplicates()

### Data cleaning
When checking for patients that are present in the 2 datasets, ```calc_case_description_test_set``` and ```mass_case_description_test_set```, it is seen that 3 patients have multiple entries. As such to prevent duplicated entries (one patient being recorded more than once), we will drop all patients that have multiple entries.

** note that this will also be done on the train set.

In [38]:
# getting all entries with matching patient_id
matching_test = pd.merge(
    calc_case_description_test_set, mass_case_description_test_set,
    how = "inner", left_on = "patient_id", right_on = "patient_id"
)

# getting unique patient_id to remove from original sets
unique_patient_ids = matching_test['patient_id'].unique().tolist()
# print(unique_patient_ids)

# remove entries with matching patient id
calc_case_description_test_set = calc_case_description_test_set[~calc_case_description_test_set['patient_id'].isin(unique_patient_ids)]
mass_case_description_test_set = mass_case_description_test_set[~mass_case_description_test_set['patient_id'].isin(unique_patient_ids)]

# getting all entries with matching patient_id
matching_train = pd.merge(
    calc_case_description_train_set, mass_case_description_train_set,
    how = "inner", left_on = "patient_id", right_on = "patient_id"
)

# getting unique patient_id to remove from original sets
unique_patient_ids = matching_train['patient_id'].unique().tolist()
# print(unique_patient_ids)

# remove entries with matching patient id
calc_case_description_train_set = calc_case_description_train_set[~calc_case_description_train_set['patient_id'].isin(unique_patient_ids)]
mass_case_description_train_set = mass_case_description_train_set[~mass_case_description_train_set['patient_id'].isin(unique_patient_ids)]

We now want to connect the ```dicom_info``` dataset to the respective train and test sets. This is to have the image file path column for easy image processing. This is done by getting the match to ```SeriesInstanceUID``` using the original ```image file path```. After merging, the ```image file path``` from ```dicom_info``` will be keeping the ```pathology``` and ```image_path```.

In [39]:
def connectDicom(dataset):
    # splitting and getting the second set of numbers
    dataset["common_id"] = dataset["image file path"].str.split("/").str[2]
    # merging where common_id matches SeriesInstanceUID
    dataset = pd.merge(dataset, dicom_info, how = "inner", left_on = "common_id", right_on = "SeriesInstanceUID")
    # dropping original "image file path", and "common_id"
    dataset = dataset[["pathology", "image_path"]]
    return dataset

# running function onto each dataset
calc_case_description_test_set = connectDicom(calc_case_description_test_set)
calc_case_description_train_set = connectDicom(calc_case_description_train_set)
mass_case_description_test_set = connectDicom(mass_case_description_test_set)
mass_case_description_train_set = connectDicom(mass_case_description_train_set)

### Conducting train-test split
As all 4 datasets have the same columns, it is now possible to combine all datasets together and apply the ```train_test_split()``` to evenly split the data. After combining all 4 datasets, we will modify the path to ensure that it is accessing the right file path for the project. The new train-test split datasets will be saved as CSVs for easier access in the next steps.

In [40]:
# combining all 4 datasets together to conduct train_test_split
dataset = pd.concat([calc_case_description_test_set, calc_case_description_train_set,
                     mass_case_description_test_set, mass_case_description_train_set])

# modifying path to follow actual image path
def modify_path(img_path):
    new_path =  f'datasets/jpeg/{img_path.split("jpeg/")[1]}'
    return new_path

# changing path in image_path column
dataset["image_path"] = dataset["image_path"].apply(modify_path)
# splitting data into training and testing sets
train_data, test_data = train_test_split(dataset, test_size = 0.2, random_state = 42)

# writing into csv
train_data.to_csv("datasets/csv/train_data.csv", index = False)
test_data.to_csv("datasets/csv/test_data.csv", index = False)

display(train_data.head())
display(test_data.head())

Unnamed: 0,pathology,image_path
176,MALIGNANT,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.1925301...
182,BENIGN,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.4105746...
398,BENIGN,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.8590696...
893,MALIGNANT,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.2596841...
101,BENIGN,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.8708234...


Unnamed: 0,pathology,image_path
537,MALIGNANT,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.3947922...
370,MALIGNANT,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.3969023...
574,MALIGNANT,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.3311828...
177,MALIGNANT,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.2551878...
313,BENIGN,datasets/jpeg/1.3.6.1.4.1.9590.100.1.2.1490924...


## Dataset Citation
- Dataset: Awsaf (2021) CBIS-DDSM: Breast Cancer Image Dataset, Kaggle. Available at: https://www.kaggle.com/datasets/awsaf49/cbis-ddsm-breast-cancer-image-dataset
- Licencing: Creative commons (no date) CC BY-SA 3.0 Deed | Attribution-ShareAlike 3.0 Unported | Creative Commons. Available at: https://creativecommons.org/licenses/by-sa/3.0/