# Predictive Brain Tumor Image AI Project - Data Wrangling & EDA

To make an image-processing brain tumor predictive model to automate on scale.

# Brain Tumor info and labeling

Glioma: A type of cancer arising from glial cells in the brain or spinal cord.
Meningioma: Usually a benign tumor from the meninges; rarely malignant.
Notumor: Means no tumor was found.
Pituitary: Refers to the gland; tumors (mostly benign) can form, rarely cancerous.

So I am thinking of making:
- Glioma classified as cancer in training 
- and Notumor (no tumor), Meningioma, and Pituitary as no cancer in training

# Image Folder Structure

We will most likely be using pytorch and it's dataloader, and that system likes to have subfolders which the data loader will use the subfolders name to apply labels to each of the images during training and testing.

# Handle Data Split

So the data is originally split up into train and test folders but I intentionally make a duplicate new folder of them merged together. The reason for merging is to make it easier to do cross validation.

# Initial Setup

In [1]:
import CnnCustLib

In [2]:
ImageDataTogetherPath = './ImageData-Merge'

# Check if any images are corrupted

In [3]:
CnnCustLib.check_for_corrupt_images(ImageDataTogetherPath, remove_corrupted=False)

Checking images in: ./ImageData-Together
All images in './ImageData-Together' are intact.


Looks like none of the images have image corruption. That is good.

In [4]:
# remove corrupted images
    # not needed as no images are corrupted

# CnnCustLib.check_for_corrupt_images(ImageDataTogetherPath, remove_corrupted=True)

# Check general info of Image collection 

In [5]:
CnnCustLib.analyze_file_type(ImageDataTogetherPath)

{'file_type_counts': Counter({'.jpg': 7023}), 'total_files_scanned': 7023}

In [6]:
CnnCustLib.analyze_file_size(ImageDataTogetherPath, unit='KB')

{'min_size_KB': 3.3935546875,
 'max_size_KB': 710.845703125,
 'mean_size_KB': 22.054279382252954,
 'median_size_KB': 20.5595703125,
 'std_size_KB': 15.833513470167311,
 'mode_size_KB': 12.908203125,
 'total_files_scanned': 7023}

In [7]:
CnnCustLib.analyze_file_dimensions(ImageDataTogetherPath)

{'dimension_counts': {(512, 512): 4742,
  (225, 225): 332,
  (630, 630): 90,
  (236, 236): 81,
  (201, 251): 58,
  (228, 221): 51,
  (232, 217): 50,
  (300, 168): 49,
  (442, 442): 46,
  (150, 198): 44,
  (200, 252): 43,
  (428, 417): 42,
  (227, 222): 39,
  (173, 201): 36,
  (206, 244): 35,
  (256, 256): 31,
  (192, 192): 31,
  (218, 231): 29,
  (201, 250): 29,
  (215, 234): 28,
  (227, 262): 27,
  (208, 242): 24,
  (468, 444): 24,
  (359, 449): 23,
  (504, 540): 23,
  (642, 361): 22,
  (214, 236): 22,
  (400, 442): 20,
  (236, 213): 19,
  (393, 400): 18,
  (207, 243): 18,
  (276, 326): 17,
  (442, 454): 17,
  (230, 282): 17,
  (420, 280): 17,
  (441, 442): 16,
  (235, 214): 16,
  (550, 664): 16,
  (194, 259): 15,
  (680, 680): 15,
  (339, 340): 13,
  (212, 238): 12,
  (380, 530): 12,
  (275, 183): 12,
  (196, 257): 12,
  (350, 350): 11,
  (235, 227): 10,
  (220, 275): 10,
  (236, 260): 9,
  (177, 197): 9,
  (356, 474): 9,
  (236, 251): 8,
  (236, 262): 7,
  (234, 218): 7,
  (350, 393

Looks like all the data are jpgs so no need to convert.

For file size I am looking specifically for anything larger than 1 MB file size, but it all seems manageable as no images are too large.  

It seems the image dimensions are all over the place and will have to resize them all to a standard. It seems the most common dimension is the 512x512 and will choose that as the new standard for all images. I thought about removing them but with are small data set of under 10K I believe all data is valuable.

# Scale & Threshold

In [8]:
# go through all the images and scale their pixel values to be between 0 and 1
CnnCustLib.scale_thresholding_main(
    folder_path='./ImageData-Merge',
    output_path='./ImageData-Merge_Scaled',
    batch_size=16,
    lower=0.0,
    upper=1.0,
    resize=True,
    new_size=(512, 512)  # Resize all images to 256x256
)

Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0000_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0001_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0002_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0003_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0004_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0005_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0006_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0007_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0008_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-glTr_0009_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-gl_0010_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-gl_0011_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-gl_0012_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-gl_0013_rescaled.jpg
Saved: ./ImageData-Together_Scaled\glioma\Te-gl_0014_res

# Segmentation

We are doing image classification instead of image segmentation models so no need to segment the data.

Also we are on a tight schedule to complete out company goals, and do not have the man power nor time to segment all those image datas by hand.

But such an option is definitely possible in future if the company would like to make a more upgraded brain tumor detection model. 

# Transformation & Augmentation

We are under 10K data which is pretty small and thus one method to increase the number of image data is with image augmentation. This can be done by mirroring the image in different ways or other effects as a valid way to increase the models recognization of bran and tumor images from potential different perspective.