## Introduction
---
- Machine Learning applications in the field of Medicine to support and help the diagnosis of various diseases is crucial, to catch these at an early state. For this the algorithms
used have to take into account the most telling characteristics about the tests performed to analyze the targeted anatomical part of the patient. This is true in the particular case of Brain Tumors where there are typical exams that are prescribged in order to find if there is a tumor in the cerebral cortex of the patient.
- With the intent to build a model that classifies the imagiological results of tests like CT scans and MRIs into 'healty' or 'cancer' classes, to support the specialists decisions, we decided to train said model with a dataset with about 4600 unique samples of these types of exams, consisting their image results. 
-Given time we will explore the classification between different types of cancer in the brain cancer category.
---
## Step 1 - Data Exploration and Preprocessing

- Inspecting the dataset structure and labels:  
  - here is where we can separate the dataset through the labels Cancer/Healthy;
  - given the separation we may start to operate in the dataset

- The 1st operation needed is to reduce the images to a fixed shape, normalizing them into the same resolution;
- Next we also need to split the dataset as said in the project proposal, in "Training Data" and "Testing Data";
---
#### Process
- For this step then we will use "pandas" library to read the CSV metadata in order to get the information provided about the images
- When this is completed we will separate the data into "Healthy" and "Cancer" lists
- After this the main preprocessing of the data will begin 

In [None]:
import numpy as np
import pandas as pd
\
DEBUG = True

csv_file = "./data/metadata_rgb_only.csv"

"""This method reads the data from a CSV file
returning said data."""
def read_data(csv_file):
    df = pd.read_csv(csv_file)
    return df 

data = read_data(csv_file)

if DEBUG == True:
    print(f"ðŸ”´--------------ðŸ”´   Debug   ðŸ”´--------------ðŸ”´\n")
    print(data)
    

#### Step 1 - Completed

- For now we have a variable `data` that has all the information in the file metadata stored in an N column matrix. Since this is the case we can pick and choose the data that we need to separate in the different classes. 

## Step 2 - Splitting dataset by its label
---

In [None]:
def get_list(data, column):
    list = data[column] # get the info on column class
    return list 

# Get the image labels
labels = get_list(data, column="class")

if DEBUG == True:
    print(f"ðŸ”´--------------ðŸ”´   Debug   ðŸ”´--------------ðŸ”´\n")
    print(labels, sep="\n")

# Get the images list
images = get_list(data, column="image")

if DEBUG == True:
    print(f"ðŸ”´--------------ðŸ”´   Debug   ðŸ”´--------------ðŸ”´\n")
    print(images)

# Split the images into healty/cancer
cancer  = images[labels=="tumor"]
healthy = images[labels=="normal"]

print(cancer, healthy, sep="\n\n")

### Step 2 - Completed


## Step 3 - Creation of Train and Testing datasets
---
- This allows to separate already both datasets.

In [None]:
import os
import shutil
import random as rand

DATA_DIR = "./data"    
TRAIN_DIR = "./data/train"
TEST_DIR  = "./data/test"
TEST_SPLIT = 0.2
SEED = 42

rand.seed(SEED)

classes = ["cancer", "healthy"]

for cls in classes:
    os.makedirs(os.path.join(TRAIN_DIR, cls), exist_ok=True)
    os.makedirs(os.path.join(TEST_DIR, cls), exist_ok=True)

    files = [f for f in os.listdir(os.path.join(DATA_DIR, cls)) if f.lower().endswith(('.png','.jpg','.jpeg'))]
    rand.shuffle(files)

    split_idx = int(len(files) * TEST_SPLIT)
    for f in files[:split_idx]:
        shutil.copy2(os.path.join(DATA_DIR, cls, f), os.path.join(TEST_DIR, cls, f))
    for f in files[split_idx:]:
        shutil.copy2(os.path.join(DATA_DIR, cls, f), os.path.join(TRAIN_DIR, cls, f))

print("âœ… Dataset split done!")


### Step 3 - Completed

## Step 4 - Plotting images
---
- Separation of the images into the binary classes predetermined: `cancer` & `healthy` 
- With this we have the possibility of determining the priors for example
- We should also split the dataset in `training` and `testing data` @ this point

In [None]:
import matplotlib.pyplot as plt
from PIL import Image


cancer_img_dir= "./data/cancer"
healthy_img_dir = "./data/healthy"

N = 5

def random_plot(image_list, label, image_dir, n=N):
    plt.figure(figsize=(12,2))
    for i, img_name in enumerate(image_list[:n]):
        img_path = os.path.join(image_dir, img_name)
        img = Image.open(img_path)
        plt.subplot(1, n, i+1)
        plt.imshow(img)
        plt.axis("off")
        plt.title(label)
    plt.show()

if DEBUG == True:
    print("ðŸ”´--------------ðŸ”´   Debug   ðŸ”´--------------ðŸ”´")
    random_plot(cancer.to_list(), "tumor", cancer_img_dir)
    random_plot(healthy.to_list(), "normal", healthy_img_dir)
    print(f"Number of cancer images: {len(cancer)}")
    print(f"Number of healthy images: {len(healthy)}")

### Step 4 - Completed

## Step 5 - Image Preprocessing/Normalization/Reshape and Splitting
---
- Since the images have different resolutions and formats they should be normalized to the same size in order to build a solid foundation of comparison to train the model.
- To deal with that processing we decided to resize all images to a 256x256 resolution, although that number can be changed.
    - All 'L' format images shall be converted to 'RGB'.

- These will be split in an 80/20 where both classes have to be reasonably represented in the samples, especially in the training data
- Explain datagen.flow and respective variables

In [None]:
datagen = ImageDataGenerator(
    rescale=1./255,
    validation_split=0.2   # 20% Testing 80% Training
)

batch_size = 4
img_size = (128,128)

train_gen = datagen.flow_from_directory( # flow_from_directory() by default converts to RGB
    TRAIN_DIR,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='binary',
    subset='training'
)

val_gen = datagen.flow_from_directory(
    TRAIN_DIR,
    target_size=img_size,
    batch_size=batch_size,
    class_mode='binary',
    subset='validation'
)

if DEBUG == True:
    print("ðŸ”´--------------ðŸ”´   Debug   ðŸ”´--------------ðŸ”´")
    images, labels = next(train_gen)  # batch_size images

    # Print shape of the first image to confirm size
    print("Shape of first image:", images[0].shape)  # should be (128,128,3) if you set target_size=(128,128)

    # Plot first N images in the batch
    N = min(8, len(images))  # just in case batch_size < 8
    plt.figure(figsize=(16, 4))
    for i in range(N):
        plt.subplot(1, N, i+1)
        plt.imshow(images[i])
        plt.axis('off')
        plt.title('cancer' if labels[i]==0 else 'healthy')
    plt.show()


### Step 5 - Completed

## Step 6 - Validating CNN model 
---
- As the chosen model totally reflects the results of our program, we decided to stick to CNN (Convolutional Neural Network) that is pretty reliable when it comes to analyzing image, video and music data.
- Source of inspiration : https://www.kaggle.com/code/kanncaa1/convolutional-neural-network-cnn-tutorial

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

model = Sequential([
    Conv2D(16, (3,3), activation='relu', input_shape=(128,128,3)),
    MaxPooling2D(2,2),
    Conv2D(32, (3,3), activation='relu'),
    MaxPooling2D(2,2),
    Conv2D(64, (3,3), activation='relu'),
    MaxPooling2D(2,2),
    Flatten(),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', 'precision', 'recall']
)

history = model.fit(
    train_gen,        
    validation_data=val_gen,  # To confront to validate results
    epochs=30,
    verbose=1
)

