## Hackathon Day 1
### Data Loading and Exploration

In the first day of our Hackathon, we learn about loading and preprocessing the data and get some insights in the data structure and explemplary instances.

### Task 1: Select a dataset and initialize it

As the first step, we need to import the data and initialize the Dataset object. It handles the download and the access to single data instances. A second important step when using data for machine learning is to think about a proper preprocessing pipeline. Raw data often has value ranges or data structured that need to be changed before it can be used in machine learning models. A convenient package to define preprocessing transformations is `torchvision` (https://pytorch.org/vision/)

* Import the PneumoniaMNIST data with the `medmnist` package (https://medmnist.com/) and initialize it using preprocessing.
* The PneumoniaMNIST dataset comes with several image sizes. Find out which sizes are available and visualize the differences.

The MNIST datasets (as also many of other common data sets) have 3 so-called "splits".

* What are the 3 splits of the data set?
* What is the purpose of each split?

In [None]:
from torchvision.transforms import v2 as transforms
import torch

# IMPORT PNEUMONIA MNIST MODULE FROM MEDMNIST HERE

# preprocessing
TRANSFORM = transforms.Compose(
    [
        transforms.ToImage(),
        # the `Compose` function can take a list of transformations (also called pipeline)
        # think of necessary prepreocessing functions here and add them to the list
        # the basic transforms will convert the images to suitable data types and normalize the images
        # SPECIFY LIST_OF_TRANSFORMS HERE
    ]
)

# simple initialization of training-, validation-, and test-set:
# DOWNLOAD AND APPLY THE TRANSFORMS TO THE PNEUMONIAMNIST DATASET.
train_dataset = PneumoniaMNIST(split=SPLIT_NAME, transform=TRANSFORM, download=True, size=SIZE)

### Task 2: Visualize the dataset

A visual exploration of the data should be one of your first steps when conducting a machine learning project. It helps to get familiar with the data structure and quality. Especially for use-case-tailored modeling task, it is essential to have a broad knowledge about the properties and characteristics of your data.

* Print out some properties of the datasets initialized before
* Visualize some example instances by using the `medmnist` package

In [None]:
# VISUALIZE PROPERTIES AND QUALITATIVE SAMPLES OF TRAINING-SET:
print("Training Dataset:")
print(DATASET)

# PLOT SOME SAMPLES FROM THE DATASET FOR TRAINING AND EVALUATION SPLITS

As the dataset information shows, we have a binary classification task. The label '0' represents a normal chest X-Ray, whereas a label '1' represents a chest X-Ray where pneumonia is visible. Thus, it might be useful to see, if we already see differences in instances that have different labels.

* Write the code that visualizes 10 examples images from the train dataset for 'normal' and 'pneumonia' labels each using the `matplotlib` package (https://matplotlib.org/)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, axes = plt.subplots(2,10, figsize=(12, 6))
for n in range(2):
    count=0
    for img, label in train_dataset:
        if label[0]==n and count<10:
            # PLOT_SAMPLES here
            count+=1

# Set titles for the rows
fig.text(0.5, 0.76, "normal", ha='center', fontsize=12)
fig.text(0.5, 0.4, "pneumonia", ha='center', fontsize=12)

# Adjust layout to ensure titles don't overlap with images
plt.tight_layout(rect=[0, 0, 1, 0.95])
plt.show()

Can you see significant differences between both labels? Briefly explain your findings.

### Task 3: Plot class distributions

Another recommendable staight-forward exploration method is to have a look into the class label distributions, so how many instances are available for each label.

* Find a way to extract the labels for each of the 3 data sets (train, validation, test)
* Count the labels for each set

In [None]:
import numpy as np


# Helper function to extract the labels of the dataset
def get_labels(dataset):
    labels = []
    for _, label in dataset:
        labels.extend(label)
    return np.array(labels)


# EXTRACT LABELS FOR EACH SPLIT
labels = get_labels(DATASET)
# REPEAT FOR THE EVALUATION SPLITS

# CALCULATE CLASS COUNTS OF PNEUMONIA/NORMAL SAMPLES FOR EACH SPLIT
# HINT: you can use numpy ready to use function to calculate the unique count of values
SPLIT_COUNT = ...

Now, that we have the label counts, we want to visualize them for example as a bar plot. Again, use `matplotlib` for that.

* "easy" task: Make a separate class distribution bar plot for each set
* "advanced" task: Make a stacked bar plot for all class distributions

In [None]:
# Labels for plotting
labels_names = ["Normal", "Pneumonia"]

# Easy: Create single bar plots
for counts, split in zip([.., .., ..], ["train", "validation", "test"]):
    fig, ax = plt.subplots()
    #PLOT BARPLOT HERE
    ax.set_xlabel("Class")
    ax.set_ylabel("Count")
    ax.set_title(f"Class distribution for {split} set")
    plt.show()

# Advanced: Create a stacked bar plot
fig, ax = plt.subplots()
#PLOT BARPLOT HERE

ax.set_xlabel("Class")
ax.set_ylabel("Count")
ax.set_title("Class distribution")
ax.legend()
plt.show()

Discribe your findings

* Can you see something remarkable?
* How are the labels distributed?
* Are there implications / difficulties that may occur during model training?

### Task 4 (optional): Data augmentation

#### Torchvision Augmentation

For the very most machine learning problems it is: the more data, the better. The higher the variety of the data, the more knowledge the model can gain from it which increases its accuracy. Of course, the total amount of data is limited and especially in medical imaging, data acquisition is fairly expensive. However, there are methods to increase the dataset size by artificial instances that are "close" or "similar" to the real data instances by manipulation of the existing real data. This principle is known as *Data Augmentation*.

In the field of (medical) image processing, we can augment our data e.g. by rotating, mirroring, or cropping existing images. Also changes in brightness, contrast, resolution, or adding random noise are possible. Basically, any augmentation method that leads to realistic representation and transformations of the real data is helpful.

In general torchvision transforms (https://pytorch.org/vision/stable/transforms.html) provide a module for data transformations and augmentation. The torchvision transforms can be composed in a seuential transform and applied on the fly when passed to the dataset object


In [None]:
TRANSFORM = transforms.Compose(
    [
        transforms.ToImage(),
        # include list of transformations for data augmentation
    ]
)


train_dataset = PneumoniaMNIST(split=SPLIT_NAME, transform=TRANSFORM, download=True)

To check if the images are transformed in the dataset on the fly, sample some images from the dataset multiple times and check if you observe the differences in the images.

In [None]:
from mpl_toolkits.axes_grid1 import ImageGrid

fig = plt.figure(figsize=(8, 8))
# we prepare a grid where N_ROWS images are sampled from the dataset N_COLS times
# SPECIFY THE N_ROWS AND N_COLS VALUES HERE
grid = ImageGrid(fig, 111, nrows_ncols=(N_ROWS, N_COLS), axes_pad=0.1)

sample_images = [train_dataset[row_idx][0] for row_idx in range(N_ROWS) for col_idx in range(N_COLS)]
for ax, im in zip(grid, sample_images):
    # PLOT IMAGE HERE
    ax.axis("off")
plt.show()

#### MONAI augmentation

Especially for medial imaging, `MONAI` (https://monai.io/) is a powerful Python framework for machine learning in medical applications and, in particular, also provides image augmentation methods. Refer here https://docs.monai.io/en/stable/transforms.html.

In [None]:
# import monai libraries

import random
from monai.data import CacheDataset
from monai.transforms import Compose, EnsureTyped, ScaleIntensityd

In [None]:
data_transforms = Compose([EnsureTyped(keys=["image"], data_type="tensor"), ScaleIntensityd(keys=["image"], minv=0, maxv=1)])


train_transforms = Compose(
    [
        # INCLUDE MONAI TRANSFORMATIONS HERE
        # HINT: You can also randomly rotate or flip your dataset, apply gaussian noise, etc. to create more variability in the dataset
    ]
)


# If you want to use Monai data transformations you have to use a wrapper class on PneumoniaMNIST to return the images and respective labels in the MONAI compatible format
class WrappedPneumoniaMNIST(PneumoniaMNIST):
    def __getitem__(self, index):
        image, label = super().__getitem__(index)
        image = torch.tensor(np.array(image)).unsqueeze(0)
        return {"image": image, "label": label}


# Load Pneumonia dataset for training, validation and test
size = (
    ...
)  # for first experiments you can use 28x28 to get fast results but in the end you should work with a higher resolution, e.g. 224x224
train_dataset = WrappedPneumoniaMNIST(split="train", download=True, size=size)
val_dataset = ...
test_dataset = ...

# Wrap with MONAI CacheDataset
# Attention! If you use transformations to augemnt your dataset, only apply it to the training data and not to the validation and test set, as we want to leave these unchanged.
train_monai_dataset = CacheDataset(data=train_dataset, transform=train_transforms, cache_rate=1.0, num_workers=4)
val_monai_dataset = ...
test_monai_dataset = ...

We can visualize data augmentation by selecting random samples from the dataset




In [None]:
# Show 10 images
num_samples = 10

# CREATE A LIST WITH 10 RANDOM INDICES
random_indices = random.sample(...)

fig, axs = plt.subplots(2, num_samples, figsize=(20, 5))

# HINT: For a specific index *idx* you can call the original data using dataset.data.imgs[idx] and the transformed data using dataset[idx]['image']
for idx in random_indices:
    orig_img = ...
    transformed_img = ...

    # SHOW IMAGE
    axs[0, idx].imshow(orig_img, cmap="gray")
    axs[1, idx].imshow(transformed_img.squeeze(), cmap="gray")

    # ASSIGN TITLE TO THE PLOT AND THE INDIVIDUAL IMAGES AS NORMAL/PNEUMONIA

plt.show()

### Task 5: Dataloader 

Training on large datasets all at once may exceed memory limits. Batching allows models to handle smaller chunks of data, fitting them into memory more easily. A DataLoader efficiently loads and manages data by handling batching, shuffling, and parallel processing. It feeds data into models during training and evaluation. Usually, we provide a dataset and batch size to the dataloader to create mini-batches out of the whole dataset.

In [None]:
#IMPORT DATALOADER FROM PYTORCH DATA MODULE 
from .. import DataLoader

# PREPARE THE LOADERS FOR TRAIN AND EVALUATION SPLITS
#SPECIFY THE APPROPRIATE ARGUMENTS HERE

# please check what arguments might be specified differently for the training and evaluation dataloaders?  
train_dataloader = DataLoader(dataset=..,
                              batch_size= ..,
                              shuffle=..)

### Task 6 (optional): ChestMNIST

Now, we have a look at another `medmnist` dataset - ChestMNIST.

* Initialize the ChestMNIST data sets with a preprocessing pipeline

In [None]:
# IMPORT ChestMNIST

# preprocessing
data_transform = transforms.Compose(
    [
        transforms.ToImage(),
       # INCLUDE A LIST OF TRANSFORMS HERE
    ]
)

# INITIALIZE TRAINING AND EVALUATION DATASETS FROM CHESTMNIST
train_dataset = ChestMNIST(split=.., transform=.., download=True)


Print out some data set properties and example images:

In [None]:
# VISUALIZE PROPERTIES AND QUALITATIVE SAMPLES OF TRAINING-SET:
print("Training Dataset:")
print(DATASET)

# VISUALIZE THE VALIDATION AND TEST SETS

What difference can you observe in the ChestMNIST dataset?

The ChestMNIST dataset is a so-called "multi-label" dataset.

* What is the difference between a "multi-label" and a "multi-class" problem?

Now, we have a look into how the labels are represented in the dataset.

* Print out the labels of a few instances and find out, how the labels are represented

The labels are represented in a so-calles "multi-hot" encoding. This is a binary label vector where a "0" means absence and "1" means presence of the respective disaese.

* Why is that convenient if you think about model training?

Again, we can count the label abundances but with another strategy than before.

* Try to find a efficient way to calculate the counts of each label by avoiding explicit loops
* Plot the results again in a stacked bar diagram
* What can you observe?

In [None]:
import numpy as np

# CALCULATE CLASs COUNT FOR EACH SPLIT
# HINT: you can use some numpy function 
train_counts = ..

labels_names = ..

In [None]:
# Advanced: Create a stacked bar plot
fig, ax = plt.subplots()
# CREATE STACKED BARPLOTOF DIFFERENT CLASSES INCLUDING TRAIN, VALIDTION AND TEST SPLITS

ax.set_xlabel("Class")
ax.set_ylabel("Count")
ax.set_title("Class distribution")
ax.tick_params(axis="x", labelrotation=90)
ax.legend()

plt.show()

Since we have a multi-label problem here, it is hard to isolate one disease from another. However, we can do a more interesting thing! What about observing, if there are combinations of diseases that are more common than others or find correlations? A nice way of doing that is to calculate the correlation matrix. 

* Calculate the correlation matrix of the training labels.
* Plot the correlation matrix

In [None]:
corr_mat = ..

In [None]:
# plot the correlation matrix
fig, ax = plt.subplots()
cax = ax.matshow(corr_mat, cmap="coolwarm")
cbar = fig.colorbar(cax)
cbar.set_label("Correlation coefficient")
ax.set_xticks(np.arange(len(labels_names)))
ax.set_yticks(np.arange(len(labels_names)))
ax.set_xticklabels(labels_names, rotation=90)
ax.set_yticklabels(labels_names)
fig.show()

### if you are interested ...

There are some more interesting Python packages for data exploration and manipulation than `numpy` and `matplotlib`. Here, we briefly mention some recommendable packages.

#### [`pandas`](https://pandas.pydata.org)
Pandas is a comprehensive library for data analysis with python. It offers efficient data structures and data manipulation methods for mainly numerical structured or tabular data. Additionally, it includes some visualization functions.

#### [`seaborn`](https://seaborn.pydata.org)
Seaborn is a library for statistical data visualization. It is based on matplotlib and complements it regarding convenience functions for plotting and advanced design options enabling efficient data exploration.

#### [`scipy`](https://scipy.org)
  SciPy is one of the fundamental libraries for scientific computation with Python. It offers lots of algorithms and functions for optimization, integration, interpolation, linear algebra, differential equations, etc.
  
#### [`scikit image`](https://scikit-image.org)
  Scikit-image is a collection of algorithms for image processing and manipulation in Python. It offers functions in the context of image segmentation, geometric transforms, color manipulations, filtering, feature extraction and generation, etc.