# Background 

## Task description:

In this competition, we'll **identify and segment functional tissue units (FTUs)** accross **five** human organs:

* Prostate
* Spleen
* Lung
* Kidney
* Large Intestine

The challenge in this competition is to build algorithms that **generalize**:
* across different **organs** and
* across different **dataset** differences

=> This is a **semantic segmentation** problem.

## Data description:

They only release public *Human Protein Atlas (HPA)* data for the training dataset. However, they will release private *HPA* data and *Human BioMolecular Atlas Program (HuBMAP)* for their public test set. For the private test set, they only use *HuBMAP* data.

**File Information**:
1. `train|test.csv`:
* `id` - The image ID.
* `organ` - The organ that the biopsy sample was taken from.
* `data_source` - Whether the image was provided by Hubamp or HPA.
* `img_height` - The height of the image in pixels.
* `img_width` - The width of the image in pixels.
* `pixel_size` - The height/width of a single pixel from this image in micrometers. All HPA images have a pixel size of 0.4 µm. For Hubmap imagery the pixel size is 0.5 µm for kidney, 0.2290 µm for large intestine, 0.7562 µm for lung, 0.4945 µm for spleen, and 6.263 µm for prostate.
* `tissue_thickness` - The thickness of the biopsy sample in micrometers. All HPA images have a thickness of 4 µm. The Hubmap samples have tissue slice thicknesses 10 µm for kidney, 8 µm for large intestine, 4 µm for spleen, 5 µm for lung, and 5 µm for prostate.
* `rle` - The target column. A run length encoded copy of the annotations. Provided for the training set only.
* `age` - The patient's age in years. Provided for the training set only.
* `sex` - The sex of the patient. Provided for the training set only.
2. `train_annotations`: provided in the format of **points that define the boundaries of the polygon masks of the FTUs**
3. `train|test_images`: the images:
* Expect roughly 550 images in the hidden test set.
* All images used have at least one FTU.
* All tissue data used in this competition is from healthy donors that pathologists identified as pathologically unremarkable tissue.
* HPA details:
    * All HPA images are 3000 x 3000 pixels with a tissue area within the image around 2500 x 2500 pixels.
    * HPA samples were stained with antibodies visualized with 3,3'-diaminobenzidine (DAB) and counterstained with hematoxylin.
* HuBMAP details:
    * The Hubmap images range in size from 4500x4500 down to 160x160 pixels.
    * HuBMAP images were prepared using Periodic acid-Schiff (PAS)/hematoxylin and eosin (H&E) stains.
4. `sample_submission.csv`:
* `id`-the image ID
* `rle`-a run length encoded mask of the FTUs in the image

## Evaluation metric:

This competition is evaluated on the mean [Dice coefficient](https://radiopaedia.org/articles/dice-similarity-coefficient#:~:text=The%20Dice%20similarity%20coefficient%2C%20also,between%20two%20sets%20of%20data.). The Dice coefficient can be used to compare the pixel-wise agreement between a predicted segmentation and its corresponding ground truth. The formula is given by:

$$\frac{2∗|𝑋∩𝑌|}{|𝑋|+|𝑌|}$$

where 
* X is the predicted set of pixels and Y is the ground truth. 
* The Dice coefficient is defined to be 1 when both X and Y are empty. 

**Note**: metric is to judge the performance of the model, whereas loss function is to optimize the model.

In this case, our metric is the **mean Dice coefficient**, and we can use different loss functions like 
* **Dice Loss**
* **Jaccard Loss**
* **BCE Loss**
* **Lovasz Loss**
* **Tversky Loss**

to optimize our models

## Submission file format

To reduce the submission file size, the metric uses run-length encoding on the pixel values.

Instead of submitting an exhaustive list of indices for our segmentation, we will submit pairs of values that contain a start position and a run length

E.g. '1 3' implies starting at pixel 1 and running a total of 3 pixels (1,2,3).

Note that, at the time of encoding, the mask should be **binary**
* The masks for all objects in an image are joined into a single large mask
* The value of 0 should indicate pixels that are not masked
* The value of 1 will indicate pixels that are masked.

## Methods:

1. Overview:
    * **Run-length encoding (RLE)**: a form of lossless data compression.
    
    Since we already have the RLE masks, we don't need to use the annotations from `.json` file.
    * **Given**: images (`.tiff`), masks in RLE (we need to convert RLE to binary mask before feeding to our models)
    * **Predict**: masks then convert to RLE for submission
2. Data processing:
    * Resize
    * Normalize
3. Data augmentation:
4. Baseline model: 
    * UNET

5. Testing model:
    * UNET + pretrained model from previous competition

# Imports

In [58]:
import os 
import glob
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

import plotly
from plotly import tools
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline as pyo
import plotly.io as pio
import plotly.graph_objects as go
#pio.templates.default = 'plotly_white'
sns.set_theme(style="darkgrid")
import cv2
import tifffile as tiff

import warnings
warnings.simplefilter("ignore")

## Global Config

In [2]:
class config:
    BASE_PATH = Path("../input/hubmap-organ-segmentation/")
    TRAIN_CSV_PATH = BASE_PATH / "train.csv"
    TRAIN_IMAGES_PATH = BASE_PATH / "train_images/"
    TRAIN_ANNOTATIONS_PATH = BASE_PATH / "train_annotations/" # Not needed

# Load datasets

## Train dataset

In [3]:
data = pd.read_csv(config.TRAIN_CSV_PATH)
data.head()

## Train images

Load a single image from `train_images`

In [4]:
id_ = 10274
img = tiff.imread(str(config.TRAIN_IMAGES_PATH/ f"{id_}.tiff"))
print(img.shape)

### Plot image

In [5]:
plt.figure(figsize=(10, 10))
plt.imshow(img)

## Train annotations

**Run-length encoding (RLE)**: a form of lossless data compression

In [6]:
# https://www.kaggle.com/paulorzp/rle-functions-run-length-encode-decode
def mask2rle(img): # encoder
    '''
    img: numpy array, 1 - mask, 0 - background
    Returns run length as string formated
    '''
    pixels= img.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return ' '.join(str(x) for x in runs)
 
def rle2mask(mask_rle, shape=(1600,256)): # decoder
    '''
    mask_rle: run-length as string formated (start length)
    shape: (width,height) of array to return 
    Returns numpy array, 1 - mask, 0 - background

    '''
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0]*shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape).T

In [7]:
mask = rle2mask(data[data["id"]==id_]["rle"].iloc[-1], (img.shape[1], img.shape[0]))
mask.shape

### Plot mask

In [66]:
plt.figure(figsize=(10,10))
plt.imshow(mask, cmap='PuRd', alpha=0.5)

### Overlay image and mask

In [67]:
plt.figure(figsize=(10,10))
plt.imshow(img)
plt.imshow(mask, cmap='PuRd', alpha=0.5)

In [68]:
def plot_mask(image, mask, image_id):
    plt.figure(figsize=(16, 10))
    
    plt.subplot(1, 3, 1)
    plt.imshow(image)
    plt.title(f"Image {image_id}", fontsize=18)
    
    plt.subplot(1, 3, 2)
    plt.imshow(mask, cmap="PuRd", interpolation='none')
    plt.title(f"Mask", fontsize=18)    
    
    plt.subplot(1, 3, 3)
    plt.imshow(image)
    plt.imshow(mask, cmap="PuRd", alpha=0.5)
    plt.title(f"Image {image_id} + mask", fontsize=18)    
    
    
    plt.show()

In [69]:
plot_mask(img, mask, id_)

# Exploratory Data Analysis

## Statistical Description

In [10]:
def EDA(df):
    
    print('\033[1m' +'EXPLORATORY DATA ANALYSIS :'+ '\033[0m\n')
    print('\033[1m' + 'Shape of the data (rows, columns):' + '\033[0m')
    print(df.shape, 
          '\n------------------------------------------------------------------------------------\n')
    
    print('\033[1m' + 'All columns from the dataframe :' + '\033[0m')
    print(df.columns, 
          '\n------------------------------------------------------------------------------------\n')
    
    print('\033[1m' + 'Datatpes and Missing values:' + '\033[0m')
    print(df.info(), 
          '\n------------------------------------------------------------------------------------\n')
    
    for col in df.columns:
        print('\033[1m' + 'Unique values in {} :'.format(col) + '\033[0m',len(data[col].unique()))
    print('\n------------------------------------------------------------------------------------\n')
    
    print('\033[1m' + 'Summary statistics for the data :' + '\033[0m')
    print(df.describe(include='all'), 
          '\n------------------------------------------------------------------------------------\n')
    
        
    print('\033[1m' + 'Memory used by the data :' + '\033[0m')
    print(df.memory_usage(), 
          '\n------------------------------------------------------------------------------------\n')
    
    print('\033[1m' + 'Number of duplicate values :' + '\033[0m')
    print(df.duplicated().sum())
          
EDA(data)

## Data visualization

### Univariate visualization of categorical variables (sex, organ)


In [11]:
# https://www.kaggle.com/code/toomuchsauce/mental-health-plotly-interactive-viz
columns = ['organ','sex']
df = data[columns]

buttons = []
i = 0
vis = [False] * 4

for col in df.columns:
    vis[i] = True
    buttons.append({'label' : col,
             'method' : 'update',
             'args'   : [{'visible' : vis},
             {'title'  : col}] })
    i+=1
    vis = [False] * 4

fig = go.Figure()

for col in df.columns:
    fig.add_trace(go.Pie(
             values = df[col].value_counts(),
             labels = df[col].value_counts().index,
             title = dict(text = 'Distribution of {}'.format(col),
                          font = dict(size=18, family = 'monospace'),
                          ),
             hole = 0.5,
             hoverinfo='label+percent',))

fig.update_traces(hoverinfo='label+percent',
                  textinfo='label+percent',
                  textfont_size=12,
                  opacity = 0.8,
                  showlegend = False,
                  marker = dict(colors = sns.color_palette('PuRd').as_hex(),
                              line=dict(color='#000000', width=1)))
              

fig.update_layout(margin=dict(t=0, b=0, l=0, r=0),
                  updatemenus = [dict(
                        type = 'dropdown',
                        x = 1.15,
                        y = 0.85,
                        showactive = True,
                        active = 0,
                        buttons = buttons)],
                 annotations=[
                             dict(text = "<b>Choose<br>Column<b> : ",
                             showarrow=False,
                             x = 1.06, y = 0.92, yref = "paper", align = "left")])

for i in range(1,2):
    fig.data[i].visible = False

fig.show()

### Univariate visualization of non-categorical variables (age)


In [73]:
fig = make_subplots(rows = 1, cols=1)

fig.append_trace(go.Histogram(
                        x = data['age'],
                        nbinsx = 15,
                        #text = ['16', '500', '562', '149', '26', '5', '1'],
                        marker =  dict(color="#EEC8F9")),
                        row=1, col=1)

fig.update_xaxes(        
        title = dict(text = 'Age',
                     font = dict(size = 15,
                                 family = 'monospace')),
        row=1, col=1,
        tickfont = dict(size=15, family = 'monospace', color = 'black'),
        tickmode = 'array',
        #ticktext = ['20-24','25-29', '30-34','35-39', '40-44','45-49', '50-54','55-59', '60-64','65-69', '70-79','80-99'],
        ticklen = 8,
        showline = True,
        showgrid = True,ticks = 'outside')

fig.update_yaxes(
        row=1, col=1,
        title = dict(text = 'Count',
                     font = dict(size = 15,
                                 family = 'monospace')),
    
        tickfont = dict(size=15, family = 'monospace'),
        tickmode = 'array',
        showline = False,
        showgrid = True)

fig.update_traces(
                  marker_line_color='black',
                  marker_line_width = 2,
                  opacity = 0.6,
                  row = 1, col = 1)


fig.update_layout(height=900, width=1200,
                  title = dict(text = 'Univariate visualization of non-categorical variables<br> Age Distribution',
                               x = 0.5,
                               font = dict(size = 16, color ='#9211B5',
                               family = 'monospace')),
                  #plot_bgcolor='#EEC8F9 ',
                  #paper_bgcolor = '#EEC8F9 ',
                  showlegend = False)

fig.show()