# Trabalho de implementação 1: Segmentation of cell nucleii

## Disciplina: Fundamentos de Visão Computacional
### Alunos
#### João Vitor Rodrigues - 00243705
#### Pedro Sidra Freitas  - 00262537

## Setup

### Libraries

In [1]:
import cv2 
import pandas as pd
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from itertools import chain
from IPython.display import Image
import random

In [2]:
# Optional ( I like QT Graphs so i can zoom and i think %matplotlib widget sucks!)
%matplotlib qt

### Load data

Data comes from the git repo through [git lfs](https://git-lfs.github.com/)

In [3]:
images = []

for imagePath in Path("./data").glob("*.png"):
    image = cv2.imread(str(imagePath))
    if image.size > 0:
        images.append(image)
    else:
        print(f"Failed reading image {imagePath}")

images = np.stack(images)
N = len(images)
print(images.shape)

(4, 1040, 1408, 3)


In [4]:
maximized=True
def colPlot(images, **kwargs):
    fig, axs = plt.subplots(1,len(images))

    for im, ax in zip(images, axs):
        if im.ndim > 2: # if color
            ax.imshow(im[...,::-1]) # opencv is BGR
        else:
            ax.imshow(im, **kwargs)
    
    if maximized:
        figManager = plt.get_current_fig_manager()
        figManager.window.showMaximized()
    fig.tight_layout()

def multiRowPlot(images, titles, nrows, ncols, **kwargs):
    fig, axs = plt.subplots(nrows, ncols, sharey="col", sharex="col")

    for i, (im, title, ax) in enumerate(zip(images, titles, axs.flatten())):
        ax.set_title(title)
        ax.imshow(im, **kwargs)
        ax.axis("off")


    if maximized:
        figManager = plt.get_current_fig_manager()
        figManager.window.showMaximized()
    fig.tight_layout()


def imageTitles(pattern):
    return [pattern.format(i=i) for i in range(1, N+1)]

## Part 1

First we analyse the channels of the image and pick the best way to "grayscale" it.

The red channel is a highliting of cell nuceii, and the G and B channels (equivalent) are the grayscale, monochromatic image from the microscope.

### Plot all channels (RGB)

In [5]:
colPlot(images[:,:,::-1])

### Plot red channel and R-B for comparrison

* $I_R-I_B$ clearly shows nucleii with high contrast and no unwanted features.
* $I_R$ shows more detail for other parts of the cell. But that detail introduces unwanted features that don't have a very high contrast with nucleii

**Use $I_f=I_R-I_B$ for segmentation**

In [6]:
# the R channel
R = images[...,2] 

# the R-B difference image
diffRB = images[...,2] - images[...,0]

# For plotting
imPlots = chain(R, diffRB)
imTitles = chain(imageTitles("$I_{{ {i}R }}$"),
        imageTitles("$I_{{ {i}R }} - I_{{ {i}B }}$"))


In [7]:
multiRowPlot(imPlots, imTitles, nrows=2, ncols=N, cmap="gray")

Comparaison between: **Ir** and **Ir-Ib**

<img src="data/analysis/Ir___diff_Ir_Ib.png" width=1000 height=800 align="center"/> 

Zoomed comparaison between: **Ir** and **Ir-Ib**

<img src="data/analysis/Ir___diff_Ir_Ib_Zoomed.png" width=1000 height=800 align="center" />

In [8]:
for a,b in zip(images,diffRB):
    fig, axs = plt.subplots(1,2,sharex=True,sharey=True)
    axs[0].imshow(a[...,0], cmap="gray")
    axs[1].imshow(b)

Zoomed Cmap from **Ir-Ib**

<img src="data/analysis/diff_Ir_Ib_Zoomed_cmap.png" width=1000 height=700 align="center" />

### Histograms

We can also see in a log-histogram that $I_R-I_B$ has a more more distinct peak in its histogram, meaning higher contrast between backgrond and features.


In [9]:

def histColPlot(images:np.array, hist_args:dict):
    fig, axs = plt.subplots(2,len(images))
    for i, (im, ax) in enumerate(zip(images, axs[0])):
        ax.set_title(f"Image {i+1}")
        ax.imshow(im, cmap="gray")
        ax.axis("off")

    for i, (im, ax) in enumerate(zip(images, axs[1])):
        ax.set_title(f"Image {i+1} hist")
        ax.hist(im.flatten(), **hist_args)

    fig.tight_layout()

    return fig, axs

In [10]:
histColPlot(R, 
            hist_args=dict(bins=255,log=True))

(<Figure size 640x480 with 8 Axes>,
 array([[<AxesSubplot:title={'center':'Image 1'}>,
         <AxesSubplot:title={'center':'Image 2'}>,
         <AxesSubplot:title={'center':'Image 3'}>,
         <AxesSubplot:title={'center':'Image 4'}>],
        [<AxesSubplot:title={'center':'Image 1 hist'}>,
         <AxesSubplot:title={'center':'Image 2 hist'}>,
         <AxesSubplot:title={'center':'Image 3 hist'}>,
         <AxesSubplot:title={'center':'Image 4 hist'}>]], dtype=object))

In [11]:
histColPlot(diffRB, 
            hist_args=dict(bins=255,log=True))

(<Figure size 640x480 with 8 Axes>,
 array([[<AxesSubplot:title={'center':'Image 1'}>,
         <AxesSubplot:title={'center':'Image 2'}>,
         <AxesSubplot:title={'center':'Image 3'}>,
         <AxesSubplot:title={'center':'Image 4'}>],
        [<AxesSubplot:title={'center':'Image 1 hist'}>,
         <AxesSubplot:title={'center':'Image 2 hist'}>,
         <AxesSubplot:title={'center':'Image 3 hist'}>,
         <AxesSubplot:title={'center':'Image 4 hist'}>]], dtype=object))

Histogram comparaison : **Red Channel** and **Ir-Ib** 


<img src="data/analysis/Hist_Red.png" width=1200 height=900 align="center" />

<img src="data/analysis/Hist_diff_Ir_Ib.png" width=1200 height=900 align="center" />

### Determine $I_f$

$I_f$ is the grayscale image we'll use for segmentation. We do a linear stretching of the $I_R-I_B$ image so that $max(I_R-I_B)=255$ and $min(I_R-I_B)=0$

In [12]:
useChannel = diffRB

# np.max is calculated over ALL images
# This means e.g. we don't strech image 1 more than image 2
minval = np.min(useChannel)
If = (useChannel - minval) * 255.0 / (np.max(useChannel)-minval)

If = If.astype(np.uint8)

In [13]:
histColPlot(If, hist_args=dict(bins=255, log=True))
plt.suptitle("$I_f$ and Histograms")

Text(0.5, 0.98, '$I_f$ and Histograms')

**If** stretched histogram

<img src="data/analysis/Hist_If.png" width=1200 height=900 align="center" />

## Part 2 - Thresholding

We apply two thresholding techniques to segment the nucleii of cells using $I_f$.

1. "handmade" threshold: from the histograms, we choose $T$ such that is isolates the background
2. Otsu's technique: we use the opencv implementation of Otsu's thresholding to determine $T$

We then extract the nucleii count and areas using `cv2.connectedComponents`, for result analysis.


### Threshold: Manual

In [14]:
 
# for image in diffRB:
#     data = image.flatten()
    
#     count, bins_count = np.histogram(data, bins='auto')

#     # finding the PDF of the histogram using count values
#     pdf = count / sum(count)
    
#     # using numpy np.cumsum to calculate the CDF
#     # We can also find using the PDF values by looping and adding
#     cdf = np.cumsum(pdf)
    
    
    
#     Thresh_values = np.arange (5,40,5)
#     no_of_colors= len(Thresh_values)
#     colorT=["#00"+''.join([random.choice('0123456789ABCDEF') for i in range(4)])
#        for j in range(no_of_colors)]
    
#     # plotting PDF and CDF
#     fig, axs = plt.subplots(1,1,sharex=True,sharey=True)
#     plt.plot( pdf, color="green", label="PDF")
#     plt.plot(bins_count[1:], cdf, label="CDF")
#     plt.xticks(np.arange(0,255,5))
#     for i,T in enumerate(Thresh_values):
#         plt.axvline(T, color='r', label=f"{T=}")
#     plt.legend()



In [15]:
# From the histograms, this seems like a good value for threshold
T = 30

In [16]:
fig, axs = histColPlot(If, hist_args=dict(bins=255, log=True))

for hist in axs[1]:
    hist.axvline(T, color="r", label=f"{T=}")
    hist.legend()
plt.suptitle("$I_f$ histograms and chosen Threshold")

Text(0.5, 0.98, '$I_f$ histograms and chosen Threshold')

<img src="data/analysis/Hist_If_Thresh30.png" width=1200 height=900 align="center" />

In [17]:
threshManual = 255* ( If > T )
threshManual=threshManual.astype(np.uint8)

In [18]:
imPlots = chain(If, threshManual)
imTitles = chain(imageTitles("$I_f{{ {i} }}$"), imageTitles("$T_{i}$"))

multiRowPlot(imPlots, imTitles, nrows=2, ncols=N, cmap="gray")
plt.show()

Binary image Zoomed

<img src="data/analysis/If_Thresh_30_Zoomed.png" width=1200 height=900 align="center"/>

### Threshold: Otsu`s

In [19]:
threshOtsu = []
threshOtsuVals = []
for If_i in If:
    ret, otsuThresh_i =  cv2.threshold(If_i, 127, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY)
    threshOtsu.append(otsuThresh_i)
    threshOtsuVals.append(ret)

threshOtsu=np.stack(threshOtsu, axis=0)

In [20]:
imPlots = chain(If, threshManual, threshOtsu)
imTitles = chain(imageTitles("$I_f{{ {i} }}$"), 
    (
      title + f",{T=}" 
      for title in imageTitles("$T_{i} Manual$")
    ),
    ( title + f", {T=}"
      for T, title 
      in zip(threshOtsuVals,imageTitles("$T_{i}$ Otsu"))
    ))

print(imTitles)
multiRowPlot(imPlots, imTitles, nrows=3, ncols=N, cmap="gray")

<itertools.chain object at 0x0000017CE60CC820>


Zoomed comparaison between **Manual Threshold** and **Otsu's**


<img src="data/analysis/Thresh_manual-30_Otsu_Zoomed.png" width=1200 height=900 align="center" />

### Evaluation

`ThreshManual` looks more consistant overall, with `threshOtsu` choosing a value that is slightly too high and leaves out part of some cells.

This makes sense because `Otsu` assumes a strong **bimodal distribution**, but the image has a very small foreground area. 

We can see this by calculating the percentage of pixels labeled as foreground by manual thresholding:

In [21]:
print("Number of foreground pixels (by image):")
print(np.count_nonzero(threshManual, axis=(1,2)))
print("Percent of foreground pixels (by image):")
percents = np.round(np.count_nonzero(threshManual, axis=(1,2))/threshManual[0].size * 100, 2)
print(", ".join([f"{p}%" for p in percents]))

Number of foreground pixels (by image):
[ 8772  6038 14083  8174]
Percent of foreground pixels (by image):
0.6%, 0.41%, 0.96%, 0.56%


In [22]:
# We picked this threshold:
thresh = threshManual

### Get connected areas

In [23]:
conn = 8
areas = []
for image in thresh:
    n_labels, labels =  cv2.connectedComponents(image, None, conn)

    # count number of pixels in each connected component
    # print(np.array([labels==i for i in range(n_labels)]).shape)
    area = np.count_nonzero([labels==i for i in range(n_labels)], axis=(1,2))

    # First item is background label
    area = area[1:]

    areas.append(area)

* There are a lot of areas with 1 or 2 pixels. 
    These are likely noise.

    We'll calculate stats including them first, but later we'll try filtering out areas that equal 1 or 2

In [24]:
for i, area in enumerate(areas):
    print(f"Image {i+1} areas:")
    print(area)

Image 1 areas:
[   1  641    1    1    1    1   52    1   12    1   15   12    1    7
    1    1  624    1    1  509    1    1  534    1  660    1    1  739
  698  855    1  672    1    1  591    1 1284    1    1  803    1    1
    1    1    1    2    1    1    1   11    1    1    2    1    5    3
    1    1    2    1    1    1]
Image 2 areas:
[   1    1  604    1    1    1    1    1    1  698    1    1    1    1
    1  671    1    1   76    1    1    3    1    8    3    1    1    1
  482    3    2    1    1    5   10    1    2    3  137    1    1    1
   13    2    1    2    1    1   11    1    1    1    1  739    1    1
    1    1    1    1    1    1    1    1    1    1    1    1    1    1
  381    1    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1  376    1    1    1    1    1 1008    1    1    1    1
    1    1    1    1    1    1    1   34  556    1    1    2    1    1
    1   56    1    1    1    1    1    1    1    1    1    1    1    1
    1    1    1

### Stats

**Original** 
* We see there are a lot of 1-pixel areas (over 25% for some images, over 50% for image 1) .
* Maybe because of this, the mean area for the first image is much smaller
* The biggest nucleus is in image 3, which also has the highest mean size. But the median size of cells is higher on image4

**With minimum area**
* Now the mean size of images is more similar, with images 3 and 4 having the bigger cells
* Standard deviation is very high, so cells are varying a lot in size.
* We see this on the quartiles, with 20% of cells being smaller than 50 pixels on all images but image 4

**Comparison with manual count**
* Manual count is more similar to the minimum area values. 
* But either way these counts are different from the manual count. Mainly because of "close-together" nucleii being counted as only one

In [25]:
def print_stats(areas):
    dfs = [pd.DataFrame(data={f"image{i+1}":area}) for i, area in enumerate(areas)]
    print(pd.concat([df.describe().transpose().round(2) for df in dfs]))

def get_areas_and_print_stats(images):
    areas = []
    for i, image in enumerate(images):
        n_labels, labels =  cv2.connectedComponents(image, None, conn)
        # count number of pixels in each connected component
        # print(np.array([labels==i for i in range(n_labels)]).shape)
        area = np.count_nonzero([labels==i for i in range(n_labels)], axis=(1,2))
        # First item is background label
        area = area[1:]
        areas.append(area)
    
    print_stats(areas)

In [26]:
print("====== ALL")
print_stats(areas)

min_area=5
print(f"====== Areas > {min_area=}")
print_stats([a[a>min_area] for a in areas])

        count    mean     std  min  25%  50%     75%     max
image1   62.0  141.48  297.56  1.0  1.0  1.0   11.75  1284.0
image2  148.0   40.80  153.78  1.0  1.0  1.0    1.00  1008.0
image3   68.0  207.10  384.54  1.0  1.0  2.0  182.50  1879.0
image4   27.0  302.74  365.17  1.0  1.0  1.0  661.00  1026.0
        count    mean     std    min     25%    50%     75%     max
image1   18.0  484.39  377.20    7.0   24.25  607.5  691.50  1284.0
image2   18.0  327.22  324.97    8.0   31.00  256.5  592.00  1008.0
image3   24.0  584.25  448.75    6.0  174.75  669.0  784.50  1879.0
image4   12.0  679.83  189.62  316.0  534.50  685.0  768.75  1026.0


## Watershet

### Gradient

In [27]:
gradients=[]
for If_i in If:
    ddepth = cv2.CV_32F
    
    dx = cv2.Sobel(If_i, ddepth, 1, 0)
    dy = cv2.Sobel(If_i, ddepth, 0, 1)

    gradients.append(np.sqrt(dx**2+dy**2))
gradients=np.stack(gradients,axis=0)

In [28]:
multiRowPlot(
    chain(If, gradients),
    chain(imageTitles("$ I_{{ f{i} }} $"), imageTitles("$ | \\Delta I_{{ f{i} }} | $")),
    2,
    N
)

Gradients from **If**

<img src="data/analysis/Gradients_If_ThreshManual.png" width=1400 height=1000 align="center" />

### Markers

In [29]:
kernel = np.ones((3,3))
markers = [cv2.morphologyEx(m, cv2.MORPH_CLOSE, kernel) for m in thresh]
markers = [cv2.erode(m,  kernel) for m in markers]
markers = np.stack(markers, axis=0)

# kernel = np.ones((35,35))
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(35,35))
bgs = [~cv2.dilate(m,  kernel, iterations=3) for m in thresh]
bgs = np.stack(bgs, axis=0)

In [30]:
plt_images = chain(thresh, markers, bgs)
plt_titles = chain(imageTitles("thresh"), imageTitles("marker"), imageTitles("background"))
multiRowPlot(plt_images , plt_titles , 3, N)

Markers and Backgrounds

<img src="data/analysis/Marker_and_Background.png" width=1600 height=1200 align="center"/>

In [31]:
gradients = gradients*255/gradients.max()

In [32]:
from skimage.segmentation import watershed

watersheds=[]
for m, bg, grad in zip(markers, bgs, gradients):
    conn = 8
    n_labels, labels =  cv2.connectedComponents(m, None, conn)

    labels[bg>0] = labels.max() + 1
    # plt.imshow(labels * 255 / labels.max())
    out = watershed(grad, labels.astype(np.int32))
    out[out==out.max()] = 0
    watersheds.append(out)
watersheds=np.stack(watersheds,axis=0)


In [33]:
multiRowPlot(
    chain(If, markers, watersheds>0),
    chain(imageTitles("$ I_{{ f{i} }} $"), imageTitles("$marker_{{ B{i} }}$"), imageTitles("$watershed_{{ {i} }}$")),
    3,
    N
)

Zoomed **Watershed**

<img src="data/analysis/Watershed_Zoomed.png" width=1600 height=1200 align="center"/>

In [34]:
print("====== ALL")
print_stats(areas)

min_area=5
print(f"====== Areas > {min_area=}")
print_stats([a[a>min_area] for a in areas])
print(f"====== watershed")
get_areas_and_print_stats(255*(watersheds>0).astype(np.uint8))

        count    mean     std  min  25%  50%     75%     max
image1   62.0  141.48  297.56  1.0  1.0  1.0   11.75  1284.0
image2  148.0   40.80  153.78  1.0  1.0  1.0    1.00  1008.0
image3   68.0  207.10  384.54  1.0  1.0  2.0  182.50  1879.0
image4   27.0  302.74  365.17  1.0  1.0  1.0  661.00  1026.0
        count    mean     std    min     25%    50%     75%     max
image1   18.0  484.39  377.20    7.0   24.25  607.5  691.50  1284.0
image2   18.0  327.22  324.97    8.0   31.00  256.5  592.00  1008.0
image3   24.0  584.25  448.75    6.0  174.75  669.0  784.50  1879.0
image4   12.0  679.83  189.62  316.0  534.50  685.0  768.75  1026.0
        count    mean     std    min     25%    50%    75%     max
image1   17.0  454.29  326.54    5.0   28.00  574.0  601.0  1135.0
image2   15.0  376.33  284.48    4.0   57.00  401.0  590.0   933.0
image3   21.0  672.71  331.72   17.0  512.00  654.0  741.0  1723.0
image4   12.0  659.67  140.01  438.0  565.75  646.5  705.0   969.0


## Improvement 1: different filters / morph ops

(draft. just to play with median blur and morph ops)

In [42]:
filtered=[]
for If_i in If:
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(5,5))
    ddepth = cv2.CV_32F
    
    f = If_i
    # f = cv2.medianBlur(f, ksize=7)

    f = 255*(f > T)

    f = cv2.morphologyEx(f.astype(np.uint8), cv2.MORPH_CLOSE, kernel)
    f = cv2.morphologyEx(f.astype(np.uint8), cv2.MORPH_OPEN, kernel)
    # f = cv2.dilate(f.astype(np.uint8),  kernel, iterations=2)
    filtered.append(f)
    
multiRowPlot(
    chain(If, thresh, filtered),
    chain(imageTitles("$ I_{{ f{i} }} $"), imageTitles("$ Thresh I_{{ f{i} }} $"), imageTitles("$ Filter I_{{ f{i} }} $")),
    3,
    N
)

Zoomed **Watershed** with different filters

<img src="data/analysis/Watershed_Zoomed_Imp1.png" width=1600 height=1200 align="center"/>

## Improvement 2: K-Means segmentation

... Just a fancy threshold, doesn`t help much

In [48]:
kmeans=[]
for If_i in If:
    criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
    # Set flags (Just to avoid line break in the code)
    flags = cv2.KMEANS_RANDOM_CENTERS
    # Apply KMeans
    compactness,labels,centers = cv2.kmeans(
            If_i.ravel().astype(np.float32),
            2,
            None,
            criteria,
            10,
            flags
        )

    km = If_i.copy()
    km.flat[labels.ravel()==0] = 0
    km.flat[labels.ravel()==1] = 1

    kmeans.append(km)

In [49]:
multiRowPlot(
    chain(If, kmeans, thresh),
    chain(imageTitles("$ I_{{ f{i} }} $"), imageTitles("$ KMeans ~I_{{ f{i} }} $"), imageTitles("$ Thresh~I_{{ f{i} }} $")),
    3,
    N
    )


Zoomed **Watershed**

<img src="data/analysis/Watershed_Zoomed_Imp2.png" width=1600 height=1200 align="center"/>

## Improvement 3: Filter image before thresholding

Ok so this one looks better. Basically:

* Filter $I_f$ using a median filter, since it seems to have a good amount of shot noise
* Lower the threshold because now we don`t have too much noise
* Skip the morphology operations on $I_b$ and simply use the original $I_b$ as markers for watershed
* Use a "Sure-Background" marker for the watersheds. Obtain it from a large dilation of the union of foreground markers

In [50]:
gradients=[]
filts=[]
for If_i in If:
    ddepth = cv2.CV_32F
    
    filt=If_i
    # filt = cv2.blur(filt, ksize=(3,3))
    filt = cv2.medianBlur(filt, ksize=3)
    filts.append(filt)
    dx = cv2.Sobel(filt, ddepth, 1, 0)
    dy = cv2.Sobel(filt, ddepth, 0, 1)

    gradients.append(np.sqrt(dx**2+dy**2))

gradients=np.stack(gradients,axis=0)
filts=np.stack(filts,axis=0)

T = 25

thresh = 255*(filts > T).astype(np.uint8)

kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(35,35))
bgs = [~cv2.dilate(m,  kernel, iterations=3) for m in thresh]
bgs = np.stack(bgs, axis=0)

Plot filtered image, threshold, markers and background

In [51]:
multiRowPlot(
    chain(If, filts, gradients),
    chain(imageTitles("$ I_{{ f{i} }} $"), imageTitles("$ Filtered ~I_{{ f{i} }} $"), imageTitles("$ Gradient~I_{{ f{i} }} $")),
    3,
    N
    )

multiRowPlot(
    chain(If, thresh, bgs),
    chain(imageTitles("$ I_{{ f{i} }} $"), imageTitles("$ thrsh ~I_{{ f{i} }} $"), imageTitles("$ bgs~I_{{ f{i} }} $")),
    3,
    N
    )


Gradients from improv 3

<img src="data/analysis/Gradients_Imp3.png" width=1400 height=1000 align="center"/>

Background image


<img src="data/analysis/Background_Imp3.png" width=1400 height=1000 align= "center"/>

#### Calc watershed with filtered images

In [52]:
watersheds=[]
for t, bg, grad in zip(thresh, bgs, gradients):
    conn = 8

    n_labels, labels =  cv2.connectedComponents(t, None, conn)

    labels[bg>0] = labels.max() + 1
    # plt.imshow(labels * 255 / labels.max())
    out = watershed(grad, labels.astype(np.int32))
    out[out==out.max()] = 0
    watersheds.append(out)
watersheds=np.stack(watersheds,axis=0)


Result (image 1 has two new nucleii which wasn't detected before)

In [53]:
multiRowPlot(
    chain(If, thresh, watersheds>0),
    chain(imageTitles("$ I_{{ f{i} }} $"), imageTitles("$marker_{{ B{i} }}$"), imageTitles("$watershed_{{ {i} }}$")),
    3,
    N
)

Zoomed Watershed Improvement 3


<img src="data/analysis/Watershed_Zoomed_Imp3.png" width=1400 height=1000  align="center"/>

Comparaison **Original Watershed** and **Improved Watershed**


<img src="data/analysis/Watershed_Comp_Orig.png" width=1000 height=800 align="center"/>

<img src="data/analysis/Watershed_Comp_Imp.png" width=1000 height=800 align="center"/>