## Coding question
### Write in python a function connected_components.

Given an input binary array A of size 
h´w it finds all connected areas of 1s in the array. For a position A(x,y)=1 in the array any position A(x±1,y±1)=1 is considered as connected.The function should return an array of size h´w where background is assigned 0 and each connected component is assigned a different integer number.

### Solution

Iterate through each element of the input 2d array:

For each iteration, if the element and one of its neighbors are not zero, assign a value to all of the elements in this connected area. Do dfs to mark all connected elements inside the current area, then increase the assigned value by 1. Then go to the next iteration of the element.

Time complexity: O (h * w). It is related to the number of ones in the input array. If the element value equals zero, no need to do DFS. If the element value equals one, dfs will be started to iterate its all connected elements which are equal to one. Then the DFS will change the value into a different integer number other than 1. In one word: DFS will iterate all the elements which equal to one. In the worst scenario, all the elements are equal to one in the input array, then it takes 2 * h * w iteration times.  

Space complexity: O (h * w). A new 2d array is created; All of the elements of the 2d array are iterated.

In [2]:
def assign_cc(input_array): # assign different integer numbers to different connected region for the input array.
    
    # x, y: element cooradinates in this particular connected region
    # assign_value: the value to be assigned in this particular connected region. 
    # connected_area： the area (sum of number of connected components) in this particular connected region
    def dfs(x, y, assign_value, connected_area):    
        if connected_area > 0:
            grid[x][y] = assign_value    
        for c in [[0, 1], [0, -1], [1, 0], [-1, 0]]: # if there is a connected component in the neighbour, do dfs for it.
            nx, ny = x + c[0], y + c[1]
            if 0 <= nx < row and 0 <= ny < col and grid[nx][ny] == 1:
                connected_area += 1
                dfs(nx, ny, assign_value, connected_area)         
        return connected_area
    
    grid = np.copy(input_array) # create a new copy for the input array: grid.
    [row, col] = grid.shape
    assign_value = 2
    for i in range(row):
        for j in range(col):
            if grid[i][j] == 1:                
                connected_area = dfs(i, j, assign_value, 0)     
                if connected_area > 0:
                    assign_value += 1   
                    
    ## if the input array consist of a single connected region only and no other isloated elements
    ## assign the value 1 the single connected region.    
    if len(set(grid.flatten())) == 2:
        grid[grid == 2] = 1
        
    return grid

In [10]:
# Test the function
import numpy as np
print('processing as below:') 
h, w = 4, 3
np.random.seed(6)
input_array = np.random.randint(0, 2, (h, w) )
print(input_array, '\n\n')
output = assign_cc(input_array)
print(output)

# **Data analysis question**

You have a network for segmenting out cells from a microscopy image that performs good in 
some types of tissue images and worse in other types, e.g. on the tissue boundary where 
morphology looks somehow different. To improve the accuracy, it would be beneficial to include 
the low-performing images in your training set. Labeling of images for segmentation is however 
laborious.
What would you do speed up or facilitate the image labeling process?* 

*Say the model is working well on type A cells but not type B cells* 

1. Goole if there is any type of open-source model or public annotated dataset similar to type B cells.

2. Label a few type B images, say 10 or 10% of type B images, then use the transfer-learning method to train a model based on B images and labels. See if the model could manage to segment the remaining type B images. 

3. Try more data augmentation techniques on type A images and labels, especially the non-rigid transformation: like GridDistortion and ElasticTransform (see the attached pictures and the references). I assume that these non-rigid transformations are not used in the current network. We can re-train the current model with the non-rigid augmented date and then check its performance on the type B images. 

4. I assume the cells in type B segmentations are mostly empty inside, the problem is the edge detection. Then we can try some traditional image processing techniques to generate the image edge annotations, such as thresholding, seed filling, and random walk. 

5. Try the image registration techniques, rigid or non-rigid, to see if type A images could be registered to type B images in a specific function. If so, we can then fine-tune the current model to infer the type B images based on the registered images via transfer learning.

Medical image augmentation ref： https://camo.githubusercontent.com/37427c37dbc48402071f0b8fbc62bda0b4724969b29dc30212dfe8af512c1f2b/68747470733a2f2f686162726173746f726167652e6f72672f776562742f31692f66692f777a2f31696669777a79306c78657463346e776a7673732d37316e6b77302e6a706567

https://github.com/albumentations-team/albumentations#list-of-augmentations

Bioimage augmentatio ref: https://github.com/albumentations-team/albumentations_examples/blob/master/notebooks/showcase.ipynb

