# Intergration of findblots, parameter filtering and DBSCAN to mark clusters on cluster images

## Background
The main purpose of this tool is to improve the marking the clusters by applying SNR, relative fluorescence intensity (RFL), size filtering thresholds to remove the markings on the fake clusters and a clustering method called Density-Based Spatial Clustering (DBSCAN) to remove the repeat clusters that are initially marked by findblots. This analysis tool is writtern in Python.  

The SNR, relative fluorescence intensity (RFL), and size filtering thresholds are all based on percentile removal. Due to the conditions of different images vary, the threshold percentile can vary between images. However, the two parameters for DBSCAN clustering, eps and min_samples, should be very similar (or the same) for different images. The principle of DBSCAN clustering can be found in these couple links: **[document link](https://towardsdatascience.com/dbscan-algorithm-complete-guide-and-application-with-python-scikit-learn-d690cbae4c5d)** and **[youtube link](https://www.youtube.com/watch?v=6jl9KkmgDIw&t=10s)**. Based on the inital evaluation, eps = 3.5 and min_samples = 2 are good for most images. 

## Workflow
The overall workflow is (1) import the python package, (2) run the findblot analysis, (3) analyze the images either using the specific image scripts or the images in a folder scripts. To run the findblot, you can use the findblots GUI and you can find a copy in findblots GUI in this folder (**SequLiteNAS\SequLite Storage\Chemistry\Bioinformatics_KW\Cluster_density\**) and once you are in this folder, double click the **ClusterAnalysisTool.exe**. To be consistent with the Engineering team, analyze the images using the default parameters except changing the **SNR to from 1 to 2.5**. Alternatively, you can also run the findblots using the scripts (step 2) in this notebook.


## (1) Import the python package for the analysis 

If you have not download these packages, you will need to download them in your python environment. You can either do it in anaconda environment page or terminal. Run the following cell to import the python packages.

In [1]:
#Import package for data analysis 
import sys, os, imageio, csv, glob, re, random, subprocess
import numpy as np
import pandas as pd
from skimage import io as skimageio
from skimage.color import gray2rgb
from statistics import mean
from pathlib import Path
from plotnine import *
from sklearn.cluster import DBSCAN
import warnings; warnings.simplefilter('ignore')

## (2) Run Findblots analysis

You can either run the findblots using the GUI or using the script here. There are two scripts: one for analyze the images in a folder and the other one is to analyze a specific images

### (2a) Analyze the images in a folder

Input the **findblot_path** for where FindBlobs-static.exe file is located and **file_path** for where the images are located. Here are the description of FindBlob Options:

-C/--useCorners      	join corner connected blobs: [0]
<br>
-f/--filter          	use Fourier space filtering: [[0, 1]]
<br>
-G/--tileGeom        	use tiles with geometry "width height" and interpolation (<1 - piecewise, >=1 - linear): [[50, 0, 0]]
<br>
-I/--pixInt          	type of subpixel interpolation: [0]
<br>
&emsp;0 - center mass of the blob,
<br>
&emsp;1 - correction for subtracted bkgnd over 3x3 area
<br>
&emsp;2 - correction for subtracted bkgnd over 5x5 area
<br>
&emsp;3 - peak position of Gaussian fit over 3x3 area
<br>
&emsp;4 - peak position of spline fit over 3x3 area
<br>
-K/--saveMask        	save binary mask: [0]
<br>
-N/--imageNum        	image number in a stack: [0]
<br>
-p/--pixelLimits     	set blob pixel limits "min max": [[-1, 0]]
<br>
-P/--savePNG         	save view in PNG (or TIFF for Windows) file: [1]
<br>
-Q/--imageQuality    	save image QC CSV file with SNR threshold: [2.5]
<br>
-r/--resolution      	blob resoltion in pixels: [0]
<br>
-t/--thresh          	threshold divider: [0.75]
    
You can change the findblot option but changing the input in the following cell. **But make sure to have the quotation for the input like Q = '0' not Q = 0**

In [2]:
#input the path for data analysis
findblot_path = "C:/Users/kelvin/Documents/Bioinformatics_analysis/"
file_path = "D:/CG_results/Example5/"

#input parameter for the findblot analysis. Make sure to have the quotation like Q = '0' not Q = 0
C = '0'
f = '0, 1'
G = '50, 50, 1'
I = '0'
K = '0'
N = '0'
P = '1'
S = '1'
Q = '2.5'
r = '0'
t = '0.75'

In [3]:
#findblot script - no need to change anything here
excutablefile = os.path.join(findblot_path + "FindBlobs-static.exe")
for filename in os.listdir(file_path):
    if "img" not in filename and "tif" in filename and "Focused" not in filename:
        full_path = os.path.join(file_path, filename)
        
        cmd = [excutablefile,
               '-C', C,
               '-f', f,
               '-G', G,
               '-I', I,
               '-K', K,
               '-N', N,
               '-P', P,
               '-S', S,
               '-Q', Q, 
               '-r', r, 
               '-t', t, full_path]
        subprocess.run(cmd)

## (2a) Analyze a specific image

If you choose to analyze a specific image, you can run the scripts in the following two cells. The first cell is to enter the **findblot_path** for where FindBlobs-static.exe file is located and **image_path** for where the image file is located. You can change the findblot option but changing the input in the following cell. **But make sure to have the quotation for the input like Q = '0' not Q = 0**


In [22]:
#input the path for data analysis
findblot_path = "C:/Users/kelvin/Documents/Bioinformatics_analysis/"
image_path = "D:/CG_results/Example5/example1.tif"

#input parameter for the findblot analysis. Make sure to have the quotation like Q = '0' not Q = 0
C = '0'
f = '0, 1'
G = '50, 50, 1'
I = '0'
K = '0'
N = '0'
P = '1'
S = '1'
Q = '2.5'
r = '0'
t = '0.75'

In [23]:
#findblot script - no need to change anything here
excutablefile = os.path.join(findblot_path + "FindBlobs-static.exe")
cmd = [excutablefile,
       '-C', C,
       '-f', f,
       '-G', G,
       '-I', I,
       '-K', K,
       '-N', N,
       '-P', P,
       '-S', S,
       '-Q', Q, 
       '-r', r, 
       '-t', t, image_path]
subprocess.run(cmd)

CompletedProcess(args=['C:/Users/kelvin/Documents/Bioinformatics_analysis/FindBlobs-static.exe', '-C', '0', '-f', '0, 1', '-G', '50, 50, 1', '-I', '0', '-K', '0', '-N', '0', '-P', '1', '-S', '1', '-Q', '2.5', '-r', '0', '-t', '0.75', 'D:/CG_results/Example5/example1.tif'], returncode=0)

# (3) Removal of repeat clusters and clusters with low QC values

## (3a) Analyze the images by folder 

If you choose to analyze images by folder, you should begin with having only the raw or greyscale composite image files (.tif) in the input folder (file_path). After runing the findblots, you can proceed with the analysis in this section.

The following two cells is to analyze images in a folder. The first cell is to input parameters and specify folder paths. There are six parameters: 
1. **size_thresold** = the clusters have lower than the specified percentile of size will be removed 
2. **size_upper_thresold** = the clusters have higher than the specified percentile of size will be removed 
3. **SNR_thresold** = the clusters have lower than the specified percentile of SNR will be removed
4. **RFL_thresold** = the clusters have lower than the specified percentile of RFL will be removed 
5. **eps** = the maximum distance between two samples for one to be considered as in the neighborhood of the other
6. **min_samples** = the number of samples (or total weight) in a neighborhood for a point to be considered as a core point (should always be 2 in our case)

There are also two folder paths in the first cell:
1. **file_path** = it is the input folder. It **must** contain two types of file:  
    1. composite greyscale or raw image. The four composite image file must be converted to greyscale. 
    2. findblot csv files (000000-loc.csv). This is the csv files generated from findblot tools
2. **output_folder_path** = it is the folder path that you want your results to be stored. There are three types of files:
    1. image files with the new cluster marking after filtering by different thresolds and DBSCAN. 
    2. CSV files with the new cluster marking after filtering by different thresolds and DBSCAN. 
    3. Total number of clusters in each image files
    
After finish entering the input in the first cell, run both cells **sequentially**

In [4]:
# first cell for the "analyze the images in a folder" 
size_thresold = 5
size_upper_thresold = 98
SNR_thresold = 0
RFL_thresold = 5
eps = 3.5
min_samples = 2

file_path = "D:/CG_results/Example5/"
output_folder_path = "D:/CG_results/Example5/Results"

In [5]:
# second cell for the "analyze the images in a folder", do not modify the codes here unless you are 
# confident with the modifications

# function to mark the clusters
bar_thickness = 0
half_bar_length = 1
def mark_image_location_rgb(X, Y):
  for i in range(max(0, Y-half_bar_length), min(Y+half_bar_length+1,image_rgb.shape[0])):
    for j in range(max(0, X-half_bar_length), min(X+half_bar_length+1,image_rgb.shape[1])):
      if abs(i-Y) <= bar_thickness or abs(j-X) <= bar_thickness:
        image_rgb[i,j,0]=0
        image_rgb[i,j,1]=0x7FFF
        image_rgb[i,j,2]=0x7FFF

def make_folder(save_folder):
    if not os.path.exists(save_folder):
        os.makedirs(save_folder)
        
cvsfile = [file_path + x for x in os.listdir(path = file_path) if "000000-loc.csv" in x] 
all_image = [file_path + x for x in os.listdir(path = file_path) if ".tif" in x] 
imagefile = [x for x in all_image if '000000-img.tif' not in x]

key_list = []
item_list = []
for i in range(0, len(cvsfile)):
    db = pd.read_csv(cvsfile[i],
                            sep = ",")
    db['size'] = db['GaussSigmaY'] * db['GaussSigmaX']

    db1 = db[db['size'] > 0]
    df1 = db1[(db1['size'] > np.percentile(db1['size'], size_thresold)) & 
              (db1['size'] < np.percentile(db1['size'], size_upper_thresold)) &
              (db1['SNR'] > np.percentile(db1['SNR'], SNR_thresold)) &
              (db1['Signal'] > np.percentile(db1['Signal'], RFL_thresold))]

    dbb = DBSCAN(eps= eps, min_samples = min_samples).fit(df1[['X', 'Y']])
    cluster_labels = dbb.labels_

    df1['group_ID'] = cluster_labels
    df2 = df1
    df3 = df2[df2.group_ID != -1].sort_values(by=['group_ID'])\
            .groupby('group_ID')\
            .mean().reset_index()\
            .append(df2[df2.group_ID == -1])
    
    make_folder(output_folder_path)
    csvpath = output_folder_path + "temp.csv"
    df3.to_csv(csvpath, sep=',', index=False)
    
    imagefile_in = imagefile[i]
    image_gray = skimageio.imread(imagefile_in, True)
    image_rgb = gray2rgb(image_gray)
    with open(csvpath, mode='r', encoding="utf-8-sig") as csv_file:
      csv_reader = csv.DictReader(csv_file)
      for row in csv_reader:
        mark_image_location_rgb(int(float(row['X'])), int(float(row['Y'])))
    path_basename = os.path.basename(imagefile_in)
    base_name = os.path.splitext(path_basename)[0]
    imagefile_out = output_folder_path + "/" + base_name + \
        "_marked_" + \
        "size_" + str(size_thresold) + \
         "_SNR_" + str(SNR_thresold) + \
         "_RFL_" + str(RFL_thresold)  
    skimageio.imsave(imagefile_out + "_filter_dbscan.tif", image_rgb)
    df3.to_csv(imagefile_out + "_filter-loc.csv", sep=',', index=False)
    
    key_list.append(path_basename.replace('.tif', ''))
    item_list.append(df3.shape[0])

cluster_num_dict = dict(zip(key_list, item_list))
df4 = pd.DataFrame(list(cluster_num_dict.items()),columns = ['file_name','number_of_cluster']) 
df4.to_csv(output_folder_path + \
           "size_" + str(size_thresold) + \
           "_SNR_" + str(SNR_thresold) + \
           "_RFL_" + str(RFL_thresold)  + "_number_clusters.csv", sep=',', index=False)
os.remove(csvpath)

## (3b) Analyze a specific image

**The following two cells is to analyze a specific image. The first cell is to input parameter and specify folder path. There are also six parameters:** 
1. **size_thresold** = the clusters have lower than the specified percentile of size will be removed 
2. **size_upper_thresold** = the clusters have higher than the specified percentile of size will be removed 
3. **SNR_thresold** = the clusters have lower than the specified percentile of SNR will be removed
4. **RFL_thresold** = the clusters have lower than the specified percentile of RFL will be removed 
5. **eps** = the maximum distance between two samples for one to be considered as in the neighborhood of the other
6. **min_samples** = the number of samples (or total weight) in a neighborhood for a point to be considered as a core point (should always be 2 in our case)

**There are also three folder paths in the first cell:**
1. **image_file_path** = it is the image file folder. It contain either the composite greyscale or raw image. The four composite image file must be converted to greyscale. 
2. **csv_file_path** = findblot csv files (000000-loc.csv). This is the csv files generated from findblot tools
3. **output_folder_path** = it is the folder path that you want your results to be stored.
    
After finish entering the input in the first cell, run both cells **sequentially**. **The number of clusters will be printed out after finish running the script.**

In [54]:
# first cell for the "analyze the images in a folder" 
size_thresold = 0
size_upper_thresold = 98
SNR_thresold = 0
RFL_thresold = 0
eps = 3.5
min_samples = 2

image_file_path = "C:Downloads/CRT68y_Inc_Inc1_composite_b00.00mm_305.26um_0.200s.tif"
csv_file_path = "C:Downloads/CRT68y_Inc_Inc1_composite_b00.00mm_305.26um_0.200s_000000-loc.csv"
output_folder_path = "C:Downloads/"

In [55]:
# second cell for the "analyze a specific image", do not modify the codes here unless you are 
# confident with the modifications

# function to mark the clusters
bar_thickness = 0
half_bar_length = 1
def mark_image_location_rgb(X, Y):
  for i in range(max(0, Y-half_bar_length), min(Y+half_bar_length+1,image_rgb.shape[0])):
    for j in range(max(0, X-half_bar_length), min(X+half_bar_length+1,image_rgb.shape[1])):
      if abs(i-Y) <= bar_thickness or abs(j-X) <= bar_thickness:
        image_rgb[i,j,0]=0
        image_rgb[i,j,1]=0x7FFF
        image_rgb[i,j,2]=0x7FFF
        
db = pd.read_csv(csv_file_path, sep = ",")
db['size'] = db['GaussSigmaY'] * db['GaussSigmaX']

db1 = db[db['size'] > 0]
df1 = db1[(db1['size'] > np.percentile(db1['size'], size_thresold)) & 
          (db1['size'] < np.percentile(db1['size'], size_upper_thresold)) &
          (db1['SNR'] > np.percentile(db1['SNR'], SNR_thresold)) &
          (db1['Signal'] > np.percentile(db1['Signal'], RFL_thresold))]

dbb = DBSCAN(eps= eps, min_samples = min_samples).fit(df1[['X', 'Y']])
cluster_labels = dbb.labels_

df1['group_ID'] = cluster_labels
df2 = df1
df3 = df2[df2.group_ID != -1].sort_values(by=['group_ID'])\
        .groupby('group_ID')\
        .mean().reset_index()\
        .append(df2[df2.group_ID == -1])

make_folder(output_folder_path)
csvpath = output_folder_path + "temp.csv"
df3.to_csv(csvpath, sep=',', index=False)

imagefile_in = image_file_path
image_gray = skimageio.imread(imagefile_in, True)
image_rgb = gray2rgb(image_gray)
with open(csvpath, mode='r', encoding="utf-8-sig") as csv_file:
  csv_reader = csv.DictReader(csv_file)
  for row in csv_reader:
    mark_image_location_rgb(int(float(row['X'])), int(float(row['Y'])))
path_basename = os.path.basename(imagefile_in)
base_name = os.path.splitext(path_basename)[0]
imagefile_out = output_folder_path + "/" + base_name + \
    "_marked_" + \
    "size_" + str(size_thresold) + \
     "_SNR_" + str(SNR_thresold) + \
     "_RFL_" + str(RFL_thresold) 
skimageio.imsave(imagefile_out + "_filter_dbscan.tif", image_rgb)
df3.to_csv(imagefile_out + "_filter-loc.csv", sep=',', index=False)    
os.remove(csvpath)

print("There are " + str(df3.shape[0]) + " of clusters identified in this image")

There are 35717 of clusters identified in this image
