# TrenchRipper Master Notebook

## Introduction

This notebook contains the entire `TrenchRipper` pipline, divided into simple steps. This pipline is ideal for Mother <br>Machine image data where cells possess fluorescent segmentation markers. Segmentation on phase or brightfield data <br>is being developed, but is still an experimental feature.

The steps in this pipeline are as follows:
1. Extracting your Mother Machine data (.nd2) into hdf5 format
2. Identifying and cropping individual trenches into kymographs
3. Segmenting cells with a fluorescent marker
4. Determining lineages and object properties

In each step, the user will dynamically specify parameters using a series of interactive diagnostics on their dataset. <br>Following this, a parameter file will be written to disk and then used to deploy a parallel computation on the <br>dataset, either locally or on a SLURM cluster.


This is intended as an end-to-end solution to analyzing Mother Machine data. As such, **it is not trivial to plug data <br>directly into intermediate steps**, as it will lack the correct formatting and associated metadata. A notable <br>exception to this is using another program to segment data. The library references binary segmentation masks using <br>only metadata derived from their associated kymographs. As such, it is possible to generate segmentations on these <br>kymographs elsewhere and place them into the segmentation data path to have `TrenchRipper` act on those <br>segmentations instead. More on this in the segmentation section...

#### Imports

Run this section to import all relavent packages and libraries used in this notebook. You must run this everytime you open a new python kernel.

In [None]:
import paulssonlab.deaton.trenchripper.trenchripper as tr

import warnings

warnings.filterwarnings(action="once")

import matplotlib

matplotlib.rcParams["figure.figsize"] = [20, 10]

#### Specify Paths

Begin by defining the directory in which all processing will be done, as well as the initial nd2 file we will be <br>processing. This line should be run everytime you open a new python kernel.

The format should be: `headpath = "/path/to/folder"` and `nd2file = "/path/to/file.nd2"`

For example:
```
headpath = "/n/scratch2/de64/2019-05-31_validation_data"
nd2file = "/n/scratch2/de64/2019-05-31_validation_data/Main_Experiment.nd2"
```

Ideally, these files should be placed in a storage location with relatively fast I/O

In [None]:
headpath = "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/gfp/"
# hdf5inputpath = "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/run/"
nd2file = "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/GFP001.nd2"

## Extract to hdf5 files

In this section, we will be extracting our image data. Currently this notebook only supports `.nd2` format; however <br>there are `.tiff` extractors in the TrenchRipper source files that are being added to `Master.ipynb` soon.

In the abstract, this step will take a single `.nd2` file and split it into a set of `.hdf5` files stored in <br>`headpath/hdf5`. Splitting the file up in this way will facilitate quick procesing in later steps. Each field of <br>view will be split into one or more `.hdf5` files, depending on the number of images per file requested (more on <br>this later). 

To keep track of which output files correspond to which FOVs, as well as to keep track of experiment metadata, the <br>extractor also outputs a `metadata.hdf5` file in the `headpath` folder. The data from this step is accessible in <br>that `metadata.hdf5` file under the `global` key. If you would like to look at this metadata, you may use the <br>`tr.utils.pandas_hdf5_handler` to read from this file. Later steps will add additional metadata under different <br>keys into the `metadata.hdf5` file.

#### Start Dask Workers

First, we start a `dask_controller` instance which will handle all of our parallel processing. The default parameters <br>here work well on O2. The critical arguments here are:

**walltime** : For a cluster, the length of time you will request each node for.

**local** : `True` if you want to perform computation locally. `False` if you want to perform it on a SLURM cluster.

**n_workers** : Number of nodes to request if on the cluster, or number of processes if computing locally.

**memory** : For a cluster, the amount of memory you will request each node for.

**working_directory** : For a cluster, the directory in which data will be spilled to disk. Usually set as a folder in <br>the `headpath`.

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="04:00:00",
    local=False,
    n_workers=40,
    memory="2GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

After running the above line, you will have a running Dask client. Run the line below and click the link to supervise <br>the computation being administered by the scheduler. 

Don't be alarmed if the screen starts mostly blank, it may take time for your workers to spin up. If you get a 404 <br>error on a cluster, it is likely that your ports are not being forwarded properly. If this occurs, please register <br>the issue on github.

In [None]:
dask_controller.daskclient

In [None]:
dask_controller.shutdown()

##### Perform Extraction

Now that we have our cluster scheduler spun up, it is time to convert files. This will be handled by the <br>`hdf5_extractor` object. This extractor will pull up each FOV and split it such that each derived `.hdf5` file <br>contains, at maximum, N timepoints of that FOV per file. The image data stored in these files takes the <br>form of `(N,Y,X)` arrays that are accessible using the desired channel name as a key. 

The arguments for this extractor are:

 - **nd2file** : The filepath to the `.nd2` file you intend to extract.
 
 - **headpath** : The folder in which processing is occuring. Should be the same for each step in the pipeline.

 - **tpts_per_file** : The maximum number of timepoints stored in each output `.hdf5` file. Typical values are between 25 <br>and 100.

 - **ignore_fovmetadata** : Used when `.nd2` data is corrupted and does not possess records for stage positions or <br>timepoints. Only set `False` if the extractor throws errors on metadata handling.

 - **nd2reader_override** : Overrides values in metadata recovered using the `nd2reader`. Currently set to <br>`{"z_levels":[],"z_coordinates":[]}` by default to correct a known issue where z coordinates are mistakenly <br>interpreted as a z stack. See the [nd2reader](https://rbnvrw.github.io/nd2reader/) documentation for more info.

In [None]:
# hdf5_extractor = tr.marlin_extractor(hdf5inputpath, headpath, metaparsestr='metadata_{timepoint:d}.hdf5')

In [None]:
hdf5_extractor = tr.ndextract.hdf5_fov_extractor(
    nd2file,
    headpath,
    tpts_per_file=50,
    ignore_fovmetadata=False,
    nd2reader_override={"z_levels": [], "z_coordinates": []},
)

In [None]:
# hdf5_extractor = tr.ndextract.tiff_extractor(
#     tiffpath,
#     headpath,
#     ["Phase","YFP"],tpts_per_file=50
# )

##### Extraction Parameters

Here, you may set the time interval you want to extract. Useful for cropping data to the period exhibiting the dynamics of interest.

Optionally take notes to add to the `metadata.hdf5` file. Notes may also be taken directly in this notebook.

In [None]:
hdf5_extractor.inter_set_params()

##### Begin Extraction 

Running the following line will start the extraction process. This may be monitored by examining the `Dask Dashboard` <br> under the link displayed earlier. Once the computation is complete, move to the next line.

This step may take a long time, though it is possible to speed it up using additional workers.

In [None]:
hdf5_extractor.extract(dask_controller)

##### Shutdown Dask

Once extraction is complete, it is likely that you will want to shutdown your `dask_controller` if you are on a <br>
cluster. This is because the specifications of the current `dask_controller` will not be optimal for later steps. <br>
To do this, run the following line and wait for it to complete. If it hangs, interrupt your kernel and re-run it. <br>
If this also fails to shutdown your workers, you will have to manually shut them down using `scancel` in a terminal.

In [None]:
dask_controller.daskclient.restart()

In [None]:
dask_controller.shutdown()

## Kymographs

Now that you have extracted your data into a series of `.hdf5` files, we will now perform identification and cropping <br>of the individual trenches/growth channels present in the images. This algorithm assumes that your growth trenches <br>are vertically aligned and that they alternate in their orientation from top to bottom. See the example image for the <br>correct geometry:

![example_image](./resources/example_image.jpg)

The output of this step will be a set of `.hdf5` files stored in `headpath/kymograph`. The image data stored in these <br>files takes the form of `(K,T,Y,X)` arrays where K is the trench index, T is time, and Y,X are the crop dimensions. <br>These arrays are accessible using keys of the form `"[Image Channel]"`. For example, looking up phase channel <br>data of trenches in the topmost row of an image will require the key `"Phase"`

[ '/n/scratch3/users/d/de64/190917_20x_phase_gfp_segmentation002',
 '/n/scratch3/users/d/de64/190922_20x_phase_gfp_segmentation',
 '/n/scratch3/users/d/de64/190925_20x_phase_yfp_segmentation',
 '/n/scratch3/users/d/de64/ezrdm_training_sb7',
 '/n/scratch3/users/d/de64/mbm_training_sb7',
 '/n/scratch3/users/d/de64/Sb7_L35',
 '/n/scratch3/users/d/de64/MM_DVCvecto_TOP_1_9',
 '/n/scratch3/users/d/de64/Vibrio_2_1_TOP',
 '/n/scratch3/users/d/de64/Vibrio_A_B_VZRDM--04--RUN_80ms',
 '/n/scratch3/users/d/de64/RpoSOutliers_WT_hipQ_100X',
 '/n/scratch3/users/d/de64/Main_Experiment',
 '/n/scratch3/users/d/de64/bde17_gotime']

### Test Parameters



##### Initialize the interactive kymograph class

As a first step, initialize the `tr.interactive.kymograph_interactive` class that will be help us choose the <br>parameters we will use to generate kymographs. 

In [None]:
interactive_kymograph = tr.kymograph_interactive(headpath)

In [None]:
import numpy as np
import pandas as pd
import copy

In [None]:
x_vals

In [None]:
def get_grid_lookups(global_df, delta=10):
    first_tpt = global_df.loc[pd.IndexSlice[:, slice(0, 0)], :]

    x_vals = first_tpt["x"].values
    y_vals = first_tpt["y"].values

    x_dist = np.abs(np.subtract.outer(x_vals, x_vals))
    y_dist = np.abs(np.subtract.outer(y_vals, y_vals))

    close_x = x_dist < delta
    close_y = y_dist < delta

    x_groups = []
    for x_idx in range(close_x.shape[0]):
        x_groups.append(np.where(close_x[x_idx])[0])
    x_groups = [np.array(item) for item in set(list(tuple(arr) for arr in x_groups))]
    x_groups = sorted(x_groups, key=lambda x: x[0])
    x_lookup = {item: k for k, group in enumerate(x_groups) for item in group}

    y_groups = []
    for y_idx in range(close_y.shape[0]):
        y_groups.append(np.where(close_y[y_idx])[0])
    y_groups = [np.array(item) for item in set(list(tuple(arr) for arr in y_groups))]
    y_groups = sorted(y_groups, key=lambda x: x[0])
    y_lookup = {item: k for k, group in enumerate(y_groups) for item in group}

    return x_lookup, y_lookup


def get_grid_indices(global_df, delta=10):
    x_lookup, y_lookup = get_grid_lookups(global_df, delta=10)

    column_indices = [
        x_lookup[fov_idx]
        for fov_idx in global_df.index.get_level_values("fov").tolist()
    ]
    row_indices = [
        y_lookup[fov_idx]
        for fov_idx in global_df.index.get_level_values("fov").tolist()
    ]

    return column_indices, row_indices

In [None]:
columns, rows = get_grid_indices(interactive_kymograph.metadf)
test = copy.deepcopy(interactive_kymograph.metadf)
test["Column"] = columns
test["Row"] = rows

In [None]:
test

In [None]:
test["column"] = column_indices

In [None]:
test

In [None]:
def infer_grid()

In [None]:
viewer = tr.hdf5_viewer(headpath, persist_data=False)

##### Examine Images

Here you can manually inspect images before beginning parameter tuning.

In [None]:
viewer.view(width=1200)

You will now want to select a few test FOVs to try out parameters on, the channel you want to detect trenches on, and <br>the time interval on which you will perform your processing.

The arguments for this step are:

- **seg_channel (string)** : The channel name that you would like to segment on.

- **invert (list)** : Whether or not you want to invert the image before detecting trenches. By default, it is assumed that <br>the trenches have a high pixel intensity relative to the background. This should be the case for Phase Contrast and <br>Fluorescence Imageing, but may not be the case for Brightfield Imaging, in which case you will want to invert the image.

- **fov_list (list)** : List of integers corresponding to the FOVs that you wish to make test kymographs of.

- **t_subsample_step (int)** : Step size to be used for subsampling input files in time, recommend that subsampling results in <br>between 5 and 10 timepoints for quick processing.

Hit the "Run Interact" button to lock in your parameters. The button will become transparent briefly and become solid again <br>when processing is complete. After that has occured, move on to the next step. 

In [None]:
interactive_kymograph.import_hdf5_interactive()

##### Tune "trench-row" detection hyperparameters

The kymograph code begins by detecting the positions of trench rows in the image as follows:

1. Reducing each 2D image to a 1D signal along the y-axis by computing the qth percentile of the data along the x-axis
2. Smooth this signal using a median kernel
3. Normalize the signal by linearly scaling 0. and 1. to the minimum and maximum, respectively
4. Use a set threshold to determine the trench row poisitons

The arguments for this step are:

 - **y_percentile (int)** : Percentile to use for step 1.

 - **smoothing_kernel_y_dim_0 (int)** : Median kernel size to use for step 2.

 - **y_percentile_threshold (float)** : Threshold to use in step 4.

Running the following widget will display the smoothed 1-D signal for each of your timepoints. In addition, the threshold <br>value for each fov will be displayed as a red line.

In [None]:
interactive_kymograph.preview_y_precentiles_interactive()

##### Tune "trench-row" cropping hyperparameters

Next, we will use the detected rows to perform cropping of the input image in the y-dimension:

1. Determine edges of trench rows based on threshold mask.
2. Filter out rows that are too small.
3. Use the remaining rows to compute the drift in y in each image.
4. Apply the drift to the initally detected rows to get rows in all timepoints.
5. Perform cropping using the "end" of the row as reference (the end referring to the part of the trench farthest from <br>the feeding channel).

Step 5 performs a simple algorithm to determine the orientation of each trench:

```
row_orientations = [] # A list of row orientations, starting from the topmost row
if the number of detected rows == 'Number of Rows': 
    row_orientations.append('Orientation')
elif the number of detected rows < 'Number of Rows':
    row_orientations.append('Orientation when < expected rows')
for row in rows:
    if row_orientations[-1] == downward:
        row_orientations.append(upward)
    elif row_orientations[-1] == upward:
        row_orientations.append(downward)
```

Additionally, if the device tranches face a single direction, alternation of row orientation may be turned off by setting the<br> `Alternate Orientation?` argument to False. The `Use Median Drift?` argument, when set to True, will use the<br> median drift in y across all FOVs for drift correction, instead of doing drift correction independently for all FOVs. <br>This can be useful if there are a large fraction of FOVs which are failing drift correction. Note that `Use Median Drift?` <br>sets this behavior for both y and x drift correction.

The arguments for this step are:

 - **y_min_edge_dist (int)** : Minimum row length necessary for detection (filters out small detected objects).

 - **padding_y (int)** : Padding to add to the end of trench row when cropping in the y-dimension.

 - **trench_len_y (int)** : Length from the end of each trench row to the feeding channel side of the crop.

 - **Number of Rows (int)** : The number of rows to expect in your image. For instance, two in the example image.
 
 - **Alternate Orientation? (bool)** : Whether or not to alternate the orientation of consecutive rows.

 - **Orientation (int)** : The orientation of the top-most row where 0 corresponds to a trench with a downward-oriented trench <br>opening and 1 corresponds to a trench with an upward-oriented trench opening.

 - **Orientation when < expected rows(int)** : The orientation of the top-most row when the number of detected rows is less than <br>expected. Useful if your trenches drift out of your image in some FOVs.
 
 - **Use Median Drift? (bool)** : Whether to use the median detected drift across all FOVs, instead of the drift detected in each FOV individually.

 - **images_per_row(int)** : How many images to output per row for this widget.

Running the following widget will display y-cropped images for each fov and timepoint.

In [None]:
interactive_kymograph.preview_y_crop_interactive()

##### Tune trench detection hyperparameters

Next, we will detect the positions of trenchs in the y-cropped images as follows:

1. Reducing each 2D image to a 1D signal along the x-axis by computing the qth percentile of the data along the y-axis.
2. Determine the signal background by smoothing this signal using a large median kernel.
3. Subtract the background signal.
4. Smooth the resultant signal using a median kernel.
5. Use an [otsu threhsold](https://imagej.net/Auto_Threshold#Otsu) to determine the trench midpoint poisitons.

After this, x-dimension drift correction of our detected midpoints will be performed as follows:

6. Begin at t=1
7. For $m \in \{midpoints(t)\}$ assign $n \in \{midpoints(t-1)\}$ to m if n is the closest midpoint to m at time $t-1$,<br>
points that are not the closest midpoint to any midpoints in m will not be mapped.
8. Compute the translation of each midpoint at time.
9. Take the average of this value as the x-dimension drift from time t-1 to t.

The arguments for this step are:

 - **t (int)** : Timepoint to examine the percentiles and threshold in.

 - **x_percentile (int)** : Percentile to use for step 1.

 - **background_kernel_x (int)** : Median kernel size to use for step 2.

 - **smoothing_kernel_x (int)** : Median kernel size to use for step 4.

 - **otsu_scaling (float)** : Scaling factor to apply to the threshold determined by Otsu's method.

Running the following widget will display the smoothed 1-D signal for each of your timepoints. In addition, the threshold <br>value for each fov will be displayed as a red line. In addition, it will display the detected midpoints for each of your timepoints. <br>If there is too much sparsity, or discontinuity, your drift correction will not be accurate.

In [None]:
interactive_kymograph.preview_x_percentiles_interactive()

##### Tune trench cropping hyperparameters

Trench cropping simply uses the drift-corrected midpoints as a reference and crops out some fixed length around them <br>
to produce an output kymograph. **Note that the current implementation does not allow trench crops to overlap**. If your<br>
trench crops do overlap, the error will not be caught here, but will cause issues later in the pipeline. As such, try <br>
to crop your trenches as closely as possible. This issue will be fixed in a later update.

The arguments for this step are:

 - **trench_width_x (int)** : Trench width to use for cropping.

 - **trench_present_thr (float)** : Trenches that appear in less than this percent of FOVs will be eliminated from the dataset.<br>
If not removed, missing positions will be inferred from the image drift.

 - **Use Median Drift? (bool)** : Whether to use the median detected drift across all FOVs, instead of the drift detected in each FOV individually.


Running the following widget will display a random kymograph for each row in each fov and will also produce midpoint plots <br>showing retained midpoints

In [None]:
interactive_kymograph.preview_kymographs_interactive()

##### Export and save hyperparameters

Run the following line to register and display the parameters you have selected for kymograph creation.

In [None]:
interactive_kymograph.process_results()

If you are satisfied with the above parameters, run the following line to write these parameters to disk at `headpath/kymograph.par`<br>
This file will be used to perform kymograph creation in the next section.

In [None]:
interactive_kymograph.write_param_file()

### Generate Kymograph

##### Start Dask Workers

Again, we start a `dask_controller` instance which will handle all of our parallel processing. The default parameters <br>here work well on O2 for kymograph creation. The critical arguments here are:

**walltime** : For a cluster, the length of time you will request each node for.

**local** : `True` if you want to perform computation locally. `False` if you want to perform it on a SLURM cluster.

**n_workers** : Number of nodes to request if on the cluster, or number of processes if computing locally.

**memory** : For a cluster, the amount of memory you will request each node for.

**working_directory** : For a cluster, the directory in which data will be spilled to disk. Usually set as a folder in <br>the `headpath`.

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="04:00:00",
    local=False,
    n_workers=50,
    memory="2GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

After running the above line, you will have a running Dask client. Run the line below and click the link to supervise <br>the computation being administered by the scheduler. 

Don't be alarmed if the screen starts mostly blank, it may take time for your workers to spin up. If you get a 404 <br>error on a cluster, it is likely that your ports are not being forwarded properly. If this occurs, please register <br>the issue on github.

In [None]:
dask_controller.daskclient

##### Perform Kymograph Cropping

Now that we have our cluster scheduler spun up, we will extract kymographs using the parameters stored in `headpath/kymograph.par`. <br>
This will be handled by the `kymograph_cluster` object. This will detect trenches in all of the files present in `headpath/hdf5` that <br>
you created in the first step. It will then crop these trenches and place the crops in a series of `.hdf5` files in `headpath/kymograph`. <br>
These files will store image data in the form of `(K,T,Y,X)` arrays where K is the trench index, T is time and Y,X are the image dimensions <br>
of the crop.

The arguments for this step are:

 - **headpath** : The folder in which processing is occuring. Should be the same for each step in the pipeline.

 - **trenches_per_file** : The maximum number of trenches stored in each output `.hdf5` file. Typical values are between 25 <br>and 100.

 - **paramfile** : Set to true if you want to use parameters from `headpath/kymograph.par` Otherwise, you will have to specify <br>
 parameters as direct arguments to `kymograph_cluster`.

In [None]:
kymoclust = tr.kymograph.kymograph_cluster(
    headpath=headpath, trenches_per_file=200, paramfile=True
)

##### Begin Kymograph Cropping 

Running the following line will start the cropping process. This may be monitored by examining the `Dask Dashboard` <br>
under the link displayed earlier. Once the computation is complete, move to the next line.

**Do not move on until all tasks are displayed as 'in memory' in Dask.**

In [None]:
kymoclust.generate_kymographs(dask_controller)

In [None]:
ff = tr.focus_filter(headpath)

In [None]:
ff.choose_filter_channel_inter()

In [None]:
ff.plot_histograms()

In [None]:
ff.plot_focus_threshold_inter()

In [None]:
ff.write_param_file()

##### Post-process Images

After the above step, kymographs will have been created for each `.hdf5` input file. They will now need to be reorganized <br>
into a new set of files such that each file has, at most, `trenches_per_file` trenches in each file.

**Do not move on until all tasks are displayed as 'in memory' in Dask.**

In [None]:
kymoclust.post_process(dask_controller)

##### Check kymograph statistics

Run the next line to display some statistics from kymograph creation. The outputs are:

 - **fovs processed** : The number of FOVs successfully processed out of the total number of FOVs
 - **rows processed** : The number of rows of trenches processed out of the total number of rows
 - **trenches processed** : The number of trenches successfully processed
 - **row/fov** : The average number of rows successfully processed per FOV
 - **trenches/fov** : The average number of trenches successfully processed per FOV
 - **failed fovs** : A list of failed FOVs. Spot check these FOVs in the viewer to determine potential problems

In [None]:
kymoclust.kymo_report()

##### Shutdown Dask

Once cropping is complete, it is likely that you will want to shutdown your `dask_controller` if you are on a <br>
cluster. This is because the specifications of the current `dask_controller` will not be optimal for later steps. <br>
To do this, run the following line and wait for it to complete. If it hangs, interrupt your kernel and re-run it. <br>
If this also fails to shutdown your workers, you will have to manually shut them down using `scancel` in a terminal.

In [None]:
dask_controller.daskclient.restart()

In [None]:
dask_controller.shutdown()

## Fluorescence Segmentation

Now that you have copped your data into kymographs, we will now perform segmentation/cell detection <br>
on your kymographs. Currently, this pipeline only supports segmentation of fluorescence images; however, <br>
segmentation of transmitted light imaging techniques is in development.

The output of this step will be a set of `segmentation_[File #].hdf5` files stored in `headpath/fluorsegmentation`.<br>
The image data stored in these files takes the exact same form as the kymograph data, `(K,T,Y,X)` arrays <br>
where K is the trench index, T is time, and Y,X are the crop dimensions. These arrays are accessible using <br>
keys of the form `"[Trench Row Number]"`.

Since no metadata is generated by this step, it is possible to use another segmentation algorithm on the kymograph <br>
data. The output of segmentation must be split into `segmentation_[File #].hdf5` files, where `[File #]` agrees with the<br>
corresponding `kymograph_[File #].hdf5` file. Additionally, the `(K,T,Y,X)` arrays must be of the same shape as the <br>
kymograph arrays and accessible at the corresponding `"[Trench Row Number]"` key. These files must be placed into <br>
their own folder at `headpath/foldername`. This folder may then be used in later steps.

### Test Parameters

##### Initialize the interactive segmentation class

As a first step, initialize the `tr.fluo_segmentation_interactive` class that will be handling all steps of generating a segmentation. 

In [None]:
interactive_segmentation = tr.fluo_segmentation_interactive(headpath)

##### Choose channel to segment on

In [None]:
interactive_segmentation.choose_seg_channel_inter()

#### Import data

Fill in 

You will need to tune the following `args` and `kwargs` (in order):

**fov_idx (int)** :

**n_trenches (int)** :

**t_range (tuple)** :

**t_subsample_step (int)** :

In [None]:
interactive_segmentation.import_array_inter()

##### Process data

In [None]:
interactive_segmentation.plot_processed_inter()

#### Determine Cell Mask Envelope

Fill in.

You will need to tune the following `args` and `kwargs` (in order):

**cell_mask_method (str)** : Thresholding method, can be a local or global Otsu threshold.

**cell_otsu_scaling (float)** : Scaling factor applied to determined threshold.

**local_otsu_r (int)** : Radius of thresholding kernel used in the local otsu thresholding.

In [None]:
interactive_segmentation.plot_cell_mask_inter()

In [None]:
interactive_segmentation.plot_eig_mask_inter()

In [None]:
interactive_segmentation.plot_dist_mask_inter()

In [None]:
interactive_segmentation.plot_marker_mask_inter()

In [None]:
interactive_segmentation.process_results()

In [None]:
interactive_segmentation.write_param_file()

### Generate Segmentation

#### Start Dask Workers

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="01:00:00",
    local=False,
    n_workers=50,
    memory="1GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

In [None]:
dask_controller.displaydashboard()

In [None]:
segment = tr.segment.fluo_segmentation_cluster(headpath, paramfile=True)

In [None]:
segment.dask_segment(dask_controller)

#### Stop Dask Workers

In [None]:
dask_controller.shutdown()

## Region Properties (No Lineage)

Note this does not require a dask client

In [None]:
from paulssonlab.deaton.trenchripper.trenchripper import pandas_hdf5_handler
from paulssonlab.deaton.trenchripper.trenchripper import dask_controller

import h5py
import os

import skimage as sk
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.delayed as delayed

from distributed.client import futures_of
from time import sleep

from matplotlib import pyplot as plt


class regionprops_extractor:
    def __init__(
        self,
        headpath,
        segmentationdir,
        intensity_channel_list=None,
        include_background=False,
        props=["centroid", "area", "mean_intensity"],
        unpack_dict={"centroid": ["centroid_y", "centroid_x"]},
    ):
        self.headpath = headpath
        self.intensity_channel_list = intensity_channel_list
        self.intensity_channel_dict = {
            channel: i for i, channel in enumerate(intensity_channel_list)
        }
        self.include_background = include_background
        self.kymographpath = headpath + "/kymograph"
        self.segmentationpath = headpath + "/" + segmentationdir
        self.metapath = self.kymographpath + "/metadata"
        self.analysispath = headpath + "/analysis"
        self.props = props
        self.unpack_dict = unpack_dict

    def get_file_regionprops(self, file_idx):
        segmentation_file = (
            self.segmentationpath + "/segmentation_" + str(file_idx) + ".hdf5"
        )
        kymograph_file = self.kymographpath + "/kymograph_" + str(file_idx) + ".hdf5"

        with h5py.File(segmentation_file, "r") as segfile:
            seg_arr = segfile["data"][:]
        if self.intensity_channel_list is not None:
            kymo_arr_list = []
            with h5py.File(kymograph_file, "r") as kymofile:
                for intensity_channel in self.intensity_channel_list:
                    kymo_arr_list.append(kymofile[intensity_channel][:])
        all_props_list = []
        for k in range(seg_arr.shape[0]):
            for t in range(seg_arr.shape[1]):
                labels = sk.measure.label(seg_arr[k, t])
                ## Measure regionprops of background pixels; will always be marked as the first object
                if self.include_background:
                    labels += 1
                if self.intensity_channel_list is not None:
                    for i, intensity_channel in enumerate(self.intensity_channel_list):
                        rps = sk.measure.regionprops(labels, kymo_arr_list[i][k, t])
                        props_list = []
                        for obj, rp in enumerate(rps):
                            props_entry = [file_idx, k, t, obj, intensity_channel]
                            for prop_key in self.props:
                                if prop_key in self.unpack_dict.keys():
                                    prop_split = self.unpack_dict[prop_key]
                                    prop_output = rp[prop_key]
                                    props_entry += [
                                        prop_output[i] for i in range(len(prop_split))
                                    ]
                                else:
                                    props_entry += [rp[prop_key]]
                            props_list.append(props_entry)
                        all_props_list += props_list

                #                         for prop_key in self.props:
                #                             props_list.append([file_idx, k, t, obj, intensity_channel])
                #                         props_list = [[file_idx, k, t, obj, intensity_channel]+[getattr(rp, prop_key) for prop_key in self.props] for obj,rp in enumerate(rps)]
                #                         all_props_list+=props_list
                else:
                    rps = sk.measure.regionprops(labels)
                    props_list = []
                    for obj, rp in enumerate(rps):
                        props_entry = [file_idx, k, t, obj]
                        for prop_key in self.props:
                            if prop_key in self.unpack_dict.keys():
                                prop_split = self.unpack_dict[prop_key]
                                prop_output = rp[prop_key]
                                props_entry += [
                                    prop_output[i] for i in range(len(prop_split))
                                ]
                            else:
                                props_entry += [rp[prop_key]]
                        props_list.append(props_entry)
                    all_props_list += props_list

        #                     props_list = [[file_idx, k, t, obj]+[getattr(rp, prop_key) for prop_key in self.props] for obj,rp in enumerate(rps)]
        #                     all_props_list+=props_list

        property_columns = [
            self.unpack_dict[prop] if prop in self.unpack_dict.keys() else [prop]
            for prop in self.props
        ]
        property_columns = [item for sublist in property_columns for item in sublist]

        if self.intensity_channel_list is not None:
            column_list = [
                "File Index",
                "File Trench Index",
                "timepoints",
                "Objectid",
                "Intensity Channel",
            ] + property_columns
            df_out = pd.DataFrame(all_props_list, columns=column_list).reset_index()
        else:
            column_list = [
                "File Index",
                "File Trench Index",
                "timepoints",
                "Objectid",
            ] + property_columns
            df_out = pd.DataFrame(all_props_list, columns=column_list).reset_index()

        file_idx = df_out.apply(
            lambda x: int(
                f"{x['File Index']:04}{x['File Trench Index']:04}{x['timepoints']:04}{x['Objectid']:02}{self.intensity_channel_dict[x['Intensity Channel']]:02}"
            ),
            axis=1,
        )

        df_out["File Parquet Index"] = [item for item in file_idx]
        df_out = df_out.set_index("File Parquet Index").sort_index()
        del df_out["index"]

        return df_out

    def analyze_all_files(self, dask_cont):
        df = dd.read_parquet(self.metapath)
        file_list = df["File Index"].unique().compute().tolist()
        #         kymo_meta = dd.read_parquet(self.metapath)
        #         file_list = kymo_meta["File Index"].unique().tolist()

        delayed_list = []
        for file_idx in file_list:
            df_delayed = delayed(self.get_file_regionprops)(file_idx)
            delayed_list.append(df_delayed.persist())

        ## filtering out non-failed dataframes ##
        all_delayed_futures = []
        for item in delayed_list:
            all_delayed_futures += futures_of(item)
        while any(future.status == "pending" for future in all_delayed_futures):
            sleep(0.1)

        good_delayed = []
        for item in delayed_list:
            if all([future.status == "finished" for future in futures_of(item)]):
                good_delayed.append(item)

        ## compiling output dataframe ##
        df_out = dd.from_delayed(good_delayed).persist()
        df_out["File Parquet Index"] = df_out.index
        df_out = df_out.set_index("File Parquet Index", drop=True, sorted=False)
        df_out = df_out.repartition(partition_size="25MB").persist()

        kymo_df = dd.read_parquet(self.metapath)
        kymo_df["File Merge Index"] = kymo_df["File Parquet Index"]
        kymo_df = kymo_df.set_index("File Merge Index", sorted=True)
        kymo_df = kymo_df.drop(
            ["File Index", "File Trench Index", "timepoints", "File Parquet Index"],
            axis=1,
        )

        df_out["File Merge Index"] = df_out.apply(
            lambda x: int(
                f'{x["File Index"]:04}{x["File Trench Index"]:04}{x["timepoints"]:04}'
            ),
            axis=1,
        )
        df_out = df_out.reset_index(drop=False)
        df_out = df_out.set_index("File Merge Index", sorted=True)

        df_out = df_out.join(kymo_df)
        df_out = df_out.set_index("File Parquet Index", sorted=True)

        dd.to_parquet(
            df_out,
            self.analysispath,
            engine="fastparquet",
            compression="gzip",
            write_metadata_file=True,
        )

    def export_all_data(self, n_workers=20, memory="8GB"):

        dask_cont = dask_controller(
            walltime="01:00:00",
            local=False,
            n_workers=n_workers,
            memory=memory,
            working_directory=self.headpath + "/dask",
        )
        dask_cont.startdask()
        dask_cont.displaydashboard()
        dask_cont.futures = {}

        try:
            self.analyze_all_files(dask_cont)
            dask_cont.shutdown()
        except:
            dask_cont.shutdown()
            raise

In [None]:
def get_image_measurements(
    kymographpath, channels, file_idx, output_name, img_fn, *args, **kwargs
):

    df = dd.read_parquet(kymographpath + "/metadata")
    df = df.set_index("File Parquet Index", sorted=True)

    start_idx = int(str(file_idx) + "00000000")
    end_idx = int(str(file_idx) + "99999999")

    working_dfs = []

    proc_file_path = kymographpath + "/kymograph_" + str(file_idx) + ".hdf5"
    with h5py.File(proc_file_path, "r") as infile:
        working_filedf = df.loc[start_idx:end_idx].compute()
        trench_idx_list = working_filedf["File Trench Index"].unique().tolist()
        for trench_idx in trench_idx_list:
            trench_df = working_filedf[
                working_filedf["File Trench Index"] == trench_idx
            ]
            for channel in channels:
                kymo_arr = infile[channel][trench_idx]
                fn_out = [
                    img_fn(kymo_arr[i], *args, **kwargs)
                    for i in range(kymo_arr.shape[0])
                ]
                trench_df[channel + " " + output_name] = fn_out
            working_dfs.append(trench_df)

    out_df = pd.concat(working_dfs)
    return out_df


def get_all_image_measurements(
    headpath, output_path, channels, output_name, img_fn, *args, **kwargs
):
    kymographpath = headpath + "/kymograph"
    df = dd.read_parquet(kymographpath + "/metadata")

    file_list = df["File Index"].unique().compute().tolist()

    delayed_list = []
    for file_idx in file_list:
        df_delayed = delayed(get_image_measurements)(
            kymographpath, channels, file_idx, output_name, img_fn, *args, **kwargs
        )
        delayed_list.append(df_delayed.persist())

    ## filtering out non-failed dataframes ##
    all_delayed_futures = []
    for item in delayed_list:
        all_delayed_futures += futures_of(item)
    while any(future.status == "pending" for future in all_delayed_futures):
        sleep(0.1)

    good_delayed = []
    for item in delayed_list:
        if all([future.status == "finished" for future in futures_of(item)]):
            good_delayed.append(item)

    ## compiling output dataframe ##
    df_out = dd.from_delayed(good_delayed).persist()
    df_out["FOV Parquet Index"] = df_out.index
    df_out = df_out.set_index("FOV Parquet Index", drop=True, sorted=False)
    df_out = df_out.repartition(partition_size="25MB").persist()

    dd.to_parquet(
        df_out,
        output_path,
        engine="fastparquet",
        compression="gzip",
        write_metadata_file=True,
    )

In [None]:
analyzer = regionprops_extractor(
    "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/gfp",
    "fluorsegmentation",
    intensity_channel_list=["RFP-Penta", "GFP-Penta"],
    include_background=True,
)

In [None]:
analyzer.export_all_data()

In [None]:
kymo_df = dd.read_parquet(analyzer.metapath)
kymo_df["File Merge Index"] = kymo_df["File Parquet Index"]
kymo_df = kymo_df.set_index("File Merge Index", sorted=True)
kymo_df = kymo_df.drop(["File Index", "File Trench Index", "timepoints"], axis=1)

analysis_df = dd.read_parquet(analyzer.analysispath)
analysis_df["File Merge Index"] = analysis_df.apply(
    lambda x: int(
        f'{x["File Index"]:04}{x["File Trench Index"]:04}{x["timepoints"]:04}'
    ),
    axis=1,
)
analysis_df = analysis_df.set_index("File Merge Index", sorted=True)

merged_df = analysis_df.join(kymo_df)
merged_df = merged_df.set_index("File Parquet Index", sorted=True)

In [None]:
merged_df

In [None]:
import dask.dataframe as dd
import dask.delayed as delayed
from distributed.client import futures_of
import numpy as np
import pandas as pd
import h5py
import seaborn as sns
import scipy.signal
import skimage as sk
from time import sleep
from matplotlib import pyplot as plt

In [None]:
region_props = pd.read_pickle(
    "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/gfp/analysis.pkl"
)  # .loc[(slice(None), slice(None), list(range(10,20)), slice(None))]
region_props = region_props.reset_index()
region_props = region_props.set_index(
    ["trenchid", "timepoints", "Intensity Channel", "Objectid"], drop=True
)
region_props = region_props.sort_index()

#### Output

At this point you may want to use your output. The output of this step is a set of `.hdf5` files stored in <br>`headpath/kymograph`. The image data stored in these files takes the form of `(K,T,Y,X)` arrays <br>where K is the trench index, T is time, and Y,X are the crop dimensions.

These arrays are accessible using keys of the form `"[Trench Row Number]/[Image Channel]"`. <br>For example, looking up phase channel data of trenches in the topmost row of an image will require <br>the key `"0/Phase"` The metadata associated with these files is a large pandas dataframe relating <br>crops to original FOVs, accessible using the "kymograph" key on `headpath/metadata.hdf5`

To assist in accessing this file, you may use the `trenchripper.pandas_hdf5_handler` object to <br>interface with this file as follows:

## FISH Analysis

##### Start Dask Workers

Again, we start a `dask_controller` instance which will handle all of our parallel processing. The default parameters <br>here work well on O2 for kymograph creation. The critical arguments here are:

**walltime** : For a cluster, the length of time you will request each node for.

**local** : `True` if you want to perform computation locally. `False` if you want to perform it on a SLURM cluster.

**n_workers** : Number of nodes to request if on the cluster, or number of processes if computing locally.

**memory** : For a cluster, the amount of memory you will request each node for.

**working_directory** : For a cluster, the directory in which data will be spilled to disk. Usually set as a folder in <br>the `headpath`.

In [None]:
headpath = "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/barcodes"

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="02:00:00",
    local=False,
    n_workers=30,
    memory="4GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

In [None]:
dask_controller.displaydashboard()

#### Get Barcode Signal (Percentile Function)

In [None]:
import numpy as np

tr.get_all_image_measurements(
    headpath,
    headpath + "/percentiles",
    ["RFP", "Cy5", "Cy7"],
    "95th Percentile",
    np.percentile,
    95,
)

In [None]:
dask_controller.daskclient.futures[
    "get_image_measurements-b6cbb727-44de-4fa5-b6b4-b6651aa69849"
].exception

#### Determine Barcodes

In [None]:
fish_test = tr.fish_analysis(
    "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/barcodes",
    "./lDE14_final_df.tsv",
    "./lDE14_final_df.json",
    hamming_thr=0,
    channel_names=["RFP 98th Percentile", "Cy5 98th Percentile", "Cy7 98th Percentile"],
)

In [None]:
fish_test.plot_signal_threshold_inter()

In [None]:
fish_test.get_bit_thresholds()

In [None]:
fish_test.bit_threshold_list = [
    800,
    800,
    1000,
    1200,
    1200,
    1200,
    1500,
    1500,
    1200,
    1500,
    2500,
    4000,
    4000,
    3000,
    3500,
    3000,
    3500,
    3500,
    2500,
    5000,
    400,
    400,
    400,
    300,
    500,
    400,
    400,
    500,
    400,
    400,
]

In [None]:
fish_test.plot_bit_threshold_inter()

In [None]:
fish_test.output_barcode_df()

#### Import Barcode Dataframe

In [None]:
meta_handle = tr.pandas_hdf5_handler(
    "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/barcodes/metadata.hdf5"
)
barcode_df = meta_handle.read_df("barcodes", read_metadata=True)

#### Compute Bit-wise Error (if singleton errors allowed)

In [None]:
true_barcodes = np.array(
    barcode_df["barcode"].apply(lambda x: list(barcode_to_FISH(x))).tolist()
).astype("uint8")

In [None]:
read_barcodes = np.array(
    [list(item) for item in barcode_df["Barcode"].tolist()]
).astype("uint8")

In [None]:
true_barcodes.shape

In [None]:
read_barcodes.shape

In [None]:
bit_error = (
    np.sum(np.logical_xor(true_barcodes, read_barcodes), axis=0)
    / true_barcodes.shape[0]
)

In [None]:
plt.bar(range(30), bit_error)
plt.show()

#### Compute Call Rate

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy as sp
import sklearn as skl
import dask.dataframe as dd

from matplotlib import pyplot as plt

ttl_true = np.sum([item == True for item in barcode_df["dark_gfp"].tolist()])
ttl_false = np.sum([item == False for item in barcode_df["dark_gfp"].tolist()])
ttl_none = np.sum([item == "Unknown" for item in barcode_df["dark_gfp"].tolist()])
ttl_called = ttl_true + ttl_false
ttl_trenches = barcode_df.metadata["Total Trenches"]
ttl_trenches_w_cells = barcode_df.metadata["Total Trenches With Cells"]
percent_called = ttl_called / ttl_trenches
percent_called_w_cells = ttl_called / ttl_trenches_w_cells

In [None]:
percent_called

In [None]:
percent_called_w_cells

#### Import GFP Regionprops Output

In [None]:
analysis_df = dd.read_parquet(
    "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/gfp/analysis"
)

#### Get trenchwise GFP signal

In [None]:
mchy_df = analysis_df[analysis_df["Intensity Channel"] == "RFP-Penta"]
mchy_groupby = mchy_df.groupby(["trenchid", "timepoints"])

gfp_df = analysis_df[analysis_df["Intensity Channel"] == "GFP-Penta"]
gfp_groupby = gfp_df.groupby(["trenchid", "timepoints"])

gfp_intensity_wo_bkd = (
    gfp_groupby.apply(
        lambda x: (
            x["mean_intensity"] - x[x["Objectid"] == 0]["mean_intensity"].iloc[0]
        ).to_dict(),
        meta=("mean_intensity", float),
    )
    .reset_index(drop=True)
    .compute()
    .to_list()
)
gfp_intensity_wo_bkd = {k: v for d in gfp_intensity_wo_bkd for k, v in d.items()}
gfp_intensity_wo_bkd = pd.DataFrame.from_dict(
    gfp_intensity_wo_bkd, orient="index", columns=["mean_intensity_wo_bkd"]
)
gfp_df = gfp_df.join(gfp_intensity_wo_bkd).persist()
del gfp_intensity_wo_bkd

mchy_intensity_wo_bkd = (
    mchy_groupby.apply(
        lambda x: (
            x["mean_intensity"] - x[x["Objectid"] == 0]["mean_intensity"].iloc[0]
        ).to_dict(),
        meta=("mean_intensity", float),
    )
    .reset_index(drop=True)
    .compute()
    .to_list()
)
mchy_intensity_wo_bkd = {k: v for d in mchy_intensity_wo_bkd for k, v in d.items()}
mchy_intensity_wo_bkd = pd.DataFrame.from_dict(
    mchy_intensity_wo_bkd, orient="index", columns=["mean_intensity_wo_bkd"]
)
mchy_df = mchy_df.join(mchy_intensity_wo_bkd).persist()
del mchy_intensity_wo_bkd

gfp_df_nobkd = gfp_df[gfp_df["Objectid"] != 0]
gfp_df_nobkd["Object Parquet Index"] = gfp_df_nobkd.apply(
    lambda x: int(
        f"{x['File Index']:04}{x['File Trench Index']:04}{x['timepoints']:04}{x['Objectid']:02}"
    ),
    axis=1,
)
gfp_df_nobkd = gfp_df_nobkd.set_index("Object Parquet Index")

mchy_df_nobkd = mchy_df[mchy_df["Objectid"] != 0]
mchy_df_nobkd["Object Parquet Index"] = mchy_df_nobkd.apply(
    lambda x: int(
        f"{x['File Index']:04}{x['File Trench Index']:04}{x['timepoints']:04}{x['Objectid']:02}"
    ),
    axis=1,
)
mchy_df_nobkd = mchy_df_nobkd.set_index("Object Parquet Index")

ratio_series = (
    gfp_df_nobkd["mean_intensity_wo_bkd"] / mchy_df_nobkd["mean_intensity_wo_bkd"]
)
gfp_df_nobkd["gfp/mchy Ratio"] = ratio_series

trenchid_groupby = gfp_df_nobkd.groupby("trenchid")
median_ratio = trenchid_groupby["gfp/mchy Ratio"].apply(np.median).compute()
median_ratio = median_ratio.sort_index()

In [None]:
plt.hist(
    median_ratio,
    range=(0, 20),
    bins=50,
    color="grey",
    label="Measured Dark GFP",
    density=True,
)
plt.xlabel("Mean Intensity Ratio", fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
# plt.savefig("./2021-03-10_lDE14_figure_2.png",dpi=300,bbox_inches="tight")
plt.show()

#### Apply GFP Signal Threshold

In [None]:
threshold = 2.0

dark_gfp = median_ratio < threshold
perc_gfp = 1.0 - (np.sum(dark_gfp) / len(median_ratio))
print(perc_gfp)

plt.hist(
    median_ratio[median_ratio < threshold],
    range=(0, 20),
    bins=50,
    color="grey",
    label="Measured Dark GFP",
    density=True,
)
plt.hist(
    median_ratio[median_ratio > threshold],
    range=(0, 20),
    bins=50,
    color="green",
    label="Measured GFP",
    density=True,
)
plt.xlabel("Mean Intensity Ratio", fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
# plt.savefig("./2021-03-10_lDE14_figure_2.png",dpi=300,bbox_inches="tight")
plt.show()

#### Get Trench Mapping

In [None]:
gfp_kymo_df = dd.read_parquet(
    "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/gfp/kymograph/metadata"
)
barcode_kymo_df = dd.read_parquet(
    "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/barcodes/kymograph/metadata"
)

max_gfp_tpt = gfp_kymo_df.loc[:1000]["timepoints"].max().compute()
min_barcode_tpt = barcode_kymo_df.loc[:1000]["timepoints"].min().compute()

last_gfp_tpt_df = gfp_kymo_df[gfp_kymo_df["timepoints"] == max_gfp_tpt].compute()
first_barcode_tpt_df = barcode_kymo_df[
    barcode_kymo_df["timepoints"] == min_barcode_tpt
].compute()

trenchid_map = tr.get_trenchid_map(first_barcode_tpt_df, last_gfp_tpt_df)

#### Get GFP Call Error and Recovery Rate

In [None]:
barcode_df["Measured Dark GFP"] = barcode_df.apply(
    tr.map_Series, axis=1, args=(dark_gfp, trenchid_map)
)
barcode_df["Measured GFP Ratio"] = barcode_df.apply(
    tr.map_Series, axis=1, args=(median_ratio, trenchid_map)
)
called_df = barcode_df[barcode_df["Measured Dark GFP"] != "Unknown"]
ttl_correct = np.sum(called_df["dark_gfp"] == called_df["Measured Dark GFP"])
ttl_called = len(called_df)
recovery_rate = len(called_df) / len(dark_gfp)

print("Error Rate:" + str(1.0 - ttl_correct / ttl_called))
print("Recovery Rate:" + str(recovery_rate))

In [None]:
plt.hist(
    called_df[called_df["Measured Dark GFP"] == True]["Measured GFP Ratio"],
    range=(0, 20),
    bins=50,
    color="grey",
    alpha=0.7,
    label="Predicted Dark GFP",
    density=True,
)
plt.hist(
    called_df[called_df["Measured Dark GFP"] == False]["Measured GFP Ratio"],
    range=(0, 20),
    bins=50,
    color="green",
    alpha=0.7,
    label="Predicted GFP",
    density=True,
)
plt.xlabel("Median Intensity", fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
# plt.savefig("./2020-11-07_lDE11_figure_1.png",dpi=300,bbox_inches="tight")
plt.show()

#### Get Confusion Matrix

In [None]:
def get_confusion_mat(df):
    TP = np.sum((df["dark_gfp"] == False) & (df["Measured Dark GFP"] == False))
    TN = np.sum((df["dark_gfp"] == True) & (df["Measured Dark GFP"] == True))
    FP = np.sum((df["dark_gfp"] == False) & (df["Measured Dark GFP"] == True))
    FN = np.sum((df["dark_gfp"] == True) & (df["Measured Dark GFP"] == False))

    error = (FP + FN) / (TP + TN + FP + FN)
    FP_error = FP / (TP + TN + FP + FN)
    FN_error = FN / (TP + TN + FP + FN)

    return error, FP_error, FN_error

In [None]:
error, FP_error, FN_error = get_confusion_mat(called_df)
print("Error: " + str(error))
print("FP error: " + str(FP_error))
print("FN error: " + str(FN_error))

In [None]:
import seaborn as sns

sns.set()

hamming_filters = list(range(1, 5))
hamming_n_barcodes = []
hamming_errors = []
for i in hamming_filters:
    filtered_df = called_df[called_df["Closest Hamming Distance"] >= i]
    n_barcode = len(filtered_df)
    error, FP_error, FN_error = get_confusion_mat(filtered_df)
    error = np.round(100 * error, decimals=2)
    hamming_errors.append(error)
    hamming_n_barcodes.append(n_barcode)

sns.lineplot(hamming_filters, hamming_errors, linewidth=5)
plt.xticks(fontsize=20)
plt.yticks(
    fontsize=20,
)
plt.show()

sns.lineplot(hamming_filters, hamming_n_barcodes, linewidth=5)
plt.xticks(fontsize=20)
plt.yticks(
    fontsize=20,
)
plt.show()

In [None]:
def bootstrap_subsample(df, value, subsample_n, bootstrap_n=500):
    bootstrap_errors = []
    for i in range(bootstrap_n):
        sub_df = df.sample(n=subsample_n)
        error, FP_error, FN_error = get_confusion_mat(sub_df)
        error = np.round(100 * error, decimals=2)
        bootstrap_errors.append(error)
    percentile = sp.stats.percentileofscore(bootstrap_errors, value)

    return percentile

In [None]:
bootstrap_subsample(
    called_df, hamming_errors[-1], hamming_n_barcodes[-1], bootstrap_n=500
)

In [None]:
def get_gmm_params(values):
    gmm = skl.mixture.GaussianMixture(n_components=2, n_init=10)
    gmm.fit(values.reshape(-1, 1))
    #     probs = gmm.predict_proba(values.reshape(-1,1))
    return gmm.means_[:, 0], ((gmm.covariances_) ** (1 / 2))[:, 0, 0]

In [None]:
import seaborn as sns

sns.set()

n_std = np.linspace(0, 2, 20)
n_barcodes = []
errors = []
FP_errors = []
FN_errors = []

means, stds = get_gmm_params(called_df["Measured GFP Ratio"].values)

if means[0] > means[1]:
    means = means[::-1]
    stds = stds[::-1]

for i in n_std:
    upper_bound = means + stds * i
    lower_bound = means - stds * i

    #     valid_dark = (called_df_barcodes["Measured Median GFP"] < upper_bound[0]) &\
    #     (called_df_barcodes["Measured Median GFP"] > lower_bound[0])
    #     valid_gfp = (called_df_barcodes["Measured Median GFP"] < upper_bound[1]) &\
    #     (called_df_barcodes["Measured Median GFP"] > lower_bound[1])
    #     valid = valid_dark|valid_gfp
    valid_dark = called_df["Measured GFP Ratio"] < upper_bound[0]
    valid_gfp = called_df["Measured GFP Ratio"] > lower_bound[1]
    valid = valid_dark | valid_gfp

    filtered_df = called_df[valid]
    n_barcode = len(filtered_df)
    error, FP_error, FN_error = get_confusion_mat(filtered_df)
    error = np.round(100 * error, decimals=2)
    FP_error = np.round(100 * FP_error, decimals=2)
    FN_error = np.round(100 * FN_error, decimals=2)
    errors.append(error)
    FP_errors.append(FP_error)
    FN_errors.append(FN_error)
    n_barcodes.append(n_barcode)

sns.lineplot(n_std, errors, linewidth=5, label="Error")
sns.lineplot(n_std, FP_errors, linewidth=5, label="FP Error")
sns.lineplot(n_std, FN_errors, linewidth=5, label="FN Error")
plt.xticks(fontsize=20)
plt.yticks(
    fontsize=20,
)
plt.legend()
plt.show()

sns.lineplot(n_std, n_barcodes, linewidth=5)
plt.xticks(fontsize=20)
plt.yticks(
    fontsize=20,
)
plt.show()

#### Sources of error

There are around twice the number of false negatives (predicted to be a Dark GFP, but measured as bright) as there are false positives (predicted to be GFP, but measured as dark).

Some theories for these error classes:

False Positives:
    
    - Mutations in the promoter (should be constant within barcodes)
    
    - Strain variation (should be lower when averaging among strains)
    
    - Misread of barcodes
    
False Negatives:
    
    - Bleed from adjacent cells (should be corrected by averging among strains)
    
    - Multiple strains per trench (?)
    
    - Misread of barcodes

#### Median GFP Approach

In [None]:
median_gfp_df = called_df.groupby("Barcode").apply(
    lambda x: x["Measured GFP Ratio"].median()
)

In [None]:
plt.hist(
    median_gfp_df[median_gfp_df < threshold],
    range=(0, 20),
    bins=50,
    color="grey",
    label="Measured Dark GFP",
)
plt.hist(
    median_gfp_df[median_gfp_df > threshold],
    range=(0, 20),
    bins=50,
    color="green",
    label="Measured GFP",
)
plt.xlabel("Mean Intensity Ratio", fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
# plt.savefig("./2020-10-10_lDE11_figure_2.png",dpi=300,bbox_inches="tight")
plt.show()

In [None]:
called_df_barcodes = called_df.set_index(["Barcode"]).sort_index()
called_df_barcodes["Measured Median GFP"] = median_gfp_df
called_df_barcodes.reset_index(drop=False)
called_df_barcodes = called_df_barcodes.groupby("Barcode").apply(lambda x: x.iloc[0])

In [None]:
called_df_barcodes

In [None]:
ttl_correct = np.sum(
    called_df_barcodes["dark_gfp"] == (called_df_barcodes["Measured Median GFP"] < 2.0)
)
ttl_called = len(called_df_barcodes)
print("Percent Correct:" + str(ttl_correct / ttl_called))