# Introduction

This notebook contains the entire `TrenchRipper` pipline, divided into simple steps. This pipline is ideal for Mother <br>Machine image data where cells possess fluorescent segmentation markers. Segmentation on phase or brightfield data <br>is being developed, but is still an experimental feature.

The steps in this pipeline are as follows:
1. Extracting your Mother Machine data (.nd2) into hdf5 format
2. Identifying and cropping individual trenches into kymographs
3. Segmenting cells with a fluorescent marker
4. Determining lineages and object properties

In each step, the user will dynamically specify parameters using a series of interactive diagnostics on their dataset. <br>Following this, a parameter file will be written to disk and then used to deploy a parallel computation on the <br>dataset, either locally or on a SLURM cluster.


This is intended as an end-to-end solution to analyzing Mother Machine data. As such, **it is not trivial to plug data <br>directly into intermediate steps**, as it will lack the correct formatting and associated metadata. A notable <br>exception to this is using another program to segment data. The library references binary segmentation masks using <br>only metadata derived from their associated kymographs. As such, it is possible to generate segmentations on these <br>kymographs elsewhere and place them into the segmentation data path to have `TrenchRipper` act on those <br>segmentations instead. More on this in the segmentation section...

#### Imports

Run this section to import all relavent packages and libraries used in this notebook. You must run this everytime you open a new python kernel.

In [None]:
import paulssonlab.deaton.trenchripper.trenchripper as tr

import warnings

warnings.filterwarnings(action="once")

import matplotlib

matplotlib.rcParams["figure.figsize"] = [20, 10]

# Part 1: mVenus

#### Specify Paths

Begin by defining the directory in which all processing will be done, as well as the initial nd2 file we will be <br>processing. This line should be run everytime you open a new python kernel.

The format should be: `headpath = "/path/to/folder"` and `nd2file = "/path/to/file.nd2"`

For example:
```
headpath = "/n/scratch2/de64/2019-05-31_validation_data"
nd2file = "/n/scratch2/de64/2019-05-31_validation_data/Main_Experiment.nd2"
```

Ideally, these files should be placed in a storage location with relatively fast I/O

In [None]:
headpath = "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/mVenus/"
# hdf5inputpath = "/home/de64/scratch/de64/sync_folder/2021-01-28_lDE14/run/"
nd2file = "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/run_100ms_mchry100_yfp50.nd2"

## Extract to hdf5 files

In this section, we will be extracting our image data. Currently this notebook only supports `.nd2` format; however <br>there are `.tiff` extractors in the TrenchRipper source files that are being added to `Master.ipynb` soon.

In the abstract, this step will take a single `.nd2` file and split it into a set of `.hdf5` files stored in <br>`headpath/hdf5`. Splitting the file up in this way will facilitate quick procesing in later steps. Each field of <br>view will be split into one or more `.hdf5` files, depending on the number of images per file requested (more on <br>this later). 

To keep track of which output files correspond to which FOVs, as well as to keep track of experiment metadata, the <br>extractor also outputs a `metadata.hdf5` file in the `headpath` folder. The data from this step is accessible in <br>that `metadata.hdf5` file under the `global` key. If you would like to look at this metadata, you may use the <br>`tr.utils.pandas_hdf5_handler` to read from this file. Later steps will add additional metadata under different <br>keys into the `metadata.hdf5` file.

#### Start Dask Workers

First, we start a `dask_controller` instance which will handle all of our parallel processing. The default parameters <br>here work well on O2. The critical arguments here are:

**walltime** : For a cluster, the length of time you will request each node for.

**local** : `True` if you want to perform computation locally. `False` if you want to perform it on a SLURM cluster.

**n_workers** : Number of nodes to request if on the cluster, or number of processes if computing locally.

**memory** : For a cluster, the amount of memory you will request each node for.

**working_directory** : For a cluster, the directory in which data will be spilled to disk. Usually set as a folder in <br>the `headpath`.

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="04:00:00",
    local=False,
    n_workers=20,
    memory="2GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

After running the above line, you will have a running Dask client. Run the line below and click the link to supervise <br>the computation being administered by the scheduler. 

Don't be alarmed if the screen starts mostly blank, it may take time for your workers to spin up. If you get a 404 <br>error on a cluster, it is likely that your ports are not being forwarded properly. If this occurs, please register <br>the issue on github.

In [None]:
dask_controller.daskclient

In [None]:
dask_controller.shutdown()

##### Perform Extraction

Now that we have our cluster scheduler spun up, it is time to convert files. This will be handled by the <br>`hdf5_extractor` object. This extractor will pull up each FOV and split it such that each derived `.hdf5` file <br>contains, at maximum, N timepoints of that FOV per file. The image data stored in these files takes the <br>form of `(N,Y,X)` arrays that are accessible using the desired channel name as a key. 

The arguments for this extractor are:

 - **nd2file** : The filepath to the `.nd2` file you intend to extract.
 
 - **headpath** : The folder in which processing is occuring. Should be the same for each step in the pipeline.

 - **tpts_per_file** : The maximum number of timepoints stored in each output `.hdf5` file. Typical values are between 25 <br>and 100.

 - **ignore_fovmetadata** : Used when `.nd2` data is corrupted and does not possess records for stage positions or <br>timepoints. Only set `False` if the extractor throws errors on metadata handling.

 - **nd2reader_override** : Overrides values in metadata recovered using the `nd2reader`. Currently set to <br>`{"z_levels":[],"z_coordinates":[]}` by default to correct a known issue where z coordinates are mistakenly <br>interpreted as a z stack. See the [nd2reader](https://rbnvrw.github.io/nd2reader/) documentation for more info.

In [None]:
# hdf5_extractor = tr.marlin_extractor(hdf5inputpath, headpath, metaparsestr='metadata_{timepoint:d}.hdf5')

In [None]:
hdf5_extractor = tr.ndextract.hdf5_fov_extractor(
    nd2file,
    headpath,
    tpts_per_file=50,
    ignore_fovmetadata=False,
    nd2reader_override={"z_levels": [], "z_coordinates": []},
)

In [None]:
# hdf5_extractor = tr.ndextract.tiff_extractor(
#     tiffpath,
#     headpath,
#     ["Phase","YFP"],tpts_per_file=50
# )

##### Extraction Parameters

Here, you may set the time interval you want to extract. Useful for cropping data to the period exhibiting the dynamics of interest.

Optionally take notes to add to the `metadata.hdf5` file. Notes may also be taken directly in this notebook.

In [None]:
hdf5_extractor.inter_set_params()

##### Begin Extraction 

Running the following line will start the extraction process. This may be monitored by examining the `Dask Dashboard` <br> under the link displayed earlier. Once the computation is complete, move to the next line.

This step may take a long time, though it is possible to speed it up using additional workers.

In [None]:
hdf5_extractor.extract(dask_controller)

##### Shutdown Dask

Once extraction is complete, it is likely that you will want to shutdown your `dask_controller` if you are on a <br>
cluster. This is because the specifications of the current `dask_controller` will not be optimal for later steps. <br>
To do this, run the following line and wait for it to complete. If it hangs, interrupt your kernel and re-run it. <br>
If this also fails to shutdown your workers, you will have to manually shut them down using `scancel` in a terminal.

In [None]:
dask_controller.daskclient.restart()

In [None]:
dask_controller.shutdown()

## Kymographs

Now that you have extracted your data into a series of `.hdf5` files, we will now perform identification and cropping <br>of the individual trenches/growth channels present in the images. This algorithm assumes that your growth trenches <br>are vertically aligned and that they alternate in their orientation from top to bottom. See the example image for the <br>correct geometry:

![example_image](./resources/example_image.jpg)

The output of this step will be a set of `.hdf5` files stored in `headpath/kymograph`. The image data stored in these <br>files takes the form of `(K,T,Y,X)` arrays where K is the trench index, T is time, and Y,X are the crop dimensions. <br>These arrays are accessible using keys of the form `"[Image Channel]"`. For example, looking up phase channel <br>data of trenches in the topmost row of an image will require the key `"Phase"`

[ '/n/scratch3/users/d/de64/190917_20x_phase_gfp_segmentation002',
 '/n/scratch3/users/d/de64/190922_20x_phase_gfp_segmentation',
 '/n/scratch3/users/d/de64/190925_20x_phase_yfp_segmentation',
 '/n/scratch3/users/d/de64/ezrdm_training_sb7',
 '/n/scratch3/users/d/de64/mbm_training_sb7',
 '/n/scratch3/users/d/de64/Sb7_L35',
 '/n/scratch3/users/d/de64/MM_DVCvecto_TOP_1_9',
 '/n/scratch3/users/d/de64/Vibrio_2_1_TOP',
 '/n/scratch3/users/d/de64/Vibrio_A_B_VZRDM--04--RUN_80ms',
 '/n/scratch3/users/d/de64/RpoSOutliers_WT_hipQ_100X',
 '/n/scratch3/users/d/de64/Main_Experiment',
 '/n/scratch3/users/d/de64/bde17_gotime']

### Test Parameters



##### Initialize the interactive kymograph class

As a first step, initialize the `tr.interactive.kymograph_interactive` class that will be help us choose the <br>parameters we will use to generate kymographs. 

In [None]:
interactive_kymograph = tr.kymograph_interactive(headpath)

In [None]:
viewer = tr.hdf5_viewer(headpath, persist_data=False)

##### Examine Images

Here you can manually inspect images before beginning parameter tuning.

In [None]:
viewer.view(width=1200)

You will now want to select a few test FOVs to try out parameters on, the channel you want to detect trenches on, and <br>the time interval on which you will perform your processing.

The arguments for this step are:

- **seg_channel (string)** : The channel name that you would like to segment on.

- **invert (list)** : Whether or not you want to invert the image before detecting trenches. By default, it is assumed that <br>the trenches have a high pixel intensity relative to the background. This should be the case for Phase Contrast and <br>Fluorescence Imageing, but may not be the case for Brightfield Imaging, in which case you will want to invert the image.

- **fov_list (list)** : List of integers corresponding to the FOVs that you wish to make test kymographs of.

- **t_subsample_step (int)** : Step size to be used for subsampling input files in time, recommend that subsampling results in <br>between 5 and 10 timepoints for quick processing.

Hit the "Run Interact" button to lock in your parameters. The button will become transparent briefly and become solid again <br>when processing is complete. After that has occured, move on to the next step. 

In [None]:
interactive_kymograph.import_hdf5_interactive()

##### Tune "trench-row" detection hyperparameters

The kymograph code begins by detecting the positions of trench rows in the image as follows:

1. Reducing each 2D image to a 1D signal along the y-axis by computing the qth percentile of the data along the x-axis
2. Smooth this signal using a median kernel
3. Normalize the signal by linearly scaling 0. and 1. to the minimum and maximum, respectively
4. Use a set threshold to determine the trench row poisitons

The arguments for this step are:

 - **y_percentile (int)** : Percentile to use for step 1.

 - **smoothing_kernel_y_dim_0 (int)** : Median kernel size to use for step 2.

 - **y_percentile_threshold (float)** : Threshold to use in step 4.

Running the following widget will display the smoothed 1-D signal for each of your timepoints. In addition, the threshold <br>value for each fov will be displayed as a red line.

In [None]:
interactive_kymograph.preview_y_precentiles_interactive()

In [None]:
interactive_kymograph.preview_y_precentiles_consensus_interactive()

##### Tune "trench-row" cropping hyperparameters

Next, we will use the detected rows to perform cropping of the input image in the y-dimension:

1. Determine edges of trench rows based on threshold mask.
2. Filter out rows that are too small.
3. Use the remaining rows to compute the drift in y in each image.
4. Apply the drift to the initally detected rows to get rows in all timepoints.
5. Perform cropping using the "end" of the row as reference (the end referring to the part of the trench farthest from <br>the feeding channel).

Step 5 performs a simple algorithm to determine the orientation of each trench:

```
row_orientations = [] # A list of row orientations, starting from the topmost row
if the number of detected rows == 'Number of Rows': 
    row_orientations.append('Orientation')
elif the number of detected rows < 'Number of Rows':
    row_orientations.append('Orientation when < expected rows')
for row in rows:
    if row_orientations[-1] == downward:
        row_orientations.append(upward)
    elif row_orientations[-1] == upward:
        row_orientations.append(downward)
```

Additionally, if the device tranches face a single direction, alternation of row orientation may be turned off by setting the<br> `Alternate Orientation?` argument to False. The `Use Median Drift?` argument, when set to True, will use the<br> median drift in y across all FOVs for drift correction, instead of doing drift correction independently for all FOVs. <br>This can be useful if there are a large fraction of FOVs which are failing drift correction. Note that `Use Median Drift?` <br>sets this behavior for both y and x drift correction.

The arguments for this step are:

 - **y_min_edge_dist (int)** : Minimum row length necessary for detection (filters out small detected objects).

 - **padding_y (int)** : Padding to add to the end of trench row when cropping in the y-dimension.

 - **trench_len_y (int)** : Length from the end of each trench row to the feeding channel side of the crop.

 - **Number of Rows (int)** : The number of rows to expect in your image. For instance, two in the example image.
 
 - **Alternate Orientation? (bool)** : Whether or not to alternate the orientation of consecutive rows.

 - **Orientation (int)** : The orientation of the top-most row where 0 corresponds to a trench with a downward-oriented trench <br>opening and 1 corresponds to a trench with an upward-oriented trench opening.

 - **Orientation when < expected rows(int)** : The orientation of the top-most row when the number of detected rows is less than <br>expected. Useful if your trenches drift out of your image in some FOVs.
 
 - **Use Median Drift? (bool)** : Whether to use the median detected drift across all FOVs, instead of the drift detected in each FOV individually.

 - **images_per_row(int)** : How many images to output per row for this widget.

Running the following widget will display y-cropped images for each fov and timepoint.

In [None]:
interactive_kymograph.preview_y_crop_interactive()

##### Tune trench detection hyperparameters

Next, we will detect the positions of trenchs in the y-cropped images as follows:

1. Reducing each 2D image to a 1D signal along the x-axis by computing the qth percentile of the data along the y-axis.
2. Determine the signal background by smoothing this signal using a large median kernel.
3. Subtract the background signal.
4. Smooth the resultant signal using a median kernel.
5. Use an [otsu threhsold](https://imagej.net/Auto_Threshold#Otsu) to determine the trench midpoint poisitons.

After this, x-dimension drift correction of our detected midpoints will be performed as follows:

6. Begin at t=1
7. For $m \in \{midpoints(t)\}$ assign $n \in \{midpoints(t-1)\}$ to m if n is the closest midpoint to m at time $t-1$,<br>
points that are not the closest midpoint to any midpoints in m will not be mapped.
8. Compute the translation of each midpoint at time.
9. Take the average of this value as the x-dimension drift from time t-1 to t.

The arguments for this step are:

 - **t (int)** : Timepoint to examine the percentiles and threshold in.

 - **x_percentile (int)** : Percentile to use for step 1.

 - **background_kernel_x (int)** : Median kernel size to use for step 2.

 - **smoothing_kernel_x (int)** : Median kernel size to use for step 4.

 - **otsu_scaling (float)** : Scaling factor to apply to the threshold determined by Otsu's method.

Running the following widget will display the smoothed 1-D signal for each of your timepoints. In addition, the threshold <br>value for each fov will be displayed as a red line. In addition, it will display the detected midpoints for each of your timepoints. <br>If there is too much sparsity, or discontinuity, your drift correction will not be accurate.

In [None]:
interactive_kymograph.preview_x_percentiles_interactive()

##### Tune trench cropping hyperparameters

Trench cropping simply uses the drift-corrected midpoints as a reference and crops out some fixed length around them <br>
to produce an output kymograph. **Note that the current implementation does not allow trench crops to overlap**. If your<br>
trench crops do overlap, the error will not be caught here, but will cause issues later in the pipeline. As such, try <br>
to crop your trenches as closely as possible. This issue will be fixed in a later update.

The arguments for this step are:

 - **trench_width_x (int)** : Trench width to use for cropping.

 - **trench_present_thr (float)** : Trenches that appear in less than this percent of FOVs will be eliminated from the dataset.<br>
If not removed, missing positions will be inferred from the image drift.

 - **Use Median Drift? (bool)** : Whether to use the median detected drift across all FOVs, instead of the drift detected in each FOV individually.


Running the following widget will display a random kymograph for each row in each fov and will also produce midpoint plots <br>showing retained midpoints

In [None]:
interactive_kymograph.preview_kymographs_interactive()

##### Export and save hyperparameters

Run the following line to register and display the parameters you have selected for kymograph creation.

In [None]:
interactive_kymograph.process_results()

If you are satisfied with the above parameters, run the following line to write these parameters to disk at `headpath/kymograph.par`<br>
This file will be used to perform kymograph creation in the next section.

In [None]:
interactive_kymograph.write_param_file()

### Generate Kymograph

##### Start Dask Workers

Again, we start a `dask_controller` instance which will handle all of our parallel processing. The default parameters <br>here work well on O2 for kymograph creation. The critical arguments here are:

**walltime** : For a cluster, the length of time you will request each node for.

**local** : `True` if you want to perform computation locally. `False` if you want to perform it on a SLURM cluster.

**n_workers** : Number of nodes to request if on the cluster, or number of processes if computing locally.

**memory** : For a cluster, the amount of memory you will request each node for.

**working_directory** : For a cluster, the directory in which data will be spilled to disk. Usually set as a folder in <br>the `headpath`.

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="04:00:00",
    local=False,
    n_workers=50,
    memory="8GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

After running the above line, you will have a running Dask client. Run the line below and click the link to supervise <br>the computation being administered by the scheduler. 

Don't be alarmed if the screen starts mostly blank, it may take time for your workers to spin up. If you get a 404 <br>error on a cluster, it is likely that your ports are not being forwarded properly. If this occurs, please register <br>the issue on github.

In [None]:
dask_controller.daskclient

##### Perform Kymograph Cropping

Now that we have our cluster scheduler spun up, we will extract kymographs using the parameters stored in `headpath/kymograph.par`. <br>
This will be handled by the `kymograph_cluster` object. This will detect trenches in all of the files present in `headpath/hdf5` that <br>
you created in the first step. It will then crop these trenches and place the crops in a series of `.hdf5` files in `headpath/kymograph`. <br>
These files will store image data in the form of `(K,T,Y,X)` arrays where K is the trench index, T is time and Y,X are the image dimensions <br>
of the crop.

The arguments for this step are:

 - **headpath** : The folder in which processing is occuring. Should be the same for each step in the pipeline.

 - **trenches_per_file** : The maximum number of trenches stored in each output `.hdf5` file. Typical values are between 25 <br>and 100.

 - **paramfile** : Set to true if you want to use parameters from `headpath/kymograph.par` Otherwise, you will have to specify <br>
 parameters as direct arguments to `kymograph_cluster`.

In [None]:
kymoclust = tr.kymograph.kymograph_cluster(
    headpath=headpath, trenches_per_file=50, paramfile=True
)

##### Begin Kymograph Cropping 

Running the following line will start the cropping process. This may be monitored by examining the `Dask Dashboard` <br>
under the link displayed earlier. Once the computation is complete, move to the next line.

**Do not move on until all tasks are displayed as 'in memory' in Dask.**

In [None]:
kymoclust.generate_kymographs(dask_controller)

In [None]:
ff = tr.focus_filter(headpath)

In [None]:
ff.choose_filter_channel_inter()

In [None]:
ff.plot_histograms()

In [None]:
ff.plot_focus_threshold_inter()

In [None]:
ff.write_param_file()

##### Post-process Images

After the above step, kymographs will have been created for each `.hdf5` input file. They will now need to be reorganized <br>
into a new set of files such that each file has, at most, `trenches_per_file` trenches in each file.

**Do not move on until all tasks are displayed as 'in memory' in Dask.**

In [None]:
kymoclust.post_process(dask_controller)

##### Check kymograph statistics

Run the next line to display some statistics from kymograph creation. The outputs are:

 - **fovs processed** : The number of FOVs successfully processed out of the total number of FOVs
 - **rows processed** : The number of rows of trenches processed out of the total number of rows
 - **trenches processed** : The number of trenches successfully processed
 - **row/fov** : The average number of rows successfully processed per FOV
 - **trenches/fov** : The average number of trenches successfully processed per FOV
 - **failed fovs** : A list of failed FOVs. Spot check these FOVs in the viewer to determine potential problems

In [None]:
kymoclust.kymo_report()

##### Shutdown Dask

Once cropping is complete, it is likely that you will want to shutdown your `dask_controller` if you are on a <br>
cluster. This is because the specifications of the current `dask_controller` will not be optimal for later steps. <br>
To do this, run the following line and wait for it to complete. If it hangs, interrupt your kernel and re-run it. <br>
If this also fails to shutdown your workers, you will have to manually shut them down using `scancel` in a terminal.

In [None]:
dask_controller.daskclient.restart()

In [None]:
dask_controller.shutdown()

## Fluorescence Segmentation

Now that you have copped your data into kymographs, we will now perform segmentation/cell detection <br>
on your kymographs. Currently, this pipeline only supports segmentation of fluorescence images; however, <br>
segmentation of transmitted light imaging techniques is in development.

The output of this step will be a set of `segmentation_[File #].hdf5` files stored in `headpath/fluorsegmentation`.<br>
The image data stored in these files takes the exact same form as the kymograph data, `(K,T,Y,X)` arrays <br>
where K is the trench index, T is time, and Y,X are the crop dimensions. These arrays are accessible using <br>
keys of the form `"[Trench Row Number]"`.

Since no metadata is generated by this step, it is possible to use another segmentation algorithm on the kymograph <br>
data. The output of segmentation must be split into `segmentation_[File #].hdf5` files, where `[File #]` agrees with the<br>
corresponding `kymograph_[File #].hdf5` file. Additionally, the `(K,T,Y,X)` arrays must be of the same shape as the <br>
kymograph arrays and accessible at the corresponding `"[Trench Row Number]"` key. These files must be placed into <br>
their own folder at `headpath/foldername`. This folder may then be used in later steps.

### Test Parameters

##### Initialize the interactive segmentation class

As a first step, initialize the `tr.fluo_segmentation_interactive` class that will be handling all steps of generating a segmentation. 

In [None]:
interactive_segmentation = tr.fluo_segmentation_interactive(headpath)

##### Choose channel to segment on

In [None]:
interactive_segmentation.choose_seg_channel_inter()

#### Import data

Fill in 

You will need to tune the following `args` and `kwargs` (in order):

**fov_idx (int)** :

**n_trenches (int)** :

**t_range (tuple)** :

**t_subsample_step (int)** :

In [None]:
interactive_segmentation.import_array_inter()

##### Process data

In [None]:
interactive_segmentation.plot_processed_inter()

#### Determine Cell Mask Envelope

Fill in.

You will need to tune the following `args` and `kwargs` (in order):

**cell_mask_method (str)** : Thresholding method, can be a local or global Otsu threshold.

**cell_otsu_scaling (float)** : Scaling factor applied to determined threshold.

**local_otsu_r (int)** : Radius of thresholding kernel used in the local otsu thresholding.

In [None]:
interactive_segmentation.plot_cell_mask_inter()

In [None]:
interactive_segmentation.plot_eig_mask_inter()

In [None]:
interactive_segmentation.plot_dist_mask_inter()

In [None]:
interactive_segmentation.plot_marker_mask_inter()

In [None]:
interactive_segmentation.process_results()

In [None]:
interactive_segmentation.write_param_file()

### Generate Segmentation

#### Start Dask Workers

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="01:00:00",
    local=False,
    n_workers=100,
    memory="2GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

In [None]:
dask_controller.displaydashboard()

In [None]:
segment = tr.segment.fluo_segmentation_cluster(headpath, paramfile=True)

In [None]:
segment.dask_segment(dask_controller)

#### Stop Dask Workers

In [None]:
dask_controller.shutdown()

## Region Properties (No Lineage)

In [None]:
analyzer = tr.analysis.regionprops_extractor(
    headpath,
    "fluorsegmentation",
    intensity_channel_list=["mCherry", "YFP"],
    include_background=True,
)

In [None]:
analyzer.export_all_data(n_workers=100)

In [None]:
import dask.dataframe as dd

In [None]:
dd.read_parquet(
    "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/mVenus/analysis"
).loc[:10000].compute()

# Part 2: Barcodes

#### Specify Paths

Begin by defining the directory in which all processing will be done, as well as the initial nd2 file we will be <br>processing. This line should be run everytime you open a new python kernel.

The format should be: `headpath = "/path/to/folder"` and `nd2file = "/path/to/file.nd2"`

For example:
```
headpath = "/n/scratch2/de64/2019-05-31_validation_data"
nd2file = "/n/scratch2/de64/2019-05-31_validation_data/Main_Experiment.nd2"
```

Ideally, these files should be placed in a storage location with relatively fast I/O

In [None]:
headpath = "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/Barcodes/"
hdf5inputpath = "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/run"
# nd2file = "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/run_100ms_mchry100_yfp50.nd2"

## Extract to hdf5 files

In this section, we will be extracting our image data. Currently this notebook only supports `.nd2` format; however <br>there are `.tiff` extractors in the TrenchRipper source files that are being added to `Master.ipynb` soon.

In the abstract, this step will take a single `.nd2` file and split it into a set of `.hdf5` files stored in <br>`headpath/hdf5`. Splitting the file up in this way will facilitate quick procesing in later steps. Each field of <br>view will be split into one or more `.hdf5` files, depending on the number of images per file requested (more on <br>this later). 

To keep track of which output files correspond to which FOVs, as well as to keep track of experiment metadata, the <br>extractor also outputs a `metadata.hdf5` file in the `headpath` folder. The data from this step is accessible in <br>that `metadata.hdf5` file under the `global` key. If you would like to look at this metadata, you may use the <br>`tr.utils.pandas_hdf5_handler` to read from this file. Later steps will add additional metadata under different <br>keys into the `metadata.hdf5` file.

#### Start Dask Workers

First, we start a `dask_controller` instance which will handle all of our parallel processing. The default parameters <br>here work well on O2. The critical arguments here are:

**walltime** : For a cluster, the length of time you will request each node for.

**local** : `True` if you want to perform computation locally. `False` if you want to perform it on a SLURM cluster.

**n_workers** : Number of nodes to request if on the cluster, or number of processes if computing locally.

**memory** : For a cluster, the amount of memory you will request each node for.

**working_directory** : For a cluster, the directory in which data will be spilled to disk. Usually set as a folder in <br>the `headpath`.

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="04:00:00",
    local=False,
    n_workers=20,
    memory="2GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

After running the above line, you will have a running Dask client. Run the line below and click the link to supervise <br>the computation being administered by the scheduler. 

Don't be alarmed if the screen starts mostly blank, it may take time for your workers to spin up. If you get a 404 <br>error on a cluster, it is likely that your ports are not being forwarded properly. If this occurs, please register <br>the issue on github.

In [None]:
dask_controller.daskclient

##### Perform Extraction

Now that we have our cluster scheduler spun up, it is time to convert files. This will be handled by the <br>`hdf5_extractor` object. This extractor will pull up each FOV and split it such that each derived `.hdf5` file <br>contains, at maximum, N timepoints of that FOV per file. The image data stored in these files takes the <br>form of `(N,Y,X)` arrays that are accessible using the desired channel name as a key. 

The arguments for this extractor are:

 - **nd2file** : The filepath to the `.nd2` file you intend to extract.
 
 - **headpath** : The folder in which processing is occuring. Should be the same for each step in the pipeline.

 - **tpts_per_file** : The maximum number of timepoints stored in each output `.hdf5` file. Typical values are between 25 <br>and 100.

 - **ignore_fovmetadata** : Used when `.nd2` data is corrupted and does not possess records for stage positions or <br>timepoints. Only set `False` if the extractor throws errors on metadata handling.

 - **nd2reader_override** : Overrides values in metadata recovered using the `nd2reader`. Currently set to <br>`{"z_levels":[],"z_coordinates":[]}` by default to correct a known issue where z coordinates are mistakenly <br>interpreted as a z stack. See the [nd2reader](https://rbnvrw.github.io/nd2reader/) documentation for more info.

In [None]:
hdf5_extractor = tr.marlin_extractor(
    hdf5inputpath, headpath, metaparsestr="metadata_{timepoint:d}.hdf5"
)

##### Extraction Parameters

Here, you may set the time interval you want to extract. Useful for cropping data to the period exhibiting the dynamics of interest.

Optionally take notes to add to the `metadata.hdf5` file. Notes may also be taken directly in this notebook.

In [None]:
hdf5_extractor.inter_set_params()

##### Begin Extraction 

Running the following line will start the extraction process. This may be monitored by examining the `Dask Dashboard` <br> under the link displayed earlier. Once the computation is complete, move to the next line.

This step may take a long time, though it is possible to speed it up using additional workers.

In [None]:
hdf5_extractor.extract(dask_controller)

##### Shutdown Dask

Once extraction is complete, it is likely that you will want to shutdown your `dask_controller` if you are on a <br>
cluster. This is because the specifications of the current `dask_controller` will not be optimal for later steps. <br>
To do this, run the following line and wait for it to complete. If it hangs, interrupt your kernel and re-run it. <br>
If this also fails to shutdown your workers, you will have to manually shut them down using `scancel` in a terminal.

In [None]:
dask_controller.daskclient.restart()

In [None]:
dask_controller.shutdown()

## Kymographs

Now that you have extracted your data into a series of `.hdf5` files, we will now perform identification and cropping <br>of the individual trenches/growth channels present in the images. This algorithm assumes that your growth trenches <br>are vertically aligned and that they alternate in their orientation from top to bottom. See the example image for the <br>correct geometry:

![example_image](./resources/example_image.jpg)

The output of this step will be a set of `.hdf5` files stored in `headpath/kymograph`. The image data stored in these <br>files takes the form of `(K,T,Y,X)` arrays where K is the trench index, T is time, and Y,X are the crop dimensions. <br>These arrays are accessible using keys of the form `"[Image Channel]"`. For example, looking up phase channel <br>data of trenches in the topmost row of an image will require the key `"Phase"`

[ '/n/scratch3/users/d/de64/190917_20x_phase_gfp_segmentation002',
 '/n/scratch3/users/d/de64/190922_20x_phase_gfp_segmentation',
 '/n/scratch3/users/d/de64/190925_20x_phase_yfp_segmentation',
 '/n/scratch3/users/d/de64/ezrdm_training_sb7',
 '/n/scratch3/users/d/de64/mbm_training_sb7',
 '/n/scratch3/users/d/de64/Sb7_L35',
 '/n/scratch3/users/d/de64/MM_DVCvecto_TOP_1_9',
 '/n/scratch3/users/d/de64/Vibrio_2_1_TOP',
 '/n/scratch3/users/d/de64/Vibrio_A_B_VZRDM--04--RUN_80ms',
 '/n/scratch3/users/d/de64/RpoSOutliers_WT_hipQ_100X',
 '/n/scratch3/users/d/de64/Main_Experiment',
 '/n/scratch3/users/d/de64/bde17_gotime']

### Test Parameters



##### Initialize the interactive kymograph class

As a first step, initialize the `tr.interactive.kymograph_interactive` class that will be help us choose the <br>parameters we will use to generate kymographs. 

In [None]:
interactive_kymograph = tr.kymograph_interactive(headpath)

In [None]:
viewer = tr.hdf5_viewer(headpath, persist_data=False)

##### Examine Images

Here you can manually inspect images before beginning parameter tuning.

In [None]:
viewer.view(width=1200)

You will now want to select a few test FOVs to try out parameters on, the channel you want to detect trenches on, and <br>the time interval on which you will perform your processing.

The arguments for this step are:

- **seg_channel (string)** : The channel name that you would like to segment on.

- **invert (list)** : Whether or not you want to invert the image before detecting trenches. By default, it is assumed that <br>the trenches have a high pixel intensity relative to the background. This should be the case for Phase Contrast and <br>Fluorescence Imageing, but may not be the case for Brightfield Imaging, in which case you will want to invert the image.

- **fov_list (list)** : List of integers corresponding to the FOVs that you wish to make test kymographs of.

- **t_subsample_step (int)** : Step size to be used for subsampling input files in time, recommend that subsampling results in <br>between 5 and 10 timepoints for quick processing.

Hit the "Run Interact" button to lock in your parameters. The button will become transparent briefly and become solid again <br>when processing is complete. After that has occured, move on to the next step. 

In [None]:
interactive_kymograph.import_hdf5_interactive()

##### Tune "trench-row" detection hyperparameters

The kymograph code begins by detecting the positions of trench rows in the image as follows:

1. Reducing each 2D image to a 1D signal along the y-axis by computing the qth percentile of the data along the x-axis
2. Smooth this signal using a median kernel
3. Normalize the signal by linearly scaling 0. and 1. to the minimum and maximum, respectively
4. Use a set threshold to determine the trench row poisitons

The arguments for this step are:

 - **y_percentile (int)** : Percentile to use for step 1.

 - **smoothing_kernel_y_dim_0 (int)** : Median kernel size to use for step 2.

 - **y_percentile_threshold (float)** : Threshold to use in step 4.

Running the following widget will display the smoothed 1-D signal for each of your timepoints. In addition, the threshold <br>value for each fov will be displayed as a red line.

In [None]:
interactive_kymograph.preview_y_precentiles_interactive()

In [None]:
interactive_kymograph.preview_y_precentiles_consensus_interactive()

##### Tune "trench-row" cropping hyperparameters

Next, we will use the detected rows to perform cropping of the input image in the y-dimension:

1. Determine edges of trench rows based on threshold mask.
2. Filter out rows that are too small.
3. Use the remaining rows to compute the drift in y in each image.
4. Apply the drift to the initally detected rows to get rows in all timepoints.
5. Perform cropping using the "end" of the row as reference (the end referring to the part of the trench farthest from <br>the feeding channel).

Step 5 performs a simple algorithm to determine the orientation of each trench:

```
row_orientations = [] # A list of row orientations, starting from the topmost row
if the number of detected rows == 'Number of Rows': 
    row_orientations.append('Orientation')
elif the number of detected rows < 'Number of Rows':
    row_orientations.append('Orientation when < expected rows')
for row in rows:
    if row_orientations[-1] == downward:
        row_orientations.append(upward)
    elif row_orientations[-1] == upward:
        row_orientations.append(downward)
```

Additionally, if the device tranches face a single direction, alternation of row orientation may be turned off by setting the<br> `Alternate Orientation?` argument to False. The `Use Median Drift?` argument, when set to True, will use the<br> median drift in y across all FOVs for drift correction, instead of doing drift correction independently for all FOVs. <br>This can be useful if there are a large fraction of FOVs which are failing drift correction. Note that `Use Median Drift?` <br>sets this behavior for both y and x drift correction.

The arguments for this step are:

 - **y_min_edge_dist (int)** : Minimum row length necessary for detection (filters out small detected objects).

 - **padding_y (int)** : Padding to add to the end of trench row when cropping in the y-dimension.

 - **trench_len_y (int)** : Length from the end of each trench row to the feeding channel side of the crop.

 - **Number of Rows (int)** : The number of rows to expect in your image. For instance, two in the example image.
 
 - **Alternate Orientation? (bool)** : Whether or not to alternate the orientation of consecutive rows.

 - **Orientation (int)** : The orientation of the top-most row where 0 corresponds to a trench with a downward-oriented trench <br>opening and 1 corresponds to a trench with an upward-oriented trench opening.

 - **Orientation when < expected rows(int)** : The orientation of the top-most row when the number of detected rows is less than <br>expected. Useful if your trenches drift out of your image in some FOVs.
 
 - **Use Median Drift? (bool)** : Whether to use the median detected drift across all FOVs, instead of the drift detected in each FOV individually.

 - **images_per_row(int)** : How many images to output per row for this widget.

Running the following widget will display y-cropped images for each fov and timepoint.

In [None]:
interactive_kymograph.preview_y_crop_interactive()

##### Tune trench detection hyperparameters

Next, we will detect the positions of trenchs in the y-cropped images as follows:

1. Reducing each 2D image to a 1D signal along the x-axis by computing the qth percentile of the data along the y-axis.
2. Determine the signal background by smoothing this signal using a large median kernel.
3. Subtract the background signal.
4. Smooth the resultant signal using a median kernel.
5. Use an [otsu threhsold](https://imagej.net/Auto_Threshold#Otsu) to determine the trench midpoint poisitons.

After this, x-dimension drift correction of our detected midpoints will be performed as follows:

6. Begin at t=1
7. For $m \in \{midpoints(t)\}$ assign $n \in \{midpoints(t-1)\}$ to m if n is the closest midpoint to m at time $t-1$,<br>
points that are not the closest midpoint to any midpoints in m will not be mapped.
8. Compute the translation of each midpoint at time.
9. Take the average of this value as the x-dimension drift from time t-1 to t.

The arguments for this step are:

 - **t (int)** : Timepoint to examine the percentiles and threshold in.

 - **x_percentile (int)** : Percentile to use for step 1.

 - **background_kernel_x (int)** : Median kernel size to use for step 2.

 - **smoothing_kernel_x (int)** : Median kernel size to use for step 4.

 - **otsu_scaling (float)** : Scaling factor to apply to the threshold determined by Otsu's method.

Running the following widget will display the smoothed 1-D signal for each of your timepoints. In addition, the threshold <br>value for each fov will be displayed as a red line. In addition, it will display the detected midpoints for each of your timepoints. <br>If there is too much sparsity, or discontinuity, your drift correction will not be accurate.

In [None]:
interactive_kymograph.preview_x_percentiles_interactive()

##### Tune trench cropping hyperparameters

Trench cropping simply uses the drift-corrected midpoints as a reference and crops out some fixed length around them <br>
to produce an output kymograph. **Note that the current implementation does not allow trench crops to overlap**. If your<br>
trench crops do overlap, the error will not be caught here, but will cause issues later in the pipeline. As such, try <br>
to crop your trenches as closely as possible. This issue will be fixed in a later update.

The arguments for this step are:

 - **trench_width_x (int)** : Trench width to use for cropping.

 - **trench_present_thr (float)** : Trenches that appear in less than this percent of FOVs will be eliminated from the dataset.<br>
If not removed, missing positions will be inferred from the image drift.

 - **Use Median Drift? (bool)** : Whether to use the median detected drift across all FOVs, instead of the drift detected in each FOV individually.


Running the following widget will display a random kymograph for each row in each fov and will also produce midpoint plots <br>showing retained midpoints

In [None]:
interactive_kymograph.preview_kymographs_interactive()

##### Export and save hyperparameters

Run the following line to register and display the parameters you have selected for kymograph creation.

In [None]:
interactive_kymograph.process_results()

If you are satisfied with the above parameters, run the following line to write these parameters to disk at `headpath/kymograph.par`<br>
This file will be used to perform kymograph creation in the next section.

In [None]:
interactive_kymograph.write_param_file()

### Generate Kymograph

##### Start Dask Workers

Again, we start a `dask_controller` instance which will handle all of our parallel processing. The default parameters <br>here work well on O2 for kymograph creation. The critical arguments here are:

**walltime** : For a cluster, the length of time you will request each node for.

**local** : `True` if you want to perform computation locally. `False` if you want to perform it on a SLURM cluster.

**n_workers** : Number of nodes to request if on the cluster, or number of processes if computing locally.

**memory** : For a cluster, the amount of memory you will request each node for.

**working_directory** : For a cluster, the directory in which data will be spilled to disk. Usually set as a folder in <br>the `headpath`.

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="04:00:00",
    local=False,
    n_workers=100,
    memory="8GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

After running the above line, you will have a running Dask client. Run the line below and click the link to supervise <br>the computation being administered by the scheduler. 

Don't be alarmed if the screen starts mostly blank, it may take time for your workers to spin up. If you get a 404 <br>error on a cluster, it is likely that your ports are not being forwarded properly. If this occurs, please register <br>the issue on github.

In [None]:
dask_controller.daskclient

##### Perform Kymograph Cropping

Now that we have our cluster scheduler spun up, we will extract kymographs using the parameters stored in `headpath/kymograph.par`. <br>
This will be handled by the `kymograph_cluster` object. This will detect trenches in all of the files present in `headpath/hdf5` that <br>
you created in the first step. It will then crop these trenches and place the crops in a series of `.hdf5` files in `headpath/kymograph`. <br>
These files will store image data in the form of `(K,T,Y,X)` arrays where K is the trench index, T is time and Y,X are the image dimensions <br>
of the crop.

The arguments for this step are:

 - **headpath** : The folder in which processing is occuring. Should be the same for each step in the pipeline.

 - **trenches_per_file** : The maximum number of trenches stored in each output `.hdf5` file. Typical values are between 25 <br>and 100.

 - **paramfile** : Set to true if you want to use parameters from `headpath/kymograph.par` Otherwise, you will have to specify <br>
 parameters as direct arguments to `kymograph_cluster`.

In [None]:
kymoclust = tr.kymograph.kymograph_cluster(
    headpath=headpath, trenches_per_file=200, paramfile=True
)

##### Begin Kymograph Cropping 

Running the following line will start the cropping process. This may be monitored by examining the `Dask Dashboard` <br>
under the link displayed earlier. Once the computation is complete, move to the next line.

**Do not move on until all tasks are displayed as 'in memory' in Dask.**

In [None]:
kymoclust.generate_kymographs(dask_controller)

In [None]:
ff = tr.focus_filter(headpath)

In [None]:
ff.choose_filter_channel_inter()

In [None]:
ff.plot_histograms()

In [None]:
ff.plot_focus_threshold_inter()

In [None]:
ff.write_param_file()

##### Post-process Images

After the above step, kymographs will have been created for each `.hdf5` input file. They will now need to be reorganized <br>
into a new set of files such that each file has, at most, `trenches_per_file` trenches in each file.

**Do not move on until all tasks are displayed as 'in memory' in Dask.**

In [None]:
kymoclust.post_process(dask_controller)

##### Check kymograph statistics

Run the next line to display some statistics from kymograph creation. The outputs are:

 - **fovs processed** : The number of FOVs successfully processed out of the total number of FOVs
 - **rows processed** : The number of rows of trenches processed out of the total number of rows
 - **trenches processed** : The number of trenches successfully processed
 - **row/fov** : The average number of rows successfully processed per FOV
 - **trenches/fov** : The average number of trenches successfully processed per FOV
 - **failed fovs** : A list of failed FOVs. Spot check these FOVs in the viewer to determine potential problems

In [None]:
kymoclust.kymo_report()

##### Shutdown Dask

Once cropping is complete, it is likely that you will want to shutdown your `dask_controller` if you are on a <br>
cluster. This is because the specifications of the current `dask_controller` will not be optimal for later steps. <br>
To do this, run the following line and wait for it to complete. If it hangs, interrupt your kernel and re-run it. <br>
If this also fails to shutdown your workers, you will have to manually shut them down using `scancel` in a terminal.

In [None]:
dask_controller.daskclient.restart()

In [None]:
dask_controller.shutdown()

## FISH Analysis

##### Start Dask Workers

Again, we start a `dask_controller` instance which will handle all of our parallel processing. The default parameters <br>here work well on O2 for kymograph creation. The critical arguments here are:

**walltime** : For a cluster, the length of time you will request each node for.

**local** : `True` if you want to perform computation locally. `False` if you want to perform it on a SLURM cluster.

**n_workers** : Number of nodes to request if on the cluster, or number of processes if computing locally.

**memory** : For a cluster, the amount of memory you will request each node for.

**working_directory** : For a cluster, the directory in which data will be spilled to disk. Usually set as a folder in <br>the `headpath`.

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="02:00:00",
    local=False,
    n_workers=10,
    memory="16GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

In [None]:
dask_controller.displaydashboard()

#### Get Barcode Signal (Percentile Function)

In [None]:
import numpy as np

tr.get_all_image_measurements(
    dask_controller,
    headpath,
    headpath + "/percentiles",
    ["RFP", "Cy5", "Cy7"],
    "98th Percentile",
    np.percentile,
    98,
)

#### Determine Barcodes

In [None]:
fish_test = tr.fish_analysis(
    headpath,
    "./lDE18_final_df.tsv",
    "./lDE18_final_df.json",
    hamming_thr=1,
    channel_names=["RFP 98th Percentile", "Cy5 98th Percentile", "Cy7 98th Percentile"],
)

In [None]:
fish_test.plot_signal_threshold_inter()

In [None]:
fish_test.get_bit_thresholds()

In [None]:
fish_test.bit_threshold_list = [
    1100,
    900,
    1600,
    1400,
    1400,
    1400,
    1500,
    1500,
    1300,
    1500,
    3000,
    6000,
    8000,
    3500,
    6000,
    2000,
    4500,
    5000,
    4000,
    3000,
    500,
    1000,
    600,
    700,
    700,
    700,
    700,
    700,
    600,
    600,
]

In [None]:
fish_test.plot_bit_threshold_inter()

In [None]:
fish_test.output_barcode_df()

# Step 3: Combined Analysis

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy as sp
import sklearn as skl
import dask.dataframe as dd

from matplotlib import pyplot as plt

In [None]:
headpath = "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/Barcodes/"

In [None]:
dask_controller = tr.trcluster.dask_controller(
    walltime="02:00:00",
    local=False,
    n_workers=50,
    memory="16GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()

In [None]:
dask_controller.displaydashboard()

In [None]:
dask_controller.daskclient.restart()

#### Import Barcode Dataframe

In [None]:
meta_handle = tr.pandas_hdf5_handler(
    "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/Barcodes/metadata.hdf5"
)
pandas_barcode_df = meta_handle.read_df("barcodes", read_metadata=True)
barcode_df = dd.from_pandas(pandas_barcode_df, npartitions=500, sort=True)
barcode_df = barcode_df.persist()

In [None]:
ttl_called = len(barcode_df.index)
ttl_trenches = pandas_barcode_df.metadata["Total Trenches"]
ttl_trenches_w_cells = pandas_barcode_df.metadata["Total Trenches With Cells"]
percent_called = ttl_called / ttl_trenches
percent_called_w_cells = ttl_called / ttl_trenches_w_cells

In [None]:
percent_called

In [None]:
percent_called_w_cells

In [None]:
1.0 - percent_called_w_cells

#### Import mVenus Regionprops Output

In [None]:
analysis_df = dd.read_parquet(
    "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/mVenus/analysis"
)
mvenus_df = analysis_df[analysis_df["Intensity Channel"] == "YFP"]

#### Get trenchwise mVenus average and standard deviation

In [None]:
trench_timepoints = mvenus_df.apply(
    lambda x: int(f'{x["trenchid"]:04}{x["timepoints"]:04}'), axis=1, meta=int
)
mvenus_df["trench-timepoint"] = trench_timepoints
mvenus_df_trench = mvenus_df.set_index("trench-timepoint", sorted=True)
mvenus_groupby = mvenus_df_trench.groupby("trench-timepoint")
mvenus_intensity_wo_bkd = (
    mvenus_groupby.apply(lambda x: x["mean_intensity"] - x.iloc[0]["mean_intensity"])
    .reset_index(drop=True)
    .to_frame()
)
mvenus_intensity_wo_bkd = mvenus_intensity_wo_bkd.rename(
    columns={"mean_intensity": "mean_intensity_wo_bkd"}
)
mvenus_intensity_wo_bkd.index = mvenus_df.index
mvenus_df = mvenus_df.join(mvenus_intensity_wo_bkd)
mvenus_df = mvenus_df[mvenus_df["Objectid"] != 0].persist()
# mvenus_df["Object Parquet Index"] = mvenus_df.apply(lambda x: int(f"{x['File Index']:04}{x['File Trench Index']:04}{x['timepoints']:04}{x['Objectid']:02}"),axis=1)
# mvenus_df = mvenus_df.set_index("Object Parquet Index")

trenchid_groupby = mvenus_df.groupby("trenchid", sort=True)
mvenus_mean = (
    trenchid_groupby["mean_intensity_wo_bkd"]
    .mean()
    .to_frame()
    .reset_index()
    .set_index("trenchid", sort=True)
)
mvenus_mean = mvenus_mean.rename(
    columns={"mean_intensity_wo_bkd": "mVenus Mean Intensity"}
).persist()
mvenus_sd = (
    trenchid_groupby["mean_intensity_wo_bkd"]
    .std()
    .to_frame()
    .reset_index()
    .set_index("trenchid", sort=True)
)
mvenus_sd = mvenus_sd.rename(
    columns={"mean_intensity_wo_bkd": "mVenus Standard Deviation"}
).persist()
mvenus_fano = (
    (
        (mvenus_sd["mVenus Standard Deviation"] ** 2)
        / mvenus_mean["mVenus Mean Intensity"]
    )
    .to_frame()
    .reset_index()
    .set_index("trenchid", sort=True)
)
mvenus_fano = mvenus_fano.rename(columns={0: "mVenus Fano Factor"}).persist()

scalar_dict = {
    "mVenus Mean Intensity": mvenus_mean,
    "mVenus Standard Deviation": mvenus_sd,
    "mVenus Fano Factor": mvenus_fano,
}

del mvenus_df

#### Get Trench Mapping

In [None]:
def get_mapping(df):
    df1_xy = np.array(df["Frame 1 Position"]).T
    df2_xy = np.array(df["Frame 2 Position"]).T
    ymat = np.subtract.outer(df1_xy[:, 0], df2_xy[:, 0])
    xmat = np.subtract.outer(df1_xy[:, 1], df2_xy[:, 1])
    distmat = (ymat**2 + xmat**2) ** (1 / 2)

    # ensuring map is one-to-one
    mapping = np.argmin(distmat, axis=1)
    invmapping = np.argmin(distmat, axis=0)

    mapping = {
        idx: map_idx
        for idx, map_idx in enumerate(mapping)
        if invmapping[map_idx] == idx
    }

    df1_trenchids = df["Frame 1 Trenchid"]
    df2_trenchids = df["Frame 2 Trenchid"]

    trenchid_map = {
        trenchid: df2_trenchids[mapping[i]]
        for i, trenchid in enumerate(df1_trenchids)
        if i in mapping.keys()
    }

    return trenchid_map


def get_trenchid_map(kymodf1, kymodf2):

    fovset1 = set(kymodf1["fov"].unique().compute().tolist())
    fovset2 = set(kymodf2["fov"].unique().compute().tolist())
    fov_intersection = fovset1.intersection(fovset2)

    fovdf1_groupby = kymodf1.set_index("fov", sorted=True).groupby("fov")
    fovdf1_pos = (
        fovdf1_groupby.apply(
            lambda x: [list(x["y (local)"]), list(x["x (local)"])], meta=list
        )
        .loc[list(fov_intersection)]
        .to_frame()
    )
    fovdf1_pos.columns = ["Frame 1 Position"]
    fovdf1_trenchid = (
        fovdf1_groupby.apply(lambda x: list(x["trenchid"]), meta=list)
        .loc[list(fov_intersection)]
        .to_frame()
    )
    fovdf1_trenchid.columns = ["Frame 1 Trenchid"]

    fovdf2_groupby = kymodf2.set_index("fov", sorted=True).groupby("fov")
    fovdf2_pos = (
        fovdf2_groupby.apply(
            lambda x: [list(x["y (local)"]), list(x["x (local)"])], meta=list
        )
        .loc[list(fov_intersection)]
        .to_frame()
    )
    fovdf2_pos.columns = ["Frame 2 Position"]
    fovdf2_trenchid = (
        fovdf2_groupby.apply(lambda x: list(x["trenchid"]), meta=list)
        .loc[list(fov_intersection)]
        .to_frame()
    )
    fovdf2_trenchid.columns = ["Frame 2 Trenchid"]

    combined_df = fovdf1_pos.join([fovdf1_trenchid, fovdf2_pos, fovdf2_trenchid])
    mapping_df = combined_df.apply(get_mapping, axis=1, meta=dict).compute()

    trenchid_map = {
        key: val for item in mapping_df.to_list() for key, val in item.items()
    }

    return trenchid_map


def get_called_df(barcode_df, scalar_dict):

    ## note that scalar index needs to be trenchid sorted
    init_scalar_df = scalar_dict[list(scalar_dict.keys())[0]]
    init_scalar_df_idx = init_scalar_df.index.compute().to_list()
    valid_barcode_df = barcode_df[
        barcode_df["trenchid"].isin(trenchid_map.keys())
    ].compute()
    barcode_df_mapped_trenchids = valid_barcode_df["trenchid"].apply(
        lambda x: trenchid_map[x]
    )
    valid_init_scalar_df_indices = barcode_df_mapped_trenchids.isin(init_scalar_df_idx)
    barcode_df_mapped_trenchids = barcode_df_mapped_trenchids[
        valid_init_scalar_df_indices
    ]
    final_valid_barcode_df_indices = barcode_df_mapped_trenchids.index.to_list()
    final_valid_init_scalar_df_indices = barcode_df_mapped_trenchids.to_list()
    final_scalar_dfs = []
    for key, val in scalar_dict.items():
        final_scalar_dfs.append(
            val.loc[final_valid_init_scalar_df_indices].reset_index(drop=True)
        )

    called_df = barcode_df.loc[final_valid_barcode_df_indices].reset_index(drop=True)
    called_df = called_df.join(final_scalar_dfs)

    return called_df

In [None]:
mvenus_kymo_df = dd.read_parquet(
    "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/mVenus/kymograph/metadata"
)
barcode_kymo_df = dd.read_parquet(
    "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/Barcodes/kymograph/metadata"
)

max_mvenus_tpt = mvenus_kymo_df.loc[:1000]["timepoints"].max().compute()
min_barcode_tpt = barcode_kymo_df.loc[:1000]["timepoints"].min().compute()

kymodf1 = mvenus_kymo_df[mvenus_kymo_df["timepoints"] == max_mvenus_tpt]
kymodf2 = barcode_kymo_df[barcode_kymo_df["timepoints"] == min_barcode_tpt]

last_mvenus_tpt_df = mvenus_kymo_df[mvenus_kymo_df["timepoints"] == max_mvenus_tpt]
first_barcode_tpt_df = barcode_kymo_df[barcode_kymo_df["timepoints"] == min_barcode_tpt]

trenchid_map = get_trenchid_map(first_barcode_tpt_df, last_mvenus_tpt_df)

In [None]:
called_df = get_called_df(barcode_df, scalar_dict)

In [None]:
def get_scalar_mean_median_mode(
    df, groupby, key
):  ## defined bc of a global variable issue with the key not playing nice with dask groupby
    df[key + ": Median"] = groupby.apply(lambda x: np.median(x[key]), meta=float)
    df[key + ": Mean"] = groupby.apply(lambda x: np.mean(x[key]), meta=float)
    df[key + ": SEM"] = groupby.apply(
        lambda x: sp.stats.sem(x[key]), meta=float
    )  # ADDING THIS MESSED UP THE LABELS?
    return df


## HERE

In [None]:
called_df_barcode = called_df.set_index("Barcode", sorted=False)
called_df_barcode_groupby = called_df.groupby(["Barcode"])
barcode_only_df = called_df_barcode_groupby.apply(lambda x: x.iloc[0])

barcode_only_df["trenchids"] = called_df_barcode_groupby.apply(
    lambda x: list(x["trenchid"]), meta=list
)

for key in scalar_dict.keys():
    barcode_only_df = get_scalar_mean_median_mode(
        barcode_only_df, called_df_barcode_groupby, key
    )

for key, _ in scalar_dict.items():
    del barcode_only_df[key]

barcode_only_df["N Trenches"] = barcode_only_df["trenchids"].apply(
    lambda x: len(x), meta=int
)

del barcode_only_df["trenchid"]
del barcode_only_df["Barcode Signal"]
del barcode_only_df["Barcode"]
del barcode_only_df["barcode"]

In [None]:
barcode_only_df = barcode_only_df.compute()
barcode_only_df = barcode_only_df[barcode_only_df["N Trenches"] > 5]

In [None]:
plt.hist(barcode_only_df["N Trenches"], range=(1, 10))
plt.show()

In [None]:
kymo_df = dd.read_parquet(
    "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/mVenus/kymograph/metadata"
)

In [None]:
## NEW INDEX
trench_timepoint_index = kymo_df.apply(
    lambda x: int(f'{x["trenchid"]:08}{x["timepoints"]:04}'), axis=1, meta=int
)
kymo_df["Trenchid Timepoint Index"] = trench_timepoint_index
## END
kymo_df = kymo_df.persist()
# dd.to_parquet(outputdf, self.kymographpath + "/metadata",engine='fastparquet',compression='gzip',write_metadata_file=True,overwrite=True)

In [None]:
dd.to_parquet(
    kymo_df,
    "/home/de64/scratch/de64/sync_folder/2021-05-27_lDE18_20x_run_1/mVenus/kymograph/metadata",
    engine="fastparquet",
    compression="gzip",
    write_metadata_file=True,
    overwrite=True,
)

In [None]:
first_indices = kymo_df.loc[list(kymo_df.divisions)].compute()

In [None]:
def set_new_aligned_index(df, index_column):
    ### Sets a column to be the new index
    ### Fast, but assumes index is both sorted and division aligned to the primary index
    first_indices = df.loc[list(df.divisions)].compute()
    new_index_divisions = first_indices[index_column].to_list()
    output_df = df.set_index(
        index_column,
        drop=False,
        sorted=True,
        npartitions=df.npartitions,
        divisions=new_index_divisions,
    )
    return output_df

In [None]:
test_df = set_new_aligned_index(kymo_df, "File Parquet Index")

In [None]:
test_df.loc[0].compute()

In [None]:
test_kymo_df["fov"].unique().compute()

In [None]:
filter_level = 0.7
filtered_df = barcode_only_df[
    (
        (
            barcode_only_df["mVenus Fano Factor: SEM"]
            / barcode_only_df["mVenus Fano Factor: Mean"]
        )
        < filter_level
    )
    & (
        (
            barcode_only_df["mVenus Mean Intensity: SEM"]
            / barcode_only_df["mVenus Mean Intensity: Mean"]
        )
        < filter_level
    )
]

In [None]:
plt.scatter(
    filtered_df["mVenus Mean Intensity: Median"],
    filtered_df["mVenus Fano Factor: Median"],
    alpha=0.5,
)
plt.xscale("log")
plt.ylim(0, 100)
plt.show()

In [None]:
plt.scatter(
    filtered_df["mVenus Mean Intensity: Median"],
    filtered_df["mVenus Fano Factor: Median"],
    alpha=0.5,
)
plt.loglog()
plt.show()

In [None]:
filtered_df["Log mVenus Mean Intensity"] = np.log10(
    filtered_df["mVenus Mean Intensity: Median"]
)
filtered_df["Log mVenus Fano Factor"] = np.log10(
    filtered_df["mVenus Fano Factor: Median"]
)

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg = LinearRegression().fit(
    filtered_df["Log mVenus Mean Intensity"].values.reshape(-1, 1),
    filtered_df["Log mVenus Fano Factor"].values,
)

In [None]:
reg.score(
    filtered_df["Log mVenus Mean Intensity"].values.reshape(-1, 1),
    filtered_df["Log mVenus Fano Factor"].values,
)

In [None]:
corrected_fano = filtered_df["Log mVenus Fano Factor"].values - (
    reg.coef_[0] * filtered_df["Log mVenus Mean Intensity"].values + reg.intercept_
)
perc = np.percentile(corrected_fano, 97)
high_fano_df = filtered_df[corrected_fano > perc]
high_fano_complement_df = filtered_df[corrected_fano <= perc]

# corrected_fano = filtered_df["Log mVenus Fano Factor"].values - (reg.coef_[0]*filtered_df["Log mVenus Mean Intensity"].values+reg.intercept_)
# perc = np.percentile(corrected_fano,99)
# low_fano_df = filtered_df[corrected_fano<perc]
# low_fano_complement_df= filtered_df[corrected_fano>=perc]

In [None]:
perc

In [None]:
high_fano_df[high_fano_df["mVenus Mean Intensity: Median"] > 1000]["Promoter"].to_list()

In [None]:
plt.scatter(
    filtered_df["mVenus Mean Intensity: Median"],
    filtered_df["mVenus Standard Deviation: Median"]
    / filtered_df["mVenus Mean Intensity: Median"],
    alpha=0.3,
)
plt.xscale("log")
plt.ylim(0.0, 1.0)
plt.show()

In [None]:
barcode_only_df = called_df.groupby("Barcode").apply(lambda x: x.iloc[0])
del barcode_only_df["mVenus Mean Intensity"]
del barcode_only_df["trenchid"]

# barcode_only_df["mVenus Mean Intensity Observations"] = barcode_mvenus_series
# barcode_only_df["mVenus Mean Intensity"] = barcode_only_df.apply(lambda x: np.mean(x["mVenus Mean Intensity Observations"]),axis=1)
# barcode_only_df["mVenus Standard Deviation"] = barcode_only_df.apply(lambda x: np.std(x["mVenus Mean Intensity Observations"]),axis=1)

barcode_only_df["mVenus Mean Intensity"] = barcode_mvenus_mean
barcode_only_df["mVenus Standard Deviation"] = barcode_mvenus_sd

barcode_only_df["mVenus Fano Factor"] = (
    barcode_only_df["mVenus Standard Deviation"] ** 2
) / barcode_only_df["mVenus Mean Intensity"]
barcode_only_df["mVenus CV"] = (
    barcode_only_df["mVenus Standard Deviation"]
    / barcode_only_df["mVenus Mean Intensity"]
)

In [None]:
filtered_df = barcode_only_df[
    (barcode_only_df["mVenus Mean Intensity"] > 2000)
    & (barcode_only_df["mVenus Mean Intensity"] < 50000)
]

In [None]:
meta_handle = tr.pandas_hdf5_handler(headpath + "metadata.hdf5")

In [None]:
filtered_df

In [None]:
## DO THIS BETTER (with parquet) later

meta_handle.write_df("mVenus Analysis 2", filtered_df)

In [None]:
meta_handle = tr.pandas_hdf5_handler(headpath + "metadata.hdf5")

In [None]:
filtered_df = meta_handle.read_df("mVenus Analysis 2")

In [None]:
filtered_df["Promoter"].apply(lambda x: x[-12] != "T" and x[-12] != "-")

In [None]:
plt.hist(
    filtered_df[filtered_df["Promoter"].apply(lambda x: x[-12] == "T")][
        "mVenus Fano Factor"
    ],
    density=True,
    range=(0, 10000),
    bins=50,
    alpha=0.5,
)
plt.hist(
    filtered_df[
        filtered_df["Promoter"].apply(lambda x: x[-12] != "T" and x[-12] != "-")
    ]["mVenus Fano Factor"],
    density=True,
    range=(0, 10000),
    bins=50,
    alpha=0.5,
)

plt.show()

In [None]:
np.median(
    filtered_df[filtered_df["Promoter"].apply(lambda x: x[-12] == "T")][
        "mVenus Fano Factor"
    ]
)

In [None]:
np.median(
    filtered_df[filtered_df["Promoter"].apply(lambda x: x[-12] != "T")][
        "mVenus Fano Factor"
    ]
)

In [None]:
plt.scatter(
    filtered_df["mVenus Mean Intensity"], filtered_df["mVenus CV"] ** 2, alpha=0.5
)
plt.loglog()
plt.show()

In [None]:
np.logspace(3, 5, 10)

In [None]:
filtered_df["mVenus Mean Intensity"]
pd.cut(filtered_df["mVenus Mean Intensity"], bins=np.logspace(3, 5, 10))

In [None]:
sns.violinplot(
    pd.cut(filtered_df["mVenus Mean Intensity"], bins=np.logspace(3.3, 4, 10)),
    filtered_df["mVenus Fano Factor"],
    cut=0,
)
plt.ylim(0, 500)

In [None]:
plt.scatter(
    filtered_df["mVenus Mean Intensity"], filtered_df["mVenus Fano Factor"], alpha=0.5
)
plt.loglog()
plt.show()

In [None]:
filtered_df["Log mVenus Mean Intensity"] = np.log10(
    filtered_df["mVenus Mean Intensity"]
)
filtered_df["Log mVenus Fano Factor"] = np.log10(filtered_df["mVenus Fano Factor"])

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
reg = LinearRegression().fit(
    filtered_df["Log mVenus Mean Intensity"].values.reshape(-1, 1),
    filtered_df["Log mVenus Fano Factor"].values,
)

In [None]:
reg.score(
    filtered_df["Log mVenus Mean Intensity"].values.reshape(-1, 1),
    filtered_df["Log mVenus Fano Factor"].values,
)

In [None]:
corrected_fano = filtered_df["Log mVenus Fano Factor"].values - (
    reg.coef_[0] * filtered_df["Log mVenus Mean Intensity"].values + reg.intercept_
)
perc = np.percentile(corrected_fano, 99)
high_fano_df = filtered_df[corrected_fano > perc]
high_fano_complement_df = filtered_df[corrected_fano <= perc]

In [None]:
plt.hist(corrected_fano)

In [None]:
perc

In [None]:
len(high_fano_df)

In [None]:
high_fano_df.to_csv("./high_fano_df.tsv", sep="\t")
high_fano_complement_df.to_csv("./high_fano_complement_df.tsv", sep="\t")

In [None]:
import pandas as pd

In [None]:
high_fano_df = pd.read_csv("./high_fano_df.tsv", sep="\t")
high_fano_complement_df = pd.read_csv("./high_fano_complement_df.tsv", sep="\t")

In [None]:
from Bio.Seq import Seq
from Bio import motifs

In [None]:
high_fano_seqlist = [
    Seq(item) for item in high_fano_df["Promoter"].tolist() if "-" not in item
]
high_fano_complement_seqlist = [
    Seq(item)
    for item in high_fano_complement_df["Promoter"].tolist()
    if "-" not in item
]

In [None]:
high_fano_df

In [None]:
high_fano_seqlist

In [None]:
high_fano_motif = motifs.create(high_fano_seqlist)
high_fano_comp_motif = motifs.create(high_fano_complement_seqlist)

In [None]:
high_fano_motif.degenerate_consensus

In [None]:
high_fano_freqs[:, -12]

In [None]:
10000 * 0.12

In [None]:
high_fano_motif.weblogo("mymotif.png")

In [None]:
high_fano_comp_motif.degenerate_consensus

In [None]:
high_fano_comp_motif.weblogo("mymotif_comp.png")

In [None]:
import numpy as np

from matplotlib import pyplot as plt

high_fano_freqs = np.array(
    [val for key, val in high_fano_motif.counts.normalize(pseudocounts=0.5).items()]
)
high_fano_comp_freqs = np.array(
    [
        val
        for key, val in high_fano_comp_motif.counts.normalize(pseudocounts=0.5).items()
    ]
)

In [None]:
plt.hist(abs(high_fano_freqs - high_fano_comp_freqs).flatten())

In [None]:
plt.imshow(abs(high_fano_freqs - high_fano_comp_freqs))

In [None]:
high_fano_df["Promoter"].tolist()

In [None]:
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])

In [None]:
sns.lmplot(x="Log mVenus Mean Intensity", y="Log mVenus Fano Factor", data=filtered_df)

In [None]:
plt.scatter(
    filtered_df["mVenus Mean Intensity"], filtered_df["mVenus Fano Factor"], alpha=0.5
)
plt.semilogx()
plt.ylim(0, 1000)
plt.show()

In [None]:
plt.hist(filtered_df["mVenus Fano Factor"], range=(10, 1000), bins=30)
plt.show()
plt.hist(filtered_df["mVenus CV"], range=(0.001, 10), bins=30)
plt.show()

In [None]:
filtered_df

##### Problem

Some of the barcodes have insane fano factors. Inspection indicates misassignment of off cells to bright trajectories. Try to id where this is coming from and fix. May be able to compute within-trench statistics and use median for the barcode.