# TrenchRipper Master Notebook

## Experimental Notes



	3. Experiment 3 (File name: 2020-01-10_strains_46_47)
		a. Strains: DE46 and DE47 mixed 1:1
		b. Chip: 1.5W/L25 (height: 1.36 um) snakes (Sylvia)
			i. Baked for ~3 mins
			ii. However, was aged by ~2 weeks
			iii. Same chip as 2020-01-05_strains_46_47'
			iv. Passivated overnight in 2.5% F108
		c. Growth: Loaded cells in the morning, grew for ~8 hours
			i. Still some dead mother cells, but more robust growth than in previous days
			ii. Cells loaded from extended (>10 hours) stationary
		d. Results
			i. Less unloading than with the PFA fixation on the same chip (2020-01-05_strains_46_47)
			ii. Loading balanced in strain representation
            iii. Both cycles worked very well, but still problems from dead mothers/physiology. Suggest loading cells in early stationary if possible. Problem may be alleviated once in typical (MG1655) strain background.

#### Imports

In [None]:
import copy
import warnings

import ipywidgets as widgets
import matplotlib
import trenchripper as tr
from ipywidgets import (
    Dropdown,
    FloatRangeSlider,
    FloatSlider,
    IntRangeSlider,
    IntSlider,
    IntText,
    SelectMultiple,
    fixed,
    interact,
    interact_manual,
    interactive,
)

matplotlib.rcParams["figure.figsize"] = [20, 10]
warnings.filterwarnings(action="once")

#### Specify Paths

Begin by defining the directory in which all processing will be done, as well as the initial nd2 file we will be processing.

In [None]:
headpath = "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_1"
nd2file = "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_1/mothermachine_cycle_1.nd2"

In [None]:
headpath = "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_2"
nd2file = "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_2/mothermachine_cycle_2.nd2"

#### Transfer files into the scratch folder

In [None]:
sourcedir = "/n/files/SysBio/PAULSSON\ LAB/Personal\ Folders/Daniel/Image_Data/FISH_barcoding/2020-01-10_strains_46_47/cycle_2"
targetdir = "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_2"
tr.cluster.transferjob(sourcedir, targetdir)

### Extract to hdf5 files

#### Start Dask Workers

In [None]:
dask_controller = tr.cluster.dask_controller(
    walltime="04:00:00",
    local=False,
    n_workers=10,
    memory="2GB",
    working_directory=headpath + "/dask",
)
dask_controller.startdask()
dask_controller.daskcluster.start_workers()

In [None]:
dask_controller.displaydashboard()

#### Perform Extraction

In [None]:
hdf5_extractor = tr.ndextract.hdf5_fov_extractor(
    nd2file, headpath, tpts_per_file=100, ignore_fovmetadata=False
)

In [None]:
hdf5_extractor.inter_get_notes()

In [None]:
hdf5_extractor.extract(dask_controller)

#### Shutdown Dask

In [None]:
dask_controller.shutdown()

## Kymographs

### Test Parameters

#### Initialize the interactive kymograph class

As a first step, initialize the `tr.interactive.kymograph_interactive` class that will be handling all steps of generating a kymograph. 

You will need to specify the following `args` and `kwargs` (in order):


**Args**

**input_file_prefix (string)** : File prefix for all input hdf5 files of the form "\[input_file_prefix\]\[number\].hdf5" This should be the default output format for the hdf5 export code, but you will need to rename files if taking input files from a different source.

**all_channels (list)** : list of strings corresponding to the different image channels available in the input hdf5 file, with the channel used for segmenting trenches in the first position. NOTE: these names must match those of the input hdf5 file datasets.

**fov_list (list)** : List of ints corresponding to the fovs that you wish to make kymographs of.

**Kwargs**

**t_subsample_step (int)** : Step size to be used for subsampling input files in time, recommend that subsampling results in between 5 and 20 timepoints for quick processing.

**t_range (tuple of ints)** : Range size to be used for subsampling input files in time.

The last line will perform import and subsampling of the input hdf5 image files.

In [None]:
matplotlib.rcParams["figure.figsize"] = [20, 10]
interactive_kymograph = tr.interactive.kymograph_interactive(headpath)
channels, fov_list, timepoints_len = interactive_kymograph.get_image_params()

In [None]:
interact(
    interactive_kymograph.view_image,
    fov_idx=IntText(value=0, description="FOV number:", disabled=False),
    t=IntSlider(
        value=0, min=0, max=timepoints_len - 1, step=1, continuous_update=False
    ),
    channel=Dropdown(
        options=channels, value=channels[0], description="Channel:", disabled=False
    ),
    invert=Dropdown(options=[True, False], value=False),
)

In [None]:
import_hdf5 = interactive(
    interactive_kymograph.import_hdf5_files,
    {"manual": True},
    all_channels=fixed(channels),
    seg_channel=Dropdown(options=channels, value=channels[0]),
    invert=Dropdown(options=[True, False], value=False),
    fov_list=SelectMultiple(options=fov_list),
    t_range=IntRangeSlider(
        value=[0, timepoints_len - 1],
        min=0,
        max=timepoints_len - 1,
        step=1,
        disabled=False,
        continuous_update=False,
    ),
    t_subsample_step=IntSlider(value=10, min=0, max=200, step=1),
)
display(import_hdf5)

In [None]:
imported_array_list = copy.copy(import_hdf5.result)

#### Tune "trench-row" detection hyperparameters

The kymograph code begins by detecting the positions of trench rows in the image as follows:

1. Reducing each 2D image to a 1D signal along the y-axis by computing the qth percentile of the data along the x-axis
2. Smooth this signal using a median kernel
3. Use a [triangle threshold](https://imagej.net/Auto_Threshold#Triangle) to determine the trench row poisitons

This method uses the following `kwargs`, which you can tune here:

**y_percentile (int)** : Percentile to use for step 1.

**smoothing_kernel_y_dim_0 (int)** : Median kernel size to use for step 2.

**triangle_nbins (int)** : Number of bins to use in the triangle method histogram.

**triangle_scaling (float)** : Scaling factor to apply to the threshold determined by the triangle method.


Running the following widget will display the smoothed 1-D signal for each of your timepoints. In addition, the threshold value for each fov will be displayed as a red line.

In [None]:
matplotlib.rcParams["figure.figsize"] = [20, 10]

row_detection = interactive(
    interactive_kymograph.preview_y_precentiles,
    {"manual": True},
    imported_array_list=fixed(imported_array_list),
    y_percentile=IntSlider(value=100, min=0, max=100, step=1),
    smoothing_kernel_y_dim_0=IntSlider(value=17, min=1, max=200, step=2),
    triangle_nbins=IntSlider(value=50, min=10, max=300, step=10),
    triangle_scaling=FloatSlider(value=3.5, min=0.0, max=4.0, step=0.05),
    triangle_threshold_bounds=FloatRangeSlider(
        value=[0, 1.0],
        min=0,
        max=1.0,
        step=0.01,
        disabled=False,
        continuous_update=False,
    ),
)
display(row_detection)

#### Generate "trench-row" detection output

After determining your desired hyperparameters, set them in the next cell and run it to produce output for later steps. **Note: The thresholding parameters do not need to be specified at this point.**

In [None]:
y_percentiles_smoothed_list = copy.copy(row_detection.result)

#### Tune "trench-row" cropping hyperparameters

Next, we will use the detected rows to perform cropping of the input image in the y-dimension:

1. Determine edges of trench rows based on threshold mask.
2. Filter out rows that are too small.
3. Perform cropping using the "end" of the row as reference (the end referring to the part of the trench farthest from the feeding channel).

This method uses the following `kwargs`, which you can tune here:

**y_min_edge_dist (int)** : Minimum row length necessary for detection.

**padding_y (int)** : Padding to be used when cropping in the y-dimension.

**trench_len_y (int)** : Length from the end of the tenches to be used when cropping in the y-dimension.

**top_orientation (int)** : The orientation of the top-most row where 0 corresponds to a trench with a downward-oriented trench opening and 1 corresponds to a trench with an upward-oriented trench opening.

**vertical_spacing (float)** : Parameter for setting the distance of plots being viewed.

Running the following widget will display y-cropped images for each fov and timepoint.

In [None]:
matplotlib.rcParams["figure.figsize"] = [20, 10]
y_cropping = interactive(
    interactive_kymograph.preview_y_crop,
    {"manual": True},
    y_percentiles_smoothed_list=fixed(y_percentiles_smoothed_list),
    imported_array_list=fixed(imported_array_list),
    y_min_edge_dist=IntSlider(value=50, min=10, max=200, step=10),
    padding_y=IntSlider(value=20, min=0, max=100, step=1),
    trench_len_y=IntSlider(value=270, min=0, max=1000, step=10),
    vertical_spacing=FloatSlider(value=0.9, min=0.0, max=2.0, step=0.01),
    expected_num_rows=IntText(value=2, description="Number of Rows:", disabled=False),
    orientation_detection=Dropdown(
        options=[0, 1, "phase"], value=0, description="Orientation:", disabled=False
    ),
    orientation_on_fail=Dropdown(
        options=[None, 0, 1],
        value=0,
        description="Orientation when < expected rows:",
        disabled=False,
    ),
)
display(y_cropping)

#### Generate "trench-row" cropping output

After determining your desired hyperparameters, set them in the next cell and run it to produce output for later steps.

In [None]:
cropped_in_y_list = copy.copy(y_cropping.result)

#### Tune trench detection hyperparameters

Next, we will detect the positions of trenchs in the y-cropped images as follows:

1. Reducing each 2D image to a 1D signal along the x-axis by computing the qth percentile of the data along the y-axis.
2. Determine the signal background by smooth this signal using a large median kernel.
3. Subtract the background signal.
4. Smooth the resultant signal using a median kernel.
5. Use a [otsu threhsold](https://imagej.net/Auto_Threshold#Otsu) to determine the trench midpoint poisitons.

This method uses the following `kwargs`, which you can tune here:

**x_percentile (int)** : Percentile to use for step 1.

**background_kernel_x (int)** : Median kernel size to use for step 2.

**smoothing_kernel_x (int)** : Median kernel size to use for step 4.

**otsu_nbins (int)** : Number of bins to use in the Otsu's method histogram.

**otsu_scaling (float)** : Scaling factor to apply to the threshold determined by Otsu's method.

**vertical_spacing (float)** : Parameter for setting the distance of plots being viewed.

Running the following widget will display the smoothed 1-D signal for each of your timepoints. In addition, the threshold value for each fov will be displayed as a red line.

In [None]:
trench_detection = interactive(
    interactive_kymograph.preview_x_percentiles,
    {"manual": True},
    cropped_in_y_list=fixed(cropped_in_y_list),
    t=IntSlider(value=0, min=0, max=cropped_in_y_list[0].shape[4] - 1, step=1),
    x_percentile=IntSlider(value=85, min=50, max=100, step=1),
    background_kernel_x=IntSlider(value=21, min=1, max=601, step=20),
    smoothing_kernel_x=IntSlider(value=9, min=1, max=31, step=2),
    otsu_nbins=IntSlider(value=50, min=10, max=200, step=10),
    otsu_scaling=FloatSlider(value=0.25, min=0.0, max=2.0, step=0.01),
    vertical_spacing=FloatSlider(value=0.9, min=0.0, max=2.0, step=0.01),
)
display(trench_detection)

#### Generate trench detection output

After determining your desired hyperparameters, set them in the next cell and run it to produce output for later steps. **Note: The thresholding parameters do not need to be specified at this point.**

In [None]:
smoothed_x_percentiles_list = trench_detection.result

#### Check midpoint drift

Next, we will perform x-dimension drift correction of our detected midpoints as follows:

1. Begin at t=1
2. For $m \in \{midpoints(t)\}$ assign $n \in \{midpoints(t-1)\}$ to m if n is the closest midpoint to m at time $t-1$,
points that are not the closest midpoint to any midpoints in m will not be mapped.
3. Compute the translation of each midpoint at time.
4. Take the average of this value as the x-dimension drift from time t-1 to t.

This method uses the following `kwargs`, which you can tune here:

**vertical_spacing (float)** : Parameter for setting the distance of plots being viewed.

Running the following widget will display the detected midpoints for each of your timepoints. If there is too much sparsity, or discontinuity, your drift correction will not be accurate.

In [None]:
midpoint_drift = interactive(
    interactive_kymograph.preview_midpoints,
    {"manual": True},
    smoothed_x_percentiles_list=fixed(smoothed_x_percentiles_list),
    vertical_spacing=FloatSlider(value=0.8, min=0.0, max=2.0, step=0.01),
)
display(midpoint_drift)

#### Generate midpoint drift output

After determining your desired hyperparameters, set them in the next cell and run it to produce output for later steps.

In [None]:
all_midpoints_list, x_drift_list = midpoint_drift.result

#### Tune trench cropping hyperparameters

Trench cropping simply uses the drift-corrected midpoints as a reference and crops out some fixed length around them to produce an output kymograph

This method uses the following `kwargs`, which you can tune here:

**trench_width_x (int)** : Trench width to use for cropping.

**vertical_spacing (float)** : Parameter for setting the distance of plots being viewed.

Running the following widget will display a random kymograph for each row in each fov.

It will also produce midpoint plots showing retained midpoints

In [None]:
matplotlib.rcParams["figure.figsize"] = [20, 10]
interact_manual(
    interactive_kymograph.preview_kymographs,
    cropped_in_y_list=fixed(cropped_in_y_list),
    all_midpoints_list=fixed(all_midpoints_list),
    x_drift_list=fixed(x_drift_list),
    trench_width_x=IntSlider(value=30, min=10, max=50, step=2),
    trench_present_thr=FloatSlider(value=0.0, min=0.0, max=1.0, step=0.05),
    vertical_spacing=FloatSlider(value=0.8, min=0.0, max=2.0, step=0.01),
)

#### Export and save hyperparameters

In [None]:
interactive_kymograph.process_results()

In [None]:
interactive_kymograph.write_param_file()

### Generate Kymograph

#### Start Dask Workers

In [None]:
dask_controller = tr.cluster.dask_controller(
    walltime="04:00:00",
    local=False,
    n_workers=100,
    memory="8GB",
    working_directory="/n/scratch2/de64/2019-11-09_CN_Growth_Curve/",
)
dask_controller.startdask()
dask_controller.daskcluster.start_workers()

In [None]:
dask_controller.displaydashboard()

In [None]:
kymoclust = tr.kymograph.kymograph_cluster(
    headpath=headpath, trenches_per_file=100000, paramfile=True
)

In [None]:
kymoclust.generate_kymographs(dask_controller)

In [None]:
kymoclust.post_process(dask_controller)

#### Check kymograph statistics

In [None]:
kymoclust.kymo_report()

#### Shutdown Dask

In [None]:
dask_controller.shutdown()

In [None]:
import copy

import h5py
import numpy as np
import pandas as pd
import skimage as sk
import trenchripper as tr
from matplotlib import pyplot as plt

In [None]:
meta_handle = tr.utils.pandas_hdf5_handler(
    "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_1/metadata.hdf5"
)

In [None]:
cycle_1_meta = meta_handle.read_df("kymograph")

In [None]:
cycle_1_meta[:1000:100]

In [None]:
cycle_1_dict = {}
with h5py.File(
    "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_1/kymograph/kymograph_0.hdf5"
) as infile:
    for key in infile.keys():
        cycle_1_dict[key] = infile[key][:]

In [None]:
cycle_1_rfp_imgs = cycle_1_dict["RFP"][:, 0, 50:150]
cycle_1_cy5_imgs = cycle_1_dict["Cy5"][:, 0, 50:150]
cycle_1_cy7_imgs = cycle_1_dict["Cy7"][:, 0, 50:150]

In [None]:
cycle_1_rfp = np.percentile(
    cycle_1_rfp_imgs.reshape(cycle_1_rfp_imgs.shape[0], -1), 90, axis=1
)
cycle_1_cy5 = np.percentile(
    cycle_1_cy5_imgs.reshape(cycle_1_cy5_imgs.shape[0], -1), 90, axis=1
)
cycle_1_cy7 = np.percentile(
    cycle_1_cy7_imgs.reshape(cycle_1_cy7_imgs.shape[0], -1), 90, axis=1
)

# cycle_1_rfp = np.mean(cycle_1_rfp_imgs.reshape(cycle_1_rfp_imgs.shape[0],-1),axis=1)
# cycle_1_cy5 = np.mean(cycle_1_cy5_imgs.reshape(cycle_1_cy5_imgs.shape[0],-1),axis=1)
# cycle_1_cy7 = np.mean(cycle_1_cy7_imgs.reshape(cycle_1_cy7_imgs.shape[0],-1),axis=1)

In [None]:
cycle_1_meta["RFP"] = cycle_1_rfp
cycle_1_meta["CY5"] = cycle_1_cy5
cycle_1_meta["CY7"] = cycle_1_cy7

In [None]:
cycle_1_meta[:5]

In [None]:
meta_handle = tr.utils.pandas_hdf5_handler(
    "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_2/metadata.hdf5"
)

In [None]:
cycle_2_meta = meta_handle.read_df("kymograph")

In [None]:
cycle_2_meta[:1000:100]

In [None]:
cycle_2_dict = {}
with h5py.File(
    "/n/scratch2/de64/2020-01-10_strains_46_47/cycle_2/kymograph/kymograph_0.hdf5"
) as infile:
    for key in infile.keys():
        cycle_2_dict[key] = infile[key][:]

In [None]:
cycle_2_rfp_imgs = cycle_2_dict["RFP"][:, 0, 50:150]
cycle_2_cy5_imgs = cycle_2_dict["Cy5"][:, 0, 50:150]
cycle_2_cy7_imgs = cycle_2_dict["Cy7"][:, 0, 50:150]

In [None]:
cycle_2_rfp = np.percentile(
    cycle_2_rfp_imgs.reshape(cycle_2_rfp_imgs.shape[0], -1), 90, axis=1
)
cycle_2_cy5 = np.percentile(
    cycle_2_cy5_imgs.reshape(cycle_2_cy5_imgs.shape[0], -1), 90, axis=1
)
cycle_2_cy7 = np.percentile(
    cycle_2_cy7_imgs.reshape(cycle_2_cy7_imgs.shape[0], -1), 90, axis=1
)

# cycle_2_rfp = np.mean(cycle_2_rfp_imgs.reshape(cycle_2_rfp_imgs.shape[0],-1),axis=1)
# cycle_2_cy5 = np.mean(cycle_2_cy5_imgs.reshape(cycle_2_cy5_imgs.shape[0],-1),axis=1)
# cycle_2_cy7 = np.mean(cycle_2_cy7_imgs.reshape(cycle_2_cy7_imgs.shape[0],-1),axis=1)

In [None]:
cycle_2_meta["RFP"] = cycle_2_rfp
cycle_2_meta["CY5"] = cycle_2_cy5
cycle_2_meta["CY7"] = cycle_2_cy7

In [None]:
cycle_2_meta[:5]

In [None]:
cycle_1_meta[:5]

In [None]:
cycle_1_meta = cycle_1_meta.set_index(
    ["fov", "row", "trench"], drop=True, append=False, inplace=False
)
cycle_1_meta = cycle_1_meta.sort_index()
cycle_2_meta = cycle_2_meta.set_index(
    ["fov", "row", "trench"], drop=True, append=False, inplace=False
)
cycle_2_meta = cycle_2_meta.sort_index()

In [None]:
out_df = []
for fov in cycle_1_meta.index.get_level_values("fov").unique().tolist():
    working_cycle_1_meta = copy.copy(cycle_1_meta.loc[fov])
    working_cycle_2_meta = copy.copy(cycle_2_meta.loc[fov])
    x_diff_mat = np.subtract.outer(
        working_cycle_1_meta["x (global)"], working_cycle_2_meta["x (global)"]
    )
    y_diff_mat = np.subtract.outer(
        working_cycle_1_meta["y (global)"], working_cycle_2_meta["y (global)"]
    )
    dist_mat = (x_diff_mat**2 + y_diff_mat**2) ** (1 / 2)
    matched_idx = np.argmin(dist_mat, axis=1)
    matched_cycle_2 = working_cycle_2_meta.iloc[matched_idx]
    working_cycle_1_meta["RFP2"] = working_cycle_2_meta["RFP"]
    working_cycle_1_meta["CY52"] = working_cycle_2_meta["CY5"]
    working_cycle_1_meta["CY72"] = working_cycle_2_meta["CY7"]
    working_cycle_1_meta["fov"] = fov
    out_df.append(working_cycle_1_meta)
out_df = pd.concat(out_df)

out_df = out_df[~out_df["RFP2"].isna()]
out_df = out_df[~out_df["CY52"].isna()]
out_df = out_df[~out_df["CY72"].isna()]

out_df = out_df.reset_index(inplace=False)
out_df = out_df.set_index(
    ["fov", "row", "trench"], drop=True, append=False, inplace=False
)
out_df = out_df.sort_index()

In [None]:
all_rfp = np.array(out_df.loc[1:]["RFP"].tolist() + out_df.loc[1:]["RFP2"].tolist())
all_cy5 = np.array(out_df.loc[1:]["CY5"].tolist() + out_df.loc[1:]["CY52"].tolist())
all_cy7 = np.array(out_df.loc[1:]["CY7"].tolist() + out_df.loc[1:]["CY72"].tolist())

In [None]:
plt.hist(out_df.loc[1:]["RFP"].tolist(), bins=200)
plt.xlim(600, 1000)
plt.show()
plt.hist(out_df.loc[1:]["RFP2"].tolist(), bins=200)
plt.xlim(600, 1000)
plt.show()
plt.hist(all_rfp, bins=200)
plt.xlim(600, 1000)
plt.show()

In [None]:
plt.hist(out_df.loc[1:]["CY5"].tolist(), bins=200)
plt.xlim(300, 5000)
plt.show()
plt.hist(out_df.loc[1:]["CY52"].tolist(), bins=200)
plt.xlim(300, 5000)
plt.show()
plt.hist(all_cy5, bins=200)
plt.xlim(300, 5000)
plt.show()

In [None]:
plt.hist(out_df.loc[1:]["CY7"].tolist(), bins=200)
plt.xlim(200, 500)
plt.show()
plt.hist(out_df.loc[1:]["CY72"].tolist(), bins=200)
plt.xlim(200, 500)
plt.show()
plt.hist(all_cy7, bins=200)
plt.xlim(200, 500)
plt.show()

In [None]:
rfp_thr = np.median(all_rfp)
cy5_thr = np.median(all_cy5)
cy7_thr = np.median(all_cy7)

In [None]:
print(rfp_thr, cy5_thr, cy7_thr)

In [None]:
rfp_thr = sk.filters.threshold_triangle(all_rfp)
cy5_thr = sk.filters.threshold_triangle(all_cy5)
cy7_thr = sk.filters.threshold_triangle(all_cy7)

In [None]:
print(rfp_thr, cy5_thr, cy7_thr)

In [None]:
## practical best
rfp_thr = sk.filters.threshold_triangle(all_rfp)
cy5_thr = np.median(all_cy5)
cy7_thr = sk.filters.threshold_triangle(all_cy7)

In [None]:
print(rfp_thr, cy5_thr, cy7_thr)

In [None]:
rfp_ratio = out_df.loc[1:]["RFP"] / out_df.loc[1:]["RFP2"]
cy5_ratio = out_df.loc[1:]["CY5"] / out_df.loc[1:]["CY52"]
cy7_ratio = out_df.loc[1:]["CY7"] / out_df.loc[1:]["CY72"]

In [None]:
plt.hist(rfp_ratio, bins=50)
plt.show()

In [None]:
plt.hist(cy5_ratio, bins=50)
plt.show()

In [None]:
plt.hist(cy7_ratio, bins=50)
plt.show()

In [None]:
rfp_0 = rfp_ratio > 1.05
rfp_1 = rfp_ratio < 0.95
rfp_ambiguous = (~rfp_0) & (~rfp_1)
cy5_0 = cy5_ratio > 1.05
cy5_1 = cy5_ratio < 0.95
cy5_ambiguous = (~cy5_0) & (~cy5_1)
cy7_0 = cy7_ratio > 1.05
cy7_1 = cy7_ratio < 0.95
cy7_ambiguous = (~cy7_0) & (~cy7_1)

In [None]:
ratio_ambiguous = np.any(
    np.array([rfp_ambiguous, cy5_ambiguous, cy7_ambiguous]), axis=0
)

In [None]:
one_cyc_ambiguous = rfp_1_onecyc & cy5_1_onecyc & cy7_1_onecyc

In [None]:
ambiguous = one_cyc_ambiguous & ratio_ambiguous

In [None]:
rfp_1_onecyc = out_df.loc[1:]["RFP"] < (rfp_thr)
TP = np.sum((rfp_1_onecyc & rfp_1) & ~rfp_ambiguous)
FP = np.sum((rfp_1_onecyc & rfp_0) & ~rfp_ambiguous)
FN = np.sum(((~rfp_1_onecyc) & rfp_1) & ~rfp_ambiguous)
TN = np.sum(((~rfp_1_onecyc) & rfp_0) & ~rfp_ambiguous)
marginal_rfp_err = (FP + FN) / (TP + TN)

In [None]:
print(TP, FP, FN, TN)
print(marginal_rfp_err)

In [None]:
cy5_1_onecyc = out_df.loc[1:]["CY5"] < (cy5_thr)
TP = np.sum((cy5_1_onecyc & cy5_1) & ~cy5_ambiguous)
FP = np.sum((cy5_1_onecyc & cy5_0) & ~cy5_ambiguous)
FN = np.sum(((~cy5_1_onecyc) & cy5_1) & ~cy5_ambiguous)
TN = np.sum(((~cy5_1_onecyc) & cy5_0) & ~cy5_ambiguous)
marginal_cy5_err = (FP + FN) / (TP + TN)

In [None]:
print(TP, FP, FN, TN)
print(marginal_cy5_err)

In [None]:
cy7_1_onecyc = out_df.loc[1:]["CY7"] < (cy7_thr)
TP = np.sum((cy7_1_onecyc & cy7_1) & ~cy7_ambiguous)
FP = np.sum((cy7_1_onecyc & cy7_0) & ~cy7_ambiguous)
FN = np.sum(((~cy7_1_onecyc) & cy7_1) & ~cy7_ambiguous)
TN = np.sum(((~cy7_1_onecyc) & cy7_0) & ~cy7_ambiguous)
marginal_cy7_err = (FP + FN) / (TP + TN)

In [None]:
print(TP, FP, FN, TN)
print(marginal_cy7_err)

In [None]:
rfp_ratio_err = np.sum(rfp_0) / (np.sum(rfp_0) + np.sum(rfp_1))
cy5_ratio_err = np.sum(cy5_1) / (np.sum(cy5_0) + np.sum(cy5_1))

In [None]:
print(rfp_ratio_err, cy5_ratio_err)

In [None]:
rfp_noratio_err = np.sum((~rfp_1_onecyc) & (~one_cyc_ambiguous)) / (
    np.sum((rfp_1_onecyc) & (~one_cyc_ambiguous))
    + np.sum((~rfp_1_onecyc) & (~one_cyc_ambiguous))
)
cy5_noratio_err = np.sum((cy5_1_onecyc) & (~one_cyc_ambiguous)) / (
    np.sum((cy5_1_onecyc) & (~one_cyc_ambiguous))
    + np.sum((~cy5_1_onecyc) & (~one_cyc_ambiguous))
)

In [None]:
print(rfp_noratio_err, cy5_noratio_err)

In [None]:
print(np.sum(one_cyc_ambiguous))

In [None]:
cycle_1_cy5_imgs.shape

In [None]:
no_signal = cycle_1_cy5_imgs[cycle_1_cy5 < tri_thr]
signal = cycle_1_cy5_imgs[cycle_1_cy5 > tri_thr]

In [None]:
plt.imshow(signal[11000])

### Discussion

 - it does not look like any of the dyes are significantly better/worse than the others
 - Once accounting for ambiguous trenches (probably empty), the error rates drop significantly (to under 2%) for RFP and CY5
     - this is possible to compute because they both have the same bit throughout the population
 - As for marginal error (assuming the ratio measurment as ground truth), the rates hover at around 5% 
     - interpretation is difficult since two bits are uniform across the population
     - at least in the case of cy7, the margin is ~5%
 - overall, I have the sense that the error rate is between 2-5% (this will possibly improve with better filtering of empty trenches and other methodological improvements to the signal)

In [None]:
import copy
import itertools

import numpy as np
from matplotlib import pyplot as plt

In [None]:
import scipy as sp
import scipy.stats

In [None]:
binom = sp.stats.binom(30, 0.5)

In [None]:
x = list(range(30))
y = binom.pmf(x)

N = 100000
plt.bar(x, y)
plt.show()

In [None]:
p_one_away = 1.0 - ((1.0 - y[1]) ** (N))

In [None]:
p_one_away

In [None]:
p_two_away = 1.0 - ((1.0 - y[2]) ** (N))

In [None]:
p_two_away

In [None]:
x = list(range(3))
y = binom.pmf(x)

N = 100000
plt.bar(x, y * N)
plt.show()

In [None]:
barcode_len = 30
error_rate = 0.02
n_barcodes = 100000
exp_barcodes = 10000

barcodes = np.random.choice(np.array([True, False]), size=(n_barcodes, barcode_len))
sampled_idx = np.random.choice(range(barcodes.shape[0]), size=(exp_barcodes,))
sampled_barcodes = barcodes[sampled_idx]
errors = np.random.choice(
    [True, False],
    size=(sampled_barcodes.shape),
    replace=True,
    p=[error_rate, 1.0 - error_rate],
)
read_barcodes = copy.copy(sampled_barcodes)
read_barcodes[errors] = ~read_barcodes[errors]
hamming_arr = barcode_len - np.sum(
    sampled_barcodes[:, np.newaxis, :] == read_barcodes[np.newaxis, :, :], axis=2
)

In [None]:
plt.hist(hamming_arr.flatten(), range=(0, 6), bins=7)
plt.show()

In [None]:
sampled_barcodes[:, np.newaxis, :].shape

In [None]:
read_barcodes[np.newaxis, :, :].shape

In [None]:
sampled_barcodes == read_barcodes

In [None]:
sampled_idx = np.random.choice(range(barcodes.shape[0]), size=(100,))

In [None]:
barcodes[sampled_idx]

In [None]:
def compute_assignment_err(
    barcode_len=18, error_rate=0.05, n_barcodes=100000, exp_barcodes=1000000
):
    barcodes = np.random.choice(np.array([True, False]), size=(n_barcodes, barcode_len))
    sampled_idx = np.random.choice(range(barcodes.shape[0]), size=(exp_barcodes,))
    sampled_barcodes = barcodes[sampled_idx]

    read_barcodes = []

    for i in range(sampled_barcodes.shape[0]):
        errors = np.random.choice(
            [True, False],
            size=(barcode_len,),
            replace=True,
            p=[error_rate, 1.0 - error_rate],
        )
        barcode = sampled_barcodes[i]
        error_barcode = copy.copy(barcode)
        error_barcode[errors] = ~error_barcode[errors]
        read_barcodes.append(error_barcode)

    read_barcode_arr = np.array(read_barcodes)

    read_barcode_arr = np.array(read_barcodes)

    used_barcodes_indices = np.random.choice(
        len(all_barcodes), size=(used_barcode_num,), replace=True
    )
    used_barcodes = []
    for idx in used_barcodes_indices:
        barcode = all_barcodes[idx]
        used_barcodes.append(barcode)
    read_barcode_indices = np.random.choice(
        len(used_barcodes), size=(read_barcode_num,), replace=True
    )
    read_barcodes = []
    true_barcodes = []
    for idx in read_barcode_indices:
        errors = np.random.choice(
            [True, False],
            size=(barcode_len,),
            replace=True,
            p=[error_rate, 1.0 - error_rate],
        )
        barcode = used_barcodes[idx]
        true_barcodes.append(barcode)
        error_barcode = copy.copy(barcode)
        error_barcode[errors] = ~error_barcode[errors]
        read_barcodes.append(error_barcode)
    used_barcode_arr = np.array(used_barcodes)
    read_barcode_arr = np.array(read_barcodes)
    hdist_arr = []
    for i in range(read_barcode_arr.shape[0]):
        read_barcode = read_barcode_arr[i]
        xor_out = np.logical_xor(read_barcode, used_barcode_arr)
        h_dists = np.sum(xor_out, axis=1)
        hdist_arr.append(h_dists)
    hdist_arr = np.array(hdist_arr)
    matched_indices = np.argmin(hdist_arr, axis=1)
    matched_idx_hdist = np.min(hdist_arr, axis=1)
    within_tolerence = matched_idx_hdist < 3

    true_match = (matched_indices == read_barcode_indices)[within_tolerence]
    perc_discarded = np.sum(~within_tolerence) / read_barcode_num
    err_rate = np.sum(~true_match) / np.sum(within_tolerence)

    return err_rate, perc_discarded

In [None]:
compute_assignment_err(30, 0.03, 10000, 1000)

In [None]:
compute_assignment_err(12, 0.02, 10000, 1000)

In [None]:
barcode_lens = [10, 15, 20]
error_rates = [0.0, 0.02, 0.05, 0.1]
err_rates = []
perc_discardeds = []
for barcode_len in barcode_lens:
    err_rates_list = []
    perc_discardeds_list = []
    for error_rate in error_rates:
        err_rate, perc_discarded = compute_assignment_err(
            barcode_len, error_rate, 10000, 1000
        )
        err_rates_list.append(err_rate)
        perc_discardeds_list.append(perc_discarded)
    err_rates.append(err_rates_list)
    perc_discardeds.append(perc_discardeds_list)

In [None]:
err_rates = np.array(err_rates)
perc_discardeds = np.array(perc_discardeds)

In [None]:
import seaborn as sns

In [None]:
subsample_ratio = [np.round((1000 / (2**item)), decimals=4) for item in barcode_lens]

In [None]:
sns.heatmap(err_rates, xticklabels=error_rates, yticklabels=subsample_ratio)

In [None]:
sns.heatmap(perc_discardeds, xticklabels=error_rates, yticklabels=subsample_ratio)

In [None]:
plt.plot(err_rates[:, 1])

In [None]:
plt.plot(err_rates[:, 2])

In [None]:
plt.plot(err_rates[1, :])

In [None]:
plt.plot(err_rates[2, :])