## Ovrlpy

Analyze the VSI across the tissue section using `ovrlpy-0.2.1`  

**Tool**: `ovrlpy-0.2.1`  
**Data Link**: [Supplemental Data for: Segmentation-free inference of cell types from in situ transcriptomics data](https://zenodo.org/records/3478502)  
- merfish_barcodes_example.csv: mRNA spot locations


In this notebook, we will use ovrlpy to investigate the [mouse hypothalamus data](https://datadryad.org/stash/dataset/doi:10.5061/dryad.8t8s248) (Moffitt et al., 2018).  
We want to create a signal embedding of the transcriptome, and a vertical signal incoherence map to identify locations with a high risk of containing spatial doublets.  


### packages

In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import ovrlpy

### load the data

#### load the data and define settings and input files

merfish_barcodes_example.csv contains 3739360 rows and 13 columns:  
- Gene_name: Gene  
- Cell_name:   
- Animal_ID: 1  
- Bregma: -0.24  
- Animal_sex: Female  
- Behavior: Naive  
- Centroid_X: the x coordinate  
- Centroid_Y: the y coordinate  
- Centroid_Z: the z coordinate  
- Total_brightness:  
- Area:  
- Error_bit:  
- Error_direction:  

In [None]:
from pathlib import Path

data_folder_path = Path(
    "../data/mouse_hypothalamus/raw/"
)

result_folder = Path("../data/results/barcodes_xmpl")
result_folder.mkdir(exist_ok=True, parents=True)

In [None]:
columns = [
    "Centroid_X",
    "Centroid_Y",
    "Centroid_Z",
    "Gene_name",
    "Cell_name",
    "Total_brightness",
    "Area",
    "Error_bit",
    "Error_direction",
]

coordinate_df = pd.read_csv(
    data_folder_path / "merfish_barcodes_example.csv", usecols=columns
).rename(
    columns={
        "Centroid_X": "x",
        "Centroid_Y": "y",
        "Centroid_Z": "z",
        "Gene_name": "gene",
    }
)
# coordinate_df["gene"] = coordinate_df["gene"].str.decode("utf-8")


# remove dummy molecules
coordinate_df = coordinate_df.loc[
    ~coordinate_df["gene"].str.contains("Blank|NegControl"),
]

coordinate_df["gene"] = coordinate_df["gene"].astype("category")

# shift the coordinates to avoid the negative values
coordinate_df['x'] = coordinate_df['x'] - coordinate_df['x'].min()
coordinate_df['y'] = coordinate_df['y'] - coordinate_df['y'].min()

coordinate_df[::1000].plot.scatter(x="x", y="y", s=1)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()

# make a copy to avoid SettingWithCopyWarning
coordinate_df = coordinate_df.copy()
coordinate_df.head()

### Running the ovrlpy pipeline
ovrlpy provides a convenience function `run` to run the entire pipeline. The function creates a signal integrity map, a signal strength map and a Visualizer obejcet to visualize the results.

In [None]:
signal_integrity, signal_strength, visualizer = ovrlpy.run(
    df=coordinate_df, cell_diameter=10, n_expected_celltypes=15, n_workers=13
)

In [None]:
# save the signal integrity and strength matrix for subsequent analysis
sig_integrity = pd.DataFrame(signal_integrity)
sig_strength = pd.DataFrame(signal_strength)

sig_integrity.to_csv(result_folder/"barcodes_signal_integrity.csv", index=False, header=False)
sig_strength.to_csv(result_folder/"barcodes_signal_strength.csv", index=False, header=False)

### Visualizing results
The visualizer object has a plotting method to show the embeddings of the sampled gene expression signal.

In [None]:
visualizer.plot_fit()

In the same way, the signal integrity map can be visualized, where visualization is cut off at regions below a certain signal strength threshold:

In [None]:
fig, ax = ovrlpy.plot_signal_integrity(
    signal_integrity, signal_strength, signal_threshold=3
)

In [None]:
fig, ax = ovrlpy.plot_signal_integrity(
    signal_integrity, signal_strength, signal_threshold=2
)

### Detecting doublets
We can detect individual doublet events with ovrlpy, again setting a signal strength threshold to filter out low-transcript regions:

In [None]:
doublet_df = ovrlpy.detect_doublets(
    signal_integrity, signal_strength, minimum_signal_strength=3, integrity_sigma=2
)

doublet_df.shape

In [None]:
doublet_df = ovrlpy.detect_doublets(
    signal_integrity, signal_strength, minimum_signal_strength=2, integrity_sigma=2
)

doublet_df.shape

In [None]:
doublet_df.to_csv(result_folder/"barcodes_doublet_df.csv", index=False)

In [None]:
_ = plt.scatter(
    doublet_df["x"],
    doublet_df["y"],
    c=doublet_df["integrity"],
    s=0.2,
    cmap="viridis",
    vmin=0,
    vmax=1,
)
_ = plt.gca().set_aspect("equal")
_ = plt.colorbar()

Having sampled regions of potential doublets, we can visualize them as close-up transcriptome molecule clouds through the Visualizer's learned color embeddings - by providing their (x, y) locations to `ovrlpy.plot_region_of_interest`

In [None]:
doublet_case = 0

x, y = doublet_df.loc[doublet_case, ["x", "y"]]

_ = ovrlpy.plot_region_of_interest(
    x, y, coordinate_df, visualizer, signal_integrity, signal_strength, window_size=60
)

### Other functionality
Furthermore, we can save the visualizer object to file for later use leveraging the `pickle` module

In [None]:
import pickle

with open(result_folder / "my_analysis.pickle", "wb") as file:
    pickle.dump(visualizer, file)

... and easily reload it if needed.

In [None]:
with open(result_folder / "my_analysis.pickle", "rb") as file:
    visualizer = pickle.load(file)

Additionally, the analysis has produced a global z-level adjustment of the transcriptome coordinates, which can be used to create a z-stack of adjacent, aligned sections in silico:

In [None]:
plt.figure(figsize=(20, 5))

ax = plt.subplot(111, projection="3d")

for i in range(-2, 3):
    subset = coordinate_df[(coordinate_df.z - coordinate_df.z_delim).between(i, i + 1)]

    ax.scatter(
        subset.x[::100],
        subset.y[::100],
        np.zeros(1 + (len(subset) // 100)) + i,
        s=1,
        alpha=0.1,
    )