# ISS_decoding of preprocessed files

This notebook guides you through the steps necessary for the decoding of ISS image data that have been preprocessed using our `ISS_preprocessing` module.

In this step, we make use of the starfish library to extract the information contained in our images. starfish is a Python library for processing images of image-based spatial transcriptomics. You can read more about it here: https://spacetx-starfish.readthedocs.io/en/latest/

In the following steps, we first format our images to a starfish-compatible format (SpaceTx), and provide some information about our experiment design. 
Once this is completed, we can proceed to the actual decoding of the data from the SpaceTx images.
Format images.


For this notebook to work correctly, you should have a specific folder for each one of the sections you want to process, and each folder should contain the following subfolders tree: `/preprocessing/ReslicedTiles`. Under `/preprocessing` you might still have other subfolders such as `/mipped/`, `/OME_tiffs/` and `/stitched/`, but these are irrelevant at this stage, and will not exist in the output folder from `ISS_CARE`, if that's the choice you've made for denoising the images. 

The important thing is that you have a `/preprocessing/ReslicedTiles/`, because the resliced images are the starting point of the preprocessing

## We start by importing the necessary modules

In [1]:
import ISS_decoding.SpaceTx_format as STX
import ISS_decoding.decoding as DEC
import ISS_decoding.qc_metrics as QC
import pandas as pd

## Specify the path where your preprocessed sections are saved

Here we input the path to each one of the samples as an element of a list. Each element needs to contain the full path to the sample folder. We consider as "sample folder" the parent folder to `/preprocessing/`.


In [2]:
samples=['/home/marco/Downloads/media/marco/mountstuff/rfl_test/']


    

## Format the images to SpaceTx format

The first thing to do before we can start the actual decoding is to transform our images (the resliced tiles) to the SpaceTx format.

To read more about the SpaceTx format, read the following: https://github.com/spacetx/sptx-format

In order to complete this step, you need to input the path to a `codebook_csv`. This is a file that associates a unique color sequence across ISS cycles to each gene. This file is a comma separated file with no header, in which the first column contains the gene name, while  columns 2 to 6 (in case of a 6 cycle experiment) contain numbers representing the expected positive DO_decorator in each cycle for that gene. 

Let's look at the variables are taken by the function:

`filenames` = type: `list`. This list contains the names of the subfolders in the ReslicedTiles folder. Depending on how many cycles you have in your experiment, you will have to shorten it. For example, if you only have 4 cycles, you should remove: 'Base_5_stitched','Base_6_stitched'. 

`tile_dim` = type: `int`. The dimensions of your ReslicedTiles. Default = 2000. Make sure this matches the `tile_dim` size you used in the `tile_stitched_images` function in the `ISS_preprocessing` module.

`pixelscale`  = type: `int`. This is the size of the pixels in microns. This is needed if you want the coordinates of the spot to be output in microns, and will depend on your microscope settings. Set to 1 if you want the data to be on a coordinate scale. Default = 0.1625.

`channels` = type: `list`. The channels, in the order they were acquired in the microscope. Default = ["AF750", "Cy5", "Cy3", "AF488", "DAPI"]

`DO_decorators` =  type: `list` **This point can be tricky to understant.** This shows how you associate the numbers in the codebook (ie 1,2,3,4) to a specific list of colors (DO_decorator) . In our lab, 1,2,3,4 correspond to  ["AF750", "488", "Cy3",  "Cy5"]. Default =  ["AF750", "488", "cy3",  "Cy5"]. Users who will follow our barcode design and readout schemes should not change this.

`folder_spacetx`  = this is the name of the folder containing where the SpaceTx files will be saved. Default = SpaceTX_format. 

`nuclei_channel`  = type: `int`. This is the number of the channel that corresponds to your nuclei stained image. This is 1-indexed and our default = 5. 








In [12]:

for sample in samples:
    STX.make_spacetx_format(path = sample,
    codebook_csv = '/home/marco/Downloads/media/marco/mountstuff/codebook_organoid.csv',
    filenames=['Base_1_stitched', 'Base_2_stitched', 
               'Base_3_stitched', 'Base_4_stitched', 'Base_5_stitched'],
    tile_dim=4000,
    pixelscale=1,
    channels=["AF750", "Cy5", "Cy3", "AF488", "DAPI", "At425"],
    DO_decorators=["AF750", "AF488", "Cy3", "Cy5", "At425"],
    folder_spacetx='SpaceTX_format_rfl',
    nuclei_channel=4,
    )


## Decode the SpaceTx formatted data


At this stage, after running the above code block, you should have successfully transformed your preprocessed resliced tiles into the SpaceTx format. Each `/preprocessing/SpaceTX_format/` should now contain a file called `experiment.json`

To decode the images, we now need to use the `process_experiment` function.


In brief, the function will take each individual SpaceTx image file, looks for the same spot across cycles, deduce a colour sequence for each spot across the different cycles, and match the extracted sequence to the `codebook` we provided previously. 

In detail, this process is actually a bit more complex and several steps happen within the `process_experiment` function. We suggest you to have a look at the relevant part of the manual for exhaustive explanation of what happens under the hood. 

### Changing decoding parameters

Modifying some of  the input parameters of the `process_experiment` function, it is possible to explore a range of settings to optimize the decoding for every specific experiment, if necessary.

`Exp_path`: here you have to specify the path to the `experiment.json` described above. This is normally inside the `SpaceTX_format` subfolder.

`Output`: This is the output folder, where the decoded data will be deposited. We recommend that this path points to a new empty folder, since the function generates and individual csv for every tile processed.

`Register`: select `True` if you want to register/align your images and `False` if you don’t want to align your images. In principle, this should be already taken care of by the `ISS_preprocessing` module. Nevertheles, this step is also recommended, as sometimes small adjustments in the registration of images improve this step.

`Register_dapi`: specify `True` if you want to align your images based on Dapi/ nuclear staining. If it’s specified as `False`, the alignment will be done based on a pseudoanchor(please read the glossary on the manual to know what this is). This parameter only is considered if the parameter `Register==True`.

`Masking radius`: the radius of the top hat filter for spot detection. Depending on the size, it will “smoothen” your data to a different extent. The standard values go between 7 and 15. We recommend 7. Refer to the manual for further explanations.

`Threshold`: Defines the minimum intensity that a spot should have in order to be detected. Decreasing the threshold increases the number of detected spots, but also increases the chance of background signal being counted as a spot. 

`Sigma vals`:  correspond to min_sigma, max_sigma, num_sigma. See https://spacetx-starfish.readthedocs.io/en/latest/api/spots/index.html#spot-finding. Our default values are set in a way that allow us to capture signals in our experimental settings, on both Leica and Zeiss and at both 20x and 40x magnifications.

`Decode_mode`: two options are given: ‘PRMC’: per round max channel OR ‘MD’, metric distance. We suggest to use PRMC. More information on how the two methods differ are in the manual.

`Normalization method`: ‘MH’ or ‘CPTZ’. These are two alternative image normalization methods. We normally advise to use 'MH' for ISS data. Please refer to the manual to understand the differences.

However the above parameters are specified, the output decoded files will always be .csv files consisting essentially of the location (XY position) and identity (gene) of every spot decoded. 


In [None]:
for sample in samples:
    DEC.process_experiment(exp_path = sample+'/SpaceTX_format_rfl/experiment.json', 
    output = sample+'/decoded_dense/',  
    register = False,
    register_dapi = False,
    masking_radius = 7, 
    threshold = 0.002, 
    sigma_vals = [1, 10, 30], # min, max and number
    decode_mode = 'PRMC',
    normalization_method = 'MH',
    dense=True
            )
    


no FOVS done
decoding fov_000
getting images


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 40.52it/s]


not registering images


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 956.66it/s]


normalizing channel intensities
decoding with PerRoundMaxChannel
locating spots for channel 0


## Merging the decoded data.

As mentioned in the previous text, every single SpaceTX file (corresponding to a ReslicedTile) is decoded independently. This makes the decoding feasible, though slow, even on old laptops.

The last step is then to concatenate all the decoded csv files that belong to a single experiment . For this, we use the following function.

In [None]:
for sample in samples:
    print (sample)
    DEC.concatenate_starfish_output(path=sample+'/decoded_dense/',
                            outpath=sample+'/decoded_dense/')

In [14]:
for sample in samples:
    DEC.process_experiment(exp_path = sample+'/SpaceTX_format_rfl/experiment.json', 
    output = sample+'/decoded_sparse/',  
    register = False,
    register_dapi = False,
    masking_radius = 7, 
    threshold = 0.002, 
    sigma_vals = [1, 10, 30], # min, max and number
    decode_mode = 'PRMC',
    normalization_method = 'MH',
    dense = False
            )

no FOVS done
decoding fov_000
getting images


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 40.11it/s]


not registering images


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 964.08it/s]


normalizing channel intensities


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 49.29it/s]


locating spots


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 53.96it/s]


decoding with PerRoundMaxChannel
decoding fov_001
getting images


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 42.00it/s]


not registering images


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 950.64it/s]


normalizing channel intensities


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.19it/s]


locating spots


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.16it/s]


decoding with PerRoundMaxChannel
decoding fov_002
getting images


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 43.56it/s]


not registering images


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 957.22it/s]


normalizing channel intensities


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.90it/s]


locating spots


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 51.41it/s]


decoding with PerRoundMaxChannel
decoding fov_003
getting images


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 41.07it/s]


not registering images


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:00<00:00, 956.09it/s]


normalizing channel intensities


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.28it/s]


locating spots


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 58.04it/s]


decoding with PerRoundMaxChannel


In [16]:
for sample in samples:
    print (sample)
    DEC.concatenate_starfish_output(path=sample+'/decoded_sparse/',
                            outpath=sample+'/decoded_sparse/')

/home/marco/Downloads/media/marco/mountstuff/rfl_test/
5
decoded.csv


IndexError: list index out of range

_________________



_________________

# Explore the decoded data

After combining the csv files for every individual tile, we should get as an output a big table (saved in csv) containing the information of our raw decoded spots. The main columns are: 

1. The location of the decoded spot (columns “xc” and “yc”) and 

2. the identity of every spot (column: “target”). 

Similarly to what happens in NGS experiments, there "raw" reads migh have variable quality and one key step before performing downstream analyses is to assess this quality and filter the data if necessary (**spoiler alert: it is always necessary to do some level of filtering**)

From now on, we'll need to work on each sample individually. It is important to assess the quality of the reads on individual sample, for obvious reasons.

First of all we read the combined CSV file for one sections. You will have to input the path down here accordingly.

In [51]:
reads=pd.read_csv('/path/to/your/decoding_output/decoded.csv')

The number of extracted raw reads for this section is:

In [52]:
len(reads)

417000

## Understanding read quality

To have an idea of how the read quality looks like, we can generate a violin plot of the qualities per cycle using the `quality_per_cycle` function. The function gets the data from the `reads` table above. The number of cycles needs to be specified manually by the user. 

You can have a look at the manual to understand how the quality is calculated. Summing up, 1 is the theoretical maximum and 0.25 the theoretical minimum in a 4 colour setting. The more your violin plots are shifted towards 1, the better.

In [None]:
QC.quality_per_cycle(reads,cycles=4)

Another useful plot to generate is to see how quality score metrics reflect whether a read has a match on the codebook ("assigned") or not ("non assigned"). We plot the `quality_minimum` and `quality_mean` against the assignment/non assignment to understand which is the best strategy to filter our data. Ideally we should infer from this plot a good quality threshold for filtering our data. 

In [None]:
QC.compare_scores(reads,score1='quality_mean',score2='quality_minimum',hue='assigned',kind='hist',color='#3266a8')


The following command does the same, but only on a single score.

In [None]:
QC.plot_scores(reads,on='quality_mean',hue='assigned',log_scale=False) 

## Filtering the data

There are different strategies to filter the data, depending on what the above plots show. You can refer to the manual for more specific example.

A general common sense criterion is to filter out all the reads whose `quality_minimum` is < 0.5
This discards all the reads that show poor quality in **at least one cycle**. This is quite a conservative criterion, but as a rule of thumb is a good one.


Other methods for filtering are also described in the relevant part of the manual.

In [53]:
reads_filt=QC.filter_reads(reads,min_quality_minimum=0.5)


The number of filtered reads for this section is:

In [54]:
len(reads_filt)

370251

# Plot expression data

After quality-filtering, a sensible thing to do is to plot the gene expression decoded from your tissue. 

Parameters `xcolumn` and `y column` define the XY positions of your spots in the `reads_filt` dataframe. 

`key` points to the column containing the gene identity of your spots in the `reads_filt` dataframe.

`genes` allows to plot a single gene per plot (=`individual`), or to plots all genes in a single plot (=`all`)

`size`= sets the dot size

`background`= sets the background color	

`title_color`= sets the color of the title for each plot.

`colorcode`= sets the color of the expression dots

`figuresize`=(10,10) sets the size of the plot

`save`= defaults to `None`. If `True` saves the plot in the speficied `format`

`format`= saves the plot to a specific format (ie `pdf`)


In [None]:
QC.plot_expression(reads,key='target',colorcode=['red'],xcolumn='xc',ycolumn='yc',genes='individual',size=8,background='black',title_color='white',figuresize=(10,10),save=None,format='pdf')

## Other useful functions

`quality_per_gene`: this function plots the quality of the reads assigned to each gene. Since every gene has an associated sequence of colors across cycles, we might have a big quality bias between different genes arising just because of the colour sequence (ie. if a channel has always a bad signal/noise, genes with many cycles in that channels will always have lower quality).

In [None]:
QC.quality_per_gene(reads,on='quality_mean',gene_name='target')

`plot_frequencies`: allows us to plot the relative abundance of the decoded dots for each gene. This is useful to identify potential decoding artefacts. For example, a gene that looks 1000 times more abundant than the second more represented gene, should raise an eyebrow.

In [None]:
QC.plot_frequencies(reads,on='target')