# Probabilistic cell typing

This notebooks guides you through the integration of ISS data with pre-existing clustered and annotated scRNAseq datasets, using Probabilistic Cell Typing (PCIseq).

Please have a look at:
https://www.nature.com/articles/s41592-019-0631-4
https://github.com/acycliq/pciSeq


Using this method, and if the genes measured by ISS have been accurately chosen for the task, it is possible to link these 2 modalities, and put on a geographical map the clusters inferred by scRNAseq in a coupled dataset.

## Import the necessary modules

In [None]:
import os
import numpy as np
import pandas as pd
import skimage.color
import matplotlib.pyplot as plt
from scipy.sparse import load_npz, coo_matrix
from ISS_postprocessing import pciseq

## Read the single cell RNA sequencing dataset

In this step we can either input a clustered scRNAseq object, or even just a table with the average expression data per cluster.

In [None]:
sc_file = ('/media/marco/Meola/CHK_NOV22/mean_expression_sc_gallus_input_for_pciseq_82genes.csv')

In [2]:
scRNAseq = pd.read_csv(sc_file, header=None, index_col=0, compression=None, dtype=object)
scRNAseq = scRNAseq.rename(columns=scRNAseq.iloc[0], copy=False).iloc[1:]
scRNAseq = scRNAseq.astype(float)


In [None]:
# We can show the data to have a look and confirm everything looks as it should.
scRNAseq

## Import the segmentation mask and the ISS data

In the following blocks of code, we read the segmentation mask and the ISS decoded data, respectively in the `coo_file` and `spots_file` variables.

In [None]:
coo_file = ('/media/marco/Meola/CHK_NOV22/CHK2_18_CARE/stardist_segmentation_expanded.npz')
spots_file = ('/media/marco/Meola/CHK_NOV22/CHK2_18_CARE/decoding_PRMC_MH/decoded.csv')

In [None]:
coo = load_npz(coo_file)
iss_spots = pd.read_csv(spots_file)

## Preprocessing the spots

We can now, if we haven't done it yet, filter the spots according to a quality criterion. Please read the manual to understand the different filtering criteria, and meaningful thresholds.

We then clean up the various data (scRNAseq and ISS) in order to exclude all the genes that are not present in both modalities, as they are not informative for the integration of the data.

In [None]:
spots_filt = iss_spots.loc[iss_spots['quality_minimum'] > 0.5]
processed_spots=pciseq.preprocess_spots(spots_filt, conversion_factor = 0.1625)

#select overlapping gene set bewtween ISS and scRNAseq
ISS=list(processed_spots['Gene'])
output = []
for x in ISS:
    if x not in output:
        output.append(x)
#print(output)
overlap=list(set(scseq).intersection(output))
scrnaseq_clean = trasposed.filter(items=overlap, axis=0)
processed_spots_clean = processed_spots[processed_spots['Gene'].isin(overlap)]



Then we set an output directory, where the PCIseq output will be stored

In [None]:
output_dir='/media/marco/Meola/CHK_NOV22/PCIseq_new/chk2_18_expanded/'

## Running PCIseq

Now we can finally run the Probabilistic Cell Typing algorithm. The python implementation is quite slow, so allow for some time especially if working with large datasets.

In [None]:
pciseq.run_pciseq(processed_spots_clean, coo, scrnaseq_clean, output_dir, save_output = True)

# Read the PCIseq output and plot the data

PCIseq is **probabilistic** in 2 ways:
1. it calculates the probability of each cell to be a specific cell type or another
2. it calculates the probablity of a given ISS spot to belong to a cell or another

Now, in the folder we indicated above as output, 3 files should have appeared:

`most_probable.csv` This is a table containing a list of cells, their xy position, the most probable cell types they have been assigned to and the probability of the assignment.

`geneData.json` This is a table that contains a list of spots, the cell they have been assigned to and the probability of that assignment.

`cellData.json` This is a more complex table at a cell level, including secondary probabilities and other data

We begin by reading the `most_probable.csv` file.

In [10]:
pcifile = ('/media/marco/Meola/CHK_NOV22/PCIseq_new/chk2_18_expanded/most_probable.csv')
pciout = pd.read_csv(pcifile)

The `pciout['ClassName']` column will contain the primary assignment for each cell, and the `pciout['Prob']` will contain the probability of that assignment.

For each cluster, we can plot the cells, together with their color-coded PCIseq probability using the following code:

In [None]:
import seaborn as sns
for cluster in clusters.unique():
    plt.figure(figsize=(15, 15))
    print (cluster)
    pcigene=pciout.loc[pciout['ClassName'] == cluster]
    sns.scatterplot(x='X', y='Y', hue='Prob', data=pcigene,  palette='rainbow', s=5)
    plt.show()