All the Visium data I've downloaded is here: /oak/stanford/groups/horence/JuliaO/data/visium/
. I copied the "batch download" commands from https://www.10xgenomics.com/resources/datasets/ and ran them in sherlock (separately for each sample). Key files are:
*.tif
: The histology image
As in https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-2-sagittal-anterior-1-standard-1-1-0; see /oak/stanford/groups/horence/JuliaO/data/visium/V1_Mouse_Brain_Sagittal_Posterior
for example downloaded data:
*possorted_genome_bam.bam
: The BAM file.spatial/tissue_positions_list.csv
: maps barcodes to spatial locations (described here: https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/images).filtered_feature_bc_matrix/*
: Gives gene expression values to compare against. Should containbarcodes.tsv.gz
,features.tsv.gz
, andmatrix.mtx.gz
As in https://www.biorxiv.org/content/10.1101/2021.08.03.455000v1.full.pdf; see /oak/stanford/groups/horence/JuliaO/data/visium/brain_metastases
for example downloaded data:
*_L001_R{1,2}_001.fastq.gz
: R1 and R2 fastqs
If the BAM was not available for the download, you need to run spaceranger to get the other required files. Spaceranger is downloaded on sherlock here: /home/groups/horence/applications/spaceranger-1.3.1/
.
Example call to spaceranger count with input descriptions from https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/using/count#count
$ cd /home/jdoe/runs
$ spaceranger count --id=sample345 \ #Output directory
--transcriptome=/opt/refdata/GRCh38-2020-A \ #Path to Reference
--fastqs=/home/jdoe/runs/HAWT7ADXX/outs/fastq_path \ #Path to FASTQs
--sample=mysample \ #Sample name from FASTQ filename
--image=/home/jdoe/runs/images/sample345.tiff \ #Path to brightfield image
--slide=V19J01-123 \ #Slide ID
--area=A1 \ #Capture area
--localcores=8 \ #Allowed cores in localmode
--localmem=64 #Allowed memory (GB) in localmode
Example call to spaceranger in this repo: run_spaceranger.sh
The Reference was downloaded using this command:
curl -O https://cf.10xgenomics.com/supp/spatial-exp/refdata-gex-GRCh38-2020-A.tar.gz
The slide
and area
values should be available from wherever you downloaded the data. For example, see "visium gene expression slideserial number" in https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5420749.
Once you run spaceranger, the required files from Step 1 should be available in outs/filtered_feature_bc_matrix/
and outs/spatial/
.
The 10X team are very helpful with debugging, so if you run into errors definitely email them.
Use the notebook visium_meta_clean.ipynb
to create the metadata file. Change the following at the top of the notebook to match your data:
dataname
: Name to use when saving the output filecer_loc_name
: Path to thetissue_positions_list.csv
file discussed in Step 1.im_path
: Path to the tif imagefilt_bc_name
: path to the filtered gene expression matrix discussed in Step 1.
Then run all the cells.
Note: sometimes I've found the images to be rotated according to what I'd expect. For example, you can see that the pixel map on the left of a brain metastasis slide seems to be mismatched - the image is flipped relative to what we're plotting, so we're extracting the wrong pixel values from the image. You can flip around the blur matrix and transpose it until it matches up with the histology (image on the right). You can also change the plot_xcoord
and plot_ycoord
columns to flip the image around so it matches the histology. I haven't found a programmatic way of knowing when this flip is happening.
Outputs:
notebooks/output/visium_meta/{dataname}_blur.png
: Plot of the histology image blurred to the specified level (default: 70)notebooks/output/visium_meta/{dataname}_pixquant.png
: Plot of the pixel values from the blurred image, quantilednotebooks/output/visium_meta/{dataname}_pixval.png
: Plot of the raw pixel values from the blurred imagenotebooks/output/visium_meta/meta_{dataname}.tsv
: The metadata tsv file
Columns in metadata file (some columns from tissue_positions_list.csv
directly, see documentation here: https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/images):
barcode
: Barcode that can be mapped to BAM barcodein_tissue
: 1 if in the tissue, 0 if out. All should be 1 because we're subsetting to only spots in the tissue here.array_row
: The row coordinate of the spot in the array from 0 to 77. The array has 78 rows.array_col
: The column coordinate of the spot in the array. In order to express the orange crate arrangement of the spots, this column index uses even numbers from 0 to 126 for even rows, and odd numbers from 1 to 127 for odd rows. Notice then that each row (even or odd) has 64 spots.xcoord
: The row pixel coordinate of the center of the spot in the full resolution image.ycoord
: The column pixel coordinate of the center of the spot in the full resolution image.cell_id
:{dataname}_{barcode[:-2]}
(cutting off the-1
at the end of the barcode; this should match up with names of cells in ReadZS/SpliZ)plot_xcoord
:ycoord
(use for plotting to align with histology image)plot_ycoord
:-xcoord
(use for plotting to align with histology image)pixval
: pixel value at the given spot (with blur)pixquant
: pixel quantile at the given spot (with blur; default is 10 quantiles)
Visium data can be run using the main branch of the ReadZS pipeline without modification. An example config file, samplesheet, and bash script are provided.
In the bash script, change runName
to whatever you want. In the samplesheet, your sample name and its BAM (I haven't run with multiple visium samples with ReadZS, but I suppose that you could). In the config file, include the path to the samplesheet, metadatafile described in Step 3, and other input files required for ReadZS.
Note: It's okay if the ReadZS pipeline doesn't complete, as long as the CALC_ZSCORES
step finishes.
The ReadZS pipeline creates a separate z score file for each chromosome. It's convenient for us to concatenate them into one file. You can use the provided bash script to do this, by letting ZDIR
be the path to the zscore directory from the ReadZS output, and DATANAME
be your dataname (this determines the name the file will be saved with).
The relevant file is the concatenated file created with this script, {dataname}.zscore
. This contains one row per spot per 5000-bp window. Relevant columns are:
window
: 5000-bp ReadZS windowcell_id
: Spot identifierz_scaled
: ReadZS value
Running the SpliZ pipeline is pretty much the same procedure as running the ReadZS. Visium data can be run using the main branch of the SpliZ pipeline without modification. An example config file, samplesheet, and bash script are provided.
Note: It's okay if the SpliZ pipeline doesn't complete, as long as the CALC_SPLIZVD
step finishes.
The relevant SpliZ output file is SpliZ_values/*_subcol.tsv
. This file has one row per gene per spot. The relevant columns are:
gene
: gene namecell
: spot idscZ
: SpliZ value for that gene in that spot
Note that for the remaining code, rather than taking arguments for a bunch of input paths each time, I have all the paths saved in a csv that I access in each script. The access looks something like this:
dataname = "V1_Mouse_Brain_Sagittal_Posterior"
samples = pd.read_csv("notebooks/output/make_samplesheet/spatial.csv",index_col = 0)
row = samples.loc[dataname]
score = "ReadZS"
scores = pd.read_csv("output/make_samplesheet/scores.csv",index_col=0)
srow = scores.loc[score]
I found this easier than having to manually copy the paths each time. Also, it's fine if all the metadata paths (for example) don't follow the same format between datasets, because I just have to put the correct path for each one into the samplesheet and then everything works smoothly. I think this is a good approach if there are a lot of datasets/scores you're iterating over, because it forces you to keep an up-to-date spreadsheet of where all the inputs/outputs are, and you can avoid copy/pasting paths. Also, no need to put an entry in every cell; you can just put an empty string if you don't have the given file for the given dataset.
The process of making this samplesheet is a little manual, since you have to write all the correct paths for all of your datasets. But you only have to do this once for each. I make the samplesheet in a notebook. The columns of the output of the spatial file are:
dataname
: identifier of datamethod
: visium or 10xspliz_vals
: path to thesym_SVD_normdonor_S_0.1_z_0.0_b_5_r_0.01_subcol.tsv
file from the SpliZ pipeline (inSpliZ_values
)readzs_vals
: path to the concatenated z score file from theconcat_zscore.sh
scriptge_vals
: path to parsed gene expression values (Step 6)tissue
: string identifying the tissue (e.g. "Brain")image
: path to the .tif imagemetadata
: path to the metadata file (described in Step 3)pixcorr_ge
: path to correlation of pixel value and gene expression (Step 8)pixcorr_readzs
: path to correlation of pixel value and ReadZS (Step 8)pixcorr_spliz
: path to correlation of pixel value and SpliZ (Step 8)readzs_ge
: path to Parsed ReadZS counts (Step 6)resid_spliz
: path to residual after removing gene expression for SpliZ (Step 7)resid_readzs
: path to residual after removing gene expression for ReadZS (Step 7)bam
: path to BAM filetiss_pos_list
: path totissue_positions_list.csv
described in step 1ge_mat
: path to the filtered gene expression matrix described in step 1readzs_counts
: path to the raw ReadZS count file (output from the pipeline)
The column outputs of the scores file are:
name
: name of the score (e.g.SpliZ
)col
: column in the zscore file that contains this value (e.g.scZ
)genecol
: eithergene
orwindow
cellid
: Name of the cell column (cell
orcell_id
)valname
: name of column inspatial.csv
file that contains zscores
We want to extract gene expression values for each gene so that we can extricate gene expression patterns from splicing patterns. The script parse_gene_expression.py
parses the gene expression matrix returned by spaceranger into a table that we can use in downstream analysis. The sbatch script run_ge.sh
submits the job for this step.
Input parameters:
dataname
: Data name in the dataset samplesheet (Step 5)thresh
: Require more thanthresh
spots to have nonzero values for the gene for it to be reported.genes
: Genes to parse gene expression for. If blank, all will be parsed.norm
: Including the--norm
tag returns values normalized by count in each cell. Otherwise values are just integer counts
Outputs (one row per gene, spot pair):
{dataname}.tsv
: table with all gene counts per cell and metadata{dataname}.pq
: parquet form of table{dataname}_sub_{thresh}.tsv
: table with gene counts per cell if the gene has a nonzero value in> thresh
spots{dataname}_sub_{thresh}.pq
: parquet form of table
Output columns not defined in the metadata:
gene
: gene nameensembl
: ensembl idfrac_count
: if norm: the number of counts mapping to this window in this spot, divided by the number of counts mapping to this spot; if not norm, just the number of counts mapping to this window in this spot.
We want to extract gene expression values for each window so that we can compare with ReadZS patterns. The script readzs_ge.py
parses the count values from the ReadZS output for each window and turns them into a table we can use for downstream analysis. The sbatch script run_ge_readzs.sh
submits the job for this step.
Input parameters:
dataname
: Data name in the dataset samplesheet (Step 5)thresh
: Only report windows with expression in >thresh
spots
Output (one row per window, spot pair):
{dataname}_readzs_ge_{thresh}.tsv
: table with all window counts per spot and metadata{dataname}_readzs_ge_{thresh}.pq
: parquet version of the table
Output columns not defined in metadata:
cell_id
: spot identification columnchr
: chromosomepos
: position on chromosomestrand
: strand on chromosomesample
:dataname
window
: window identifierwindow_count
: The number of reads for the given window for the given cellcell_count
: The number of reads across all windows for the given spotfrac_count
:window_count
/cell_count
If we're interested in SpliZ/ReadZS patterns that are independent of gene expression, it can be helpful to "regress out" gene expression information to get scores with no ability to predict gene expression. We compute the residuals as follows:
- Normalize both SpliZ (or ReadZS) values and gene expression values so they are on the same scale
- Perform a linear regression to predict SpliZ (or ReadZS) value based on the gene expression value
- Subtract the predicted value from the observed value to get the "residual": the variation in the SpliZ unexplained by gene expression
We can create this "normalized" version of the SpliZ, ReadZS, or gene expression scores by doing the following for each gene/window:
- Rank all values from smallest to largest (ties are broken randomly, so the same value can get multiple ranks)
- Assign each rank to a uniform value between 0 and 1 (the value will be
$\frac{r}{R + 1}$ where$r$ is the rank in question and$R$ is the max rank of the dataset) - Use the reverse normal cdf to map each of these values to values from the normal distribution.
Normalization is important because the gene expression and SpliZ/ReadZS values can be on different scales, and can be skewed. For example, for the gene Gng13 the SpliZ and gene counts histograms are skewed before normalization:
After normalization, we can see that these irregularities go away:
We can look at the plots of SpliZ vs gene expression, and compare to normalized SpliZ vs. normalized gene expression:
Now here is the histogram for the residuals (after regressing out gene expression) and the plot of the SpliZ residual vs gene expression (you can see now the correlation between the two is zero):
The script that does all of this is save_residuals.py
. The script to submit the job is run_res.sh
.
Input parameters:
dataname
: Name to use when saving the output filescore
: The score you're interested in (SpliZ
orReadZS
)score2
: What you'll regress out ofscore
(ge
orReadZS_ge
)outname
: path to save outputthresh
: Require > this many spots for the score to keep the given gene/window
Output:
{dataname}_{score}_{score2}_{thresh}.tsv
: One row per gene/spot pair
Output columns not yet defined:
res
: residual score after regressingscore2
out ofscore
{score}_norm
: normalized version ofscore
{score2}_norm
: normalized version ofscore2
One method of determining spatial regulation is to extract the information from the histology image and correlate that with SpliZ/ReadZS values. Specifically, we can extract the pixel value at each spot in the image (as described in step 3) and correlate pixel value with SpliZ/ReadZS value. We can also take this correlation with any function of the pixel value, such as the quantiled value. Nuclei are stained in the histology image, so there is some biological interpretability of the results.
The script that performs this correlation is pixel_correlation.py, and it can be submitted with run_corr.sh.
Input parameters:
dataname
: Name to use when saving the output filethresh
: Only perform correlation for gene/window with non-NA values in >thresh
spotsscore
: The score you're interested in correlating with the pixel value (e.g.SpliZ
orReadZS
)outpath
: path to save output
Output:
{dataname}_{score}_{thresh}.tsv
: One row per gene/window
Output columns not previously defined:
num_spots
: Number of spots with non-NA values for this gene/windowcorr_pixval
: Spearman correlation ofscore
with pixel valuepval_pixval
: Spearman p value ofscore
with pixel valuecorr_pixquant
: Spearman correlation ofscore
with pixel quantilepval_pixquant
: Spearman p value ofscore
with pixel quantilepval_pixval_adj
: Benjamini-Hochberg-corrected pixel value correlation p valuepval_pixquant_adj
: Benjamini-Hochberg-corrected pixel quantile correlation p value
Another method of finding spatial patterns is inspired by the Ising model of magnetism. The test is performed separately for each gene.
Let's define a graph radius
variable in the script to be larger). Let
Note that if the scores are centered around 0 (which they are for the SpliZ and ReadZS), then under the null hypothesis that the arrangement of the scores isn't spatially determined,
The traditional Ising model is discrete, with spots taking values of +1 or -1. Based on the configuration of spins, you can solve for the magnetism of the system (an indication of how "ordered" the system is). Continuous-spin Ising models were introduced in 1969. However, the theory to calculate the magnestism of the continuous system isn't straightforward. Also, a system would have high magnetism even if almost all the spots were the same, which isn't a configuration we're interested in. The Ising model has been applied to biological problems before.
Instead, to get p values I used permutations. At first, I just permuted which scores were assigned to which spot for the given genes, and found how "extreme" the score we observed was compared to the permuted scores. Let the permuted Ising scores be
But this method wasn't the best null for the empirical data. I switched to a null where, for one permutation for a given gene, for each neighboring pair of spots with non-NA values for that gene, I randomly chose a gene for which both of those spots were non-null, and multiplied their scores.
The script that calculates these Ising values is ising.py. It can be submitted using run_ising.sh.
This takes a while for the ReadZS (it's not very well optimized). For that, I use a threshold of 1000 (compared to 100 for SpliZ) and I run separately for each score with a 48-hour time limit.
Input parameters:
dataname
: Name to use when saving the output filescore
: The score you're interested in finding spatial patterns in (e.g.SpliZ
orReadZS
)thresh
: Require >thresh
spots to have non-NA values forscore
num_perms
: Number of permutations to perform
Output:
{dataname}_{score}_{thresh}_{num_perms}.tsv
: one row per gene/window
Output columns not previously defined:
score_cont
: Ising score for the gene/windownum_pairs
: Number of spots neighboring each other with non-NA values for the gene/windowperm_pval
: P value from permutationsperm_pvals_adj
: P value from permutations adjusted with Benjamini-Hochbergmean_score
: The mean value ofscore
for this gene/window
Once genes with significant spatial patterns for a score are determined, the next step is to visualize those patterns. To create a plot with the histology image, score1 values mapped to spot locations, and score2 values mapped to spot locations, you can use plot_gene_val.py. It can be submitted using run_spat_plot.sh. Example plot:
Input parameters:
dataname
: Name to use when saving the output file/identifying data to plotscore
: first score to plot (must be in: "ge","SpliZ","ReadZS","ReadZS_ge", "ReadZS_resid","ReadZS_norm","SpliZ_resid","SpliZ_norm","ReadZS_ge_norm","ge_norm")score2
: second score to plot (must be in: "ge","SpliZ","ReadZS","ReadZS_ge", "ReadZS_resid","ReadZS_norm","SpliZ_resid","SpliZ_norm","ReadZS_ge_norm","ge_norm")window_file
: file with no header, window on each line, saying which genes/windows to plot. It is pre-created for significant genes by the ising script (*_plot.txt
).
Output:
{dataname}_{gene}_{score}_{score2}.png
: plot with three panels{dataname}_{gene}_{score}_{score2}_quant.png
: plot with three panels where the scores are quantiled (4 quantiles)
This creates ReadZS-style peak plots for each window in a file. Cells are divided up by ReadZS quantiles, and then peaks are plotted. The window_file
described above can be used as input. Specify the dataname, and make sure the path to the file is specified. Plotting can be run with the run_plotterfile.sh file, specifiying the following parameters:
Input parameters:
DATANAME
: Name to use when saving plots/accessing dataGFF
: gff file for the organism being analyzedWINDOWFILE
: file with one window per line and no header. These are the windows that will be plotted
To copy over Table 2: scp jolivier@login.sherlock.stanford.edu:/oak/stanford/groups/horence/JuliaO/visium_analysis/notebooks/output/make_tables/spatial_table2.tsv .
To concatenate plot files into one: cat V1*SpliZ_norm*b0_plot.txt | sort | uniq -u > joint_genes.txt
make_overview_table.ipynb
: used to make table 1make_tables.ipynb
: used to make table 2analyze_ising.ipynb
: make correlation plots
[1] Meyer E, Dehghannasiri R, Chaung K, Salzman J. (2021). ReadZS detects developmentally regulated RNA processing programs in single cell RNA-seq and defines subpopulations independent of gene expression. bioRxiv. https://doi.org/10.1101/2021.09.29.462469
[2] Olivieri JE, Dehghannasiri R, Salzman J. (2022) The SpliZ generalizes ‘Percent Spliced In’to reveal regulated splicing at single-cell resolution. Nature Methods. 19(3):307-10. https://doi.org/10.1038/s41592-022-01400-x