Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xfuse on Visium data #10

Open
alikhuseynov opened this issue Jun 23, 2020 · 21 comments
Open

xfuse on Visium data #10

alikhuseynov opened this issue Jun 23, 2020 · 21 comments

Comments

@alikhuseynov
Copy link

Hello,
Your approach looks very interesting! And I hope your pre-print will get accepted.
I would like to run xfuse on GPU cluster using one dataset from Visium. Few questions:

  • does one need to do standard preprocessing/normalization of spatial data, eg in Seurat, or even do some prior deconvolution of spots using reference sc/snRNAseq data (eg using SPOTlight)? And then run xfuse preprocessing, before xfuse run
  • what would be the example to prepare Visium data for xfuse? eg:
    xfuse convert visium --counts s1_filtered_feature_bc_matrix.h5 --image tissue_hires_image.png --transformation-matrix tissue_positions_list.csv --scale 0.3 --output-file section1.h5
    .h5, .pnd, .csv are supported?
  • in my-config.toml [analyses.gene_maps] normalize = false what does that exactly mean, no normalization of counts?
  • is there a full summary of configuration options for xfuse?

Many Thanks

@ludvb
Copy link
Owner

ludvb commented Jun 23, 2020

Hi, Thanks for your interest in our paper! I'll try to answer point-wise below:

  • The model assumes the data to be unnormalized. Batch effects across sections can be accounted for using design covariates (for an example, see the "section" attributes under the [slides."..."] headings in the example configuration from the README). Integrating with sc reference data is not currently possible but would indeed be an interesting extension.
  • To prepare Visium data, the command would be xfuse convert visium --bc-matrix s1_filtered_feature_bc_matrix.h5 --image image.tif --tissue-positions spatial/tissue_positions_list.csv --scale-factors spatial/scalefactors_json.json --scale 0.3 --output-file section1.h5, where image.tif is the original microscopy image (not the hires image produced by Space Ranger).
  • When normalize = true, the pixel scaling factor s_nxy (cf. Eq. 6 in the paper) is forced to one for all pixels. Effectively, this ignores differences in transcript density across spatial areas, which could be useful if you are interested in the pixel-wise relative expression of a certain gene. But use this options with caution, as predictions in areas with a low number of transcripts are likely unreliable. normalize = false is typically preferable.
  • To create a template config file with all of the available options, run xfuse init config.toml. Documentation is somewhat lacking at the moment, and we're working on improving the situation in the near future. But let me know if you have any questions in the meantime! :)

@alikhuseynov
Copy link
Author

Many Thanks for the clear explanation!
2 more questions :)

  • what kind of output data does xfuse produce exactly? like a count matrix - genes x pixels, plus segmented or tessellated HE image?
  • can the xfuse output data be used to visualize spatial gene expression in R? similar to what you have it in STUtility package (quite cool bwt)
    best,
    Alik

@ludvb
Copy link
Owner

ludvb commented Jun 24, 2020

Happy to help! :)

The outputs depend on what you specify in the [analyses] config section. We currently support four different output types/analyses:

  • [analyses.gene_maps]: Predicted high-res mean expression for each gene (png)
  • [analyses.metagenes]: Predicted metagene profiles (csv and png summary) and their spatial expression patterns (png)
  • [analyses.imputation]: Posterior samples of the mean expression in predefined spatial areas (csv)
  • [analyses.differential_expression]: Posterior samples of the log-fold change in mean expression between predefined spatial areas (csv)

The last two types of analyses require an annotation file (see #3).

Interop with other packages is unfortunately quite limited at this time. It is of course possible to load the png and csv files listed above in R, but those files are just summarizations of the results. Storing the gene expression maps in a count matrix would be possible but result in a very large output file. Ideas on new output types better suited for data exchange are welcome!

@alikhuseynov
Copy link
Author

Thank you!

@alikhuseynov
Copy link
Author

alikhuseynov commented Jan 3, 2021

Hi there,
few questions:

  • is there a way to reproduce the .png plots, adding color scale bar for expression range?
  • is it possible to write out an output file, eg count matrix with genes x pixels (I understand that it will be large)
  • what do you think of using processed count data, ie use Seurat SCTransform, then convert SCT counts to filtered_feature_bc_matrix.h5, and run xfuse using SCT processed .h5?

Many Thanks:)

@ludvb
Copy link
Owner

ludvb commented Jan 4, 2021

Hi,

The gene maps are based on the Matplotlib inferno color map, so it would be possible to create a color bar by running, for example

import matplotlib as mpl
from imageio import imsave
w = 100
h = 5
colorbar = np.broadcast_to(mpl.cm.inferno(np.linspace(0, 1, w)), (h, w, 4))
imsave("colorbar.png", colorbar)

A caveat is, of course, that it is unlabeled and does not show the range of expression values.

PR #27 will make it possible to save gene maps as numpy arrays, which could be stacked post-hoc to produce genes x pixels matrices. You can try it out if you like on the refactor-analysis branch. However, that branch includes name changes to some of the model parameters, so I expect old models may need to be retrained. To produce the tensor maps, you would add something like

[analyses.gene_maps]
type = "gene_maps"
[analyses.gene_maps.options]
gene_regex = ".*"
num_samples = 1
genes_per_batch = 10
predict_mean = true
normalize = false
mask_tissue = true
scale = 1.0
writer = "tensor"

to the config file. This is still a bit experimental, so I can't vouch for stability.

Alternatively, you could modify these lines of code on master:

gene_map = balance_colors(
gene_map.cpu().numpy(), q=0, q_high=0.999
)
gene_map = greyscale2colormap(gene_map)
gene_map = mask_background(gene_map, mask)
filename = os.path.join(
output_dir, slide_name, f"{gene_name}.png",
)
os.makedirs(os.path.dirname(filename), exist_ok=True)
imwrite(filename, gene_map)

Replace them with something like

 filename = os.path.join( 
     output_dir, slide_name, f"{gene_name}.pt", 
 ) 
 os.makedirs(os.path.dirname(filename), exist_ok=True) 
 torch.save(gene_map, filename)

to save the gene maps as pytorch tensors.

I'm not very familiar with SCTransform, so I can't comment about that transformation in particular. It is possible to use non-integer count data on the positive real line, though, so if you'd like to try that out, I'd be interested in hearing the results. However, it's unclear if the expression modeling we're using is a good fit. Are you using SCTransform to normalize for sequencing depth or section-wise batch effects? If the latter, have you tried using section-wise covariates?

@alikhuseynov
Copy link
Author

HI, many thanks for your suggestions! It did looks like inferno color scale to me..
I will try out refactor-analysis branch or suggested alternative code changes - to have the color scale with gene expression range per gene map.
Seurat SCTransform or original repo https://github.com/ChristophH/sctransform is quite useful (both for scRNAseq or Spatial/Visium) to correct for seq. depth and one can regress out any other vars/covariates (I haven't used section-wise covariates in xfuse yet, it was one Visium section for now).
I will tell you the results once I have used SCT normalized data as input for xfuse, it's just an idea, but maybe will produce even more clearer results :)
Also, SCTransform is used, as part of Seurat, in STUtility package (I think somebody from your lab was developing it).

genes x pixels matrices outputs would be extremely useful for downstream stuff..
Thanks!
A

@ludvb
Copy link
Owner

ludvb commented Jan 4, 2021

Let me know how it goes! Yep, STUtility is developed by our data scientists. They know a lot more about normalization than I do ;)

Noted. A problem is, as you mentioned, memory usage. A genes x pixels matrix for a 2000x2000 image over 10k genes takes up roughly 150 GB. Though, downsampling or selecting a limited number of genes should be doable.

@alikhuseynov
Copy link
Author

Yes, I will let you the xfuse results on SCTransform once I have them ;)

For now, I installed the updated xfuse branch and trained the model again (just one Visium heart section)
Can you give some explanations about those params:

[analyses.analysis-gene_maps.options]
gene_regex = ".*"
num_samples = 10
genes_per_batch = 10
predict_mean = true
normalize = false
mask_tissue = true
scale = 1.0

For gene_maps .jpg output:
what do the .invcv+, .stdv, .mean show exactly, or generally, how they are calculated?
If you have some preliminary detailed documentation on the new branch, could you share it?

I attached one gene map example, the really black regions are those with low UMI and gene count per spot originally
cdh5_exp

Many thanks!
A

@ludvb
Copy link
Owner

ludvb commented Jan 19, 2021

Hey,

Sorry for taking a while to get back. Looking forward to hearing about the results! :)

Here's a brief description of the options:

  • gene_regex: You may have noticed that it takes a while to compute the gene maps, and the resulting data files take up quite a lot of space. This option allows you to only process the genes that you are interested in, speeding things up and reducing the total file size. For example, if you are interested in the cadherin genes and a gene "xyz", you could use the regex "^(CDH\d+|xyz)$".
  • num_samples: This is the number of Monte Carlo samples to use to approximate the predictive distribution. Increasing this number improves the estimation but increases runtime linearly. From experience, the marginal benefit of additional samples quite quickly diminishes. If you are exploring all genes, 10 is probably a good number. If you are only exploring a few genes, it could be a good idea to increase this number to, say, 50 or so.
  • genes_per_batch: In order to fit the data in memory, only a few genes can be predicted at a time. In general, you'd want to have an as large number here as possible without exhausting memory and running into out-of-memory exceptions. The default value 10 usually works quite well on the GPUs we are using in our lab, which have a memory size of 11 GB, for image sizes of ~2000x2000.
  • predict_mean: This option determines whether to predict the mean expression of genes or draw samples from the resulting count distribution. I haven't really experimented much with this, but I think you need a high num_samples to get good results.
  • normalize: Determines whether to normalize the spatial scaling factor, effectively forcing the variable s in the model to one. This may be useful to highlight regions in which the expression is upregulated in comparison to other genes.
  • mask_tissue: Whether to remove the background (i.e., no tissue) in the output images.
  • scale: This is a scaling factor for the gene maps. Values lower than 1.0 downsample the maps by the corresponding factor. Similar to num_samples, this is mostly a performance consideration: If you're exploring all genes, it could make sense to use a lower factor to speed things up and then construct full-sized maps of the genes you want to study closer.

The mean map shows the mean of the predictive distribution and the stdv map the standard deviation. The invcv+ map shows the inverted coefficient of variation of the expression in each pixel compared to the mean over the entire tissue. It should highlight areas with upregulated expression that the model is more certain about. For example, you can see that the model predicts a high mean in the frame/fiducials of the Visium array on the left edge of the tissue. However, the prediction also has a high standard deviation (uncertainty) in this area. As a result, the fiducials are not as strongly recognized in the invcv+ map as in the mean map.

The file xfuse/analyze/gene_maps.py contains the code for these computations. Essentially:

# samples is a numpy array of dimension (num_samples, H, W) containing the monte carlo samples for a given gene
# mean gene map:
mean = samples.mean(0)
# stdv gene map:
stdv = samples.std(0)
# invcv+ gene map
lfc = np.log2(samples.transpose(1, 2, 0)) - np.log2(samples.mean((1, 2)))
invcv = lfc.mean(-1).clip(0) / lfc.std(-1)

Documentation is still lacking for these analysis routines, it is definitely something we want to improve in the future. Hope at least that the above can shed some light :)

@alikhuseynov
Copy link
Author

Hey, No worries.
Many Thanks for explanations.. now it's clear! :)
File size is indeed an issue (especially for tensor output ".pt"), so passing 10-30 genes of interests to gene_regex is a definitely an option.

@alikhuseynov
Copy link
Author

Hey there,

I added a small repo as an example of using xfuse on one Visium heart section as an example.

  • 1st approach: run xfuse on original data (ie. SpaceRanger output), some gene_maps are in images folder.

  • 2nd approach: using Seurat SCTransform to get corrected (ie sequencing depth) counts and regressed out ribosomal and mitochondrial content there as well, then saving the modified counts as filtered_feature_bc_matrix_sct.h5.
    Analysis is in Rmd script
    Then, run xfuse to train the model and get the gene_maps -> some are in images_sct

As you will see, 2nd approach works well for your model to predict expression.. For some genes the prediction is probably more precise, for others it's hard to say which way is better. Eg MYL2 expression looks better on original counts rather than on SCT-corrected, I think this is related to predict_mean and num_samples params?
I used the same .toml file for both runs.
In general, I think your model works pretty well, it just depends if one is using original or corrected counts as .h5 input.
Have you also experienced that some genes are not well predicted?

questions:

  • in general, what parameters one needs to change to get better expression predictions? Say, for the genes that were not well predicted?
  • if I increase H&E image resolution to ---scale 0.3 on xfuse convert visium (I was using ---scale 0.16), and train with 100K epochs (I trained with 30K epochs), would that help to resolve the predictions of gene expression better?
  • I'm using GPU cluster to run xfuse, unfortunately walltime max is 24h (not extendable). Once time is over, the job is terminated but the exception.session is not saved. Apparently it only works in the interactive mode after pressing ctrl+C. Can one convert, eg the last epoch-00030000.session to exception.session? such that one can re-start xfuse again. (I assume that just renaming the file won't work)
    Thanks a lot! :)
    A

@ludvb
Copy link
Owner

ludvb commented Jan 28, 2021

Thanks for the results, great stuff! Love the 3D kernel plots btw, look really neat--would be cool to implement something like that for the xfuse gene map analysis in the future :)

The gene maps based on SCT-transformed data look a bit crisper, right? I wonder if the blurriness in the results on untransformed data may be due to the low-count regions. As far as I can tell, the number of reads in those regions are abnormally low (i.e., technical artifacts) and not related to histology. Thus, the correspondence between histology and expression that the method learns becomes distorted, which may impact predictions in other areas as well. The SCT transformation corrects the sequencing depth, which restores the correspondence and results in better predictions.

I think the best way to improve results on the untransformed data may therefore be to drop observations in the low-count regions. The easiest way to do that would be to specify a threshold for the minimum number of reads in each spot. Looking at the raw data, an appropriate threshold would be something like 3000. This can be specified by adding the following slide options:

[slides.section1.options]
min_counts = 3000
always_filter = []
always_keep = [1]

This specifies that all spots with less than 3000 reads will be dropped with the exception (always_keep) of label 1, which is a virtual measurement for the area outside the tissue. It's probably best to retrain the model from scractch after this change.

I think I've seen this kind of blurriness some times when using a too high learning rate. But the current value 0.0003 should be fine, so I don't think that's the case here. The image size also looks good to me, and a higher resolution will probably only make inference more difficult.

The epoch-*.session files are actually saved just for cases like this, when the program is terminated unexpectedly, so that training can be resumed from the last checkpoint :) They contain the same data as the exception.session file and can be loaded in the same way (by passing --session checkpoints/epoch-xx.session).

@alikhuseynov
Copy link
Author

Many Thanks! I'm glad it was also useful for you and may be others..

  • 3D kernel plots can also be done in python version of Plotly library
  • before plotting, one needs to do some image processing on expression values of spots (or pixels in case of xfuse), basically applying 2D Gaussian smoothing kernel(s) on a matrix of pixels along x and y axes R code
    I think it's relatively easy to do these steps in python, I'm just not that experienced there.
  • 2D heatmap of smoothed hotspots of expression R code
  • 3D heatmap of smoothed surface topography of expression R code

Yes, the gene-maps results on SCT-transformed data looks more sharper, but for some genes the predicted expression pattern looks a bit different (eg MYL2), relatively to expression on spots
SCT-transformed gene expression example:
st_sct_xfuse

Untransformed data example:
st_xfuse
And, I was a bit confused, eg, why for MYL2 expression, xfuse predictions on Untransformed data seem to work better than for SCT-transformed data?

I will try to retrain as suggested on Untransformed data using min_counts = 3000

Thanks again for your suggestions!

@ludvb
Copy link
Owner

ludvb commented Feb 1, 2021

Awesome, thanks for the links and pointers! Didn't know plotly existed for Python. Will definitely try this out 👍

I agree that the MYL2 prediction looks too flat with the SCT-transformed data. Looking at the H&E image, I noticed the histology is very homogenous. My experience is that the model is quicker to learn expression patterns associated with clear morphological signatures. It could be that the model hasn't fully converged yet or that the expression pattern of MYL2 in the ground truth data is too idiosyncratic to drive the creation of a distinct metagene.

I wonder if the more accurate prediction using untransformed data is a scaling effect? You could try setting the option normalize = True in the analysis section of the gene map analysis to check whether the pattern is maintained. If you'd like to try this out, you will need to pull the latest commit on master, 32126e2, which fixes a problem that caused this option to be ignored.

@alikhuseynov
Copy link
Author

Great!! Plotly is just amazing :)

Here is the result on using the counts param that you suggested to "correct" UMIs.

[slides.section1.options]
min_counts = 3000
always_filter = []
always_keep = [1]

And I used num_samples = 50 since I was having only specific gene set.
This was my .toml to retrain the model.
The gene maps:
xfuse_2
I think result are better for MYL2, the rest is kind of similar to SCT-transformed data results, so using your suggested min_counts = 3000 works great!

I agree with you, I will try normalize = True to see if the pattern remains.
If the model hasn't converged, one would need to train for >30K epochs then?
Also, I was using max_metagenes = 50, does increasing or decreasing helps to get a better prediction? I would probably expect to have 12-14 cell types at max in this data..
Thanks! :)

@ludvb
Copy link
Owner

ludvb commented Feb 2, 2021

Thanks for the update and great to hear about the improvements! :)

In terms of convergence, my experience is that the number of update steps is usually slightly sublinear with respect to the number of sections used for training. With only one section, I think it could make sense to increase the number of epochs to perhaps something like 50k.

It is best to check convergence by tracking the ELBO, though, and terminate training when the ELBO flattens out. By default, xfuse emits a Tensorboard log file that gets written to the stats subdirectory in the --save-path (it is also possible to log to a csv file by passing the --stats CLI flag when training). The ELBO is quite noisy, so you will usually want to set smoothing to ~0.99.

You can launch Tensorboard by running tensorboard --logdir ./stats from the --save-path. Tensorboard then starts a web server that can be accessed from a loopback address (typically http://localhost:6006). The process is a bit more complicated when running on a remote. The easiest solution is to copy the log files to your own computer and run Tensorboard locally, but this precludes live tracking. Another solution that should be compatible with most remotes is to mount the remote file system with something like sshfs instead of copying. If port forwarding has been enabled on the remote, a better solution yet is to connect with ssh -L 6006 to bind localhost:6006 to the remote's and run Tensorboard remotely.

Using a high number of max_metagenes should be fine but could slow down the training process and increase memory consumption. With the drop and split strategy, it is probably good to have some redundancy in terms of the number of cell types in the data. If you expect 12--14 cell types, I think a good value of max_metagenes could be in the 20--30 range.

Good luck with the analysis and let me know how it goes :)

@alikhuseynov
Copy link
Author

Hi there, again thanks for all the feedbacks on using Xfuse!
another point:

  • We discussed this before a bit, just wanted to confirm. Eg if one observes spatially variable gene expression of certain markers, but underlying tissue/H&E doesn't have obvious morphological differences. Can one force Xfuse to learn a model using specific spatially variable markers as meta-genes? To then generate gene-maps.
    Thanks a lot

@ludvb
Copy link
Owner

ludvb commented Nov 1, 2021

Hi!

It is currently not possible to use prespecified metagenes, although this is something that definitely would be cool to support in the future. The best you can do right now is to train the model using only the spatially variable markers (by setting the gene_regex option to, for example, "^(marker1|marker2|marker3|and-so-on)$" in the config toml file) and hope for it to pick up the expression patterns despite them not being appreciable in the morphology. But I'm not sure how much this helps compared to training the full model.

I've found sometimes that it is easier to learn subtle gene expression patterns when the training patches show larger swathes of the tissue, so that as much context is given to the recognition network as possible. This can be achieved by either downsampling the tissue images more (lower --scale in xfuse convert) or increasing the patch_size and network_depth parameters.

Let me know how it goes and if you have any questions :)

@alikhuseynov
Copy link
Author

Thanks a lot for the suggestions! :)
I will give it a try.

@alikhuseynov
Copy link
Author

Hi again,
will it be easy to implement NMF as well? That would probably be more advantageous. eg in scikit-learn or bignmf

[analyses.analysis-metagenes.options]
method = "nmf"

Many Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants