Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing signature matrix when adaptive = False? #6

Open
fojackson8 opened this issue Mar 29, 2023 · 4 comments
Open

Accessing signature matrix when adaptive = False? #6

fojackson8 opened this issue Mar 29, 2023 · 4 comments

Comments

@fojackson8
Copy link

fojackson8 commented Mar 29, 2023

Hi, thanks for the nice paper and code.

As I understand it, the decoder function reconstructs the bulk RNA-seq input B from X (the predicted cell fractions). The learnt weights of the decoder function are then taken as the GEPs, represented in the signature matrix S in the following equation:

$$X.S = B$$

If this is the case, we should be able to access these GEPs on the simulated data, even when not in adaptive mode, ie even if we take simulated bulk RNA-seq as input data and reconstruct, we should be able to get the GEPs right?

There seems to be no option in the codebase to get Sigm when not in adaptive phase.

Happy to be corrected if I've misunderstood, or alternatively if I've missed something in the code?

@poseidonchan
Copy link
Owner

Hi,

You're right, I did not provide the access because I think it is useless. If we want the Signature matrix, why dont we just group the single-cell dataframe by cell type? The learned one is just an approximated signature matrix. Or, you can obtain the learned one by modifying the code. Like calling model.sigmatrix() in the prediction step to obtain the S, it should be very quick and easy.

@fojackson8
Copy link
Author

Thanks for the response, that's very helpful. So model.sigmatrix() returns a tensor of shape (5,16793) on the example data, so I assume these are the 5 underlying cell type GEPs? If I've understood correctly this gives the same output as sigm when callling predict with adaptive = True.

The reason we might need the signature matrix even when adaptive =False, is it helps to validate the performance/accuracy of the autoencoder model. If the reconstructed GEPs are not similar to the underlying GEPs used as input to the simulated mixing (even when adaptive=False), then perhaps this reduces confidence in the reconstruction process and the inferred cell fractions?

I think this raises an important related question: say we run this model on new bulk RNA-seq data to get the underlying cell fractions. If the inferred GEPs taken from the Decoder model weights do not correspond to any signatures from the single-cell RNA seq reference, how do we interpret this? Does this make the inferred cell fractions?

@poseidonchan
Copy link
Owner

It's a good point, I think this situation would happen if the model is under-fitting. Ideally, if the model is trained well on the simulated datasets, the learned GEP should be close to the real GEP from single-cell dataset. Generally, the big issue in deconvolution is that, the chosen reference is very different from the bulk. We can not evaluate the distance / batch effect by evaluating how well the model is trained. For the batch correction in the deconvolution problem, I suggest you take a look at the solution of CIBERSORTx, which uses B mode / S mode to minimize the distance between reference and bulk before deconvolution.

@fojackson8
Copy link
Author

Ok thanks. Agreed we cannot evaluate the distance between the reference scRNA-seq and any new bulk RNA-seq we want to deconvolute. For any new bulk RNA-seq we also don't know the "true" proportions in the mixture, just have to hope it's accurate. I guess the best way of evaluating how reliable the estimated cell type proportions are in new bulk RNA-seq data is whether or not the reconstructed GEPs correspond to real GEPs?

Question 2: when you apply the trained model for deconvoluting real data, you first train the model on simulated data. What is the dimensions of the simulated data which you use to train the model? How many genes and samples, and what is the approx. training time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants