Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of patient-matched tumor and normal tissue samples #230

Closed
samgest opened this issue Dec 12, 2023 · 3 comments
Closed

Integration of patient-matched tumor and normal tissue samples #230

samgest opened this issue Dec 12, 2023 · 3 comments

Comments

@samgest
Copy link

samgest commented Dec 12, 2023

Hi,

I am performing an analysis with several published datasets of kidney tumor (that is, raw UMI counts coming from public repositories) and I wanted to integrate them, but I'm having some issues with overcorrection.

I have 68 tumor samples coming from 68 different patients and, from some of them, I also have a sample coming from the surrounding healthy tissue (normal). In total, 68 tumor + 19 normal = 87 total scRNA-seq expression matrices (coming from 9 different datasets). I want to integrate all of this data together and remove the batch effect (sample-wise) and the dataset bias (i.e., the bias that arises from using different datasets), but not the tumor-normal differences.

I tried to integrate with RunHarmony as:

dataMerged <- dataMerged %>% 
  RunHarmony(group.by.vars = c("dataset_id", "sample_id"), plot_convergence = TRUE) 
  # Note: "sample_type" metadata not included in the "group.by.vars" argument since that's the variable I don't want to correct.

  RunUMAP(reduction = "harmony", dims = 1:10)

NOTE: I took the first 10 dimensions based on their standard deviation and where it "plateaus" (Fig. 1).

Fig. 1
Rplot02

But the results are quite overcorrected. Despite of the fact that tumor and normal tissues should share some cell types (such as lymphocytes, endothelial cells, etc.), there should be at least a big cluster of cells in the tumor samples that should not be present in the normal ones (the malignant / tumoral cells themselves). I see very little difference in the UMAP graph (Fig. 2):

Fig. 2
Rplot05

Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?

I'm quite new to scRNA-seq analysis, so any comment or suggestion is more than appreciated.
Thanks in advance.

@pati-ni
Copy link
Collaborator

pati-ni commented Dec 12, 2023 via email

@samgest
Copy link
Author

samgest commented Dec 12, 2023

There you go:

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] copykat_1.1.0               cutoff.scATOMIC_0.1.0       agrmt_1.42.12               Rmagic_2.0.3               
 [5] caret_6.0-94                lattice_0.22-5              randomForest_4.7-1.1        plyr_1.8.9                 
 [9] scATOMIC_2.0.2              clustree_0.5.1              ggraph_2.1.0                data.table_1.14.10         
[13] reshape2_1.4.4              DESeq2_1.42.0               GSVA_1.50.0                 BaseSet_0.9.0              
[17] EnsDb.Hsapiens.v86_2.99.0   ensembldb_2.26.0            AnnotationFilter_1.26.0     GenomicFeatures_1.54.1     
[21] reticulate_1.34.0           GSEABase_1.64.0             graph_1.80.0                annotate_1.80.0            
[25] XML_3.99-0.16               AnnotationDbi_1.64.1        HGNChelper_0.8.1            openxlsx_4.2.5.2           
[29] lubridate_1.9.3             forcats_1.0.0               stringr_1.5.1               purrr_1.0.2                
[33] readr_2.1.4                 tidyr_1.3.0                 ggplot2_3.4.4               tidyverse_2.0.0            
[37] tibble_3.2.1                dplyr_1.1.4                 patchwork_1.1.3             pheatmap_1.0.12            
[41] SingleR_2.4.0               celldex_1.12.0              SummarizedExperiment_1.32.0 Biobase_2.62.0             
[45] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1         IRanges_2.36.0              S4Vectors_0.40.2           
[49] BiocGenerics_0.48.1         MatrixGenerics_1.14.0       matrixStats_1.1.0           rhdf5_2.46.1               
[53] Matrix_1.6-4                harmony_1.2.0               Rcpp_1.0.11                 Seurat_5.0.1               
[57] SeuratObject_5.0.1          sp_2.1-2                   

loaded via a namespace (and not attached):
  [1] ProtGenerics_1.34.0           spatstat.sparse_3.0-3         bitops_1.0-7                 
  [4] httr_1.4.7                    RColorBrewer_1.1-3            tools_4.3.2                  
  [7] sctransform_0.4.1             utf8_1.2.4                    R6_2.5.1                     
 [10] HDF5Array_1.30.0              lazyeval_0.2.2                uwot_0.1.16                  
 [13] rhdf5filters_1.14.1           withr_2.5.2                   prettyunits_1.2.0            
 [16] gridExtra_2.3                 progressr_0.14.0              cli_3.6.1                    
 [19] spatstat.explore_3.2-5        fastDummies_1.7.3             spatstat.data_3.0-3          
 [22] ggridges_0.5.4                pbapply_1.7-2                 Rsamtools_2.18.0             
 [25] parallelly_1.36.0             rstudioapi_0.15.0             RSQLite_2.3.4                
 [28] generics_0.1.3                BiocIO_1.12.0                 ica_1.0-3                    
 [31] spatstat.random_3.2-2         zip_2.3.0                     fansi_1.0.6                  
 [34] clipr_0.8.0                   abind_1.4-5                   lifecycle_1.0.4              
 [37] yaml_2.3.7                    recipes_1.0.8                 SparseArray_1.2.2            
 [40] BiocFileCache_2.10.1          Rtsne_0.17                    grid_4.3.2                   
 [43] blob_1.2.4                    promises_1.2.1                ExperimentHub_2.10.0         
 [46] crayon_1.5.2                  miniUI_0.1.1.1                beachmat_2.18.0              
 [49] cowplot_1.1.1                 KEGGREST_1.42.0               pillar_1.9.0                 
 [52] rjson_0.2.21                  future.apply_1.11.0           codetools_0.2-19             
 [55] leiden_0.4.3.1                glue_1.6.2                    vctrs_0.6.5                  
 [58] png_0.1-8                     spam_2.10-0                   gtable_0.3.4                 
 [61] cachem_1.0.8                  gower_1.0.1                   prodlim_2023.08.28           
 [64] S4Arrays_1.2.0                mime_0.12                     tidygraph_1.2.3              
 [67] survival_3.5-7                timeDate_4022.108             SingleCellExperiment_1.24.0  
 [70] iterators_1.0.14              hardhat_1.3.0                 lava_1.7.3                   
 [73] interactiveDisplayBase_1.40.0 ellipsis_0.3.2                fitdistrplus_1.1-11          
 [76] ipred_0.9-14                  ROCR_1.0-11                   nlme_3.1-164                 
 [79] bit64_4.0.5                   progress_1.2.3                filelock_1.0.3               
 [82] RcppAnnoy_0.0.21              rprojroot_2.0.4               irlba_2.3.5.1                
 [85] rpart_4.1.23                  KernSmooth_2.23-22            colorspace_2.1-0             
 [88] DBI_1.1.3                     nnet_7.3-19                   tidyselect_1.2.0             
 [91] bit_4.0.5                     compiler_4.3.2                curl_5.2.0                   
 [94] xml2_1.3.6                    DelayedArray_0.28.0           plotly_4.10.3                
 [97] rtracklayer_1.62.0            scales_1.3.0                  lmtest_0.9-40                
[100] rappdirs_0.3.3                digest_0.6.33                 goftest_1.2-3                
[103] spatstat.utils_3.0-4          XVector_0.42.0                htmltools_0.5.7              
[106] pkgconfig_2.0.3               sparseMatrixStats_1.14.0      dbplyr_2.4.0                 
[109] fastmap_1.1.1                 rlang_1.1.2                   htmlwidgets_1.6.4            
[112] shiny_1.8.0                   DelayedMatrixStats_1.24.0     farver_2.1.1                 
[115] zoo_1.8-12                    jsonlite_1.8.8                BiocParallel_1.36.0          
[118] ModelMetrics_1.2.2.2          BiocSingular_1.18.0           RCurl_1.98-1.13              
[121] magrittr_2.0.3                GenomeInfoDbData_1.2.11       dotCall64_1.1-1              
[124] Rhdf5lib_1.24.0               munsell_0.5.0                 viridis_0.6.4                
[127] pROC_1.18.5                   stringi_1.8.2                 zlibbioc_1.48.0              
[130] MASS_7.3-60                   AnnotationHub_3.10.0          listenv_0.9.0                
[133] ggrepel_0.9.4                 deldir_2.0-2                  graphlayouts_1.0.2           
[136] Biostrings_2.70.1             splines_4.3.2                 tensor_1.5                   
[139] hms_1.1.3                     locfit_1.5-9.8                igraph_1.5.1                 
[142] spatstat.geom_3.2-7           RcppHNSW_0.5.0                biomaRt_2.58.0               
[145] ScaledMatrix_1.10.0           BiocVersion_3.18.1            BiocManager_1.30.22          
[148] foreach_1.5.2                 tweenr_2.0.2                  tzdb_0.4.0                   
[151] httpuv_1.6.13                 RANN_2.6.1                    polyclip_1.10-6              
[154] future_1.33.0                 scattermore_1.2               ggforce_0.4.1                
[157] rsvd_1.0.5                    xtable_1.8-4                  restfulr_0.0.15              
[160] RSpectra_0.16-1               later_1.3.2                   class_7.3-22                 
[163] viridisLite_0.4.2             memoise_2.0.1                 GenomicAlignments_1.38.0     
[166] cluster_2.1.6                 timechange_0.2.0              globals_0.16.2               
[169] here_1.0.1  

@pati-ni
Copy link
Collaborator

pati-ni commented Dec 12, 2023

Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?

Yes, that is correct. If you do sample level correction, it basically corrects the latent dimensions for everything that is nested in that experimental design. Performing the correction separately, as you suggest, would be the way to go.

You can do, however, cell abundance investigation within the tumor and normal kidney, which is fine to do this way.

A minor comment in your workflow is that if you decide to use only 1:10 latent variables, perform harmony just on those. Not sure how much it will change things but it may be raising issues with the curse of dimensionality.

@samgest samgest closed this as completed Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants