Integration of patient-matched tumor and normal tissue samples #230

samgest · 2023-12-12T09:45:47Z

Hi,

I am performing an analysis with several published datasets of kidney tumor (that is, raw UMI counts coming from public repositories) and I wanted to integrate them, but I'm having some issues with overcorrection.

I have 68 tumor samples coming from 68 different patients and, from some of them, I also have a sample coming from the surrounding healthy tissue (normal). In total, 68 tumor + 19 normal = 87 total scRNA-seq expression matrices (coming from 9 different datasets). I want to integrate all of this data together and remove the batch effect (sample-wise) and the dataset bias (i.e., the bias that arises from using different datasets), but not the tumor-normal differences.

I tried to integrate with RunHarmony as:

dataMerged <- dataMerged %>% 
  RunHarmony(group.by.vars = c("dataset_id", "sample_id"), plot_convergence = TRUE) 
  # Note: "sample_type" metadata not included in the "group.by.vars" argument since that's the variable I don't want to correct.

  RunUMAP(reduction = "harmony", dims = 1:10)

NOTE: I took the first 10 dimensions based on their standard deviation and where it "plateaus" (Fig. 1).

Fig. 1

But the results are quite overcorrected. Despite of the fact that tumor and normal tissues should share some cell types (such as lymphocytes, endothelial cells, etc.), there should be at least a big cluster of cells in the tumor samples that should not be present in the normal ones (the malignant / tumoral cells themselves). I see very little difference in the UMAP graph (Fig. 2):

Fig. 2

Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?

I'm quite new to scRNA-seq analysis, so any comment or suggestion is more than appreciated.
Thanks in advance.

The text was updated successfully, but these errors were encountered:

pati-ni · 2023-12-12T12:45:03Z

Hi, are you using the latest version of the software? send us a sessionInfo()

…

On Tue, Dec 12, 2023, 04:45 samgest ***@***.***> wrote: Hi, I am performing an analysis with several published datasets of kidney tumor (that is, raw UMI counts coming from public repositories) and I wanted to integrate them, but I'm having some issues with overcorrection. I have 68 tumor samples coming from 68 different patients and, of some of them, I also have a sample coming from the surrounding healthy tissue (normal). In total, 68 tumor + 19 normal = 87 total scRNA-seq expression matrices (coming from 9 different datasets). I want to integrate all of this data together and remove the batch effect (sample-wise) and the dataset bias (i.e., the bias that arises from using different datasets), but not the tumor-normal differences. I tried to integrate with RunHarmony as: dataMerged.ref <- dataMerged.ref %>% RunHarmony(group.by.vars = c("dataset_id", "sample_id"), plot_convergence = TRUE) But the results are quite overcorrected (Fig. 1). Despite of the fact that tumor and normal tissues should share some cell types (such as macrophages, lymphocytes, etc.), the gross bulk of cells should be different. Rplot05.png (view on web) <https://github.com/immunogenomics/harmony/assets/150608196/58cf9a4b-d95a-4a4d-aee4-a6f6b1bd7ed3> Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function? I'm quite new to scRNA-seq analysis, so any comment or suggestion is more than appreciated. Thanks in advance. — Reply to this email directly, view it on GitHub <#230>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSFW2C5K2ZPT7K7VUGNXVTYJARVPAVCNFSM6AAAAABARFBBQWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTOMZWHA4TKNY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

samgest · 2023-12-12T13:09:44Z

There you go:

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Madrid
tzcode source: internal

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] copykat_1.1.0               cutoff.scATOMIC_0.1.0       agrmt_1.42.12               Rmagic_2.0.3               
 [5] caret_6.0-94                lattice_0.22-5              randomForest_4.7-1.1        plyr_1.8.9                 
 [9] scATOMIC_2.0.2              clustree_0.5.1              ggraph_2.1.0                data.table_1.14.10         
[13] reshape2_1.4.4              DESeq2_1.42.0               GSVA_1.50.0                 BaseSet_0.9.0              
[17] EnsDb.Hsapiens.v86_2.99.0   ensembldb_2.26.0            AnnotationFilter_1.26.0     GenomicFeatures_1.54.1     
[21] reticulate_1.34.0           GSEABase_1.64.0             graph_1.80.0                annotate_1.80.0            
[25] XML_3.99-0.16               AnnotationDbi_1.64.1        HGNChelper_0.8.1            openxlsx_4.2.5.2           
[29] lubridate_1.9.3             forcats_1.0.0               stringr_1.5.1               purrr_1.0.2                
[33] readr_2.1.4                 tidyr_1.3.0                 ggplot2_3.4.4               tidyverse_2.0.0            
[37] tibble_3.2.1                dplyr_1.1.4                 patchwork_1.1.3             pheatmap_1.0.12            
[41] SingleR_2.4.0               celldex_1.12.0              SummarizedExperiment_1.32.0 Biobase_2.62.0             
[45] GenomicRanges_1.54.1        GenomeInfoDb_1.38.1         IRanges_2.36.0              S4Vectors_0.40.2           
[49] BiocGenerics_0.48.1         MatrixGenerics_1.14.0       matrixStats_1.1.0           rhdf5_2.46.1               
[53] Matrix_1.6-4                harmony_1.2.0               Rcpp_1.0.11                 Seurat_5.0.1               
[57] SeuratObject_5.0.1          sp_2.1-2                   

loaded via a namespace (and not attached):
  [1] ProtGenerics_1.34.0           spatstat.sparse_3.0-3         bitops_1.0-7                 
  [4] httr_1.4.7                    RColorBrewer_1.1-3            tools_4.3.2                  
  [7] sctransform_0.4.1             utf8_1.2.4                    R6_2.5.1                     
 [10] HDF5Array_1.30.0              lazyeval_0.2.2                uwot_0.1.16                  
 [13] rhdf5filters_1.14.1           withr_2.5.2                   prettyunits_1.2.0            
 [16] gridExtra_2.3                 progressr_0.14.0              cli_3.6.1                    
 [19] spatstat.explore_3.2-5        fastDummies_1.7.3             spatstat.data_3.0-3          
 [22] ggridges_0.5.4                pbapply_1.7-2                 Rsamtools_2.18.0             
 [25] parallelly_1.36.0             rstudioapi_0.15.0             RSQLite_2.3.4                
 [28] generics_0.1.3                BiocIO_1.12.0                 ica_1.0-3                    
 [31] spatstat.random_3.2-2         zip_2.3.0                     fansi_1.0.6                  
 [34] clipr_0.8.0                   abind_1.4-5                   lifecycle_1.0.4              
 [37] yaml_2.3.7                    recipes_1.0.8                 SparseArray_1.2.2            
 [40] BiocFileCache_2.10.1          Rtsne_0.17                    grid_4.3.2                   
 [43] blob_1.2.4                    promises_1.2.1                ExperimentHub_2.10.0         
 [46] crayon_1.5.2                  miniUI_0.1.1.1                beachmat_2.18.0              
 [49] cowplot_1.1.1                 KEGGREST_1.42.0               pillar_1.9.0                 
 [52] rjson_0.2.21                  future.apply_1.11.0           codetools_0.2-19             
 [55] leiden_0.4.3.1                glue_1.6.2                    vctrs_0.6.5                  
 [58] png_0.1-8                     spam_2.10-0                   gtable_0.3.4                 
 [61] cachem_1.0.8                  gower_1.0.1                   prodlim_2023.08.28           
 [64] S4Arrays_1.2.0                mime_0.12                     tidygraph_1.2.3              
 [67] survival_3.5-7                timeDate_4022.108             SingleCellExperiment_1.24.0  
 [70] iterators_1.0.14              hardhat_1.3.0                 lava_1.7.3                   
 [73] interactiveDisplayBase_1.40.0 ellipsis_0.3.2                fitdistrplus_1.1-11          
 [76] ipred_0.9-14                  ROCR_1.0-11                   nlme_3.1-164                 
 [79] bit64_4.0.5                   progress_1.2.3                filelock_1.0.3               
 [82] RcppAnnoy_0.0.21              rprojroot_2.0.4               irlba_2.3.5.1                
 [85] rpart_4.1.23                  KernSmooth_2.23-22            colorspace_2.1-0             
 [88] DBI_1.1.3                     nnet_7.3-19                   tidyselect_1.2.0             
 [91] bit_4.0.5                     compiler_4.3.2                curl_5.2.0                   
 [94] xml2_1.3.6                    DelayedArray_0.28.0           plotly_4.10.3                
 [97] rtracklayer_1.62.0            scales_1.3.0                  lmtest_0.9-40                
[100] rappdirs_0.3.3                digest_0.6.33                 goftest_1.2-3                
[103] spatstat.utils_3.0-4          XVector_0.42.0                htmltools_0.5.7              
[106] pkgconfig_2.0.3               sparseMatrixStats_1.14.0      dbplyr_2.4.0                 
[109] fastmap_1.1.1                 rlang_1.1.2                   htmlwidgets_1.6.4            
[112] shiny_1.8.0                   DelayedMatrixStats_1.24.0     farver_2.1.1                 
[115] zoo_1.8-12                    jsonlite_1.8.8                BiocParallel_1.36.0          
[118] ModelMetrics_1.2.2.2          BiocSingular_1.18.0           RCurl_1.98-1.13              
[121] magrittr_2.0.3                GenomeInfoDbData_1.2.11       dotCall64_1.1-1              
[124] Rhdf5lib_1.24.0               munsell_0.5.0                 viridis_0.6.4                
[127] pROC_1.18.5                   stringi_1.8.2                 zlibbioc_1.48.0              
[130] MASS_7.3-60                   AnnotationHub_3.10.0          listenv_0.9.0                
[133] ggrepel_0.9.4                 deldir_2.0-2                  graphlayouts_1.0.2           
[136] Biostrings_2.70.1             splines_4.3.2                 tensor_1.5                   
[139] hms_1.1.3                     locfit_1.5-9.8                igraph_1.5.1                 
[142] spatstat.geom_3.2-7           RcppHNSW_0.5.0                biomaRt_2.58.0               
[145] ScaledMatrix_1.10.0           BiocVersion_3.18.1            BiocManager_1.30.22          
[148] foreach_1.5.2                 tweenr_2.0.2                  tzdb_0.4.0                   
[151] httpuv_1.6.13                 RANN_2.6.1                    polyclip_1.10-6              
[154] future_1.33.0                 scattermore_1.2               ggforce_0.4.1                
[157] rsvd_1.0.5                    xtable_1.8-4                  restfulr_0.0.15              
[160] RSpectra_0.16-1               later_1.3.2                   class_7.3-22                 
[163] viridisLite_0.4.2             memoise_2.0.1                 GenomicAlignments_1.38.0     
[166] cluster_2.1.6                 timechange_0.2.0              globals_0.16.2               
[169] here_1.0.1

pati-ni · 2023-12-12T14:45:14Z

Is there any way I could integrate tumor and normal data to remove the batch-effect without removing the intrinsic differences between sample types? Perhaps I should integrate them separately (tumor vs normal) and then integrate the Harmony embeddings with Seurat v5's IntegrateEmbeddings function?

Yes, that is correct. If you do sample level correction, it basically corrects the latent dimensions for everything that is nested in that experimental design. Performing the correction separately, as you suggest, would be the way to go.

You can do, however, cell abundance investigation within the tumor and normal kidney, which is fine to do this way.

A minor comment in your workflow is that if you decide to use only 1:10 latent variables, perform harmony just on those. Not sure how much it will change things but it may be raising issues with the curse of dimensionality.

samgest closed this as completed Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of patient-matched tumor and normal tissue samples #230

Integration of patient-matched tumor and normal tissue samples #230

samgest commented Dec 12, 2023 •

edited

Loading

pati-ni commented Dec 12, 2023 via email

samgest commented Dec 12, 2023

pati-ni commented Dec 12, 2023

Integration of patient-matched tumor and normal tissue samples #230

Integration of patient-matched tumor and normal tissue samples #230

Comments

samgest commented Dec 12, 2023 • edited Loading

pati-ni commented Dec 12, 2023 via email

samgest commented Dec 12, 2023

pati-ni commented Dec 12, 2023

samgest commented Dec 12, 2023 •

edited

Loading