# Assemble cell type frequencies

In this notebook, we read our scRNA-seq data and clinical labs, and use these to produce frequencies and proportions of cell types at each of our cell type labeling levels for each sample.

These results are combined with sample-specific metadata to enable downstream analysis and comparisons.

In addition to computing counts/frequency and fraction of total cells per sample, we use the Absolute Lymphocyte Counts from Clinical Blood Counts performed on the same blood draws to estimate absolute cell counts for each sample. This is done by computing the ratio between the lymphocyte counts in the scRNA-seq data and the ALC to get an Absolute count per RNA count correction factor, and then multiplying this ratio by the number of cells observed for each cell type:

$$C_{type} = N_{type} * {{ALC}\over N_{Tcells} + N_{Bcells} + N_{NKcells}}$$

$ALC$: Absolute lymphocyte count  
$C_{type}$: Estimated absolute count for a specific type  
$N_{type}$: count of cells of a specific type in scRNA-seq sample  
$N_{Tcells} + N_{Bcells} + N_{NKcells}$: count of lymphocytes in scRNA-seq sample

## Output format
Outputs include the following columns:  

**Sample Metadata columns**  
`cohort.cohortGuid`: Cohort ID (BR1 or BR2)  
`subject.subjectGuid`: Subject ID  
`subject.biologicalSex`: Subject Sex (Female or Male)  
`subject.cmv`: Subject CMV Status (Negative or Positive)  
`subject.bmi`: Subject BMI (integer)  
`subject.race`: Subject race  
`subject.ethnicity`: Subject ethnicity  
`subject.birthYear`: Subject Birth Year  
`subject.ageAtFirstDraw`: Subject Age at earliest blood draw in study  
`sample.sampleKitGuid`: Sample Kit ID  
`sample.visitName`: Sample Visit Name  
`sample.drawDate`: Sample Draw Date (Year-Month)  
`sample.subjectAgeAtDraw`: Subject age at time of draw, based on year of Draw Date and Birth Year  
`specimen.specimenGuid`: Specimen ID (pbmc_sample_id in .h5 files)  
  
**Frequency-related columns**  
(for AIFI_L1 as an example; AIFI_L1 is replaced with AIFI_L2 and AIFI_L3 for those levels)  

`AIFI_L1`: Cell Type assignment  
`AIFI_L1_count`: Count of cells within this sample with cell type assignment  
`total_cells`: Total cells within this sample  
`scrna.lymphocyte_count`: Sum of T, NK, and B cells  
`bc.lymphocyte_count`: Absolute Lymphocyte Count (ALC) from clinical Blood Counts (bc.)  
`alc_ratio`: ALC per scRNA Lymphocyte Count  
`AIFI_L1_frac_total`: Fraction of cells with cell type assignment divided by Total cells for this sample  
`AIFI_L1_alc`: ALC estimate for this cell type assignment  
`AIFI_L1_clr`: Centered Log Ratio computed using AIFI_L1_frac_total for all types within this sample  


In [1]:
quiet_library <- function(...) { suppressPackageStartupMessages(library(...)) }

quiet_library(hise)
quiet_library(data.table)
quiet_library(dplyr)

CLR Transformation function from Mansi Singh

In [2]:
clr_transform <- function(x) {
  if (length(x) == 0) {
    return(NA)  # return NA for empty vectors
  }
  geom_mean <- exp(mean(log(x)))
  return(log(x / geom_mean))
}

## Retrieve clinical lab results

In [3]:
labs_uuid <- "bd618af8-99e2-45f8-bed8-655348d4cfeb"
res <- cacheFiles(list(labs_uuid))

In [4]:
labs_file <- "cache/bd618af8-99e2-45f8-bed8-655348d4cfeb/diha_assembled_labs_2024-04-09.csv"
labs <- read.csv(labs_file)

In [5]:
alc <- labs %>%
  select(sample.sampleKitGuid, bc.lymphocyte_count)

## Retrieve labeled cell metadata

In [6]:
meta_uuid <- "4a4d94b0-3a15-4403-b0f4-fe22204741e4"
res <- cacheFiles(list(meta_uuid))

In [7]:
meta_file <- "cache/4a4d94b0-3a15-4403-b0f4-fe22204741e4/diha_all_cell_meta_2024-05-05.csv"
meta <- fread(meta_file)
meta <- as.data.frame(meta)

In [8]:
nrow(meta)

In [9]:
sample_meta <- meta %>%
  select(starts_with("cohort"),
         starts_with("subject"),
         starts_with("sample"),
         starts_with("specimen")) %>%
  unique()

In [10]:
nrow(sample_meta)

In [11]:
sample_cols <- names(sample_meta)

## Compute lymphocyte counts per sample

In [12]:
l1_types <- unique(meta$AIFI_L1)
l1_types

In [13]:
lc_l1_types <- c("B cell", "NK cell", "T cell")
sample_lc <- meta %>%
  group_by(sample.sampleKitGuid) %>%
  mutate(total_cells = n()) %>%
  filter(AIFI_L1 %in% lc_l1_types) %>%
  group_by(sample.sampleKitGuid, total_cells) %>%
  summarise(scrna.lymphocyte_count = n(),
            .groups = "keep")

[1m[22m`summarise()` has grouped output by 'sample.sampleKitGuid'. You can override
using the `.groups` argument.


In [14]:
sample_lc <- sample_lc %>%
  left_join(alc, by = "sample.sampleKitGuid")

[1m[22mJoining with `by = join_by(sample.sampleKitGuid)`


In [15]:
sample_lc <- sample_lc %>%
  mutate(alc_ratio = bc.lymphocyte_count / scrna.lymphocyte_count)

In [16]:
head(sample_lc)

sample.sampleKitGuid,total_cells,scrna.lymphocyte_count,bc.lymphocyte_count,alc_ratio
<chr>,<int>,<int>,<int>,<dbl>
KT00001,18231,13903,1337,0.0961663
KT00002,17766,13888,2173,0.15646601
KT00003,18788,15012,1861,0.12396749
KT00004,16849,14648,1444,0.09858001
KT00006,17550,10503,1406,0.13386651
KT00007,16526,13906,1824,0.1311664


## Assemble cell type counts

### L1

In [17]:
l1_counts <- meta %>%
  # Count each type per sample
    group_by(sample.sampleKitGuid, AIFI_L1) %>%
    summarise(AIFI_L1_count = n(), .groups = "keep") %>%
  # Add ALC and total sample counts for use below
    left_join(sample_lc, by = "sample.sampleKitGuid") %>%
  # Compute fractions of total cells per sample
    mutate(AIFI_L1_frac_total = AIFI_L1_count / total_cells) %>%
  # Compute LC estimate
    mutate(AIFI_L1_alc = AIFI_L1_count * alc_ratio) %>%
  # Regroup by sample and compute CLR for fractions
    group_by(sample.sampleKitGuid) %>%
    mutate(AIFI_L1_clr = clr_transform(AIFI_L1_count / total_cells))

[1m[22mJoining with `by = join_by(sample.sampleKitGuid)`


In [18]:
head(l1_counts)

sample.sampleKitGuid,AIFI_L1,AIFI_L1_count,total_cells,scrna.lymphocyte_count,bc.lymphocyte_count,alc_ratio,AIFI_L1_frac_total,AIFI_L1_alc,AIFI_L1_clr
<chr>,<chr>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
KT00001,B cell,1781,18231,13903,1337,0.0961663,0.0976907465,171.2721715,2.0880254
KT00001,DC,246,18231,13903,1337,0.0961663,0.0134935001,23.6569086,0.1084267
KT00001,Erythrocyte,15,18231,13903,1337,0.0961663,0.0008227744,1.4424944,-2.6888546
KT00001,ILC,8,18231,13903,1337,0.0961663,0.000438813,0.7693304,-3.3174633
KT00001,Monocyte,4004,18231,13903,1337,0.0961663,0.2196259119,385.0498454,2.8981443
KT00001,NK cell,1202,18231,13903,1337,0.0961663,0.0659316549,115.5918866,1.6948373


Add sample metadata and arrange columns

In [19]:
l1_counts <- l1_counts %>%
  left_join(sample_meta, by = "sample.sampleKitGuid") %>%
  select(one_of(sample_cols), everything())

In [20]:
head(l1_counts)

cohort.cohortGuid,subject.subjectGuid,subject.biologicalSex,subject.cmv,subject.bmi,subject.race,subject.ethnicity,subject.birthYear,subject.ageAtFirstDraw,sample.sampleKitGuid,⋯,specimen.specimenGuid,AIFI_L1,AIFI_L1_count,total_cells,scrna.lymphocyte_count,bc.lymphocyte_count,alc_ratio,AIFI_L1_frac_total,AIFI_L1_alc,AIFI_L1_clr
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<int>,<int>,<chr>,⋯,<chr>,<chr>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,PB00001-01,B cell,1781,18231,13903,1337,0.0961663,0.0976907465,171.2721715,2.0880254
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,PB00001-01,DC,246,18231,13903,1337,0.0961663,0.0134935001,23.6569086,0.1084267
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,PB00001-01,Erythrocyte,15,18231,13903,1337,0.0961663,0.0008227744,1.4424944,-2.6888546
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,PB00001-01,ILC,8,18231,13903,1337,0.0961663,0.000438813,0.7693304,-3.3174633
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,PB00001-01,Monocyte,4004,18231,13903,1337,0.0961663,0.2196259119,385.0498454,2.8981443
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,PB00001-01,NK cell,1202,18231,13903,1337,0.0961663,0.0659316549,115.5918866,1.6948373


Save counts

In [21]:
l1_file <- paste0("output/diha_AIFI_L1_frequencies_",Sys.Date(),".csv")

In [22]:
write.csv(l1_counts, l1_file,
          row.names = FALSE, quote = FALSE)

### L2

In [23]:
l2_counts <- meta %>%
  # Count each type per sample
    group_by(sample.sampleKitGuid, AIFI_L1, AIFI_L2) %>%
    summarise(AIFI_L2_count = n(), .groups = "keep") %>%
  # Add ALC and total sample counts for use below
    left_join(sample_lc, by = "sample.sampleKitGuid") %>%
  # Compute fractions of total cells per sample
    mutate(AIFI_L2_frac_total = AIFI_L2_count / total_cells) %>%
  # Compute LC estimate
    mutate(AIFI_L2_alc = AIFI_L2_count * alc_ratio) %>%
  # Regroup by sample and compute CLR for fractions
    group_by(sample.sampleKitGuid) %>%
    mutate(AIFI_L2_clr = clr_transform(AIFI_L2_count / total_cells))

[1m[22mJoining with `by = join_by(sample.sampleKitGuid)`


Add sample metadata and arrange columns

In [24]:
l2_counts <- l2_counts %>%
  left_join(sample_meta, by = "sample.sampleKitGuid") %>%
  select(one_of(sample_cols), everything())

In [25]:
head(l2_counts)

cohort.cohortGuid,subject.subjectGuid,subject.biologicalSex,subject.cmv,subject.bmi,subject.race,subject.ethnicity,subject.birthYear,subject.ageAtFirstDraw,sample.sampleKitGuid,⋯,AIFI_L1,AIFI_L2,AIFI_L2_count,total_cells,scrna.lymphocyte_count,bc.lymphocyte_count,alc_ratio,AIFI_L2_frac_total,AIFI_L2_alc,AIFI_L2_clr
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<int>,<int>,<chr>,⋯,<chr>,<chr>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,B cell,Effector B cell,97,18231,13903,1337,0.0961663,0.0053206078,9.3281306,-0.41027174
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,B cell,Memory B cell,381,18231,13903,1337,0.0961663,0.0208984696,36.6393584,0.957816657
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,B cell,Naive B cell,1140,18231,13903,1337,0.0961663,0.062530854,109.6295764,2.053800823
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,B cell,Plasma cell,18,18231,13903,1337,0.0961663,0.0009873293,1.7309933,-2.094610961
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,B cell,Transitional B cell,145,18231,13903,1337,0.0961663,0.0079534858,13.9441128,-0.008248976
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,DC,ASDC,7,18231,13903,1337,0.0961663,0.0003839614,0.6731641,-3.03907257


Save counts

In [26]:
l2_file <- paste0("output/diha_AIFI_L2_frequencies_",Sys.Date(),".csv")

In [27]:
write.csv(l2_counts, l2_file,
          row.names = FALSE, quote = FALSE)

### L3

In [28]:
l3_counts <- meta %>%
  # Count each type per sample
    group_by(sample.sampleKitGuid, AIFI_L1, AIFI_L2, AIFI_L3) %>%
    summarise(AIFI_L3_count = n(), .groups = "keep") %>%
  # Add ALC and total sample counts for use below
    left_join(sample_lc, by = "sample.sampleKitGuid") %>%
  # Compute fractions of total cells per sample
    mutate(AIFI_L3_frac_total = AIFI_L3_count / total_cells) %>%
  # Compute LC estimate
    mutate(AIFI_L3_alc = AIFI_L3_count * alc_ratio) %>%
  # Regroup by sample and compute CLR for fractions
    group_by(sample.sampleKitGuid) %>%
    mutate(AIFI_L3_clr = clr_transform(AIFI_L3_count / total_cells))

[1m[22mJoining with `by = join_by(sample.sampleKitGuid)`


Add sample metadata and arrange columns

In [29]:
l3_counts <- l3_counts %>%
  left_join(sample_meta, by = "sample.sampleKitGuid") %>%
  select(one_of(sample_cols), everything())

In [30]:
head(l3_counts)

cohort.cohortGuid,subject.subjectGuid,subject.biologicalSex,subject.cmv,subject.bmi,subject.race,subject.ethnicity,subject.birthYear,subject.ageAtFirstDraw,sample.sampleKitGuid,⋯,AIFI_L2,AIFI_L3,AIFI_L3_count,total_cells,scrna.lymphocyte_count,bc.lymphocyte_count,alc_ratio,AIFI_L3_frac_total,AIFI_L3_alc,AIFI_L3_clr
<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<int>,<int>,<chr>,⋯,<chr>,<chr>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,Effector B cell,CD27+ effector B cell,71,18231,13903,1337,0.0961663,0.0038944655,6.8278069,0.3324226
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,Effector B cell,CD27- effector B cell,26,18231,13903,1337,0.0961663,0.0014261423,2.5003237,-0.6721608
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,Memory B cell,Activated memory B cell,3,18231,13903,1337,0.0961663,0.0001645549,0.2884989,-2.831645
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,Memory B cell,CD95 memory B cell,15,18231,13903,1337,0.0961663,0.0008227744,1.4424944,-1.2222071
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,Memory B cell,Core memory B cell,329,18231,13903,1337,0.0961663,0.0180461851,31.6387111,1.8658004
BR1,BR1001,Female,Negative,23,Caucasian,Non-Hispanic origin,1987,32,KT00001,⋯,Memory B cell,Early memory B cell,10,18231,13903,1337,0.0961663,0.0005485163,0.961663,-1.6276722


Save counts

In [31]:
l3_file <- paste0("output/diha_AIFI_L3_frequencies_",Sys.Date(),".csv")

In [32]:
write.csv(l3_counts, l3_file,
          row.names = FALSE, quote = FALSE)

## Store results in HISE

In order to store the results in HISE, we'll need to cache these files to register them, and then we can upload the CSV file for later steps.

In [33]:
study_space_uuid <- "de025812-5e73-4b3c-9c3b-6d0eac412f2a"
title <- paste("DIHA Cell Type Frequency, ALC, and CLR", Sys.Date())

In [34]:
search_id = ids::proquint(n_words = 3)
search_id

In [35]:
in_list <- list(labs_uuid, meta_uuid)

In [36]:
out_list <- list(l1_file, l2_file, l3_file)

In [37]:
uploadFiles(
    files = out_list,
    studySpaceId = study_space_uuid,
    title = title,
    inputFileIds = in_list,
    destination = search_id,
    store = "project",
    doPrompt = FALSE
)

In [38]:
sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4       data.table_1.15.4 hise_2.16.0      

loaded via a namespace (and not attached):
 [1] ids_1.0.1        crayon_1.5.2     vctrs_0.6.5      httr_1.4.7      
 [5] cli_3.6.2        rlang_1.1.3      stringi_1.8.3    generics_0.1.3  
 [9] assertthat_0.2.1 jsonlite_1.8.8   glue_1.7.0       RCurl_1.9