Skip to content

ismms-himc/himc_helper_functions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

himc_helper_functions

This repo will contain a script with functions that are used for 10x single cell data. These functions include:

  • Input/Output of single cell data (e.g. CytoBank)
  • Filtering sparse matrices (e.g. GEX UMI)
  • De-hashing samples
  • Read/write from/to AWS S3
  • Visualize data for QC purposes
  • Generate cell and feature meta-data
  • Harmonize TCR/BCR Clonotypes across lanes (treat as cell meta-data)

Data Flow Diagram

Disk             In-Memory                Disk
----     ------------------------       ---------
MTX  ->  feature_data  ->   df      ->   parquets    (-> Database)
           (sparse)       (dense)   ->   MTX
                                    ->   CytoBank
                                    ->   Zarr

himc_helper_functions Roadmap

Warning -- the himc_helper_functions.py script is currently a grab bag full of stuff, but will become more organized over time. This repo may eventually become a pip installable library or may (at least partially) become absorbed by Clustergrammer2.

Single-Cell Data Read/Write

This set of functions will be used to read and write single cell data between several commonly used and custom data formats. This section of this README will describe these files and some schemas for what we want to save and in what data formats we want to use.

Metadata Overview

This section outlines the metadata we would like to store for different entities (e.g. cells). For the time being we are thinking of storing this metadata as separate parquet files (with metadata type as column and entities as rows). The advantage to this workflow is that we can add cell-level metadata as it becomes available (e.g. cell type prediction) without having to re-write the underlying data (e.g. GEX data). We should also look into Anndata used by Scanpy.

Embedding cell metadata into the MTX file format could be done (by treating numeric metadata as a 'Custom' feature, but the same cannot be done for feature-based metadata (e.g. we would not want to have a metadata barcode).

  • cell_meta_data

    • Gene Expression Level Meta-data

      • gex_umi_sum
      • num_genes_meas
      • fraction_mito_umi
      • sum_mito_umi
      • mean_mito_umi (can be used in place of multiple genes)
      • mean_ribo_umi (can be used in place of multiple genes)
    • Feature Level Meta-Data

      • adt_umi_sum (if applicable)
      • hto_umi_sum (if applicable)
      • hto_first_vs_secont_highest_ratio (if applicable)
    • Cell Type Level Meta-Data

      • cell_type_broad (if applicable)
      • cell_type_narrow (if applicable)
      • cell_state (if applicable)
      • t_cell_clnonotye (if applicable)
      • b_cell_clnonotye (if applicable)
      • Subject (if applicable)
      • Sample-Metadata (e.g. timepoint, if applicable)
  • gex_meta_data

    • ensemble ID (for the purpose of exporting to Cell Ranger format)
    • freaction_of_cells_expressing (divide by total number of cells)
    • mean_umi (across all cells)
    • var_umi (across all cells)
  • adt_meta_data

    • freaction_of_cells_expressing (divide by total number of cells)
    • mean_umi_level (across all cells)
  • hto_meta_data

    • freaction_of_cells_expressing (divide by total number of cells)
    • mean_umi_level (across all cells)

Data Formats

  • Cell Ranger Version 2 Sparse Matrix MTX Format (uncompressed, read-only format)

    • barcodes.tsv
    • genes.tsv
    • matrix.mtx
  • Cell Ranger Version 3 Sparse Matrix MTX Format (compressed, read and write format)

    • barcodes.tsv.gz
    • features.tsv.gz
    • matrix.mtx.gz
  • Parquet Format (read and write format)

    • gex.parquet
    • adt.parquet
    • hto.parquet
    • meta_cell.parquet
    • meta_gex.parquet
    • meta_adt.parquet
    • meta_hto.parquet
  • CytoBank Format

    • merge of top var gex (w/o ribo/mito)

Cytobank Upload Format

  1. General matrix format is 1 row per cell and 1 column per feature
  2. First column should be a cell index, with the column header "cell_index", incremented by 0.01
  3. Features should be indicated by the prefix ADT_, HTO_ or GEX_
  4. Any derived/calculated data will include der after after the feature type. By default, always include:
  • GEX_der_umi_sum
  • GEX_der_unique_gene_count
  • GEX_der_mito_proportion_umi (proprtion of all gex umi that derive from mitochondrial genes)
  • GEX_der_mito_avg (avg umi expression of all mitochondrial genes)
  • GEX_der_ribo_avg (avg umi expression of all ribosomal genes)
  • HTO_der_umi_sum
  • HTO_der_sn
  • ADT_der_umi_sum
  1. Add random noise to ADT and HTO data to 2 decimal places for visualization purposes
  2. Include top 500 differentially expressed genes, excluding mitochondiral or ribosomal genes
  3. By default, merge data from replicate lanes but add lane as a feature to indicate which cells came from which lane

Sharing 10X Genomics Data

This is the rough layout of how we will deliver data for a single loading sample to users.

Loading-Sample name: "SAMPLE"

Directory Structure Layout

SAMPLE_Shared_10X_Count_Data/

  ############################################
  # Part 1: Cell Ranger Outputs
  ############################################
  
  cell_ranger_outputs/
    metrics_summary.csv
    web_summary.html
    
  raw_feature_bc_matrix/
    barcodes.tsv.gz
    features.tsv.gz
    matrix.mtx.gz
    
  filtered_feature_bc_matrix/
    barcodes.tsv.gz
    features.tsv.gz
    matrix.mtx.gz
  
  vdj/
    all_contig_annotations.csv
    filtered_contig_annotations.csv
    metrics_summary.csv
    web_summary.html
    
  ############################################
  # Part 2: HIMC Metadata
  ############################################    
  
  himc_outputs/
    meta_cell.csv
    meta_sample.csv
    meta_hto.csv
    meta_adt.csv
    meta_gex.csv (probably not going to include)
    0.1_SAMPLE_Pre-Processing_Notebook.html
  

The directory SAMPLE_10X_Count_Data will be zipped and a single pre-signed URL will be generated and shared with the uer. Two things should be take note of: 1) the name of the directory includex the loading sample name, and 2) the loading sample meta-data is included in the directory