# MOFA analysis of the Chromium Single Cell Multiome ATAC + Gene Expression assay

## Contents
1. Description
2. Load libraries
3. Load data
4. Load metadata
5. Parse Seurat object
6. Normalisation
7. Feature seclection
8. Train the MOFA+ model
9. MOFA downstream analysis
10. Conclusions

## 1. Description
This tutorial demonstrates how MOFA can be used to integrate scRNA-seq and scATAC-seq data from the Chromium Single Cell Multiome ATAC + Gene Expression assay recently commercialised by 10x Genomics. This tutorial results from a collaboration between the 10x Genomics R&D team and the MOFA team. The data set consists of the conventional Peripheral Blood Mononuclear Cells (PBMC) from a single healthy donor, which is available here.

MOFA is a factor analysis model that provides a general framework for the integration of multi-omic data sets in an unsupervised fashion. Intuitively, it can be viewed as a versatile and statistically rigorous generalisation of principal component analysis (PCA) to multi-omics data. Briefly, the model performs unsupervised dimensionality reduction simultaneously across multiple data modalities, thereby capturing the global sources of cell-to-cell variability via a small number of inferred factors. Importantly, it distinguishes variation that is shared between assays from variation that is unique to a specific assay. Thus, in this data set MOFA can be useful to disentangle the RNA and the ATAC activity of the different cellular populations that exist in PBMCs.

## 2. Install packages and load libraries
Make sure that `MOFA2` is imported last, to avoid collisions with functions from other packages. Note that installing these packages can be problematic if you don't have some dependencies already installed on your machine (libcurl, xml, etc). Also there may be issues with unavailable dependencies. I ended up installing these packages through Rstudio. I'll try to make this seamless and straightforward once I have the time

In [4]:
# installing devtools and remotes
install.packages('devtools', clean=TRUE, quiet=TRUE)
install.packages('remotes', clean=TRUE, quiet=TRUE)

# installing bioconductor and dependencies
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

# automatically install Bioconductor dependencies
setRepositories(ind=1:2)

Bioconductor version 3.11 (BiocManager 1.30.10), R 4.0.2 (2020-06-22)



In [5]:
# core packages needed
install.packages('data.table', clean=TRUE, quiet=TRUE)
install.packages('ggplot2', clean=TRUE, quiet=TRUE)

# automatically install Bioconductor dependencies
setRepositories(ind=1:2)

# installing packages from Bioconductor
BiocManager::install(c(
  'AnnotationFilter',
  'BiocGenerics',
  'GenomeInfoDb',
  'GenomicFeatures',
  'GenomicRanges',
  'IRanges',
  'Rsamtools',
  'S4Vectors',
  'JASPAR2020',
  'TFBSTools',
  'ggbio',
  'motifmatchr',
  'AnnotationDbi',
  'Seurat',
  'Signac',
  'msigdbr',
  'BSgenome.Hsapiens.UCSC.hg38')
)

# installing MOFA2
devtools::install_github("bioFAM/MOFA2/MOFA2", build_opts = c("--no-resave-data --no-build-vignettes"))

Bioconductor version 3.11 (BiocManager 1.30.10), R 4.0.2 (2020-06-22)

Installing package(s) 'AnnotationFilter', 'BiocGenerics', 'GenomeInfoDb',
  'GenomicFeatures', 'GenomicRanges', 'IRanges', 'Rsamtools', 'S4Vectors',
  'TFBSTools', 'ggbio', 'motifmatchr', 'AnnotationDbi', 'Seurat', 'Signac',
  'msigdbr', 'BSgenome.Hsapiens.UCSC.hg38'

Skipping install of 'MOFA2' from a github remote, the SHA1 (f6825ae7) has not changed since last install.
  Use `force = TRUE` to force installation



In [9]:
# load core libraries
library(data.table)
library(ggplot2)
library(Seurat)
library(Signac)

# for GSEA analysis
library(msigdbr)

# For motif enrichment analysis
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Hsapiens.UCSC.hg38)

# MOFA
library(MOFA2)

## Load data
### 3.1 Load multi-modal Seurat object
We have created a Seurat object with RNA and ATAC data modalities stored as different assays.

Note that the .rds file is 2.5gb and so cannot be stored in a GitHub repo without using git LFS. For now I have the .rds file saved in my local data directory but will not add it to the repository.

In [14]:
# run this to download the data set
# seurat <- readRDS(url("ftp://ftp.ebi.ac.uk/pub/databases/mofa/10x_rna_atac_vignette/seurat.rds"))

seurat <- readRDS("data/10x_rna_atac/PBMC/seurat.rds")

seurat

An object of class Seurat 
138109 features across 11909 samples within 2 assays 
Active assay: RNA (29732 features, 0 variable features)
 1 other assay present: ATAC

The metadata slot contains the cell type annotations that have been done a priori by the 10x Genomics R&D team. This will be useful to characterise the MOFA factors. One could employ the MOFA factors to perform clustering and cell type annotation, but we wont do this here.

In [15]:
head(seurat@meta.data[,c("celltype","broad_celltype","pass_rnaQC","pass_accQC")])

Unnamed: 0_level_0,celltype,broad_celltype,pass_rnaQC,pass_accQC
Unnamed: 0_level_1,<chr>,<chr>,<lgl>,<lgl>
AAACAGCCAAGGAATC,naive CD4 T cells,Lymphoid,True,True
AAACAGCCAATCCCTT,memory CD4 T cells,Lymphoid,True,True
AAACAGCCAATGCGCT,naive CD4 T cells,Lymphoid,True,True
AAACAGCCACACTAAT,,,False,False
AAACAGCCACCAACCG,,,False,False
AAACAGCCAGGATAAC,,,False,False


In [16]:
table(seurat@meta.data$celltype)


 CD56 (bright) NK cells     CD56 (dim) NK cells     classical monocytes 
                    407                     472                    1929 
   effector CD8 T cells  intermediate monocytes            MAIT T cells 
                    385                     664                     106 
         memory B cells      memory CD4 T cells              myeloid DC 
                    420                    1611                     242 
          naive B cells       naive CD4 T cells       naive CD8 T cells 
                    295                    1462                    1549 
non-classical monocytes         plasmacytoid DC 
                    383                     107 

In [17]:
table(seurat@meta.data$broad_celltype)


Lymphoid  Myeloid 
    6814     3218 

Keep cells that pass QC for both omics

In [18]:
seurat <- seurat %>%
  .[,seurat@meta.data$pass_accQC==TRUE & seurat@meta.data$pass_rnaQC==TRUE]
seurat

An object of class Seurat 
138109 features across 10032 samples within 2 assays 
Active assay: RNA (29732 features, 0 variable features)
 1 other assay present: ATAC

The RNA expression consists of 29,732 genes and 10,032 cells

In [19]:
seurat@assays[["RNA"]]

Assay data with 29732 features for 10032 cells
First 10 features:
 AL627309.1, AL627309.2, AL627309.5, AL627309.4, AP006222.2, AL669831.2,
LINC01409, FAM87B, LINC01128, LINC00115 

The ATAC expression consists of 108,377 peaks and 10,032 cells

In [20]:
seurat@assays[["ATAC"]]

Assay data with 108377 features for 10032 cells
First 10 features:
 chr1:10109-10357, chr1:180730-181630, chr1:191491-191736,
chr1:267816-268196, chr1:586028-586373, chr1:629721-630172,
chr1:633793-634264, chr1:777634-779926, chr1:816881-817647,
chr1:819912-823500 

### 3.2 Load additional information
Collect a list of position-specific weight matrices (PWM) from the JASPAR database, we’ll use this in the downstream analysis

In [21]:
pfm <- getMatrixSet(JASPAR2020,
  opts = list(species = "Homo sapiens")
)