Min JL, Hemani G, Davey Smith G, Relton C, Suderman M. Meffil: efficient normalization and analysis of very large DNA methylation datasets. Bioinformatics. 2018 Jun 21.
Efficient algorithms for analyzing DNA methylation data generated using Infinium HumanMethylation450 or MethylationEPIC BeadChips:
- Functional normalization for large datasets using parallelization.
- Normalization of datasets with mixed Infinium HumanMethylation450 and MethylationEPIC BeadChips.
- Inclusion of user-defined fixed and random effects in functional normalization procedure.
- Cell count estimation using predefined and user-defined reference datasets.
- Use of predefined and user-defined microarray probe annotations.
- Epigenome-wide association studies (using data from any normalization pipeline).
- Copy number estimation.
- Report generation summarizing all steps.
- The online wiki can be found here
- Normalization to and epigenome-wide-studies and report generation for methylation matrices stored in Genomic Data Structure files. The methylation matrix never needs to be loaded into memory.
Examples using many of these features can be found in the tests/ directory.
Only a few steps are needed to install
meffil in R. First, start R and then type the following commands:
source("http://bioconductor.org/biocLite.R") install.packages("devtools") # if the devtools package is not installed library(devtools) install_github("perishky/meffil")
library(meffil) options(mc.cores=6) # Generate samplesheet samplesheet <- meffil.create.samplesheet(path_to_idat_files) # Or read in samplesheet samplesheet <- meffil.read.samplesheet(path_to_idat_files) beta <- meffil.normalize.dataset(samplesheet, qc.file="qc/report.html", author="Analyst", study="Illumina450", number.pcs=10)
# Load meffil and set how many cores to use for parallelization library(meffil) options(mc.cores=6) # Generate samplesheet samplesheet <- meffil.create.samplesheet(path_to_idat_files) # Or read in samplesheet samplesheet <- meffil.read.samplesheet(path_to_idat_files) # Background and dye bias correction, sexprediction, cell counts estimates qc.objects <- meffil.qc(samplesheet, cell.type.reference="blood gse35069", verbose=TRUE) # Obtain genotypes for comparison with those measured on the microarray genotypes <- meffil.extract.genotypes(plink.files) # Generate QC report qc.summary <- meffil.qc.summary(qc.objects, genotypes=genotypes) meffil.qc.report(qc.summary, output.file="qc/report.html") # Remove outlier samples if necessary qc.objects <- meffil.remove.samples(qc.objects, qc.summary$bad.samples$sample.name) # Plot residuals remaining after fitting control matrix to decide on the number PCs # to include in the normalization below. print(meffil.plot.pc.fit(qc.objects)$plot) # Perform quantile normalization norm.objects <- meffil.normalize.quantiles(qc.objects, number.pcs=10) # Generate normalized probe values norm.beta <- meffil.normalize.samples(norm.objects, cpglist.remove=qc.summary$bad.cpgs$name) # Generate normalization report pcs <- meffil.methylation.pcs(norm.beta) norm.summary <- meffil.normalization.summary(norm.objects, pcs=pcs) meffil.normalization.report(norm.summary, output.file="normalization/report.html")
More info about normalization
We generate a samplesheet automatically from the
samplesheet <- meffil.create.samplesheet(path)
The function creates the following necessary columns:
- Sex (possible values "M"/"F"/NA)
And it also tries to parse the basenames to guess if the Sentrix plate
and positions are present. At this point it is worthwhile to manually
samplesheet data.frame to replace the actual sample IDs
Sample_Name column if necessary, and to add the sex values to
Sex column. Don't change these column names though.
Perform the background correction, dye bias correction, sex prediction and cell count estimates:
qc.objects <- meffil.qc(samplesheet, cell.type.reference="blood gse35069", verbose=TRUE)
A list of available cell type references can be obtained as follows:
New references can be created from a dataset using meffil.create.cell.type.reference().
Obtain the matrix of genotypes for comparison with those measured on the microarray. If such a matrix is available (rows = SNPs, columns = samples), then the following steps can be omitted. Otherwise, it is possible to obtain the matrix from a PLINK dataset as follows:
## save the SNP names in R annotation <- qc.objects[]$annotation writeLines(meffil.snp.names(annotation), con="snp-names.txt") ## run plink at the command line plink --bfile dataset --extract snp-names.txt --recodeA --out genotypes.raw --noweb ## load the genotype data in R filenames <- "genotypes.raw" genotypes <- meffil.extract.genotypes(filenames)
We can now summarise the QC analysis of the raw data
qc.summary <- meffil.qc.summary(qc.objects, genotypes=genotypes)
and generate a report:
This creates the file "qc/report.html" in the current work directory. Should open up in your web browser.
You can remove bad samples prior to performing quantile normalization:
qc.objects <- meffil.remove.samples(qc.objects, qc.summary$bad.samples$sample.name)
Next we determine the number of principal components of the control matrix to include in the quantile normalization. The following function plots the quantile residuals remaining after fitting different numbers of control matrix principal components.
And now remove control probe variance from the sample quantiles:
norm.objects <- meffil.normalize.quantiles(qc.objects, number.pcs=10)
Additional fixed and random effects can be included in the normalization by providing their corresponding column names in the samplesheet. For example, slide effects can be included as follows:
norm.objects <- meffil.normalize.quantiles(qc.objects, random.effects="Slide", number.pcs=10)
Note however that including random effects will greatly increase running time.
beta values can be generated, whilst removing CpGs
that were found to be dodgy in the QC analysis:
norm.beta <- meffil.normalize.samples(norm.objects, cpglist.remove=qc.summary$bad.cpgs$name)
A summary report of the normalization performance can also be generated:
pcs <- meffil.methylation.pcs(norm.beta) norm.summary <- meffil.normalization.summary(norm.objects, pcs=pcs) meffil.normalization.report(norm.summary, output.file="normalization/report.html")
Epigenome-wide association study (EWAS)
Prepare a variable of interest and a set of covariates.
variable <- ... ## variable of interest, one value per sample covariates <- ... ## data.frame of covariates to include (rows = samples, columns = covariates)
Add cell count estimates to the set of covariates.
counts <- t(meffil.cell.count.estimates(norm.objects)) covariates <- cbind(covariates, counts)
Run the EWAS.
ewas.ret <- meffil.ewas(norm.beta, variable=variable, covariates=covariates)
Generate a report for the EWAS, including a table describing all variables, a table describing the relationships between the variable of interest and all covariates, QQ plots, Manhattan plots, a spreadsheet listing all significantly associated CpG sites, plots of the most strongly associated sites, and plots of selected candidate CpG sites.
ewas.parameters <- meffil.ewas.parameters(sig.threshold=1e-20, max.plots=5) candidate.sites <- c("cg04946709","cg06710937","cg12177922","cg15817705","cg20299935","cg21784396") ewas.summary <- meffil.ewas.summary(ewas.ret, norm.beta, selected.cpg.sites=candidate.sites, parameters=ewas.parameter) meffil.ewas.report(ewas.summary, output.file="ewas/report.html")