methylcheck is a Python-based package for filtering and visualizing Illumina methylation array data. The focus is on quality control.
This package contains high-level APIs for filtering processed data from local files. 'High-level' means that the details are abstracted away, and functions are designed to work with a minimum of knowledge and specification required. But you can always override the "smart" defaults with custom settings if things don't work. Before starting you must first download processed data from the NIH GEO database or process a set of
idat files with
methylprep. Refer to methylprep for instructions on this step.
This package is available in PyPi.
pip install methylcheck or
pip3 install methylcheck if your OS defaults to python2x. This package only works in python3.6+.
Importing your data
Methylcheck is designed to accept the output from the
methylprep package. If you have a bunch of
methylprep will return a single pickled pandas dataframe containing all the beta values for probes.
Load your data in a Jupyter Notebook like this:
mydata = pandas.read_pickle('beta_values.pkl')
If you processed a large batch of samples using the
batch_size option in
methylprep process, there's a convenience function in
methylize (methylize.load) that will load and combine a bunch of output files in the same folder:
import methylize df = methylize.load('<path to folder with methylprep output>') # or df,meta = methylize.load_both('<path to folder with methylprep output>')
This conveniently loads a dataframe of all meta data associated with the samples, if you are using public GEO data. Some analysis functions require specifying which samples are part of a treatment group (vs control) and the
meta dataframe object can be used for this.
Alternatively, you can import public GEO datasets directly, if they are processed data containing either probe
beta values for samples or methylated/unmethylated signal intensities. If you have
idat files, process them first with
methylprep, or use the
methylprep download -i <GEO_ID> option to download and process public data.
In general, the best way to import data is to use
methylprep and run
run_pipeline(data_folder, betas=True) # or from the command line: `python -m methylprep process -d <filepath to idats> --all`
collect the beta_values.pkl file it returns/saves to disk, and load that in a Jupyter notebook. From there, each data transformation is a single line of code using Pandas DataFrames.
methylcheck will keep track of the data format/structures for you, and you can visualize the effect of each filter as you go. You can also export images of your charts for publication.
Refer to the Jupyter notebooks on readthedocs for examples of filtering probes from a batch of samples, removing outlier samples, and generating plots of data.
Quality Control (QC)
The simplest way to generate a battery of plots about your data is to run this function in a Jupyter notebook:
import methylcheck methylcheck.run_qc('<path to your methylprep processed files>')
methylcheck provides functions to
- predict the sex of samples (
- detect probes that differ between two sets of samples within a batch (
- remove sex-chromosome-linked probes and control probes
- remove "sketchy" probes, deemed unreliable by researchers
- filter sample outliers based on multi-dimensional scaling
- combine datasets for analysis
- plot sample beta or m-value distributions, or raw uncorrected probe channel intensities
Parts of this package were ported from
R package, and extended/developed by the team at Foxo Bioscience, who maintains it. You can write to
info@LifeEgx.com to give feedback, ask for help, or suggest improvements. For bugs, report issues on our github repo page.