Skip to content

lb3/CLIP-PyL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#CLIP-PyL

This is the ALPHA version of the CLIP-PyL python package package. This CLIP-PyL package contains scripts that can parse "crosslink signatures" from CLIP-seq data. The acronym "CLIP" refers to CrossLinking ImmunoPrecipitation, which is a set of methods that enable preparation of immunoprecipitate that is enriched for crosslinked RiboNucleoProtein (RNP) complexes of interest. The abbreviation "seq" refers to deep sequencing, which is employed to identify and quantify bound RNA fragments. I wrote the CLIP-PyL software to analyze CLIP-seq datasets.

#General Usage

The name of the software refers to its function, which is to produce maps of CLIP-seq read "pile-ups" (hence the name, CLIP-PyL). In bioinformatics parlance the term "pile-up" is used to refer to "read clusters" or "read peaks". The pile-ups are comprised of reads that align to a region of the genome (or transcriptome) in an overlapping manner. The term "pile-up" also refers to the pileup file format that is produced by the SAMTools package. The pileup file format describes aligned read coverage across each base of the reference sequence (e.g. a reference genome or transcriptome build). The number of mismatches, indels and aligned read termini that were tallied at each base of the reference is also encoded in a pileup file. Similarly, the CLIP-PyL software generates basewise coverage metrics. It generates basewise metrics that are relevant to identifying crosslinked and protected RNA fragments (e.g. crosslink-induced mutations) from CLIP-seq experiments. The software is capable of presenting the data in graphical form. The user may input a set of CLIP-seq alignment files (in bam format) and a set of genomic intervals to query (in bed format) and the clippyl-graphics script will output a pdf file containing coverage plots. Additionally, CLIP-PyL contains a script called clippyl-bedgraph that will output the basewise metrics in text-based bedgraph files that can be viewed with several popular genome browsers.

Currently, HITS-CLIP is the only CLIP-seq method variant that is supported. However, support for PAR-CLIP and iCLIP data will be released if considerable interest arises.

#Data Pre-Processing Requirements

The user MUST pre-process the "raw" HITS-CLIP reads by clipping the adapter sequences. I typically use the FASTQ/A Clipper tool from the FASTX toolkit for this task. However, other suitable options exist. The user must also align the preprocessed reads to a reference sequence assembly (e.g. hg19). Currently, single-end read alignments generated by the bwa fast read alignment software are supported. Finally, the user must use the SAMTools software to generate indexed bam files.

#Dependencies

If you have multiple python versions installed on your machine then please be certain to install the pysam and matplotlib packages into your python3 environment.

#Installation

Download the most recent version of the package at https://github.com/lb3/CLIP-PyL. You can use git to clone the master branch or use the "Download ZIP" button on the right side of the CLIP-PyL github page. To install CLIP-PyL, simply add the package to your python3 name space. You can accomplish this by adding the path to the clippyl directory to your PYTHONPATH environment variable. After clippyl is added to your python3 name space, you will be able to use the executable scripts that ship with clippyl.

#Testing the installation with sample data

CLIP-PyL ships with sample data. The sample data is comprised of a Stem Loop Binding Protein (SLBP) HITS-CLIP dataset. Note that only a subset of the total dataset is provided to save storage space. CLIP-PyL also ships with unittest routines that utilize the sample data. You can test your installation by issuing the following commands:

#!/bin/bash
# 1. change your current working directory to the directory where you 
#    downloaded or cloned the CLIP-PyL package
cd foo/bar/source-download-directory/
# 2. run the set of basic tests
python3 -m unittest clippyl

If the test runs to completion then you should find a file named CLIP-PyL_graphics_test.pdf in your present working directory. This file contains coverage graphics for each of the replication-dependent histone genes, calculated from the SLBP HITS-CLIP data.

#Using the clippyl-graphics script to draw CLIP-seq peaks across user-specified genomic intervals

The clippyl-graphics can be used to produce pdf files containing coverage map graphics for sets of genome intervals, which are specified by the user in a bed file. CLIP-PyL utilizes the third-party graphical back-end known as matplotlib to generate these coverage graphics. Notably, cis-element regions can be included in the coverage graphics if the user supplies additional bed files specifying the locations of these elements.

The clippyl-graphics script has two required input parameters: -i and -q.

Also, utilizing the --n_mapped_reads parameter is highly recommended because normalizing for sequencing depth is imperative. Alternatively, you may use the auto-norm parameter and clippyl will calculate the number of mapped reads from your input bam file(s) for use as the normalization factor(s).

The -i parameter takes a space-delimited list of indexed bam files as its argument. It is assumed that the corresponding bam index files (.bai) will be present in the same directory as your bam files. Also, please recall that the CLIP-seq reads must be adapter-trimmed upstream (see data pre-processing requirements above).

The -q parameter takes a file path to a bed file that lists the set of genomic intervals to be queried and graphed. In other words, this bed file specifies the genomics intervals where the graphics will be calculated. Each genomic interval listed will produce one page in the output pdf file. Multiple coverage plots will be generated on each page.

Here is an example that utilizes the sample data files:

#!/bin/bash
BAM_FILE_DIR="clippyl/sample_data/HITS-CLIP_SLBP_histone_mRNA_01/bwa_samse_hg19"
BED_FILE="clippyl/sample_data/genomic_interval_queries/RD_Histone_Genes.bed"

clippyl-graphics -i $BAM_FILE_DIR/SLBP_CLIP_high_MW_high_MNase.HISTONLY.discardUnclipped.bam \
                    $BAM_FILE_DIR/SLBP_CLIP_high_MW_low_MNase.HISTONLY.discardUnclipped.bam \
                    $BAM_FILE_DIR/SLBP_CLIP_low_MW_high_MNase.HISTONLY.discardUnclipped.bam \
                    $BAM_FILE_DIR/SLBP_CLIP_low_MW_low_MNase.HISTONLY.discardUnclipped.bam \
                    $BAM_FILE_DIR/mock_CLIP_high_MW_high_MNase.HISTONLY.discardUnclipped.bam \
                    $BAM_FILE_DIR/mock_CLIP_low_MW_high_MNase.HISTONLY.discardUnclipped.bam \
                  --n_mapped_reads 17931502 14698879 20020463 15256431 18992515 21083226 \
                  -q $BED_FILE \
                  --output "CLIP-PyL_graphics_sample.pdf"

About

CLIP-PyL is a Python package for analyzing CLIP-seq data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages