# Sequences filtering

The aim of this notebook is to filter intron and exon coordinates by excluding regions with low coverage, UCSC blacklisted regions, low complexitiy regions and regions with alignment issues. 

For more information see the **Genomic coordinates of exons and flanking introns** section in the Materials and Methods.

---

## Output

Two files *germinal_filtered_introns_coords.txt* and *germinal_filtered_exons_coords.txt* with the filtered valid sequences are generated. These files are used for the regression of H3K36me3 and nucleosomes by bins of genes (Figure 4 and Supplementary Figure S6).

The output files are not saved in the **results** folder but in **data/coordinates**.

The output files are tabulated files with 4 columns: chromosome, start, stop and Ensembl identifier.

In [1]:
from os import path

import pandas as pd
import pybedtools

## Input

Files in the **data** directory. There is more information about them in the ``README`` file in that folder.

- *intron_coords*: file with the coordinates of the introns

- *middle_exons_coords*: file with the coordinates of the middle exons

- *ucsc_blacklisted_file*: file with blacklisted regions due to low complexity or low mappability

- *low_complexity_file*: file with low complexity regions
    
- *alignability_file*: file with the regions *without* alignability problems


In [2]:
# Data
introns_coords = 'data/coordinates/genes_intron_coords.bed.gz'
middle_exons_coords = 'data/coordinates/genes_middle_exon_coords.bed.gz'

ucsc_blacklisted_file = 'data/mappability/ucsc_blacklist.bed.gz'
low_complexity_file = 'data/mappability/hg19_low_complexity_regions.gz'
alignability_file = 'data/mappability/wgEncodeCrgMapabilityAlign36mer_score1.bed.gz'

In [3]:
# Read input files
exons_coords_df = pd.read_csv(middle_exons_coords, sep="\t", header=None, low_memory=False)
    
introns_coords_df = pd.read_csv(introns_coords, sep="\t", header=None, low_memory=False)

exons_coords_df.columns = ['chr', 'start', 'end', 'ensembl', 'symbol', 'strand']
introns_coords_df.columns = ['chr', 'start', 'end', 'ensembl', 'symbol', 'strand']

exons_coords_symbol_df = exons_coords_df[['chr', 'start', 'end', 'ensembl']]
introns_coords_symbol_df = introns_coords_df[['chr', 'start', 'end', 'ensembl']]

In [4]:
exons_bp = sum(exons_coords_symbol_df['end'] - exons_coords_symbol_df['start'])
introns_bp = sum(introns_coords_symbol_df['end'] - introns_coords_symbol_df['start'])

In [5]:
print(exons_bp, introns_bp)

13632264 794037718


In [6]:
exons_cov_bed = pybedtools.BedTool.from_dataframe(exons_coords_symbol_df) 
introns_cov_bed = pybedtools.BedTool.from_dataframe(introns_coords_symbol_df)

In [None]:

# Filter blacklisted regions
ucsc_blacklisted_df = pd.read_csv(ucsc_blacklisted_file, sep='\t', header=None)

ucsc_blacklisted_bed = pybedtools.BedTool.from_dataframe(ucsc_blacklisted_df)

exons_cov_black = exons_cov_bed.subtract(ucsc_blacklisted_bed)
introns_cov_black = introns_cov_bed.subtract(ucsc_blacklisted_bed)

exons_cov_black_df = pd.read_table(exons_cov_black.fn, names=['chr', 'start', 'end', 'ensembl'])
introns_cov_black_df = pd.read_table(introns_cov_black.fn, names=['chr', 'start', 'end', 'ensembl'])

exons_bp_cov_black = sum(exons_cov_black_df['end'] - exons_cov_black_df['start'])
introns_bp_cov_black = sum(introns_cov_black_df['end'] - introns_cov_black_df['start']) 

print('+ blacklisted regions:\t', exons_bp_cov_black/exons_bp*100, introns_bp_cov_black/introns_bp*100)

# Filter low complexity regions
low_complexity_df = pd.read_csv(low_complexity_file, sep='\t', header=None)

low_complexity_bed = pybedtools.BedTool.from_dataframe(low_complexity_df)

exons_cov_black_compl = exons_cov_black.subtract(low_complexity_bed)
introns_cov_black_compl = introns_cov_black.subtract(low_complexity_bed)

exons_cov_black_compl_df = pd.read_table(exons_cov_black_compl.fn, names=['chr', 'start', 'end', 'ensembl'])
introns_cov_black_compl_df = pd.read_table(introns_cov_black_compl.fn, names=['chr', 'start', 'end', 'ensembl'])

exons_bp_cov_black_compl = sum(exons_cov_black_compl_df['end'] - exons_cov_black_compl_df['start'])
introns_bp_cov_black_compl = sum(introns_cov_black_compl_df['end'] - introns_cov_black_compl_df['start'])

print('+ low complexity:\t', exons_bp_cov_black_compl/exons_bp*100, introns_bp_cov_black_compl/introns_bp*100)

# Filter regions with alignment problems    
alignability_df = pd.read_csv(alignability_file, sep='\t', header=None)

alignability_bed = pybedtools.BedTool.from_dataframe(alignability_df)

exon_cov_black_compl_align = exons_cov_black_compl.intersect(alignability_bed)
intron_cov_black_compl_align = introns_cov_black_compl.intersect(alignability_bed)

exons_cov_black_compl_align_df = pd.read_table(exon_cov_black_compl_align.fn, names=['chr', 'start', 'end', 'ensembl'])
introns_cov_black_compl_align_df = pd.read_table(intron_cov_black_compl_align.fn, names=['chr', 'start', 'end', 'ensembl'])

exons_bp_cov_black_compl_align = sum(exons_cov_black_compl_align_df['end'] - exons_cov_black_compl_align_df['start'])
introns_bp_cov_black_compl_align = sum(introns_cov_black_compl_align_df['end'] - introns_cov_black_compl_align_df['start'])

print('+ filtered:\t\t', exons_bp_cov_black_compl_align/exons_bp*100, introns_bp_cov_black_compl_align/introns_bp*100)
    
new_exons_coords_symbol_df = pd.read_table(exon_cov_black_compl_align.fn, names=['chr', 'start', 'position', 'ensembl'])
new_introns_coords_symbol_df = pd.read_table(intron_cov_black_compl_align.fn, names=['chr', 'start', 'position', 'ensembl'])

new_exons_coords_symbol_df.to_csv(path.join('data/coordinates/', 'germinal_filtered_exons_coords.txt'),
                                      sep='\t', header=True, index=None)
new_introns_coords_symbol_df.to_csv(path.join('data/coordinates/', 'germinal_filtered_introns_coords.txt'),
                                        sep='\t', header=True, index=None)

+ blacklisted regions:	 99.9998532892 99.9802151716
+ low complexity:	 99.7365221213 99.399076531
