HTSeq (https://htseq.readthedocs.io) is an alternative library that’s used for processing NGS data. Most of the functionality made available by HTSeq is actually available in other libraries covered in this book.
</br>
HTSeq supports, among others, FASTA, FASTQ, SAM (via pysam), VCF, **General** **Feature Format(GFF)**, and **Browser Extensible Data (BED)** file formats. It also includes a set of abstractions for
processing (mapped) genomic data, encompassing concepts such as genomic positions and intervals
or alignments.

 A complete examination of the features of this library is beyond our scope, so we will
concentrate on a small subset of features. We will take this opportunity to also introduce the BED
file format.

The BED format allows for the specification of features for annotations’ tracks. It has many uses,
but it’s common to load BED files into genome browsers to visualize features. Each line includes
information about at least the position (chromosome, start, and end) and also optional fields such as
name or strand. Full details about the format can be found at https://genome.ucsc.edu/FAQ/FAQformat.html#format1.

Our simple example will use data from the region where the LCT gene is located in the human genome.
The LCT gene codifies lactase, an enzyme involved in the digestion of lactose.
We will take this information from Ensembl. Go to http://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000115850 and choose Export
data. The Output format should be BED Format. Gene information should be selected (you can
choose more if you want). 
Take a look at the file before we start. An example of a few lines of this file is provided here:

In [36]:
# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
 drive	 sample_data  'view?usp=sharing'


In [41]:
!cd 'drive/Colab Notebooks'
! head LCT.bed

/bin/bash: line 0: cd: drive/Colab Notebooks: No such file or directory
track name=gene description="Gene information"
2	135836529	135837180	ENSE00002202258	0	-
2	135833110	135833190	ENSE00001660765	0	-
2	135829592	135829676	ENSE00001731451	0	-
2	135823900	135824003	ENSE00001659892	0	-
2	135822019	135822098	ENSE00001777620	0	-
2	135817340	135818061	ENSE00001602826	0	-
2	135812310	135812956	ENSE00000776576	0	-
2	135808442	135809993	ENSE00001008768	0	-
2	135807127	135807396	ENSE00000776573	0	-


The fourth column is the feature name. This will vary widely from file to file, and you will have to check it each and every time. However, in our case, it seems apparent that we have Ensembl exons (ENSE...), GenBank records (NM_...), and coding region information (CCDS) from the Consensus Coding Sequence (CCDS) database (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi).

**Take a look at the following steps:**

**We will start by setting up a reader for our file.** 

In [43]:
from collections import defaultdict
import re
import HTSeq
lct_bed = HTSeq.BED_Reader('LCT.bed')

We are now going to extract all the types of features via their name:

In [44]:
feature_types = defaultdict(int)
for rec in lct_bed:
  last_rec = rec
  feature_types[re.search('([A-Z]+)', rec.name).
group(0)] += 1
print(feature_types)

defaultdict(<class 'int'>, {'ENSE': 27, 'NM': 17, 'CCDS': 17})


We stored the last record so that we can inspect it:

In [45]:
print(last_rec)
print(last_rec.name)
print(type(last_rec))
interval = last_rec.iv
print(interval)
print(type(interval))

<GenomicFeature: BED line 'CCDS2178.11' at 2: 135788543 -> 135788322 (strand '-')>
CCDS2178.11
<class 'HTSeq.features.GenomicFeature'>
2:[135788323,135788544)/-
<class 'HTSeq._HTSeq.GenomicInterval'>


Let’s dig deeper into the interval:

In [46]:
print(interval.chrom, interval.start, interval.end)
print(interval.strand)
print(interval.length)
print(interval.start_d)
print(interval.start_as_pos)
print(type(interval.start_as_pos))

2 135788323 135788544
-
221
135788543
2:135788323/-
<class 'HTSeq._HTSeq.GenomicPosition'>


Note the genomic position (chromosome, start, and end). The most complex issue is how to deal with the strand. If the feature is coded in the negative strand, you have to be careful with processing. HTSeq offers the *start_d* and *end_d* fields to help you with this (that is, they will be reversed with regard to the start and end if the strand is negative).

Finally, let’s extract some statistics from our coding regions (CCDS records). We will use CCDS since it’s probably better than the curated database here:

In [47]:
exon_start = None
exon_end = None
sizes = []
for rec in lct_bed:
    if not rec.name.startswith('CCDS'):
        continue
    interval = rec.iv
    exon_start = min(interval.start, exon_start or interval.start)
    exon_end = max(interval.length, exon_end or interval.end)
    sizes.append(interval.length)
sizes.sort()
print("Num exons: %d / Begin: %d / End %d" % (len(sizes), exon_start, exon_end))
print("Smaller exon: %d / Larger exon: %d / Mean size: %.1f" % (sizes[0], sizes[-1], sum(sizes)/len(sizes)))

Num exons: 17 / Begin: 135788323 / End 135837169
Smaller exon: 79 / Larger exon: 1551 / Mean size: 340.2
