In [12]:
from IPython.display import Image
from IPython.display import clear_output
from IPython.display import FileLink, FileLinks
import matplotlib.pyplot as plt
import pandas        as pd
import os
import time

from IPython.display import HTML
styles = """
<style>
    h1 {
        margin-top: 0 !important;
    }
    h2 {
        margin-top: 0 !important;
    }
</style>
"""
HTML(styles)
display(HTML(styles))

<img src="img/python-logo-master-flat.png" alt="Python Logo" style="width: 120px; float: right; margin: 0 0 10px 10px;" />

## Introduction to Python with Application in Bioinformatics



### Nanjiang Shu

#### 2024-07-16 (Day 2)

## Review of Day 1

- Liternals and their associated data types, variables, and basic operations
    - Four basic data types: Numeric, String, Boolean, None
    - Collection of data types: List, Tuple, Set and Dictionary
    - Naming of variables, rules and conventions
- Built-in functions and operations
    - print, type, len, max, sum 
- Loops
    - `for` loop and `while` loop
- How to use if/elif/else statements to control the logic of the code

## Review of the quiz from yesterday

- <a href="https://forms.office.com/Pages/DesignPageV2.aspx?origin=NeoPortalPage&subpage=design&id=DQSIkWdsW0yxEjajBLZtrQAAAAAAAAAAAAa__Yehr4dUMTZSRVUyRVMxMTQzODBTUk1IVVZOVk83Ti4u&analysis=true">Link to the quiz statistics</a>

# Questions?

## Day 2

- __Session 1__:
    - Python Script: Dump your code into a script
    - File I/O: Read and write files
- __Session 2__:
    - Variants identification with VCF (Exercise after lunch)
- __Session 3__:
    - Introduction to the Course Project
- __Session 4__:
    - Team building - Spaghetti tower

## Session 1: Python script, File I/O

## Python script

A Python script is a file containing Python code that can be executed all at once.
- File Extension: `.py`



Note: for example, we have a list of gene IDs and and print then one line per record, can that can be run in Jupyter notebook or in Python prompt or in PyCharm. Now I can save it in a text file and save it as `myscript.py`

In [4]:
genotype = "AG"
phenotype = "expressed"
if genotype == "AG":
    # Only check phenotype if genotype is "AG"
    if phenotype == "expressed":
        print("Variant is active and expressed.")
    else:
        print("Variant is inactive.")
else:
    print("Non-target variant.")

TP53
COX2
EGFR
MTOR


#### Now we will save the code in a text file and try to run it in the terminal

### Things to remember when working with scripts

- Put _#!/usr/bin/env python_ in the beginning of the file
- Make the file executable to run with `./script.py`
- Otherwise run script with `python script.py`

## 5 mins exercise
Put some of your Python code you have written yesterday and save it in a Python script, run it in a terminal with either 
___
```bash
    python yourscript.py
```

or
```bash
    ./yourscript.py
```
___
- Replace `yourscript.py` with the actual name you use
- You need to add `#!/usr/bin/env python` at the beginning of the script and also make the script executable in order to run it with the second method 

### Note: for all the examples we have shown, data is provided directly in the code, but that is not very practial in many cases. We probably want to read the data from files, and we may also want to write the result to some output files. That will require file reading and writing with Python

## File I/O - read from file and write to file

### Read from file
___
```python
file = open("filename.txt", "r")
content = file.read()
file.close()
```
___
`'r'` specifies the mode for opening the file as a read-only text file. If the file does not exist, a `FileNotFoundError` will be raised.

Other types of mode:
- `'rb'`: Opens the file as a binary file for reading.
- `'r+`  : Opens the file for reading and writing.

Extra reading: https://www.w3schools.com/python/python_file_handling.asp


#### Use the `with` keyword to ensure the file handle be closed automatically

```python
file = open("filename.txt", "r")
content = file.read()
file.close()
```
___

```python
with open("filename.txt", "r") as file:
    content = file.read()
```

Note: ASCII stands for the American Standard Code for Information Interchange. It is a character encoding standard for electronic communication. 

#### Specify `encoding='utf-8'` when opening a file in Python to ensure proper handling of text files that contain non-ASCII characters, such as Chinese characters.

```python
with open ("filename.txt", "r", encoding='utf-8') as file
    content = file.read()
```

Note: UTF-8 (Unicode Transformation Format - 8-bit) is one of the most widely used encoding schemes and can represent every character in the Unicode character set.

UTF-8 can represent 1,112,064 valid code points
over 143,000 characters have been assigned


I have a file contains DNA sequences and want to use Python to read its content and then calculate the number of nucleotides, that is, get the length of the dna sequence.

In [40]:
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    seqlength = 0
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            seqlength += len(line)
    print("Length of DNA sequence:", total_count)  

Length of DNA sequence: 386


In [38]:
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    for line in file:
        print(line)

>SEQUENCE_1

AGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGC

TTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTA

GCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCT

AGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTT

AGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGC

TAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAA

GCTTAGCTAGCTAGCTAGCTTAGCTA





### Detour: useful methods for `string`


`'string'.strip()` &emsp; &emsp; &emsp; Removes whitespace  
`'string'.split()` &emsp; &emsp; &emsp; Splits on whitespace into list  

In [27]:
string1  = '  an example string with whitespaces at both ends   '
string2 = string1.strip()
string2

'an example string with whitespaces at both ends'

In [28]:
list1 = string2.split()
list1

['an', 'example', 'string', 'with', 'whitespaces', 'at', 'both', 'ends']

In [30]:
list2 = string1.strip().split()
list2

['an', 'example', 'string', 'with', 'whitespaces', 'at', 'both', 'ends']

### Write to file
___
```python
file = open("output.txt", "w")
file.write("Hello, Python!")
file.close()
```
___
- `'w'`: Opens a file for writing only. If the file does not exist, it creates a new file. If the file exists, its previous content will be truncated, effectively deleting the content before the new write operations.

Other modes:
- `'a'`: Opens the file for writing, appending any new data you write to the end of the file's existing content, thus preserving the previous content.

Extra reading: https://www.w3schools.com/python/python_file_handling.asp

```python
with open("output.txt", "w") as file:
    file.write("Hello, Python!")
```

```python
with open("output.txt", "w", encoding = 'utf-8') as file:
    file.write("Hello, Python!")
```

In [4]:
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    seqlength = 0
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            seqlength += len(line)

outfile = "../files/output/length_of_dna_sequence.txt"
with open(outfile, "w") as file:
    file.write("Length of DNA sequence: " + str(seqlength))

### Day 2,  Exercise 1
Link: https://python-bioinfo.bioshu.se/exercises.html
___

#### Take a break after the exercise

Note: In bioinformatics, the process of identifying differences (variants) between a particular DNA sequence and a reference sequence, is known as "variant calling."

## Session 2: Variants identification with VCF
- A VCF (Variant Call Format) file is a text file format used in bioinformatics for storing gene sequence variations. 

## Description of the task

You have a VCF file with a number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1,200,000-1,300,000, including the start and end positions). What you want to know is whether this sample has any variants in this region, and if so, which variants.


### What is your input? 
<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

- __CHROM__: 1 (Chromosome 1)
- __POS__: 10177 (Position 10177 on the chromosome)
- __ID__: rs367896725 (Reference SNP ID)
- __REF__: A (Reference allele is A)
- __ALT__: AC (Alternative allele is AC)
- __QUAL__: 50 (Quality score)
- __FILTER__: PASS (status, meaning passed all quality control)
- __FORMAT__: GT:DP:CB (Genotype, Depth, Cell Barcode)
- __SAMPLES__: 0/1:30:SM(Sample genotypes and additional info)

#### Genotype field
<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

__`0/1`__
- For human, and many other organisms, there are two copies of genetic information, inherited from both parents.
- `0` means DNA at this position is the same as reference allele (`A`), 
- `1` means has alternative allele, in this case `AC` 

__The notation of alternative allele is not always `1`__
- `0/2`
- `2` means it is the second alternative allele, i.e. `C` in this case

## Start with pseudocode!
- Pseudocode is a description of what you want to do without actually using proper syntax

<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

- Open the file and loop over lines (ignore lines with started with `#`)
- Identify lines where chromosome is equal to `5` and position is between 1,200,000 and 1,300,000
- Isolate the column that contains the genotype for sample1, e.g. `0/1:30:SM`
- Extract only the genotypes (e.g. `0/1`) from the column
- Check if the genotype contains any alternate alleles, i.e., if not `0/0`.
- Print any variants containing alternate alleles for this sample between specified region

#### Open the file and loop over lines (ignore lines starting with `#`), and print the first record

<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px; margin:0px;"/> 

In [13]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    counter = 0
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            print(line)
            break
# Next, find chromosome 5

1	762969	rs115616822	G	A	.	PASS	AA=g;AC=1;AN=120;DP=226;GP=1:773106;BN=132	GT:DP:CB	0/0:1:SMB	0/0:6:SMB	0/0:3:SMB	0/0:5:SMB	0/0:15:SMB	0/0:6:SMB	0/0:4:SMB	0/0:1:SMB	0/0:11:SMB	0/0:0:SMB	0/0:3:SMB	0/0:3:SMB	0/0:5:SMB	0/0:2:SMB	0/0:8:SMB	0/0:2:SMB	0/0:4:SMB	0/0:2:SMB	0/0:4:SMB	0/0:1:SMB	0/0:6:SMB	0/0:3:SMB	0/0:6:SMB	0/0:1:SMB	0/0:2:SMB	0/0:6:SMB	0/0:7:SMB	0/0:0:SMB	0/0:0:SMB	0/0:3:SMB	0/0:4:SMB	0/0:3:SMB	0/0:2:SB	0/0:6:SMB	0/0:6:SMB	0/0:1:SMB	0/0:1:SMB	0/0:0:SMB	0/0:8:SMB	0/0:7:SMB	0/0:0:SMB	0/0:1:SMB	0/0:3:SMB	0/0:0:SMB	0/0:8:SMB	0/0:5:SMB	0/0:4:SMB	0/0:0:SMB	0/0:8:SMB	0/0:0:SMB	0/0:4:SMB	0/0:5:SMB	0/0:7:SMB	0/0:4:SMB	0/0:2:SMB	0/0:0:SMB	0/0:2:SMB	0/0:2:SMB	0/0:5:SMB	0/1:8:SMB


#### Identify lines where chromosome is `5`

<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

In [16]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            if cols[0] == '5':
                print(cols)
                break

# Next, find the correct region

['5', '106565', 'rs115608877', 'G', 'T', '.', 'PASS', 'AA=.;AC=7;AN=120;DP=91;GP=5:53565;BN=132', 'GT:DP:CB', '1/0:1:SM', '1/0:3:SM', '0/0:1:SM', '0/0:2:SM', '0/0:0:S', '0/0:0:SM', '0/0:0:SM', '0/0:1:S', '0/0:0:S', '0/0:0:S', '0/0:2:SM', '0/1:1:SM', '0/0:3:SM', '0/1:0:SM', '0/0:2:SM', '1/0:1:S', '0/0:1:SM', '0/0:0:SM', '0/0:2:SM', '0/0:1:SM', '0/0:5:SM', '0/0:0:S', '0/0:1:SM', '0/0:0:S', '0/0:1:SM', '0/0:1:SM', '0/1:1:SM', '0/1:1:SM', '0/0:2:SM', '0/0:3:SM', '0/0:2:SM', '0/0:8:SM', '0/0:0:SM', '0/0:7:SM', '0/0:4:SM', '0/0:0:S', '0/0:2:S', '0/0:0:SM', '0/0:3:SM', '0/0:2:SM', '0/0:1:SM', '0/0:2:SM', '0/0:3:S', '0/0:0:SM', '0/0:3:SM', '0/0:1:SM', '0/0:0:SM', '0/0:0:SM', '0/0:3:S', '0/0:1:SM', '0/0:0:SM', '0/0:0:SM', '0/0:3:SM', '0/0:4:SM', '0/0:2:SM', '0/0:1:SM', '0/0:0:SM', '0/0:0:SM', '0/0:1:SM', '0/0:2:SM']


#### Identify lines where chromosome is `5` and position is between `1,200,000` and `1,300,000`

<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

In [17]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                print(cols)
                break
# Next, find the genotypes for sample1

['5', '1207129', 'rs111580366', 'G', 'A', '.', 'PASS', 'AA=G;AC=1;AN=120;DP=329;GP=5:1154129;BN=132', 'GT:DP:CB', '0/0:4:SMB', '0/1:17:SMB', '0/0:9:SMB', '0/0:9:SMB', '0/0:6:SMB', '0/0:8:SMB', '0/0:4:SMB', '0/0:4:SMB', '0/0:7:SMB', '0/0:1:SMB', '0/0:14:SMB', '0/0:3:SMB', '0/0:5:SMB', '0/0:7:SMB', '0/0:9:SMB', '0/0:10:SMB', '0/0:6:SMB', '0/0:2:SMB', '0/0:4:SMB', '0/0:5:SMB', '0/0:13:SMB', '0/0:2:SMB', '0/0:9:SMB', '0/0:3:SMB', '0/0:4:SMB', '0/0:4:SMB', '0/0:6:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:6:SMB', '0/0:3:SMB', '0/0:3:SMB', '0/0:8:SMB', '0/0:5:SMB', '0/0:10:SMB', '0/0:2:SMB', '0/0:7:SMB', '0/0:0:SMB', '0/0:6:SMB', '0/0:3:SMB', '0/0:5:SMB', '0/0:5:SMB', '0/0:12:SMB', '0/0:8:SMB', '0/0:15:SMB', '0/0:5:SMB', '0/0:2:SMB', '0/0:1:SMB', '0/0:6:SMB', '0/0:2:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:3:SMB', '0/0:6:SMB', '0/0:4:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:3:SMB', '0/0:3:SMB', '0/0:3:SMB']


#### Isolate the column that contains the genotype for sample1

<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

In [19]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                geno_col = cols[9]
                print(geno_col)
                break
                    
# Next, extract the genotypes only

0/0:4:SMB


Note:
- 0/0: Genotype
- 4: Read Depth (DP)
- SMB: Custom field (e.g., called by S(Sanger), M(UMich), B(BI))

Here's how the genotype format works:

- 0 refers to the reference allele.
- 1, 2, 3, etc., refer to the first, second, third, etc., alternative alleles listed in the ALT field of the VCF file.

#### Extract the genotypes only from the column

<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

In [20]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                geno_col = cols[9]
                geno = geno_col.split(':')[0]
                print(geno)
                break
# Next, find in which positions sample1 has alternate alleles

0/0


#### Check if the genotype contains any alternate alleles

<img src="../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

In [21]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                geno_col = cols[9]
                geno = geno_col.split(':')[0]
                if geno not in ["0/0"]:
                    print(geno)
#Next, print nicely

1/1
1/1


#### Print any variants containing alternate alleles for this sample between specified region

In [26]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            line = line.strip()
            cols = line.split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and (1200000 <= pos <= 1300000):
                geno_col = cols[9]
                geno = geno_col.split(':')[0]
                if geno not in ["0/0"]:
                    ref = cols[3]
                    alt = cols[4]
                    info = chrom + ":" + str(pos) + "_" + ref + "-" + alt + " has genotype: "+ geno
                    print(info)

5:1235651_C-T has genotype: 1/1
5:1277795_G-C has genotype: 1/1


## Take a break 
___
## Quiz for Day 2
- Link: https://python-bioinfo.bioshu.se/quiz.html
___
## Lunch

## Afternoon session
___
### 1. Exercise 2 of Day 2
__
### 2. Introduction to the course project
___
##### Break
___
### 3. Team building

## Day 2, Exercise 2

- Link: https://python-bioinfo.bioshu.se/exercises.html

# Introduction to the course project

Note: Cystic fibrosis (CF) is a genetic (inherited) disease that causes sticky, thick mucus to build up in organs, including the lungs and the pancreas. In people who have CF, thick mucus clogs the airways and makes it difficult to breathe. Management includes ways of clearing lungs and a nutrition plan.

### Background
<div style="overflow: auto;">
    <div style="float: left; margin-left: 20px; margin-top: 40px;">
        <img src="img/cystic_fibrosis.png" alt="cystic fibrosis" style="width: 500px;"/>
        <p style="font-size: 12px;">source: mayoclinic.org</p>
    </div>
    <div style="font-size: 30px; float: right; margin-left: 0px; margin-top: 0px; max-width: 50%;">
        <h3>Cystic fibrosis (CF)</h3>
        <ul>
            <li>Genetic inherited disease</li>
            <li>Produces thick and sticky mucus in organs, including lungs and the pancreas</li>
            <li>Clogs the airways of patients and makes them difficult to breathe</li>
            <li>No cure available but only symptom management, such as airway clearance</li>
        </ul>
    </div>
</div>


Note: CFTR usually makes a gate for chloride ions, a type of mineral with a negative electrical charge. Chloride moves out of the cell, taking water with it, which thins out mucus and makes it more slippery. In people with CF, gene mutations in CFTR prevent this from happening, so the mucus stays sticky and thick. m

## Genomic facts of Cystic Fibrosis
- CF is caused by mutations in Cystic Fibrosis Transmembrane Conductance Regulator (CFTR)
- The CFTR protein is an ion channel protein, acting like gates in a cell membrane that control the traffic of molecules through the membrane
- For normal people, CFTR makes a gate for chloride ions. When chloride moves out of the cell, taking water with it, and thus thins mucus
- For CF patients, gene mutations in CFTR prevent this functionality, causing the mucus stays sticky and thick
<img src="img/CFTR.jpg" alt="cystic fibrosis" style="width: 300px;"/>
<p style="font-size: 12px;">source: cff.org</p>



## More about the CFTR gene

- CFTR gene is located on chromosome 7 of the human genome
- Over 1,500 mutations known to cause CF
- One type of mutations
    - Non-synonymous (with amino acid changing) mutations that generate a premature termination codon (PTC), that further leads to a truncated CFTR protein (shortened length).

## Goal of the project

#### Write a python program that:
- Extract the correct CFTR transcript from the human genome
- Translate it into its corresponding amino acid sequence
- Determine if one or more patients have a premature stop codon


__You will be guided step by step towards the final goal__

## Data

- Human reference genome
    - Chromosome 7 in fasta format
    - Gene annotations in GTF (Gene Transfer Format) format
- Genome sequencing data from five patients 
    - Chromosome 7 in fasta format

## Fasta format
```plaintext
>MT dna:chromosome chromosome:GRCh38:MT:1:16569:1 REF
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA
```

## GTF format
- GTF stands for Gene Transfer Format
- Holds information about gene structure
- Tab-delimited
- Based on the general feature format (GFF), additional structure specific to genes

```
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
```

- seqname: The name of the sequence (typically a chromosome).
- source: The source of the annotation (e.g., ENSEMBL).
- feature: The type of feature (e.g., gene, transcript, exon).
- start: The starting position of the feature in the sequence.
- end: The ending position of the feature in the sequence.
- score: A score between 0 and 1000, or . if not applicable.
- strand: The strand on which the feature is located (+ for the forward strand, - for the reverse strand).
- frame: The reading frame, one of '0', '1' or '2', or `.` if not applicable.
- attribute: A list of key-value pairs providing additional information about the feature.

## Attributes

Some attributes (always semi-colon separated key-value pairs): 
- gene_id: The stable identifier for the gene
- gene_version: The stable identifier version for the gene
- gene_name: The official symbol of this gene
- gene_source: The annotation source for this gene
- transcript_id: The stable identifier for this transcript
- transcript_name: The symbold for this transcript derived from the gene name
- exon_id: The stable identifier for this exon

```
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
```
```
1	havana	gene	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";

```

# Project page

https://python-bioinfo.bioshu.se/project.html