# Day 4, Exercise 2 - module, documentation and biopython

### There are 2 parts to this exercise, with answers listed under each part.


1. Create a module from previouse exercises and use it
2. Use Biopython to rewrite the script that calculates the length of sequences (optional).

There are various ways to write the code for these tasks. Here, we present one solution in the answers, but if you have written a different one, that's perfectly fine. Just ensure that you test your code to confirm that it performs as expected.

<hr style="border: 2px solid #000080;">

## 1.  Create a module from previouse exercises and use it

### Description:
A Python module is a file containing Python code (functions, classes, variables) which you can import and use in other Python scripts. This allows for better organization and reusability of your code. You import a module using the import statement.
- __Note__: `class` is out of scope for this course.

### Tasks:

#### Save functions to a module
- Save the functions you created from `Day 3, Exercise 2` (calculating the length of sequences and GC content) into a file called `dnatools.py` 

#### Add Documentation
- Add module-level documentation (a docstring at the beginning of the file) describing what the module is about.
- For each function, write a help message (a docstring directly under the function definition line, starting with def).

#### Testing the module 

1. Import the module 

2. Show the help message of the module `dnatools`

3. Show the help message of the function `get_seqlength`

4. Write a test script called `test_dnatools.py` that tests the functions `get_seqlength`, `read_seq` and `get_gc_content` using the sequence file `one_dna_sequence.fa`

5. Create a Python script called `./cal_seqlen_3.py` that perform the same functionality as the script `./cal_seqlen_2.py` you created at the `Bonus task of Day3, Exercise2` (copy the code from the answers if you haven't done it yet). However, this time you will use the function imported from the module `dnatools`.

Try different ways to import the function from the module, e.g., 
```python
from dnatools import get_seqlength
# Import only the function get_seqlength from the module dnatools
# The function can be called directly as get_seqlength(seqfile)
```

or

```python
import dnatools
# Import the whole module under the namespace dnatools.
# The function can be called by dnatools.get_seqlength(seqfile)
# This approach can be slower and take more memory if the module is large.
```

or 
```python
from dnatools import *
# Import all functions and variables from the module dnatools.
# All functions and variables can be used directly in the script.
# This approach can be less efficient and make the code harder to debug if the module is large.
```


### Tips:
- You don't need to add `#!/usr/bin/env python` at the beginning of the moudle file since this python script is not supposed to be execuated directlly in the command line 

___


### The answers

#### Save functions in a module and add the documentation
- the content of the file `dnatools.py` are shown below

In [None]:
"""
dnatools.py

This module provides tools for working with DNA sequences. The main functionalities include:
- Calculating the length of a DNA sequence.
- Reading a DNA sequence from a file.
- Calculating the GC content of a DNA sequence.
- Checking the GC content level of a DNA sequence against specified thresholds.

Functions:
- get_seqlength(seqfile): Calculates the length of a DNA sequence.
- readseq(seqfile): Reads a DNA sequence from a file and returns it in uppercase.
- get_gc_content(seq): Calculates the GC content of a DNA sequence as a percentage.
- check_gc_content_level(seqfile, threshold_high=60.0, threshold_low=40.0): Checks the GC content level of a DNA sequence and categorizes it as high, low, or moderate.
"""

def get_seqlength(seqfile):
    """
    Calculate the length of a DNA sequence.

    Args:
        seqfile (str): The path to the sequence file.

    Returns:
        int: The length of the DNA sequence.
    """
    with open(seqfile, "r", encoding="utf-8") as fpin:
        seqlength = 0
        for line in fpin:
            if not line.startswith(">"):
                seqlength += len(line.strip())
    return seqlength

def readseq(seqfile):
    """
    Read a DNA sequence from a file and return it in uppercase.

    Args:
        seqfile (str): The path to the sequence file.

    Returns:
        str: The DNA sequence in uppercase.
    """
    with open(seqfile, "r", encoding="utf-8") as fpin:
        seq = ""
        for line in fpin:
            if not line.startswith(">"):
                seq += line.strip()
    return seq.upper()

def get_gc_content(seq):
    """
    Calculate the GC content of a DNA sequence as a percentage.

    Args:
        seq (str): The DNA sequence.

    Returns:
        float: The GC content percentage of the DNA sequence.
    """
    return (seq.count('G') + seq.count('C')) / len(seq) * 100

def check_gc_content_level(seqfile, threshold_high=60.0, threshold_low=40.0):
    """
    Check the GC content level of a DNA sequence and categorize it as high, 
    low, or moderate based on specified thresholds.

    Args:
        seqfile (str): The path to the sequence file.
        threshold_high (float, optional): The threshold for high GC content. Default is 60.0.
        threshold_low (float, optional): The threshold for low GC content. Default is 40.0.

    Returns:
        str: A message indicating the GC content percentage and its level (high, low, or moderate).
    """
    seq = readseq(seqfile)
    gc_content = get_gc_content(seq)
    if gc_content >= threshold_high:
        return f"The GC content of the sequence from {seqfile} is {gc_content:.2f}%, level is high"
    elif gc_content <= threshold_low:
        return f"The GC content of the sequence from {seqfile} is {gc_content:.2f}%, level is low"
    else:
        return f"The GC content of the sequence from {seqfile} is {gc_content:.2f}%, level is moderate"

__Import the module__

In [None]:
import dnatools

__Show the help message of the module dnatools__

In [None]:
help(dnatools)

__Show the help message of the function `get_seqlength`__

In [None]:
help(dnatools.get_seqlength)

__Write a test script called `test_dnatools.py` that tests the functions `get_seqlength`, `read_seq` and `get_gc_content` using the sequence file `one_dna_sequence.fa`__

In [None]:
#!/usr/bin/env python
import os
from dnatools import get_seqlength, readseq, get_gc_content

# Create a test sequence file
seqfile = "../downloads/one_dna_sequence.fa" # replace with the actual path to the file


# Test get_seqlength function
print("\nTesting get_seqlength()")
seq_length = get_seqlength(seqfile)
print(f"Length of {seqfile}: {seq_length}")


# Test readseq function
print("\nTesting readseq()")
sequence = readseq(seqfile)
print(f"sequence in the file {seqfile}: {sequence}")

# Test get_gc_content function
print("\nTesting get_gc_content()")
gc_content = get_gc_content(sequence)
print(f"GC content of {sequence}: {gc_content:.2f}%")

__Create a Python script called `./cal_seqlen_3.py` that perform the same functionality as the script `./cal_seqlen_2.py` you created at the `Bonus task of Day3, Exercise2` (copy the code from the answers if you haven't done it yet). However, this time you will use the function imported from the module `dnatools`.__

In [None]:
#!/usr/bin/env python

import sys
from dnatools import get_seqlength

usage = f"USAGE: {sys.argv[0]} SEQFILE [SEQFILE ...]\n"
if len(sys.argv) < 2:
    print(usage)
    sys.exit(1)
for seqfile in sys.argv[1:]:
    print(f"Length of {seqfile}: {get_seqlength(seqfile)}") 

<hr style="border: 2px solid #000080;">

## 2. Use Biopython to rewrite the script that calculates the length of sequences (optional)

### Description:
Biopython is a collection of Python libraries and tools designed to facilitate biological computations. It provides functionalities for reading and writing different sequence formats, working with sequence alignments, performing various biological computations, and interfacing with popular bioinformatics databases and tools. Biopython aims to simplify the development of scripts and applications for bioinformatics research by offering a flexible and powerful toolkit.

### Tasks:

1. Create a Python script called `./cal_seqlen_4.py` that perform similar functionality as the script `./cal_seqlen.py` you created at the `Bonus task of Day3, Exercise2` (copy the code from the answers if you haven't done it yet). However, this time you will use the module `SeqIO` from Biopython to parse the sequence file.

    - Output the result in the format
```
Sequence {i} in {filename}, ID: {sequence id}, Length: {length}
```

2. Run the script on both the single sequence file `one_dna_sequence.fa` and the multiple sequence file `Ecoli-10seq.fna` (the file can be found in the downloads folder or <a href="https://python-bioinfo.bioshu.se/downloads/Ecoli-10seq.fna">here</a>).

    - Consider whether it would be an easy task to achieve this without using Biopython.

___

### The answers

In [None]:
#!/usr/bin/env python
import sys
from Bio import SeqIO


usage = f"USAGE: {sys.argv[0]} SEQFILE\n"
if len(sys.argv) < 2:
    print(usage)
    sys.exit(1)

seqfile = sys.argv[1]

with open(seqfile, "r") as handle:
    i = 1
    for record in SeqIO.parse(handle, "fasta"):
        print(f"Sequence {i} in {seqfile}, ID: {record.id}, Length: {len(record.seq)}")
        i += 1   

__Run the script__
```bash
python cal_seqlen_4.py ../downloads/one_dna_sequence.fa 
python cal_seqlen_4.py ../downloads/Ecoli-10seq.fna 
```

___