# Genome Mining Softwares Comparison

#### The test task

In some of our projects, we rely on genome mining software that, given a genome sequence, can find regions of interest responsible for the production of bioactive compounds. These regions of interest are often referred to as biosynthetic gene clusters (BGCs), but we can think of them simply as pairs of start and end coordinates in a long string corresponding to the genome. 

The task would be to write a simple script that compares the outputs of two different genome mining tools for the same genome. 

The tools would be [antiSMASH](https://antismash.secondarymetabolites.org/#!/about) (the state-of-the-art method) and [GECCO](https://gecco.embl.de/) (an experimental machine learning-based approach). The genome will be Streptomyces coelicolor A3(2); its sequence is attached in compressed FASTA format. 

So we upload the FASTA file here. We can read this file in Python using the $Biopython$ library, which provides tools for handling FASTA files efficiently. 

<div class="alert alert-block alert-info">
<b>How to Install:</b> 
To use Biopython in Python, we can install it with pip

$pip install biopython

</div>

In [1]:
import pandas as pd
from Bio import SeqIO


In [2]:
# path to the FASTA file
file_path = "/Users/Erfan/Downloads/Test_Task_HiWi/NC_003888.3.fna"

# Read the file using Biopython
for record in SeqIO.parse(file_path, "fasta"):
    print(f"ID: {record.id}")
    print(f"Description: {record.description}")
    print(f"Sequence: {record.seq[:100]}...")  # Print only the first 100 bases for brevity
    print(f"Length: {len(record.seq)}")


ID: NC_003888.3
Description: NC_003888.3 Streptomyces coelicolor A3(2) chromosome, complete genome
Sequence: CCCGCGGAGCGGGTACCACATCGCTGCGCGATGTGCGAGCGAACACCCGGGCTGCGCCCGGGTGTTGCGCTCCCGCTCCGCGGGAGCGCTGGCGGGACGC...
Length: 8667507
