# Biopython - Blast

### Introduction: BLAST and its importance in bioinformatics

- **Sequence Identification**: Identifies unknown DNA/protein sequences.
- **Phylogenetic Analysis**: Studies evolutionary relationships.
- **Functional Annotation**: Predicts protein functions.
- **Efficiency and Speed**: Analyses large datasets quickly.

NCBI BLAST Output:
- Formats: XML, XML2, JSON, HTML, Text, Tabular.
- Access: NCBI website or via command line tools.
- **Usage: quick online searches and manual inspection**.

Biopython BLAST Output:
- Formats: no HTML or Text parsing.
- Access: Via `Bio.Blast` module for online searches or local BLAST+ tools.
- **Usage: Ideal for automated processing and integration into Python workflows**.


### Biopython - JSON Format

An useful output format is JSON. When declaring `format_type="JSON2"`, `qblast` will return a zipped JSON format.

**Benefits of JSON Format for BLAST Results**:

- **Easy Access**: Quick access to specific data.
- **Data Manipulation**: Filter and transform data easily.
- **Integration**: Works well with other Python libraries.
- **Automation**: Simplifies repetitive tasks.
- **Storage**: Convenient for saving and retrieving results.

### BLAST Locally - BLAST+

**Advantages**:
1. **Speed**: May be faster.
2. **Custom Databases**: Allows use of your own database.
3. **Privacy**: Use BLAST without redistributing proprietary or unpublished sequence data.

**Drawbacks**:
1. **Installation**: Requires command line tools for installation.
2. **Setup**: Requires BLAST database to be set up.

--
```python
# this code requires BLAST to be installed locally
import subprocess
cmd = "blastn -query Sample.fasta -db nt -out Sample.xml"
cmd += " -evalue 0.001 -outfmt 5"
subprocess.run(cmd, shell=True)
```
--

### Biopython - BLAST Over the Internet
We use the function `qblast` in the `Bio.Blast` module to call the online version of BLAST.

NCBI BLAST Usage Guidelines:

- **Server Contact Frequency**: Max once every 10 seconds.
- **RID Polling**: Max once per minute.
- **Identification**: Use email and tool URL parameters.
- **High-Volume Searches**: Run scripts weekends or 9 pm-5 am ET.

Important Arguments:

1. **Program**: BLAST program to use (e.g., blastn, blastp).
2. **Database**: BLAST database to search against (e.g., nr, nt).
3. **Sequence**: Sequence or FASTA file.
4. **Optional**: format_type (XML, HTML, Text, XML2, JSON2, Tabular).

--
```python
## Example run: perform BLAST using the GI number of the query sequence
from Bio import Blast
result_stream = Blast.qblast("blastn", "nt", "8332116")
```
--

```python
## Example run: perform BLAST Using a Fasta file
```

In [None]:
from Bio import Blast

## Read the FASTA file
with open("Sample2.fasta") as txt:
    our_fasta = txt.read()

## Perform BLAST search
result_stream = Blast.qblast("blastn", "nt", our_fasta)

## Format result as FASTA (no metadata, ideal for quick inspection)
# result_stream = Blast.qblast("blastn", "nt", format(SeqRecord, "fasta"))

## format result as XML (with metadata, useful for in-depth analysis)
with open("my_blast.xml", "wb") as out_stream:
    out_stream.write(result_stream.read())

## Close the result stream
result_stream.close()

## Reopen the saved XML file
result_stream = open("my_blast.xml", "rb")

### BLAST Parsing
Biopython can parse XML, XML2 and tabular format. There is no parsing for HTML and text output.

**Output can be generated from**:
- Biopython BLAST over the internet
- Biopython BLAST local
- BLAST on NCBI website
- BLAST locally without Biopython (**BLAST+ command line tools**)



In [None]:
## One query
# blast_record = Blast.read(result_stream)

## Multiple querys (returns an iterator)
blast_records = Blast.parse(result_stream)

# Alternative: blast_record = next(blast_records)
print(blast_records) # parser iterates over all records and saves them

In [None]:
## A single query: access the first BLAST record
blast_record = blast_records[0]
print(blast_record)

In [None]:
blast_slice = blast_record[:3] # to copy use copy = blast_record[:]
print(blast_slice)

In [None]:
# BLAST Hit
hit = blast_record[0] # 0 top hit; -1 last hit
print(hit)

In [None]:
# BLAST HSP
alignment = hit[0]
print(alignment)

In [None]:
from Bio import SeqIO

fasta_file = "Sample2.fasta"
sequences = list(SeqIO.parse(fasta_file, "fasta"))

def filter_sequences(sequences, min_length=100, motifs=None):
    filtered_seqs = []
    for seq_record in sequences:
        matches = []
        for motif, name in motifs.items():
            if motif in seq_record.seq:
                matches.append(name)
        if len(seq_record.seq) >= min_length and matches:
            filtered_seqs.append((seq_record, matches))
    return filtered_seqs

motifs_to_search = {
    "ATG": "Start codon",
    "TGCAT": "IHF binding site",
    "GCGCA": "CRP binding site",
    "GAATTC": "EcoRI restriction site",
    "AAGCTT": "HindIII restriction site",
    "GGATCC": "BamHI restriction site",
    "CTGTTG": "LacI binding site",
    "GATC": "Dam methylation site",
    "CCATGG": "NcoI restriction site",
 }
filtered_sequences = filter_sequences(sequences, motifs=motifs_to_search)

for seq_record, matches in filtered_sequences:
    print(f">{seq_record.id}\n{seq_record.seq}\nMatches: {', '.join(matches)}\n")