## Batch PsiBlast Tutorial + data extraction

The BLAST+ suite from NCBI has psiblast, and it is able to run psiblast on multiple sequences. This tutorial does the following:

<ol>
    <li>Format Ray genome so it only has locus tags (fdh and gp)</li> 
    <li>Install BLAST+</li> 
    <li>Prepare -nr database and extract taxids</li>
    <li>Run BLAST+</li>
        - test that 
    <li>Extract proteins of desired species from output</li>
    </ol>

In [None]:
#written by @Jina Lee, contact jil143@ucsd.edu

### 1. Format Ray genome so it only has locus tags

This changes all the headers to fdh and gp so that the output contains more information

In [1]:
import re
genome = open("ray_genome.fasta")
lines = genome.readlines()

formatted = open("ray_genome_format.fasta", "w")

for line in lines:
    if ">" in line:
        locus_tag = re.findall(r'\[locus_tag=(.*?)\]', line)[0]
        formatted.write(">" + locus_tag + "\n")
    else:
        formatted.write(line)
        
formatted.close()

### 2. Install BLAST

Install the most recent version of BLAST+ (either with conda or apt-get)


### 3. Prepare -nr database and extract taxids

list of databases provided: ftp://ftp.ncbi.nlm.nih.gov/blast/db/
database documentation: https://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.html

NCBI BLAST database book:
https://www.ncbi.nlm.nih.gov/books/NBK569850/?report=reader

This tutorial may also help: https://angus.readthedocs.io/en/2016/running-command-line-blast.html

Since we want to run psiblast on the -nr database and against tailed phages (taxid 28883) the following command line arguments allows this:

`$update_blastdb.pl --decompress nt`<br>
`$get_species_taxids.sh -t 28883 > 28883.txids`

for me the --decompress did not work so if the command downloads everything but does not decompress, decompress the tar.gz files
(huge; over 300GB when decompressed)