# Pangenomics
--------------------------------------------

# Searching Graphs with BLAST


## Overview
Here you will learn how to search graphs with BLAST. In other words, you can use a DNA sequence, such as your favorite gene, to search the pangenomic graph, discover the structure of the graph, and explore homologous sequences.

## Learning Objectives
+ Learn how to use BLAST to search a pangenome graph

## Get Started

In this submodule you will extract some yeast (S288C) gene sequences for CUP1-1 and YHR054C and learn how to use them to search pangenomic graphs.

#### Extract gene sequences:
     - CUP1-1
     - YHR054C

#### BLAST 
     - Short BLAST intro
     - BLAST graph

----------------------

## Get the CUP1 and YHR054C gene sequences

We will blast the CUP1 (YHR053C) and YHR054C gene sequences against a linearized version of the graph but first we need to extract the gene sequences from S288C. There are multiple copies of each but we'll grab the first instance and use it to identify all copies through BLAST alignment.


<div class="alert alert-block alert-success"> <b>Gene Coordinates:</b><br>  
    CUP1-1 S288C_chrVIII:213043-213228<br>
    YHR054C S288C_chrVIII:213693-214757<br>
    Both are on the "-" strand.</a> </div>  
    <ul>
      

1. To extract the sequences, use `samtools faidx` and feed in the coordinates. We'll get the CUP1-1 sequence first and redirect it into a file in the *genes* directory called *genes.fa*. Then we'll extract the YHR054C gene and append it to the same file.

The parameters:

`-i`  reverse-complement  
`input fasta`  
`region coordinates`

In [None]:
!samtools faidx -i assemblies/yprp.chrVIII.fa S288C_chrVIII:213043-213228 > genes/genes.fa

!samtools faidx -i assemblies/yprp.chrVIII.fa S288C_chrVIII:213693-214757 >> genes/genes.fa

2. Take a look at the file you just made.

In [None]:
!cat genes/genes.fa

3. Let's rename the sequences so they have the gene names rather than coordinates. Use `sed`.

The parameters:

`-i` edit in place

In [None]:
!sed -i 's/S288C_chrVIII:213043-213228.rc/CUP1/' genes/genes.fa

!sed -i 's/S288C_chrVIII:213693-214757.rc/YHR054C/' genes/genes.fa

4. Take a look at it again.

In [None]:
!cat genes/genes.fa

Note that the CUP1 gene is much shorter than the other gene.

----------------------

## BLAST

The Basic Local Alignment Search Tool (BLAST) tool allows you to compare DNA sequences in order to efficiently identify the best matches. Here we will use BLAST to search the DNA sequences in the pangenome for matches to two adjacent genes.

Altschul, Stephen F., et al. "Basic local alignment search tool." Journal of molecular biology 215.3 (1990): 403-410.


### BLAST the graph manually

1. Create a FASTA file containing the graph sequence.


<div class="alert alert-block alert-info"> <b>NOTE:</b> Because each node is exported as its own fasta sequence, some sequences are very short, including many that are only a single nucleotide long.

In [None]:
!gfatools gfa2fa graphs/yprp.chrVIII.pggb.gfa > graphs/yprp.chrVIII.pggb.fa

2. Build a BLAST database for the FASTA using `makeblastdb`.

The parameters:

`-in` fasta_file_from_graph&nbsp;&nbsp;&nbsp;the file to build a database for  
`-input_type` fasta &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  the format of the input file (fasta)  
`-dbtype` nucl  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; type of sequence (nucl=DNA)

In [None]:
!makeblastdb -in graphs/yprp.chrVIII.pggb.fa -input_type fasta -dbtype nucl

3. Now, BLAST the genes against the database you just made. Use tabular format.

The parameters

`-db`  database  
`-query` query (the genes, in our case)  
`-outfmt` output format (6=tab)

In [None]:
!blastn -db graphs/yprp.chrVIII.pggb.fa -outfmt 6 -query genes.fa > genes/genesXyprp.chrVIII.pggb.fa.txt

The columns are:
+ *qseqid*      query or source (gene) sequence id
+ *sseqid*      subject or target (reference genome) sequence id
+ *pident*      percentage of identical positions
+ *length*      alignment length (sequence overlap)
+ *mismatch*    number of mismatches
+ *gapopen*     number of gap openings
+ *qstart*      start of alignment in query
+ *qend*        end of alignment in query
+ *sstart*      start of alignment in subject
+ *send*        end of alignment in subject
+ *evalue*      expect value
+ *bitscore*    bit score

4. Take a look at the blast output.

In [None]:
!cat genes/genesXyprp.chrVIII.pggb.fa.txt

There are mulitple copies of each gene. Note how some copies are split across nodes.

----------------------

## Conclusion

You learned how to blast against a pangenomic graph. Specifically, you searched for the CUP1 and YHR054C genes in the graph.

BLASTing gene sequences allows you to find out where genes of interest are in the fasta file exported from the pangenomic graph. It also allows you to identify copy numbers of the genes.

In the next chapter you will learn how visualize graphs and to blast directly against the graph and visualize the result.

----------------------

## Clean up

<div class="alert alert-warning">No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!.</div>