# Pangenomics
--------------------------------------------

# Searching Graphs with BLAST


## Overview
Here you will learn how to search graphs with BLAST. In other words, you can use a DNA sequence, such as your favorite gene, to search the pangenomic graph, discover the structure of the graph, and explore homologous sequences.

## Learning Objectives
+ Learn how to use BLAST to search a pangenome graph

## Get Started

### Get the CUP1 and YHR054C gene sequences

We will blast the CUP1 (YHR053C) and YHR054C gene sequences against a linearized version of the graph.

First, get the gene sequences. There are multiple copies of each but we'll grab the first instance and use it to identify all copies through BLAST alignment.


<div class="alert alert-block alert-success"> <b>Gene Coordinates:</b><br>  
    CUP1 S288C_chrVIII:213043-213228<br>
    YHR054C S288C_chrVIII:213693-214757<br>
    Both are on the "-" strand.</a> </div>  
    <ul>
      

Use `samtools faidx`.

The parameters:

-i  reverse-complement  
input fasta  
region coordinates

In [None]:
!samtools faidx -i yprp.chrVIII.fa S288C_chrVIII:213043-213228 > genes.fa

!samtools faidx -i yprp.chrVIII.fa S288C_chrVIII:213693-214757 >> genes.fa

Take a look at the file you just made.

In [None]:
!cat genes.fa

Let's rename the sequences so they have the gene names rather than coordinates. Use `sed`.

The parameters:

-i edit in place

In [None]:
!sed -i 's/S288C_chrVIII:213043-213228.rc/CUP1/' genes.fa

!sed -i 's/S288C_chrVIII:213693-214757.rc/YHR054C/' genes.fa

Take a look at it again.

In [None]:
!cat genes.bed

Note that the CUP1 gene is much shorter than the other gene.

### BLAST the graph manually

Create a FASTA file containing the graph sequence.

Note: because each node is exported as its own fasta sequence, some sequences are very short, including many that are only a single nucleotide long.

In [None]:
!gfatools gfa2fa yprp.chrVIII.pggb.gfa > yprp.chrVIII.pggb.fa

Build a BLAST database for the FASTA using `makeblastdb`.

The parameters:

-in fasta_file_from_graph&nbsp;&nbsp;&nbsp;the file to build a database for  
-input_type fasta &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  the format of the input file (fasta)  
-dbtype nucl  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; type of sequence (nucl=DNA)

In [None]:
!makeblastdb -in yprp.chrVIII.pggb.fa -input_type fasta -dbtype nucl

Now, BLAST the genes against the database you just made. Use tabular format.

The parameters

-db  database  
-query query (the genes, in our case)  
-outfmt output format (6=tab)

In [None]:
!blastn -db yprp.chrVIII.pggb.fa -outfmt 6 -query genes.fa > genesXyprp.chrVIII.pggb.fa.txt

The columns are:
+ *qseqid*      query or source (gene) sequence id
+ *sseqid*      subject or target (reference genome) sequence id
+ *pident*      percentage of identical positions
+ *length*      alignment length (sequence overlap)
+ *mismatch*    number of mismatches
+ *gapopen*     number of gap openings
+ *qstart*      start of alignment in query
+ *qend*        end of alignment in query
+ *sstart*      start of alignment in subject
+ *send*        end of alignment in subject
+ *evalue*      expect value
+ *bitscore*    bit score

Take a look at the blast output.

In [None]:
!cat genesXyprp.chrVIII.pggb.fa.txt

There are mulitple copies of each gene. Note how some copies are split across nodes.

## Conclusion

You learned how to blast against a pangenomic graph. Specifically, you searched for the CUP1 and YHR054C genes in the graph. In the next chapter you will learn how visualize graphs and to blast directly against the graph and visualize the result.

## Clean up
No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!