# Pangenomics
--------------------------------------------

# Searching Graphs with BLAST


## Overview
In this submodule you will learn how to search graphs with BLAST. In other words, you can use a DNA sequence, such as your favorite gene, to search the pangenomic graph, discover the structure of the graph, and explore homologous sequences.

## Learning Objectives
+ Use BLAST to search a pangenome graph

## Getting Started

In this submodule you will extract some yeast (S288C) gene sequences for CUP1 and YHR054C and learn how to use them to search pangenomic graphs.

#### Extract gene sequences:
     - CUP1
     - YHR054C

#### BLAST 
     - Short BLAST intro
     - BLAST graph

----------------------

## Get the CUP1 and YHR054C gene sequences

We will blast the CUP1 (YHR053C) and YHR054C gene sequences against a linearized version of the graph but first we need to extract the gene sequences from S288C. There are multiple copies of each but we'll grab the first instance and use it to identify all copies through BLAST alignment.


<div class="alert alert-block alert-success"> <b>Gene Coordinates:</b><br>  
    CUP1 S288C_chrVIII:213043-213228<br>
    YHR054C S288C_chrVIII:213693-214757<br>
    Both are on the "-" strand.</a> </div>  
    <ul>
      

1. To extract the sequences, use `samtools faidx` and feed in the coordinates. We'll get the CUP1 sequence first and redirect it into a file in the *genes* directory called *genes.fa*. Then we'll extract the YHR054C gene and append it to the same file.

The parameters:

`-i`  reverse-complement  
`input fasta`:`region coordinates`

In [1]:
!samtools faidx -i assemblies/yprp.chrVIII.fa S288C_chrVIII:213043-213228 > genes/genes.fa

!samtools faidx -i assemblies/yprp.chrVIII.fa S288C_chrVIII:213693-214757 >> genes/genes.fa

2. Take a look at the file you just made.

In [2]:
!cat genes/genes.fa

>S288C_chrVIII:213043-213228/rc
ATGTTCAGCGAATTAATTAACTTCCAAAATGAAGGTCATGAGTGCCAATG
CCAATGTGGTAGCTGCAAAAATAATGAACAATGCCAAAAATCATGTAGCT
GCCCAACGGGGTGTAACAGCGACGACAAATGCCCCTGCGGTAACAAGTCT
GAAGAAACCAAGAAGTCATGCTGCTCTGGGAAATGA
>S288C_chrVIII:213693-214757/rc
ATGGTACCCGCTGCTGAAAACCTATCTCCGATACCTGCCTCTATTGATAC
GAACGACATTCCTTTAATTGCTAACGATTTAAAATTACTGGAAACGCAAG
CAAAATTGATAAATATTCTGCAAGGTGTTCCTTTCTACTTGCCAGTAAAT
TTAACCAAAATTGAAAGTCTGTTAGAAACCTTGACTATGGGCGTGAGTAA
TACAGTAGACTTATATTTTCATGACAACGAAGTCAGAAAAGAATGGAAAG
ACACTTTAAATTTTATCAATACCATTGTTTATACAAATTTTTTCCTTTTT
GTTCAAAACGAATCCTCTTTGTCCATGGCAGTTCAACATTCTTCTAACAA
CAATAAGACCTCGAACTCTGAAAGATGTGCAAAGGATCTGATGAAAATTA
TTTCTAATATGCACATTTTTTACTCAATAACATTTAATTTTATCTTCCCC
ATAAAGTCGATAAAGTCATTTTCAAGCGGCAATAATCGCTTTCATTCTAA
TGGTAAAGAATTTTTATTCGCAAATCATTTTATTGAAATCTTACAGAATT
TTATAGCAATCACATTTGCTATTTTCCAACGTTGTGAAGTAATATTATAT
GACGAATTTTACAAAAATCTTTCAAATGAGGAGATTAATGTTCAATTGCT
ATTGATTCATGACAAGATTTTGGAAATTTTAAAAAAAATAGAAATTATCG
TATCCTTTTTACGAGATGAAATGAATAGCAAC

3. Let's rename the sequences so they have the gene names rather than coordinates. Use `sed`.

The parameters:

`-i` edit in place

In [3]:
!sed -i 's/S288C_chrVIII:213043-213228.rc/CUP1/' genes/genes.fa

!sed -i 's/S288C_chrVIII:213693-214757.rc/YHR054C/' genes/genes.fa

4. Take a look at it again.

In [4]:
!cat genes/genes.fa

>CUP1
ATGTTCAGCGAATTAATTAACTTCCAAAATGAAGGTCATGAGTGCCAATG
CCAATGTGGTAGCTGCAAAAATAATGAACAATGCCAAAAATCATGTAGCT
GCCCAACGGGGTGTAACAGCGACGACAAATGCCCCTGCGGTAACAAGTCT
GAAGAAACCAAGAAGTCATGCTGCTCTGGGAAATGA
>YHR054C
ATGGTACCCGCTGCTGAAAACCTATCTCCGATACCTGCCTCTATTGATAC
GAACGACATTCCTTTAATTGCTAACGATTTAAAATTACTGGAAACGCAAG
CAAAATTGATAAATATTCTGCAAGGTGTTCCTTTCTACTTGCCAGTAAAT
TTAACCAAAATTGAAAGTCTGTTAGAAACCTTGACTATGGGCGTGAGTAA
TACAGTAGACTTATATTTTCATGACAACGAAGTCAGAAAAGAATGGAAAG
ACACTTTAAATTTTATCAATACCATTGTTTATACAAATTTTTTCCTTTTT
GTTCAAAACGAATCCTCTTTGTCCATGGCAGTTCAACATTCTTCTAACAA
CAATAAGACCTCGAACTCTGAAAGATGTGCAAAGGATCTGATGAAAATTA
TTTCTAATATGCACATTTTTTACTCAATAACATTTAATTTTATCTTCCCC
ATAAAGTCGATAAAGTCATTTTCAAGCGGCAATAATCGCTTTCATTCTAA
TGGTAAAGAATTTTTATTCGCAAATCATTTTATTGAAATCTTACAGAATT
TTATAGCAATCACATTTGCTATTTTCCAACGTTGTGAAGTAATATTATAT
GACGAATTTTACAAAAATCTTTCAAATGAGGAGATTAATGTTCAATTGCT
ATTGATTCATGACAAGATTTTGGAAATTTTAAAAAAAATAGAAATTATCG
TATCCTTTTTACGAGATGAAATGAATAGCAACGGAAGTTTCAAATCTATT
AAAGGTTTCAACAAGGTTTTGAATCTGATT

Note that the CUP1 gene is much shorter than the other gene.

----------------------

## BLAST

The Basic Local Alignment Search Tool (BLAST) tool allows you to compare DNA sequences in order to efficiently identify the best matches. Here we will use BLAST to search the DNA sequences in the pangenome for matches to two adjacent genes ([Altschul, Stephen F., et al. 1990](https://doi.org/10.1016/S0022-2836(05)80360-2)).


### BLAST the graph manually

1. Create a FASTA file containing the graph sequence.


<div class="alert alert-block alert-info"> <b>NOTE:</b> Because each node is exported as its own fasta sequence, some sequences are very short, including many that are only a single nucleotide long.

In [5]:
!gfatools gfa2fa graphs/yprp.chrVIII.pggb.gfa > graphs/yprp.chrVIII.pggb.fa

[M::main] Version: 0.4-r214-dirty
[M::main] CMD: gfatools gfa2fa graphs/yprp.chrVIII.pggb.gfa
[M::main] Real time: 0.020 sec; CPU: 0.021 sec


2. Build a BLAST database for the FASTA using `makeblastdb`.

The parameters:

`-in` fasta_file_from_graph&nbsp;&nbsp;&nbsp;the file to build a database for  
`-input_type` fasta &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  the format of the input file (fasta)  
`-dbtype` nucl  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; type of sequence (nucl=DNA)

In [6]:
!makeblastdb -in graphs/yprp.chrVIII.pggb.fa -input_type fasta -dbtype nucl



Building a new DB, current time: 04/28/2025 21:33:30
New DB name:   /home/jupyter/NIGMS-Sandbox-Pangenomics-Module/module_notebooks/graphs/yprp.chrVIII.pggb.fa
New DB title:  graphs/yprp.chrVIII.pggb.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 19252 sequences in 0.19645 seconds.




3. Now, BLAST the genes against the database you just made. Use tabular format.

The parameters

`-db`  database  
`-query` query (the genes, in our case)  
`-outfmt` output format (6=tab)

In [7]:
!blastn -db graphs/yprp.chrVIII.pggb.fa -outfmt 6 -query genes/genes.fa > genes/genesXyprp.chrVIII.pggb.fa.txt

The columns are:
+ *qseqid*      query or source (gene) sequence id
+ *sseqid*      subject or target (reference genome) sequence id
+ *pident*      percentage of identical positions
+ *length*      alignment length (sequence overlap)
+ *mismatch*    number of mismatches
+ *gapopen*     number of gap openings
+ *qstart*      start of alignment in query
+ *qend*        end of alignment in query
+ *sstart*      start of alignment in subject
+ *send*        end of alignment in subject
+ *evalue*      expect value (the number of equally good or better alignments expected by chance)
+ *bitscore*    bit score

4. Take a look at the BLAST output.

In [8]:
!cat genes/genesXyprp.chrVIII.pggb.fa.txt

CUP1	7899	100.000	186	0	0	1	186	456	271	9.20e-97	344
CUP1	7715	100.000	186	0	0	1	186	1524	1339	9.20e-97	344
CUP1	7715	100.000	186	0	0	1	186	5523	5338	9.20e-97	344
CUP1	7715	100.000	186	0	0	1	186	7521	7336	9.20e-97	344
CUP1	7715	99.462	186	0	1	1	186	3521	3337	1.54e-94	337
CUP1	7773	99.457	184	1	0	1	184	1130	947	5.54e-94	335
CUP1	7851	100.000	164	0	0	1	164	164	1	1.56e-84	303
CUP1	7790	100.000	164	0	0	1	164	164	1	1.56e-84	303
CUP1	7732	100.000	164	0	0	1	164	164	1	1.56e-84	303
CUP1	7698	100.000	164	0	0	1	164	164	1	1.56e-84	303
CUP1	7638	100.000	164	0	0	1	164	164	1	1.56e-84	303
CUP1	7602	100.000	134	0	0	31	164	134	1	7.42e-68	248
CUP1	7605	100.000	29	0	0	1	29	29	1	1.74e-09	54.7
YHR054C	7715	100.000	1065	0	0	1	1065	3053	1989	0.0	1967
YHR054C	7715	100.000	1065	0	0	1	1065	7052	5988	0.0	1967
YHR054C	7715	100.000	1065	0	0	1	1065	9050	7986	0.0	1967
YHR054C	7715	99.626	1069	0	4	1	1065	5054	3986	0.0	1949
YHR054C	7715	100.000	1054	0	0	1	1054	1054	1	0.0	1947
YHR054C	7899	99.905	1053	1	0	13	1065	1973	9

The CUP1 lines with the header added are below for convience.

| qseqid | sseqid | pident  | length | mismatch | gapopen | qstart | qend | sstart | send | evalue   | bitscore |
|--------|--------|---------|--------|----------|---------|--------|------|--------|------|----------|----------|
| CUP1   | 7641   | 100.000 | 186    | 0        | 0       | 1      | 186  | 456    | 271  | 9.35e-97 | 344      |
| CUP1   | 7460   | 100.000 | 186    | 0        | 0       | 1      | 186  | 1524   | 1339 | 9.35e-97 | 344      |
| CUP1   | 7460   | 100.000 | 186    | 0        | 0       | 1      | 186  | 5523   | 5338 | 9.35e-97 | 344      |
| CUP1   | 7460   | 100.000 | 186    | 0        | 0       | 1      | 186  | 7521   | 7336 | 9.35e-97 | 344      |
| CUP1   | 7460   | 99.462  | 186    | 0        | 1       | 1      | 186  | 3521   | 3337 | 1.56e-94 | 337      |
| CUP1   | 7518   | 99.457  | 184    | 1        | 0       | 1      | 184  | 1130   | 947  | 5.63e-94 | 335      |
| CUP1   | 7593   | 100.000 | 164    | 0        | 0       | 1      | 164  | 164    | 1    | 1.59e-84 | 303      |
| CUP1   | 7535   | 100.000 | 164    | 0        | 0       | 1      | 164  | 164    | 1    | 1.59e-84 | 303      |
| CUP1   | 7477   | 100.000 | 164    | 0        | 0       | 1      | 164  | 164    | 1    | 1.59e-84 | 303      |
| CUP1   | 7443   | 100.000 | 164    | 0        | 0       | 1      | 164  | 164    | 1    | 1.59e-84 | 303      |
| CUP1   | 7383   | 100.000 | 164    | 0        | 0       | 1      | 164  | 164    | 1    | 1.59e-84 | 303      |
| CUP1   | 7347   | 100.000 | 134    | 0        | 0       | 31     | 164  | 134    | 1    | 7.54e-68 | 248      |
| CUP1   | 7350   | 100.000 | 29     | 0        | 0       | 1      | 29   | 29     | 1    | 1.76e-09 | 54.7     |

Run the flashcard code below for more information on the BLAST output.

In [12]:
from IPython.display import IFrame
IFrame('../html/flashcard_blastout.html', width=800, height=400)

----------------------

## Conclusion

In this submodule, you learned how to BLAST against a pangenomic graph. Specifically, you searched for the CUP1 and YHR054C genes in the graph.

BLASTing gene sequences allows you to find out where genes of interest are in the FASTA file exported from the pangenomic graph. It also allows you to identify copy numbers of the genes.

In the next submodule you will learn how visualize graphs and to BLAST directly against the graph and visualize the result.

----------------------

## Clean up

<div class="alert alert-warning">No cleanup is necessary for this submodule. Don't forget to shutdown your Workbench when you are done working through this module!.</div>