# Validation

In order to further validate the promising results of the PLM classifier on the testing data, we sought to interrogate a well-characterized microbial genome using the model. To this end, we chose to analyze the complete genome of Escherichia coli (strain K-12, substrain. MG1655, assembly ASM584v2). The RefSeq accession for this genome is GCF_000005845.2. 

For organizational purposes, all data used for model validation is placed into a subdirectory within the data directory. If you want to run this code, be sure to modify the `DATA_DIR` variable below to specify where the data will be stored on your machine.


In [1]:
%load_ext autoreload
%autoreload 2

In [7]:
import sys
# Add the selenobot/ subirectory to the module search path, so that the modules in this directory are visible from the notebook.
sys.path.append('../selenobot/')

from dataset import Dataset
from classifiers import Classifier
from utils import csv_size, dataframe_from_gff
from extend import extend
import pandas as pd
import re
from typing import NoReturn
import numpy as np
import pickle

In [4]:
DATA_DIR = '/home/prichter/Documents/data/selenobot-test/validation/'

genome_id = 'GCF_000005845.2' # The accession of the genome to download. 
assembly = 'ASM584v2' # The specific genome assembly. 

## Downloading genome data

In [None]:
# Download and unzip the genome data from NCBI. 
! curl 'https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/{genome_id}/download?include_annotation_type=GENOME_FASTA,PROT_FASTA,GENOME_GFF' -o '{DATA_DIR}ncbi_dataset.zip'
! unzip '{DATA_DIR}ncbi_dataset.zip' -d '{DATA_DIR}'

# Create a directory to store the genome files. 
! mkdir '{DATA_DIR}{genome_id}/' 
# Move the relevant files into the new directory for organizational purposes. 
! mv '{DATA_DIR}ncbi_dataset/data/{genome_id}/genomic.gff' -t '{DATA_DIR}{genome_id}/'
! mv '{DATA_DIR}ncbi_dataset/data/{genome_id}/protein.faa' -t '{DATA_DIR}{genome_id}/'
! mv '{DATA_DIR}ncbi_dataset/data/{genome_id}/{genome_id}_{assembly}_genomic.fna' '{DATA_DIR}{genome_id}/genome.fna'

# Remove some extraneous files which were also downloaded. 
! rm '{DATA_DIR}README.md'
! rm -R '{DATA_DIR}ncbi_dataset'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2655k    0 2655k    0     0   313k      0 --:--:--  0:00:08 --:--:--  340k
Archive:  /home/prichter/Documents/data/selenobot-test/validation/ncbi_dataset.zip
  inflating: /home/prichter/Documents/data/selenobot-test/validation/README.md  
  inflating: /home/prichter/Documents/data/selenobot-test/validation/ncbi_dataset/data/assembly_data_report.jsonl  
  inflating: /home/prichter/Documents/data/selenobot-test/validation/ncbi_dataset/data/GCF_000005845.2/GCF_000005845.2_ASM584v2_genomic.fna  
  inflating: /home/prichter/Documents/data/selenobot-test/validation/ncbi_dataset/data/GCF_000005845.2/protein.faa  
  inflating: /home/prichter/Documents/data/selenobot-test/validation/ncbi_dataset/data/GCF_000005845.2/genomic.gff  
  inflating: /home/prichter/Documents/data/selenobot-test/validation/ncbi_dataset/data/dataset_catalog.json

In [8]:
def clean(path:str) -> NoReturn:
    '''Modify the format of downloaded FASTA files to standardize reading and writing FASTA files.'''
    fasta = ''
    with open(path, 'r') as f:
        lines = f.readlines()
    # In the FASTA files downloaded from NCBI, the only relevant information is right after the > character.
    # This is the GenBank protein accession, which will be called the 'id'.
    for line in lines:
        if '>' in line: # This symbol marks the beginning of a header file.
            id_ = re.search('>([^\s]+)', line).group(1)
            fasta += f'>id={id_}\n'
        else:
            fasta += line
    # This will overwrite the original file. 
    with open(path, 'w') as f:
        f.write(fasta)

In [9]:
clean(f'{DATA_DIR}{genome_id}/protein.faa')

## Predicting selenoproteins

### Generating PLM embeddings

Using the same procedure as was used to create embeddings for the UniProt sequences for training the classifier, embeddings are generated for all predicted genes in the model genome. The code for embedding sequence data is provided in the `embed.py` file in the `scripts` directory. The process for generating embeddings was computationally-intensive, and had to be run on an external computer cluster. The steps of the embedding algorithm, which is implemented by the `PlmEmbedder` class, are given below.

1. **Amino acid sequences are read from a FASTA file.**
2. **All non-standard amino acids were replaced with an “X”.**
3. **Sequences were sorted in ascending order according to length.** This avoids the addition of unnecessary padding. 
4. **The sequences are processed and tokenized in batches.** Processing sequences in batches enables one to generate the embeddings as quickly as possible, while also preventing the GPUs from crashing. The maximum number of sequences in any given batch is 100, the maximum number of amino acids in a batch is 4000, and any sequence longer than 1000 amino acids is processed individually. 
5. **The PLM is used to generate embeddings.** These embeddings have shape shape `(length, latent dimension)`.
6. **Each embedding is sliced according to the length of the original sequence.** This is due to the fact that part of the model output corresponds to the padding tokens, which should be excluded from the final embedding. 
7. **The embeddings are mean-pooled over sequence length.** This step standardizes the length of the embedding vectors to be of fixed dimension (the latent dimension of the PLM), which is necessary for passing them to the Selenobot linear classifier. 
The output of the steps above is data containing gene IDs, the original amino acid sequences (which may contain non-standard residues), and columns with the mean-pooled embeddings. 

There are several pre-embedded genomes available in the Google Cloud repository, which can be downloaded using the code below. 

### Running the PLM classifier

In [None]:
df = dataframe_from_fasta(f'{DATA_DIR}protein.faa')
dataset = Dataset(df, embedder=None)

In [None]:
model = Classifier(latent_dim=1024, hidden_dim=512)
model.load_state_dict(torch.load(f'{DATA_DIR}plm_model_weights.pth'))

In [None]:
def load_genome(path:str) -> str:
    '''Load in the complete nucleotide sequence of the genome.
    
    :param path: A FASTA file from NCBI which contains a complete genome. 
    :return: A string of nucleotides. 
    '''
    with open(path, 'r') as f:
        lines = f.read().splitlines()[1:] # Skip the header line. 
        seq = ''.join(lines)
    return seq

In [None]:
# Eventually, will need to be able to support this for a whole list of genomes.
def database_build_query() -> NoReturn:
    '''Build a query data database, which contains the sequences to search for homology matches for.'''
    # Grab the coordinate information about the predicted selenoproteins only. Exclude known selenoproteins.
    database = load_coordinates(gene_ids=[g for g in load_predictions() if g not in known_selenoproteins])
    # Mark the sequences which will be extended past the first STOP codon. 
    # database['extend'] = [(gene_id not in known_selenoproteins) for gene_id in database.gene_id]
    database['extend'] = True 
    database = get_sequences(database, load_genome('', path=os.path.join(DATA_DIR, 'genome.fasta')))
    database_write(database, filename='query.fasta')


def database_build_control() -> NoReturn:
    '''Build a control query data database, which contains the non-extended selenoprotein sequences.'''
    # Grab the coordinate information about the predicted selenoproteins only. Exclude known selenoproteins.
    database = load_coordinates(gene_ids=[id_ for id_ in load_predictions() if id_ not in known_selenoproteins])
    database['extend'] = False # Don't extend anything here. 
    database = get_sequences(database, load_genome('', path=os.path.join(DATA_DIR, 'genome.fasta')))
    database_write(database, filename='control.fasta')

In [None]:
! pip install -e /home/prichter/Documents/find-a-bug-api

Obtaining file:///home/prichter/Documents/find-a-bug-api
  Preparing metadata (setup.py) ... [?25ldone
Installing collected packages: Find-A-Bug-API
  Attempting uninstall: Find-A-Bug-API
    Found existing installation: Find-A-Bug-API 0.0.0
    Uninstalling Find-A-Bug-API-0.0.0:
      Successfully uninstalled Find-A-Bug-API-0.0.0
  Running setup.py develop for Find-A-Bug-API
Successfully installed Find-A-Bug-API-0.0.0


In [None]:
from align import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import fabapi
import fabapi.genomes