# Set-up

This notebook describes the set-up procedure for the training, test, and validation datasets used to produce the trained Selenobot model The three files created during set-up are `train.csv`, `test.csv`, and `val.csv`, which contain the training, testing, and validation data, respectively. Each file contains the following fields:
1. `gene_id` A unique identifier for the amino acid sequence. 
2. `seq` An amino acid sequence. 
3. `label` Either 1 (indicating a truncated selenoprotein) or 0 (indicating a full-length non-selenoprotein)
4. `0 ... 1024` The mean-pooled PLM embedding vector. 
 
**NOTE:** The steps described in this notebook are not necessary for replicating Selenobot results. Pre-trained models, as well as completed training, testing, and validation datasets are available in a Google Cloud bucket, and instructions for downloading them can be found in the `testing.ipynb` and `training.ipynb` notebooks.

If you want to run this code, be sure to modify the `DATA_DIR` and `CDHIT` variables below to specify where the data will be stored on your machine. `DATA_DIR` is the absolute path specifying the location where the data will be stored, and `CDHIT` is the absolute path to the CD-HIT command. 

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
DATA_DIR = '/home/prichter/Documents/data/selenobot-test/'
CDHIT = '/home/prichter/cd-hit-v4.8.1-2019-0228/cd-hit'

In [6]:
import sys
# Add the selenobot/ subirectory to the module search path, so that the utils module is visible from the notebook.
sys.path.append('../selenobot/')

from utils import dataframe_from_fasta, dataframe_to_fasta, dataframe_from_clstr, fasta_size, csv_size # Some functions for reading and writing FASTA files. 
import pandas as pd
from typing import NoReturn, Tuple
import numpy as np
from tqdm import tqdm
import subprocess
import re

## Downloading UniProt data

The sequence data used to construct the training, test, and validation sets was obtained from Uniprot (release 2023_03, accessed 08/11/2023). The downloaded sequences were (1) all known selenoproteins in the entirety of UniProt, and (2) all SwissProt-reviewed full-length proteins. The selenoproteins can be obtained through the SwissProt REST API using the URL below. The SwissProt sequences can be downloaded as a zip file directly from the UniProt website, which can be extracted using `gzip`.

**NOTE:** Downloading the UniProt data took over an hour to run when I tried this, so don't worry if it's taking a while!

In [10]:
# Download SwissProt from UniProt (release 2023_03). 
# ! curl -s 'https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release-2023_03/knowledgebase/uniprot_sprot-only2023_03.tar.gz' -o '{DATA_DIR}sprot.fasta.tar.gz'
# Download known selenoproteins from UniProt. Filter for sequences added before the date we accessed the database (August 11, 2023)
! curl -s 'https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=%28%28ft_non_std%3Aselenocysteine%29+AND+%28date_created%3A%5B*+TO+2023-08-11%5D%29%29' -o '{DATA_DIR}sec.fasta'

In [5]:
# Unzip the SwissProt sequence file. 
! tar -xf '{DATA_DIR}sprot.fasta.tar.gz' -C '{DATA_DIR}'
! gunzip -d '{DATA_DIR}uniprot_sprot.fasta.gz'
! mv '{DATA_DIR}uniprot_sprot.fasta' '{DATA_DIR}sprot.fasta'

# Clean up some of the extra files which were extracted along with the SwissProt FASTA file.
! rm '{DATA_DIR}uniprot_sprot.dat.gz'
! rm '{DATA_DIR}uniprot_sprot_varsplic.fasta.gz'


In [44]:
def clean(path:str) -> NoReturn:
    '''Modify the format of downloaded FASTA files to standardize reading and writing FASTA files.'''
    fasta = ''
    with open(path, 'r') as f:
        lines = f.readlines()
    for line in lines:
        if '>' in line: # This symbol marks the beginning of a header file.
            id_ = re.search('\|([\w\d_]+)\|', line).group(1)
            fasta += f'>id={id_}\n'
        else:
            fasta += line
    # This will overwrite the original file. 
    with open(path, 'w') as f:
        f.write(fasta)

In [45]:
# This reformats the headers in the FASTA files to standardize reading and writing
# from FASTA files across the Selenobot project. 
clean(f'{DATA_DIR}sprot.fasta')
clean(f'{DATA_DIR}sec.fasta')

###  Removing selenoproteins

The SwissProt file contains some known selenoproteins which passed the review process. However, if we leave them in the SwissProt file, they will not be truncated and labeled as selenoproteins in later steps. This might cause some issues during the training process, as we would potentially bias the classifier against flagging the truncated equivalents. To avoid these problems, we remove all selenoproteins from the `sprot.fasta` file using the function below. 

In [5]:
def remove_selenoproteins(path:str) -> NoReturn:
    '''Remove all selenoproteins which are present in the SwissProt download.'''
    df = dataframe_from_fasta(path) # Load the SwissProt data into a pandas DataFrame.
    selenoproteins = df['seq'].str.contains('U') # Determine the indices where the selenoproteins occur. 
    if np.sum(selenoproteins) > 0:
        df = df[~selenoproteins]
        print(f'{np.sum(selenoproteins)} selenoproteins successfully removed from SwissProt.')
        dataframe_to_fasta(df, path=path) # Overwrite the original file. 

In [7]:
remove_selenoproteins(f'{DATA_DIR}sprot.fasta')

### Truncating selenoproteins

The goal of the Selenobot classifier is to distinguish a truncated selenoprotein from a full-length non-selenoprotein, so all selenoproteins present in the training, testing, and validation data should be truncated. Although some selenoproteins contain multiple selenocysteine residues, we chose to truncate at the first selenocysteine residue only. This choice was made because there are no selenoproteins identified in GTDB, to which we will ultimately apply the trained model. So there cannot be any instances of selenoproteins truncated at the second, third, etc. selenocysteine residue

The `truncate_selenoproteins` function defined below truncates all selenoproteins in the input file (`sec.fasta`). It also appends a  `"[1]”`` to the gene IDs of truncated proteins, indicating truncation at the first selenocysteine.

In [8]:
def truncate_selenoproteins(in_path:str, out_path:str) -> NoReturn:
    '''Truncate the selenoproteins stored in the input file. This function assumes that all 
    sequences contained in the file contain selenocysteine, labeled as U.'''
    # Load the selenoproteins into a pandas DataFrame. 
    df = dataframe_from_fasta(in_path)
    df_trunc = {'id':[], 'seq':[]}
    for row in df.itertuples():
        df_trunc['id'].append(row.id + '[1]') # Modify the row ID to contain 
        df_trunc['seq'].append(row.seq)
    df = pd.DataFrame(df_trunc)
    dataframe_to_fasta(df, path=out_path)

In [11]:
truncate_selenoproteins(f'{DATA_DIR}sec.fasta', f'{DATA_DIR}sec_truncated.fasta')

### Combining files

To make the UniProt data easier to work with in later steps (specifically, for using the CD-HIT clustering tool), we concatenate the data in the `sec_truncated.fasta` and `sprot.fasta` files into a single `uniprot.fasta` file. This operation can be accomplished in the terminal using the `cat` command. 

In [13]:
# Concatenate the FASTA files in the data directory. 
! cat '{DATA_DIR}sec_truncated.fasta' '{DATA_DIR}sprot.fasta' > '{DATA_DIR}uniprot.fasta'

## Downloading PLM embeddings

Embeddings were generated using a version of the Prot-T5 pre-trained protein language model entitled `Rostlab/prot_t5_xl_half_uniref50-enc`. This model is the encoder portion of the 3 billion-parameter model, trained using a masking approach. The model weights were obtained from HuggingFace via the transformers Python library. 

The code for embedding sequence data is provided in the `embed.py` file in the `scripts` directory. The process for generating embeddings was computationally-intensive, and had to be run on an external computer cluster. The steps of the embedding algorithm, which is implemented by the `PlmEmbedder` class, are given below.
1. **Amino acid sequences are read from a FASTA file.**
2. **All non-standard amino acids were replaced with an “X”.**
3. **Sequences were sorted in ascending order according to length.** This avoids the addition of unnecessary padding. 
4. **The sequences are processed and tokenized in batches.** Processing sequences in batches enables one to generate the embeddings as quickly as possible, while also preventing the GPUs from crashing. The maximum number of sequences in any given batch is 100, the maximum number of amino acids in a batch is 4000, and any sequence longer than 1000 amino acids is processed individually. 
5. **The PLM is used to generate embeddings.** These embeddings have shape shape `(length, latent dimension)`.
6. **Each embedding is sliced according to the length of the original sequence.** This is due to the fact that part of the model output corresponds to the padding tokens, which should be excluded from the final embedding. 
7. **The embeddings are mean-pooled over sequence length.** This step standardizes the length of the embedding vectors to be of fixed dimension (the latent dimension of the PLM), which is necessary for passing them to the Selenobot linear classifier. 
The output of the steps above is data containing gene IDs, the original amino acid sequences (which may contain non-standard residues), and columns with the mean-pooled embeddings. 

This process was applied to every sequence in the `uniprot.fasta`, producing the `embeddings.csv` file. This file is available for download in a [Google Cloud Bucket](https://storage.googleapis.com/selenobot-data/embeddings.csv). 

In [37]:
# Download the PLM embeddings from the Google Cloud Bucket
! curl 'https://storage.googleapis.com/selenobot-data/embeddings.csv' -o '{DATA_DIR}embeddings.csv'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7122M  100 7122M    0     0  27.3M      0  0:04:20  0:04:20 --:--:-- 25.7M02  0:00:45  0:03:17 26.7M


## Splitting the data

### Clustering with CD-HIT

When using sequence data to train machine learning models, it is important to control for homology when constructing the training, testing, and validation sets. A failure to do so may result in data leakage, as the model will learn evolutionary relationships instead of more relevant sequence characteristics (e.g. truncation, in the case of Selenobot). To control for sequence homology, we first processed the data in uniprot.fasta using CD-HIT, a widely used program for clustering biological sequences. The clustering parameters used are as follows.

1. **c=0.8** This sets the sequence similarity threshold. In other words, sequences which are given a similarity score of 0.8 or higher are grouped into the same cluster. 
2. **l=5** This parameter specifies minimum sequence length. Setting this to 5 means that no sequences fewer than amino acids in length are clustered. This is the minimum length allowed by CD-HIT. Some of the truncated selenoprotein did not meet this length requirement, and were discarded. 
3. **n=5** This is the “word length” used by the CD-HIT algorithm to determine sequence similarity. This parameter was kept as the default, which is the recommended word length when using a sequence similarity threshold of 0.8. 

The CD-HIT output is a `uniprot.clstr` file, which maps gene IDs to the ID of the cluster to which they belong. 278931 clusters were generated by the program when it was run for our investigation.


In [14]:
! {CDHIT} -i '{DATA_DIR}uniprot.fasta' -o '{DATA_DIR}uniprot' -n 5 -c 0.8 -l 5

Program: CD-HIT, V4.8.1 (+OpenMP), Aug 24 2023, 13:02:16
Command: /home/prichter/cd-hit-v4.8.1-2019-0228/cd-hit -i
         /home/prichter/Documents/data/selenobot-test/uniprot.fasta
         -o
         /home/prichter/Documents/data/selenobot-test/uniprot
         -n 5 -c 0.8 -l 5

Started: Sat Feb  3 18:41:44 2024
                            Output                              
----------------------------------------------------------------
total seq: 588641
longest and shortest : 35213 and 6
Total letters: 215202198
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 288M
Buffer          : 1 X 25M = 25M
Table           : 1 X 74M = 74M
Miscellaneous   : 7M
Total           : 395M

Table limit with the given memory limit:
Max number of representatives: 674635
Max number of word counting entries: 50531311

comparing sequences from          0  to      41986
..........    10000  finished       6421  clusters
..........    20000  finished      12133  clu

### Partitioning by cluster

We chose the size of the training dataset to be 80 percent of `uniprot.fasta`. Roughly 60 percent of remaining sequences were then sorted into the testing dataset, and the leftover sequences were reserved for the validation dataset. To ensure that no sequences which belong to the same CD-HIT cluster are present in separate datasets, we define a custom `sample` function which uses the information in `uniprot.clstr` to ensure that the entirety of any homology group is contained within the sample. 

In [18]:
train_size = int(0.8 * fasta_size(f'{DATA_DIR}uniprot.fasta'))
test_size = int(0.6 * (fasta_size(f'{DATA_DIR}uniprot.fasta') - train_size))
val_size = fasta_size(f'{DATA_DIR}uniprot.fasta') - (train_size + test_size)

print('Approximate size of training dataset:', train_size)
print('Approximate size of testing dataset:', test_size)
print('Approximate size of validation dataset:', val_size)

Approximate size of training dataset: 470970
Approximate size of testing dataset: 70645
Approximate size of validation dataset: 47098


In [19]:
def sample(df:pd.DataFrame, n:int=None) -> Tuple[pd.DataFrame, pd.DataFrame]:
    '''Sample from the cluster data such that the entirety of any homology group is contained in the sample. This function assumes that
    the size of the sample is smaller than half of the size of the input DataFrame. 
  
    :param df: A pandas DataFrame mapping the gene ID to cluster number. 
    :param n: The size of the sample. 
    :return: A tuple of DataFrames, the first being the sample, and the second being the input data with the sample removed. 
    '''
    assert (len(df) - n) >= n, f'The sample size must be less than half of the input data.'

    groups = {'sample':[], 'remainder':[]}
    curr_size = 0 # Keep track of the sample size. 
    ordered_clusters = df.groupby('cluster').size().sort_values(ascending=False).index # Sort the clusters in descending order of size. 

    add_to = 'sample'
    for cluster in tqdm(ordered_clusters, desc='sample'): # Iterate over the cluster IDs. 
        cluster = df[df.cluster == cluster] # Grab an entire cluster from the DataFrame. 
        if add_to == 'sample' and curr_size < n: # Only add to the sample partition while the size requirement is not yet met. 
            groups['sample'].append(cluster)
            curr_size += len(cluster)
            add_to = 'remainder'
        else:
            groups['remainder'].append(cluster)
            add_to = 'sample'

    sample, remainder = pd.concat(groups['sample']), pd.concat(groups['remainder'])
    # Some final checks to make sure the sampling function behaved as expected. 
    assert len(sample) + len(remainder) == len(df), f'The combined sizes of the partitions do not add up to the size of the original data.'
    assert len(sample) < len(remainder), f'The sample DataFrame should be smaller than the remainder DataFrame.'
    return sample, remainder

In [25]:
uniprot_df = dataframe_from_fasta(f'{DATA_DIR}uniprot.fasta') # Read in the UniProt data.
clstr_df = dataframe_from_clstr(f'{DATA_DIR}uniprot.clstr') # Read in the cluster data generated by CD-HIT

uniprot_df = uniprot_df.merge(clstr_df, on='id') # Add the cluster labels to the UniProt data. 
print(fasta_size(f'{DATA_DIR}uniprot.fasta') - len(uniprot_df), 'sequences were not assigned cluster groups and were dropped.')

72 sequences were not assigned cluster groups and were dropped.


In [26]:
uniprot_df, train_df = sample(uniprot_df, n=len(uniprot_df) - train_size)
val_df, test_df = sample(uniprot_df, n=val_size)

print('Size of training dataset:', len(train_df))
print('Size of testing dataset:', len(test_df))
print('Size of validation dataset:', len(val_df))

sample: 100%|██████████| 279559/279559 [02:49<00:00, 1649.35it/s]
sample: 100%|██████████| 10219/10219 [00:03<00:00, 2686.04it/s]


Size of training dataset: 470969
Size of testing dataset: 70571
Size of validation dataset: 47101


Each dataset is written to a CSV file (`train.csv`, `test.csv`, and `val.csv`) in the data directory. The datasets now contain the following fields:
1. **id** The gene ID from UniProt. The truncated selenoproteins have a “[1]" appended to their ID, which was added when the `truncate_selenoproteins` function was called.
2. **seq** The amino acid sequence.
3. **cluster** The ID of the cluster to which the sequence belongs. This field is not used after the partitioning step, but is kept in the data for the sake of completeness. 

In [31]:
train_df.set_index('id').to_csv(f'{DATA_DIR}train.csv')
test_df.set_index('id').to_csv(f'{DATA_DIR}test.csv')
val_df.set_index('id').to_csv(f'{DATA_DIR}val.csv')

## Adding labels and embeddings

The final step required to set up the training, testing, and validation datasets is to add labels and PLM embeddings to the files. For labels, we simply add a “labels” field, which contains either a `1` (indicating a truncated selenoprotein) or `0` (indicating a full-length non-selenoprotein). These labels are added according to whether or not a “[1]” is present in the ID for a particular sequence. 

In [7]:
for path in [f'{DATA_DIR}train.csv', f'{DATA_DIR}test.csv', f'{DATA_DIR}val.csv']:
    df = pd.read_csv(path)
    # Add a label column, marking those sequences with a [1] in the gene ID with a 1.
    # Make sure to turn off refex for the str.contains function, or it will try to match [1] as a pattern.
    df['label'] = df['id'].str.contains('[1]', regex=False).astype(int)
    df.set_index('id').to_csv(path)

Adding the embeddings to the datasets is slightly more involved, as the entire `embeddings.csv` file is too large to load into memory. So, we processed the embeddings in chunks using the `add_embeddings` function below. First, we load all gene IDs present in the embeddings.csv file. Then, we read the dataset (either `train.csv`, `test.csv`, or `val.csv`) in chunks of size 1000. We use the vector of gene IDs from the file to load in only those rows in `embeddings.csv` whose gene IDs are contained in the dataset chunk. We then add these embeddings to the dataset chunk, and write the chunk (with the embeddings added) to a temporary CSV file. This process is repeated, and dataset chunks are appended to the temporary CSV file until all embeddings have been added to the dataset.

In [8]:
def add_embeddings(path:str, chunk_size:int=1000) -> NoReturn:
    '''Add embedding information to a dataset, and overwrite the original dataset with the
    modified dataset (with PLM embeddings added).
    
    :param path: The path to the dataset.
    :chunk_size: The size of the chunks to split the dataset into for processing.
    '''
    embedding_ids = pd.read_csv(f'{DATA_DIR}embeddings.csv', usecols=['id'])['id'].values.ravel() # Read the IDs in the embedding file to avoid loading the entire thing into memory.
    reader = pd.read_csv(path, index_col=['id'], chunksize=chunk_size) # Use read_csv to load the dataset one chunk at a time. 
    tmp_file_path = f'{DATA_DIR}tmp.csv' # The path to the temporary file to which the modified dataset will be written in chunks.

    is_first_chunk = True
    n_chunks = csv_size(path) // chunk_size + 1
    for chunk in tqdm(reader, desc='add_embeddings', total=n_chunks):
        # Get the indices of the embedding rows corresponding to the data chunk. Make sure to shift the index up by one to account for the header. 
        idxs = np.where(np.isin(embedding_ids, chunk.index, assume_unique=True))[0] + 1 
        idxs = [0] + list(idxs) # Add the header index so the column names are included. 
        # Read in the embedding rows, skipping rows which do not match a gene ID in the chunk. 
        chunk = chunk.merge(pd.read_csv(f'{DATA_DIR}embeddings.csv', skiprows=lambda i : i not in idxs), on='id', how='inner')
        # Check to make sure the merge worked as expected. Subtract 1 from len(idxs) to account for the header row.
        assert len(chunk) == (len(idxs) - 1), f'Data was lost while merging embedding data.'
        
        chunk.to_csv(tmp_file_path, header=is_first_chunk, mode='w' if is_first_chunk else 'a') # Only write the header for the first file. 
        is_first_chunk = False
    # Replace the old dataset with the temporary file. 
    subprocess.run(f'rm {path}', shell=True, check=True)
    subprocess.run(f'mv {tmp_file_path} {path}', shell=True, check=True)

In [9]:
# Add the embedding data to each dataset.
add_embeddings(f'{DATA_DIR}train.csv') 
add_embeddings(f'{DATA_DIR}test.csv') 
add_embeddings(f'{DATA_DIR}val.csv') 

setup.data.detect.add_embeddings_to_file: 100%|██████████| 471/471 [16:03:44<00:00, 122.77s/it]      
setup.data.detect.add_embeddings_to_file: 100%|██████████| 71/71 [50:34<00:00, 42.74s/it]
setup.data.detect.add_embeddings_to_file: 100%|██████████| 48/48 [34:03<00:00, 42.57s/it]


In [1]:
# Print some information about the selenoprotein content of each dataset.
train_labels = pd.read_csv(f'{DATA_DIR}train.csv', usecols=['label']).label 
test_labels = pd.read_csv(f'{DATA_DIR}test.csv', usecols=['label']).label 
val_labels = pd.read_csv(f'{DATA_DIR}val.csv', usecols=['label']).label 

print('Selenoprotein content of the training dataset:', np.sum(train_labels) / len(train_labels))
print('Selenoprotein content of the testing dataset:', np.sum(test_labels) / len(test_labels))
print('Selenoprotein content of the validation dataset:', np.sum(val_labels) / len(val_labels))

NameError: name 'pd' is not defined