In [4]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from selenobot.files import NcbiXmlFile
import os

DATA_DIR = '/home/prichter/Documents/selenobot/data/'

# Dataset construction

## Binary classification

For binary classification, the goal is to train a model to distinguish between full-length non-selenoproteins and truncated selenoproteins. For this task, we have two datasets:
1. All of the non-fragmented proteins in SwissProt, with the selenoproteins removed. 
2. All selenoproteins registered in UniProt, truncated at the first selenocysteine residue. 

**Problem:** The binary model, when run on GTDB, seems to be flagging too many sequences as selenoproteins. There are several possibilities for what might be happening: (1) There are actually tons of truncated selenoproteins, which seems unlikely, (2) there are full-length which kind of resemble truncated selenoproteins that are being flagged, (3) there are truncated proteins that are not selenoproteins. 

## Ternary classification

For ternary classification, the goal is to train a model to distinguish between full-length non-selenoproteins, truncated selenoproteins, and truncated non-selenoproteins. The reasoning behind this is to determine if the "false positives" I think the model is classifying as truncated selenoproteins might simply be truncated regular proteins. However, I really have no sense as to whether or not these hits are being identified because they specifically look like truncated selenoproteins (i.e., similar in length, composition, or something else), or just because they are truncated. 

There are a couple options for generating the set of truncated non-selenoproteins.
1. Subset the SwissProt dataset, with a subset the same size as the selenoprotein dataset. Ensure that the sampled proteins follow roughly the same length distribution as the full-length selenoproteins, and then truncate these sequences such that their lengths roughly match the lengths of the truncated selenoproteins. This will control for length, ensuring that if the model is able to distinguish the truncated non-selenoproteins from the truncated selenoproteins, it is relying on non-length based signals. *The logic I was using for selecting a subset the same size as the selenoprotein dataset is that I was thinking that I wanted the model to have equal exposure to both types... however, because I take measures to handle class imbalance, this shouldn't matter. I'll probably just take a subset of around 20,000 for the sake of picking a number.*
2. Subset the SwissProt dataset randomly, ignoring the length of the sampled proteins (I'll check to make sure that the length distribution matches that of the entire dataset). Then, randomly truncate the proteins (possibly even truncate the same protein at multiple different locations). This will mimic arbitrary truncation events. 
3. Same options as above, but instead of sub-sampling SwissProt, I grab a bunch of new bacterial sequences from UniProt. I don't like this idea as much, as it means I am using sequences that are lower-quality. However, it would mean that I could increase the size of the dataset. Because I don't want to accidentally train the model to somehow pick up on little features of low-quality sequences, I think I will go with (1) or (2). 

Approaches (1) and (2) are kind of addressing different questions. (1) is asking if the model is able to distinguish between selenoproteins and non-selenoproteins which are truncated from non-length or composition based signals, i.e., is there something about truncated selenoprotein embeddings which looks different than the embeddings of normal proteins? (2) is asking if the model can simply distinguish between generic truncation events and truncated selenoproteins. 

Reasons to use approach (2)...
- Ultimately, I don't really care if length or composition are factors in the models decision-making. If it works, it works!
- I will test the ability of the model to make predictions based on length and composition as controls anyway, which will give a sense of just how important these factors are. 
- Because I am generating the third dataset by downsampling UniProt, I don't want to be picky about the sequences I am selecting. I am worried this would skew the distribution somehow in the full-length dataset, in a way which might be difficult to test.
- I have a hunch that it is more likely that the false positives are generic truncations, not just truncations that look *very similar* to truncated selenoproteins (unless Prodigal is WAY worse than we think). So, it makes more sense to use generic truncations. 

However, I am worried that generating the training set in this way will make life too easy for the classifier. In other words, this might produce pretty big length and composition-based signals which allow it to distinguish between truncated selenoproteins and truncated non-selenoproteins, which are not the cause of the suspected false positives. In other words, the artificially-generated truncated sequences might not look like "real" truncated sequences, so the results wouldn't be interesting. **I think the only way to pick one is to take a closer look at what the GTDB predictions look like compared to the truncated selenoproteins in the training set.**


In [5]:
# API URL for retrieving bacterial proteins which are non-TrEMBL reviewed, and have evidence at the transcript level. 
# See this link for more info on protein existence: https://www.uniprot.org/help/protein_existence
# https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=xml&query=%28%28taxonomy_id%3A2%29%29+AND+%28reviewed%3Afalse%29+AND+%28existence%3A2%29

xml_file = NcbiXmlFile(os.path.join(DATA_DIR, 'uniprot_trembl.xml'))
xml_file.to_df().to_csv(os.path.join(DATA_DIR, 'uniprot_trembl.csv'))

NcbiXmlFile.__init__: Parsing NCBI XML file, row 98542...: : 39552298it [04:20, 151663.30it/s]


### Additional considerations

**How should I select full-length proteins to truncate?** 

I am thinking it would be beneficial to not remove sequences from the full-length SwissProt dataset, and instead using UniProt TrEMBL (non-reviewed) sequences which have pretty good evidence for existence (transcript-level). The thing I am worried about is I suspect that pretty much every sequence will have a homolog in the SwissProt database. However, this is probably also true for the truncated selenoproteins. Because I am truncating the TrEMBL sequences, I don't think I need to worry about it too much. 

Actually, I wonder if I should leave the full-length selenoproteins in the set of full-length sequences. The idea is that, in the case of Prodigal at least, there won't be any full-length selenoproteins. However, leaving them in probably wouldn't hurt, especially because there are cys-homologs anyway!

**How should I go about truncating the full-length proteins?**

First, I should run the binary classifier on the GTDB subset, and look at the length distributions of predicted truncated selenoproteins relative to predicted full-length proteins, and compare this to the length distributions in the training set.  

