You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey @philippbayer, I don't know how much you want me chipping in with these suggestions? Feel free to ignore this, or tell me to mind my own business :). At least on a github issue, I can post a longer response than on twitter.
I suggest that a quick test would be comparing embedding distances between 3mers and codons. My hypothesis would be that the embedding distances between 3mers encoding the same amino acid would be different than that of embeddings not encoding the same amino acid. This should be easily falsifiable too if it is not the case. You could try this with reading frame only versus all reading frames. My suggested linear models to test for this would be:
* D_ij - the cosine distance between 3mer i and 3mer j.
* encoded_amino_acid - which amino acid does the 3mer encode for
* reading_frame - is the 3mer in the reading frame. This would require generating the embeddings where reading frame vs. non reading frame 3mers are treated differently. E.g. you might add some character to the end to distinguish between these.
* domain - do they differ between bacteria and archea. This would be a test for a phylogenetic signal.
I don't have time to currently work on this, otherwise I would try this out. These are some simple suggestions, feel free to ignore these if they are not useful to you.
The text was updated successfully, but these errors were encountered:
Hi @michaelbarton!
By all means please go ahead :) I haven't yet settled on a 'story' for this investigation, but checking 3-mers and codons should definitely yield some interesting insights!
Just like you I don't have much time to look at this, but I'll have a look on the weekend
If you have any other ideas feel free to share!
I've started to look a bit into using LSTMs to classify some of my datasets, but it's slow going since the k-mer counting is so slow. For now I'm using k-mers as predicted by HAWK together with random noise since there I know that the k-mers are different, will later switch to full datasets
Hey @philippbayer, I don't know how much you want me chipping in with these suggestions? Feel free to ignore this, or tell me to mind my own business :). At least on a github issue, I can post a longer response than on twitter.
I suggest that a quick test would be comparing embedding distances between 3mers and codons. My hypothesis would be that the embedding distances between 3mers encoding the same amino acid would be different than that of embeddings not encoding the same amino acid. This should be easily falsifiable too if it is not the case. You could try this with reading frame only versus all reading frames. My suggested linear models to test for this would be:
With the following variables:
I don't have time to currently work on this, otherwise I would try this out. These are some simple suggestions, feel free to ignore these if they are not useful to you.
The text was updated successfully, but these errors were encountered: