Three mer embeddings versus codons #1

michaelbarton · 2017-06-26T08:01:36Z

Hey @philippbayer, I don't know how much you want me chipping in with these suggestions? Feel free to ignore this, or tell me to mind my own business :). At least on a github issue, I can post a longer response than on twitter.

I suggest that a quick test would be comparing embedding distances between 3mers and codons. My hypothesis would be that the embedding distances between 3mers encoding the same amino acid would be different than that of embeddings not encoding the same amino acid. This should be easily falsifiable too if it is not the case. You could try this with reading frame only versus all reading frames. My suggested linear models to test for this would be:

* D_ij ~ 1 # Null model
* D_ij ~ reading_frame # Null model
* D_ij ~ encoded_amino_acid
* D_ij ~ encoded_amino_acid + reading_frame
* D_ij ~ encoded_amino_acid + domain
* D_ij ~ encoded_amino_acid + reading_frame + domain

With the following variables:

* D_ij - the cosine distance between 3mer i and 3mer j.
* encoded_amino_acid - which amino acid does the 3mer encode for
* reading_frame - is the 3mer in the reading frame. This would require generating the embeddings where reading frame vs. non reading frame 3mers are treated differently. E.g. you might add some character to the end to distinguish between these.
* domain - do they differ between bacteria and archea. This would be a test for a phylogenetic signal.

I don't have time to currently work on this, otherwise I would try this out. These are some simple suggestions, feel free to ignore these if they are not useful to you.

The text was updated successfully, but these errors were encountered:

philippbayer · 2017-06-26T08:38:05Z

Hi @michaelbarton!
By all means please go ahead :) I haven't yet settled on a 'story' for this investigation, but checking 3-mers and codons should definitely yield some interesting insights!

Just like you I don't have much time to look at this, but I'll have a look on the weekend

If you have any other ideas feel free to share!

I've started to look a bit into using LSTMs to classify some of my datasets, but it's slow going since the k-mer counting is so slow. For now I'm using k-mers as predicted by HAWK together with random noise since there I know that the k-mers are different, will later switch to full datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Three mer embeddings versus codons #1

Three mer embeddings versus codons #1

michaelbarton commented Jun 26, 2017

philippbayer commented Jun 26, 2017

Three mer embeddings versus codons #1

Three mer embeddings versus codons #1

Comments

michaelbarton commented Jun 26, 2017

philippbayer commented Jun 26, 2017