Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Three mer embeddings versus codons #1

Open
michaelbarton opened this issue Jun 26, 2017 · 1 comment
Open

Three mer embeddings versus codons #1

michaelbarton opened this issue Jun 26, 2017 · 1 comment

Comments

@michaelbarton
Copy link

Hey @philippbayer, I don't know how much you want me chipping in with these suggestions? Feel free to ignore this, or tell me to mind my own business :). At least on a github issue, I can post a longer response than on twitter.

I suggest that a quick test would be comparing embedding distances between 3mers and codons. My hypothesis would be that the embedding distances between 3mers encoding the same amino acid would be different than that of embeddings not encoding the same amino acid. This should be easily falsifiable too if it is not the case. You could try this with reading frame only versus all reading frames. My suggested linear models to test for this would be:

* D_ij ~ 1 # Null model
* D_ij ~ reading_frame # Null model
* D_ij ~ encoded_amino_acid
* D_ij ~ encoded_amino_acid + reading_frame
* D_ij ~ encoded_amino_acid + domain
* D_ij ~ encoded_amino_acid + reading_frame + domain

With the following variables:

* D_ij - the cosine distance between 3mer i and 3mer j.
* encoded_amino_acid - which amino acid does the 3mer encode for
* reading_frame - is the 3mer in the reading frame. This would require generating the embeddings where reading frame vs. non reading frame 3mers are treated differently. E.g. you might add some character to the end to distinguish between these.
* domain - do they differ between bacteria and archea. This would be a test for a phylogenetic signal.

I don't have time to currently work on this, otherwise I would try this out. These are some simple suggestions, feel free to ignore these if they are not useful to you.

@philippbayer
Copy link
Owner

Hi @michaelbarton!
By all means please go ahead :) I haven't yet settled on a 'story' for this investigation, but checking 3-mers and codons should definitely yield some interesting insights!

Just like you I don't have much time to look at this, but I'll have a look on the weekend

If you have any other ideas feel free to share!

I've started to look a bit into using LSTMs to classify some of my datasets, but it's slow going since the k-mer counting is so slow. For now I'm using k-mers as predicted by HAWK together with random noise since there I know that the k-mers are different, will later switch to full datasets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants