-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNA Sequences Embedding #11
Comments
Hi, Yes, you can do this. Please refer to the |
Hi, |
Hi, I am sorry that I was wrong in the last response. The 2-dimensional logits here is essentially the classification results of the given sequence, where each dimension stands for the probability of this sequence belongs to each class. The embedding for each sequence should be a 768-dimensional vector. You can achieve it by
Here the For the current version, if you have sequences longer than 512, then you need either cut it to 512 or split it into multiple pieces of 512 lengths and concatenate their embedding together. |
Thanks a lot! |
Hi, How do I get embeddings of multiple sequences at once? I tried with a list of sequences, but the output is always a 1x768 vector. Thanks. |
I tried this code and when running it inside a for loop I was always getting the process killed as it was using too much memory, I needed to put the prediction part inside
|
Thank you this is very helpful! If you run this snippet in a script that lives in the parent directory, the same level as the examples folder, you may run into some problems. I had to change the following two imports in modeling_albert.py: to the following:
Also, I made a change to the snippet to get it to work. I changed the following import statement:
to
|
Hello, I just want to import the DNATokenizer from the transformers, but there is wrong with 'ImportError: cannot import name 'DNATokenizer'', can you help me to solve this? Google doesn't tell me anything about it... |
You should clone it in right directory, because DNATokenizer is inside of the DNABERT folder. Try this, !git clone https://github.com/jerryji1993/DNABERT and run import in this directory. |
Actually I didn't use the DNATokenizer, and used the BERTTokenizer instead. The code also works, is this two way different and make much difference? |
As far as understand,DNATokenizer is specifically trained by DNA Sequences on the other side, BERT Tokenizer is tokenizer for sentences, so you will get different input values when you run both tokenizer which effect your output. |
OK, I will try your way to install this requirements. These days, I just write the code on my own and didn't install it, hope the installation will work. |
Dear Zihian, I would like to use DNABert to get short dna sequence representation in order to keep sequencing relationship instead of using one hot encoder method and I want to put this values to another model for my desire, is it good use directly embedding representation which is output[1] or attention score would be work also for representation? |
I did some research on the output of this model, the output[0] is the last_hidden_state (https://huggingface.co/docs/transformers/main_classes/output), I saw people used output[0][ : , 0 , : ] which means the 768 dimension vector for the 'CLS' in the last hidden layer for the following model, and that works for me. I think the attention score is not for the output representation. |
Dear authors, I have a question about obtaining embedding vectors of my data. May I use And after I have embedding vectors I am going to build a classifier on them. It will be nice if you tell me whether this thought is correct or not. |
I see that the model_input, which is returned by the tokenizer.encode_plus, does not have padding on the left or right, even though the input size is < 512. If I add padding on the right, the embedding generated by DNABERT changes. So, what according to you is the correct format? Should I manually add padding on the right, or just ignore it? |
Hi,i just did what u said,but still got this: |
Hi,
Thanks for this very good work. I was wondering if I can retrieve the embedding from DNA sequences that I would train and use them later then for down stream tasks. Can you please confirm that with me and guide me on how I can get the embedding?
Thanks!
The text was updated successfully, but these errors were encountered: