DNA Sequences Embedding #11

elbasir · 2021-01-17T08:49:03Z

Hi,
Thanks for this very good work. I was wondering if I can retrieve the embedding from DNA sequences that I would train and use them later then for down stream tasks. Can you please confirm that with me and guide me on how I can get the embedding?

Thanks!

Zhihan1996 · 2021-03-17T04:05:16Z

Hi,

Yes, you can do this. Please refer to the line 420 of https://github.com/jerryji1993/DNABERT/blob/master/examples/run_finetune.py. The variable logits stands for the embedding of the DNA sequence. You can directly use it.

elbasir · 2021-03-18T22:56:49Z

Hi,
Thanks for your answer. I have checked the embedding using the variable logits and I found out the embedding size is only 2-dimensional vector for each DNA sequence. I have a sequence that's longer than 500 bps, would it be possible to extend the embedding size or you think the current embedding size is enough to represent a long DNA sequence?

Zhihan1996 · 2021-03-18T23:40:49Z

Hi,

I am sorry that I was wrong in the last response. The 2-dimensional logits here is essentially the classification results of the given sequence, where each dimension stands for the probability of this sequence belongs to each class. The embedding for each sequence should be a 768-dimensional vector. You can achieve it by

import torch
from transformers import BertModel, BertConfig, DNATokenizer

dir_to_pretrained_model = "xxx/xxx"

config = BertConfig.from_pretrained('https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/config.json')
tokenizer = DNATokenizer.from_pretrained('dna6')
model = BertModel.from_pretrained(dir_to_pretrained_model, config=config)

sequence = "AATCTA ATCTAG TCTAGC CTAGCA"
model_input = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)["input_ids"]
model_input = torch.tensor(model_input, dtype=torch.long)
model_input = model_input.unsqueeze(0)   # to generate a fake batch with batch size one

output = model(model_input)

Here the output[1] is the embedding of the input sequence.

For the current version, if you have sequences longer than 512, then you need either cut it to 512 or split it into multiple pieces of 512 lengths and concatenate their embedding together.

elbasir · 2021-03-19T10:46:13Z

Thanks a lot!

iamysk · 2021-07-06T06:52:28Z

Hi,

How do I get embeddings of multiple sequences at once? I tried with a list of sequences, but the output is always a 1x768 vector.

Thanks.

maiskovich · 2021-08-14T12:36:44Z

Hi,

I am sorry that I was wrong in the last response. The 2-dimensional logits here is essentially the classification results of the given sequence, where each dimension stands for the probability of this sequence belongs to each class. The embedding for each sequence should be a 768-dimensional vector. You can achieve it by
import torch
from transformers import BertModel, BertConfig, DNATokenizer

dir_to_pretrained_model = "xxx/xxx"

config = BertConfig.from_pretrained('https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/config.json')
tokenizer = DNATokenizer.from_pretrained('dna6')
model = BertModel.from_pretrained(dir_to_pretrained_model, config=config)

sequence = "AATCTA ATCTAG TCTAGC CTAGCA"
model_input = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)["input_ids"]
model_input = torch.tensor(model_input, dtype=torch.long)
model_input = model_input.unsqueeze(0)   # to generate a fake batch with batch size one

output = model(model_input)
Here the output[1] is the embedding of the input sequence.

For the current version, if you have sequences longer than 512, then you need either cut it to 512 or split it into multiple pieces of 512 lengths and concatenate their embedding together.

I tried this code and when running it inside a for loop I was always getting the process killed as it was using too much memory, I needed to put the prediction part inside with torch.no_grad():, It ended looking like this:

from transformers import BertModel, BertConfig, DNATokenizer

dir_to_pretrained_model = "xxx/xxx"

config = BertConfig.from_pretrained('https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/config.json')
tokenizer = DNATokenizer.from_pretrained('dna6')
model = BertModel.from_pretrained(dir_to_pretrained_model, config=config)

sequence = "AATCTA ATCTAG TCTAGC CTAGCA"
with torch.no_grad():
        model_input = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)["input_ids"]
        model_input = torch.tensor(model_input, dtype=torch.long)
        model_input = model_input.unsqueeze(0)   # to generate a fake batch with batch size one
        
        output = model(model_input)```

asimokby · 2021-12-13T13:27:27Z

Hi,

I am sorry that I was wrong in the last response. The 2-dimensional logits here is essentially the classification results of the given sequence, where each dimension stands for the probability of this sequence belongs to each class. The embedding for each sequence should be a 768-dimensional vector. You can achieve it by
import torch
from transformers import BertModel, BertConfig, DNATokenizer

dir_to_pretrained_model = "xxx/xxx"

config = BertConfig.from_pretrained('https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/config.json')
tokenizer = DNATokenizer.from_pretrained('dna6')
model = BertModel.from_pretrained(dir_to_pretrained_model, config=config)

sequence = "AATCTA ATCTAG TCTAGC CTAGCA"
model_input = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)["input_ids"]
model_input = torch.tensor(model_input, dtype=torch.long)
model_input = model_input.unsqueeze(0)   # to generate a fake batch with batch size one

output = model(model_input)
Here the output[1] is the embedding of the input sequence.

For the current version, if you have sequences longer than 512, then you need either cut it to 512 or split it into multiple pieces of 512 lengths and concatenate their embedding together.

Thank you this is very helpful!

If you run this snippet in a script that lives in the parent directory, the same level as the examples folder, you may run into some problems.

I had to change the following two imports in modeling_albert.py:
from transformers.configuration_albert import AlbertConfig from transformers.modeling_bert import ACT2FN, BertEmbeddings, BertSelfAttention, prune_linear_layer

to the following:

from transformers.models.albert.configuration_albert import AlbertConfig from transformers.models.bert.modeling_bert import ACT2FN, BertEmbeddings, BertSelfAttention, prune_linear_layer

Also, I made a change to the snippet to get it to work. I changed the following import statement:

from transformers import BertModel, BertConfig, DNATokenizer

to

from src.transformers import DNATokenizer from transformers import BertModel, BertConfig

ChengkuiZhao · 2022-05-17T04:33:01Z

Hello, I just want to import the DNATokenizer from the transformers, but there is wrong with 'ImportError: cannot import name 'DNATokenizer'', can you help me to solve this? Google doesn't tell me anything about it...

aliakay · 2022-06-13T13:26:12Z

Hello, I just want to import the DNATokenizer from the transformers, but there is wrong with 'ImportError: cannot import name 'DNATokenizer'', can you help me to solve this? Google doesn't tell me anything about it...

You should clone it in right directory, because DNATokenizer is inside of the DNABERT folder.

Try this,

!git clone https://github.com/jerryji1993/DNABERT
%cd DNABERT
!python3 -m pip install --editable .
%cd examples
!python3 -m pip install -r requirements.txt

and run import in this directory.
cd "DNABERT/examples"
then it should import the DNATokenizer.

ChengkuiZhao · 2022-06-13T13:51:29Z

Hello, I just want to import the DNATokenizer from the transformers, but there is wrong with 'ImportError: cannot import name 'DNATokenizer'', can you help me to solve this? Google doesn't tell me anything about it...

You should clone it in right directory, because DNATokenizer is inside of the DNABERT folder.

Try this,

!git clone https://github.com/jerryji1993/DNABERT %cd DNABERT !python3 -m pip install --editable . %cd examples !python3 -m pip install -r requirements.txt

and run import in this directory. cd "DNABERT/examples" then it should import the DNATokenizer.

Actually I didn't use the DNATokenizer, and used the BERTTokenizer instead. The code also works, is this two way different and make much difference?
Thank you so much for your reply.

aliakay · 2022-06-13T13:58:34Z

Hello, I just want to import the DNATokenizer from the transformers, but there is wrong with 'ImportError: cannot import name 'DNATokenizer'', can you help me to solve this? Google doesn't tell me anything about it...

You should clone it in right directory, because DNATokenizer is inside of the DNABERT folder.
Try this,
!git clone https://github.com/jerryji1993/DNABERT %cd DNABERT !python3 -m pip install --editable . %cd examples !python3 -m pip install -r requirements.txt
and run import in this directory. cd "DNABERT/examples" then it should import the DNATokenizer.

Actually I didn't use the DNATokenizer, and used the BERTTokenizer instead. The code also works, is this two way different and make much difference? Thank you so much for your reply.

As far as understand,DNATokenizer is specifically trained by DNA Sequences on the other side, BERT Tokenizer is tokenizer for sentences, so you will get different input values when you run both tokenizer which effect your output.

ChengkuiZhao · 2022-06-13T14:03:35Z

Hello, I just want to import the DNATokenizer from the transformers, but there is wrong with 'ImportError: cannot import name 'DNATokenizer'', can you help me to solve this? Google doesn't tell me anything about it...

You should clone it in right directory, because DNATokenizer is inside of the DNABERT folder.
Try this,
!git clone https://github.com/jerryji1993/DNABERT %cd DNABERT !python3 -m pip install --editable . %cd examples !python3 -m pip install -r requirements.txt
and run import in this directory. cd "DNABERT/examples" then it should import the DNATokenizer.

Actually I didn't use the DNATokenizer, and used the BERTTokenizer instead. The code also works, is this two way different and make much difference? Thank you so much for your reply.

As far as understand,DNATokenizer is specifically trained by DNA Sequences on the other side, BERT Tokenizer is tokenizer for sentences, so you will get different input values when you run both tokenizer which effect your output.

OK, I will try your way to install this requirements. These days, I just write the code on my own and didn't install it, hope the installation will work.

aliakay · 2022-06-13T14:15:10Z

Hi,

I am sorry that I was wrong in the last response. The 2-dimensional logits here is essentially the classification results of the given sequence, where each dimension stands for the probability of this sequence belongs to each class. The embedding for each sequence should be a 768-dimensional vector. You can achieve it by
import torch
from transformers import BertModel, BertConfig, DNATokenizer

dir_to_pretrained_model = "xxx/xxx"

config = BertConfig.from_pretrained('https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/config.json')
tokenizer = DNATokenizer.from_pretrained('dna6')
model = BertModel.from_pretrained(dir_to_pretrained_model, config=config)

sequence = "AATCTA ATCTAG TCTAGC CTAGCA"
model_input = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)["input_ids"]
model_input = torch.tensor(model_input, dtype=torch.long)
model_input = model_input.unsqueeze(0)   # to generate a fake batch with batch size one

output = model(model_input)
Here the output[1] is the embedding of the input sequence.

For the current version, if you have sequences longer than 512, then you need either cut it to 512 or split it into multiple pieces of 512 lengths and concatenate their embedding together.

Dear Zihian,

I would like to use DNABert to get short dna sequence representation in order to keep sequencing relationship instead of using one hot encoder method and I want to put this values to another model for my desire, is it good use directly embedding representation which is output[1] or attention score would be work also for representation?

ChengkuiZhao · 2022-06-14T02:34:00Z

Hi,
I am sorry that I was wrong in the last response. The 2-dimensional logits here is essentially the classification results of the given sequence, where each dimension stands for the probability of this sequence belongs to each class. The embedding for each sequence should be a 768-dimensional vector. You can achieve it by
import torch
from transformers import BertModel, BertConfig, DNATokenizer

dir_to_pretrained_model = "xxx/xxx"

config = BertConfig.from_pretrained('https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/config.json')
tokenizer = DNATokenizer.from_pretrained('dna6')
model = BertModel.from_pretrained(dir_to_pretrained_model, config=config)

sequence = "AATCTA ATCTAG TCTAGC CTAGCA"
model_input = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)["input_ids"]
model_input = torch.tensor(model_input, dtype=torch.long)
model_input = model_input.unsqueeze(0)   # to generate a fake batch with batch size one

output = model(model_input)
Here the output[1] is the embedding of the input sequence.
For the current version, if you have sequences longer than 512, then you need either cut it to 512 or split it into multiple pieces of 512 lengths and concatenate their embedding together.
Dear Zihian,

I would like to use DNABert to get short dna sequence representation in order to keep sequencing relationship instead of using one hot encoder method and I want to put this values to another model for my desire, is it good use directly embedding representation which is output[1] or attention score would be work also for representation?

I did some research on the output of this model, the output[0] is the last_hidden_state (https://huggingface.co/docs/transformers/main_classes/output), I saw people used output[0][ : , 0 , : ] which means the 768 dimension vector for the 'CLS' in the last hidden layer for the following model, and that works for me. I think the attention score is not for the output representation.

WENHUAN22 · 2022-06-17T16:55:25Z

Dear authors,

I have a question about obtaining embedding vectors of my data.
For example, the first record in my dataset is "AATCTA ATCTAG TCTAGC CTAGCA".

May I use
model.embeddings.word_embeddings.weight[0]
as the embedding vector of my first sample?
Is there any difference between the method you introduce above (output[1]) and this one?

And after I have embedding vectors I am going to build a classifier on them. It will be nice if you tell me whether this thought is correct or not.

palset · 2022-09-08T23:23:15Z

Hi,

I am sorry that I was wrong in the last response. The 2-dimensional logits here is essentially the classification results of the given sequence, where each dimension stands for the probability of this sequence belongs to each class. The embedding for each sequence should be a 768-dimensional vector. You can achieve it by
import torch
from transformers import BertModel, BertConfig, DNATokenizer

dir_to_pretrained_model = "xxx/xxx"

config = BertConfig.from_pretrained('https://raw.githubusercontent.com/jerryji1993/DNABERT/master/src/transformers/dnabert-config/bert-config-6/config.json')
tokenizer = DNATokenizer.from_pretrained('dna6')
model = BertModel.from_pretrained(dir_to_pretrained_model, config=config)

sequence = "AATCTA ATCTAG TCTAGC CTAGCA"
model_input = tokenizer.encode_plus(sequence, add_special_tokens=True, max_length=512)["input_ids"]
model_input = torch.tensor(model_input, dtype=torch.long)
model_input = model_input.unsqueeze(0)   # to generate a fake batch with batch size one

output = model(model_input)
Here the output[1] is the embedding of the input sequence.

For the current version, if you have sequences longer than 512, then you need either cut it to 512 or split it into multiple pieces of 512 lengths and concatenate their embedding together.

I see that the model_input, which is returned by the tokenizer.encode_plus, does not have padding on the left or right, even though the input size is < 512. If I add padding on the right, the embedding generated by DNABERT changes. So, what according to you is the correct format? Should I manually add padding on the right, or just ignore it?
Thanks for the great work!

CharlsonLiu · 2024-03-15T17:00:24Z

Hello, I just want to import the DNATokenizer from the transformers, but there is wrong with 'ImportError: cannot import name 'DNATokenizer'', can you help me to solve this? Google doesn't tell me anything about it...

You should clone it in right directory, because DNATokenizer is inside of the DNABERT folder.

Try this,

!git clone https://github.com/jerryji1993/DNABERT %cd DNABERT !python3 -m pip install --editable . %cd examples !python3 -m pip install -r requirements.txt

and run import in this directory. cd "DNABERT/examples" then it should import the DNATokenizer.

Hi,i just did what u said,but still got this:
'ImportError: cannot import name 'DNATokenizer''.i dont know which step(s) was/were wrong

elbasir closed this as completed Mar 19, 2021

Zhihan1996 mentioned this issue Mar 30, 2021

On getting last hidden layer weights through python #16

Closed

Zhihan1996 mentioned this issue Apr 21, 2021

something about feature extraction #20

Closed

BinchaoPeng mentioned this issue Apr 30, 2021

some new questions about how to process seq which is more than 512 #27

Closed

mosala777 mentioned this issue Jun 29, 2021

Obtain vector embeddings of k-mer tokens #38

Closed

bill-95 mentioned this issue Mar 9, 2022

Getting the last hidden state of the encoder #71

Closed

rominaappierdo mentioned this issue Mar 10, 2022

Long DNA seqs embeddings #72

Open

ytye2010 mentioned this issue May 18, 2023

How to get the high attention regions of a given sequence. #104

Open

Buntzen-uni mentioned this issue Feb 7, 2024

. MAGICS-LAB/DNABERT_2#68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNA Sequences Embedding #11

DNA Sequences Embedding #11

elbasir commented Jan 17, 2021

Zhihan1996 commented Mar 17, 2021

elbasir commented Mar 18, 2021

Zhihan1996 commented Mar 18, 2021

elbasir commented Mar 19, 2021

iamysk commented Jul 6, 2021

maiskovich commented Aug 14, 2021

asimokby commented Dec 13, 2021

ChengkuiZhao commented May 17, 2022

aliakay commented Jun 13, 2022 •

edited

ChengkuiZhao commented Jun 13, 2022

aliakay commented Jun 13, 2022

ChengkuiZhao commented Jun 13, 2022

aliakay commented Jun 13, 2022

ChengkuiZhao commented Jun 14, 2022

WENHUAN22 commented Jun 17, 2022

palset commented Sep 8, 2022

CharlsonLiu commented Mar 15, 2024

DNA Sequences Embedding #11

DNA Sequences Embedding #11

Comments

elbasir commented Jan 17, 2021

Zhihan1996 commented Mar 17, 2021

elbasir commented Mar 18, 2021

Zhihan1996 commented Mar 18, 2021

elbasir commented Mar 19, 2021

iamysk commented Jul 6, 2021

maiskovich commented Aug 14, 2021

asimokby commented Dec 13, 2021

ChengkuiZhao commented May 17, 2022

aliakay commented Jun 13, 2022 • edited

ChengkuiZhao commented Jun 13, 2022

aliakay commented Jun 13, 2022

ChengkuiZhao commented Jun 13, 2022

aliakay commented Jun 13, 2022

ChengkuiZhao commented Jun 14, 2022

WENHUAN22 commented Jun 17, 2022

palset commented Sep 8, 2022

CharlsonLiu commented Mar 15, 2024

aliakay commented Jun 13, 2022 •

edited