Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LongCoder encoder #276

Open
boitavoi opened this issue Jul 12, 2023 · 5 comments
Open

LongCoder encoder #276

boitavoi opened this issue Jul 12, 2023 · 5 comments

Comments

@boitavoi
Copy link

boitavoi commented Jul 12, 2023

Hey!

The LongCoder work is super impressive and important, thank you for that.
I was curious, is it possible to use LongCoder encoder part for obtaining **embeddings only ** for long (>2048 tokens) source code snippets ?
Currently I use UniXcoder for my research, but I need to handle longer code snippets, is it possible to use LongCoder for embeddings somehow?

@guoday
Copy link
Contributor

guoday commented Jul 13, 2023

It's hard to use LongCoder encoder part for obtaining **embeddings only ** for long source code snippets, because I modify the code that can only supports decoder-only mode.

If you need, I can provide you the script code to convert unixcoder model to longformer model, so that you can use longformer model that are initialized from unixcoder to handle longer code snippets.

@guoday
Copy link
Contributor

guoday commented Jul 13, 2023

convert.py.zip

@boitavoi
Copy link
Author

boitavoi commented Jul 17, 2023

convert.py.zip

Thank you! this is indeed helpful :)
Does it require additional training\fine-tuning? or I can use the longformer after conversion as is?

@guoday
Copy link
Contributor

guoday commented Jul 17, 2023

After conversion, you can directly use the longformer without additional pre-training. However, it needs to fine-tune on downstream tasks.

@SasCezar
Copy link

@guoday
I used the convert script; however, I have issues using the converted model.

This is what I tried:

from transformers import LongformerConfig, RobertaTokenizer, pipeline
from models.longcoder import LongcoderModel

config = LongformerConfig.from_pretrained('/path-to-models/longformer-unixcoder')
tokenizer = RobertaTokenizer.from_pretrained('/path-to-models/longformer-unixcoder')
longcoder = LongcoderModel.from_pretrained('/path-to-models/longformer-unixcoder',config=config)

embedding = pipeline('feature-extraction', model=longcoder, tokenizer=tokenizer)

func = ("def f(a,b): if a>b: return a else return b")
embedding(func)

Then I get the following error:

AttributeError: 'LongformerConfig' object has no attribute 'is_decoder_only'

If I don't use the pipeline(), switching to the following code:

tokens=tokenizer.tokenize("return maximum value")
longcoder(tokens)

I get this error:

AttributeError: 'str' object has no attribute 'size'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants