LongCoder encoder #276

boitavoi · 2023-07-12T12:04:49Z

Hey!

The LongCoder work is super impressive and important, thank you for that.
I was curious, is it possible to use LongCoder encoder part for obtaining **embeddings only ** for long (>2048 tokens) source code snippets ?
Currently I use UniXcoder for my research, but I need to handle longer code snippets, is it possible to use LongCoder for embeddings somehow?

guoday · 2023-07-13T08:16:10Z

It's hard to use LongCoder encoder part for obtaining **embeddings only ** for long source code snippets, because I modify the code that can only supports decoder-only mode.

If you need, I can provide you the script code to convert unixcoder model to longformer model, so that you can use longformer model that are initialized from unixcoder to handle longer code snippets.

guoday · 2023-07-13T08:43:51Z

convert.py.zip

boitavoi · 2023-07-17T09:33:48Z

convert.py.zip

Thank you! this is indeed helpful :)
Does it require additional training\fine-tuning? or I can use the longformer after conversion as is?

guoday · 2023-07-17T14:29:41Z

After conversion, you can directly use the longformer without additional pre-training. However, it needs to fine-tune on downstream tasks.

SasCezar · 2023-09-18T12:48:55Z

@guoday
I used the convert script; however, I have issues using the converted model.

This is what I tried:

from transformers import LongformerConfig, RobertaTokenizer, pipeline
from models.longcoder import LongcoderModel

config = LongformerConfig.from_pretrained('/path-to-models/longformer-unixcoder')
tokenizer = RobertaTokenizer.from_pretrained('/path-to-models/longformer-unixcoder')
longcoder = LongcoderModel.from_pretrained('/path-to-models/longformer-unixcoder',config=config)

embedding = pipeline('feature-extraction', model=longcoder, tokenizer=tokenizer)

func = ("def f(a,b): if a>b: return a else return b")
embedding(func)

Then I get the following error:

AttributeError: 'LongformerConfig' object has no attribute 'is_decoder_only'

If I don't use the pipeline(), switching to the following code:

tokens=tokenizer.tokenize("return maximum value")
longcoder(tokens)

I get this error:

AttributeError: 'str' object has no attribute 'size'

guoday mentioned this issue Aug 21, 2023

what if I convert the codebert to longer tokenize? #281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongCoder encoder #276

LongCoder encoder #276

boitavoi commented Jul 12, 2023 •

edited

Loading

guoday commented Jul 13, 2023 •

edited

Loading

guoday commented Jul 13, 2023

boitavoi commented Jul 17, 2023 •

edited

Loading

guoday commented Jul 17, 2023

SasCezar commented Sep 18, 2023

LongCoder encoder #276

LongCoder encoder #276

Comments

boitavoi commented Jul 12, 2023 • edited Loading

guoday commented Jul 13, 2023 • edited Loading

guoday commented Jul 13, 2023

boitavoi commented Jul 17, 2023 • edited Loading

guoday commented Jul 17, 2023

SasCezar commented Sep 18, 2023

boitavoi commented Jul 12, 2023 •

edited

Loading

guoday commented Jul 13, 2023 •

edited

Loading

boitavoi commented Jul 17, 2023 •

edited

Loading