Using CodeBERT for code based semantic search / clustering #13

JohnGiorgi · 2020-10-08T00:27:25Z

Hi,

I am interested in using CodeBERT for semantic text similarity / clustering on code but my results are rather poor. Here is my process:

Download the data:

mkdir data data/codesearch
cd data/codesearch
gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo  
unzip codesearch_data.zip
rm codesearch_data.zip

Grab some examples to embed:

from pathlib import Path

max_instances = 8

valid = Path("data/codesearch/train_valid/python/valid.txt").read_text().split("\n")
code = [ex.split("<CODESPLIT>")[-1] for ex in valid][:max_instances]

Embed the examples

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

# Load the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model = model.to(device)

# Prepare the inputs
inputs = tokenizer(
    code, padding=True, truncation=True, return_tensors="pt"
)

# Embed the inputs
for name, tensor in inputs.items():
    inputs[name] = tensor.to(model.device)
sequence_output, _ = model(**inputs, output_hidden_states=False)
embeddings = sequence_output[:, 0, :]

Then I arbitrarily cosine the first inputs embedding with the rest of the inputs embeddings:

from torch.nn import CosineSimilarity

# Perform a cosine based semantic similarity search, using the first function as query 
sim = CosineSimilarity(dim=-1)
cosine = sim(embeddings[0], embeddings[1:])
scores, indices = cosine.topk(5)

print(f"Scores: {scores.tolist()}")
print()
print(f"Query:\n---\n{code[0]}")
print()
topk = '\n'.join([code[i] for i in indices])
print(f"Top K:\n---\n{topk}")

The output:

Scores: [0.9909096360206604, 0.9864522218704224, 0.9837372899055481, 0.9776582717895508, 0.9704807996749878]

Query:
---
def start_transaction ( self , sort , address , price = None , data = None , caller = None , value = 0 , gas = 2300 ) : assert self . _pending_transaction is None , "Already started tx" self . _pending_transaction = PendingTransaction ( sort , address , price , data , caller , value , gas )

Top K:
---
def remove_node ( self , id ) : if self . has_key ( id ) : n = self [ id ] self . nodes . remove ( n ) del self [ id ] # Remove all edges involving id and all links to it. for e in list ( self . edges ) : if n in ( e . node1 , e . node2 ) : if n in e . node1 . links : e . node1 . links . remove ( n ) if n in e . node2 . links : e . node2 . links . remove ( n ) self . edges . remove ( e )
def find_essential_genes ( model , threshold = None , processes = None ) : if threshold is None : threshold = model . slim_optimize ( error_value = None ) * 1E-02 deletions = single_gene_deletion ( model , method = 'fba' , processes = processes ) essential = deletions . loc [ deletions [ 'growth' ] . isna ( ) | ( deletions [ 'growth' ] < threshold ) , : ] . index return { model . genes . get_by_id ( g ) for ids in essential for g in ids }
async def play_now ( self , requester : int , track : dict ) : self . add_next ( requester , track ) await self . play ( ignore_shuffle = True )
def _handleAuth ( fn ) : @ functools . wraps ( fn ) def wrapped ( * args , * * kwargs ) : # auth, , authenticate users, internal from yotta . lib import auth # if yotta is being run noninteractively, then we never retry, but we # do call auth.authorizeUser, so that a login URL can be displayed: interactive = globalconf . get ( 'interactive' ) try : return fn ( * args , * * kwargs ) except requests . exceptions . HTTPError as e : if e . response . status_code == requests . codes . unauthorized : #pylint: disable=no-member logger . debug ( '%s unauthorised' , fn ) # any provider is sufficient for registry auth auth . authorizeUser ( provider = None , interactive = interactive ) if interactive : logger . debug ( 'retrying after authentication...' ) return fn ( * args , * * kwargs ) raise return wrapped
def write_log ( log_path , data , allow_append = True ) : append = os . path . isfile ( log_path ) islist = isinstance ( data , list ) if append and not allow_append : raise Exception ( 'Appending has been disabled' ' and file %s exists' % log_path ) if not ( islist or isinstance ( data , Args ) ) : raise Exception ( 'Can only write Args objects or dictionary' ' lists to log file.' ) specs = data if islist else data . specs if not all ( isinstance ( el , dict ) for el in specs ) : raise Exception ( 'List elements must be dictionaries.' ) log_file = open ( log_path , 'r+' ) if append else open ( log_path , 'w' ) start = int ( log_file . readlines ( ) [ - 1 ] . split ( ) [ 0 ] ) + 1 if append else 0 ascending_indices = range ( start , start + len ( data ) ) log_str = '\n' . join ( [ '%d %s' % ( tid , json . dumps ( el ) ) for ( tid , el ) in zip ( ascending_indices , specs ) ] ) log_file . write ( "\n" + log_str if append else log_str ) log_file . close ( )

Notice that the cosine is very high for the top-5 examples, which is unexpected as these examples are chosen randomly. Manually inspecting them, they don't appear to be very relevant to the query.

My questions:

Am I doing something wrong?
Is there a better way to do semantic similarity searching/clustering with CodeBERT? Here I am following the canonical pipeline for sentence embeddings.
One possible source of error is the tokenization. Am I supposed to use the CodeBERT tokenizer on code, or just text?

The text was updated successfully, but these errors were encountered:

fengzhangyin · 2020-10-08T11:21:08Z

Hi @JohnGiorgi ,
CodeBERT is pretrained with masked language model objective and replaced token detection objective. We should finetune CodeBERT to support downstream tasks, while you directly use CodeBERT for semantic text similarity / clustering without finetuning.

We release a new pipeline for Clone Detection task, which is similar to your task. Please refer to the website.

JohnGiorgi · 2020-10-08T14:29:37Z

I see. So https://huggingface.co/microsoft/codebert-base has not been fine-tuned on code search or a related task.

I followed the link but I don't see a pretrained model. Is there a pretrained model available for this pipeline so I do not have to fine-tune it myself? If not, are there plans to release it? It would be great to have a CodeBERT fine-tuned for search on https://huggingface.co/models!

fengzhangyin · 2020-10-08T15:12:49Z

Sorry, we don't have this plan at the moment. You can use the released pipeline to finetune CodeBERT yourself. It won't take you too much time.

JohnGiorgi · 2020-10-08T15:39:07Z

Thanks a lot.

I just have two more questions:

Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.
Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.

guoday · 2020-10-09T15:26:03Z

Thanks a lot.

I just have two more questions:

Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.

Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.

Yes, we fine-tune the model on 2 GPUs. See the last figure of here for training and inference cost.
POJ-104 dataset contains C++/C programming language, which is mentioned in figures of here.

shaileshj2803 · 2021-04-15T05:59:48Z

Hi @JohnGiorgi I am trying to detect if two codes are similar by using the cosine similarity very much similar to what you mentioned earlier. Would like to know if you were able to fine-tine the model and could you share the approach you took.
Thanks a lot

JohnGiorgi · 2021-04-15T14:29:10Z

Hi @shaileshj2803, I didn't end up pursuing this, so I don't have any advice beyond what is in this thread!

nashid · 2022-07-23T06:31:48Z

@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?

guoday · 2022-07-24T04:13:42Z

@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?

Hi, nashid. I suggest that you can follow this readme https://github.com/microsoft/CodeBERT/tree/master/UniXcoder#2-similarity-between-code-and-nl.

nashid · 2022-07-24T06:19:10Z

@guoday thanks for suggesting the link. However, please note for my case I only have two code snippet without natural language.

So natural language like docstring is not present in my case.

Will UniXcoder would still be effective in my case?

guoday · 2022-07-24T07:08:48Z

If you carefully read the readme, you will know UniXcoder doe sn't need natural language.

guody5 closed this as completed Oct 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using CodeBERT for code based semantic search / clustering #13

Using CodeBERT for code based semantic search / clustering #13

JohnGiorgi commented Oct 8, 2020 •

edited

fengzhangyin commented Oct 8, 2020

JohnGiorgi commented Oct 8, 2020

fengzhangyin commented Oct 8, 2020

JohnGiorgi commented Oct 8, 2020 •

edited

guoday commented Oct 9, 2020 •

edited

shaileshj2803 commented Apr 15, 2021

JohnGiorgi commented Apr 15, 2021

nashid commented Jul 23, 2022

guoday commented Jul 24, 2022

nashid commented Jul 24, 2022

guoday commented Jul 24, 2022 •

edited

Using CodeBERT for code based semantic search / clustering #13

Using CodeBERT for code based semantic search / clustering #13

Comments

JohnGiorgi commented Oct 8, 2020 • edited

fengzhangyin commented Oct 8, 2020

JohnGiorgi commented Oct 8, 2020

fengzhangyin commented Oct 8, 2020

JohnGiorgi commented Oct 8, 2020 • edited

guoday commented Oct 9, 2020 • edited

shaileshj2803 commented Apr 15, 2021

JohnGiorgi commented Apr 15, 2021

nashid commented Jul 23, 2022

guoday commented Jul 24, 2022

nashid commented Jul 24, 2022

guoday commented Jul 24, 2022 • edited

JohnGiorgi commented Oct 8, 2020 •

edited

JohnGiorgi commented Oct 8, 2020 •

edited

guoday commented Oct 9, 2020 •

edited

guoday commented Jul 24, 2022 •

edited