Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using CodeBERT for code based semantic search / clustering #13

Closed
JohnGiorgi opened this issue Oct 8, 2020 · 11 comments
Closed

Using CodeBERT for code based semantic search / clustering #13

JohnGiorgi opened this issue Oct 8, 2020 · 11 comments

Comments

@JohnGiorgi
Copy link

JohnGiorgi commented Oct 8, 2020

Hi,

I am interested in using CodeBERT for semantic text similarity / clustering on code but my results are rather poor. Here is my process:

Download the data:

mkdir data data/codesearch
cd data/codesearch
gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo  
unzip codesearch_data.zip
rm codesearch_data.zip

Grab some examples to embed:

from pathlib import Path

max_instances = 8

valid = Path("data/codesearch/train_valid/python/valid.txt").read_text().split("\n")
code = [ex.split("<CODESPLIT>")[-1] for ex in valid][:max_instances]

Embed the examples

import torch
from transformers import RobertaTokenizer, RobertaConfig, RobertaModel

# Load the model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")
model = model.to(device)

# Prepare the inputs
inputs = tokenizer(
    code, padding=True, truncation=True, return_tensors="pt"
)

# Embed the inputs
for name, tensor in inputs.items():
    inputs[name] = tensor.to(model.device)
sequence_output, _ = model(**inputs, output_hidden_states=False)
embeddings = sequence_output[:, 0, :]

Then I arbitrarily cosine the first inputs embedding with the rest of the inputs embeddings:

from torch.nn import CosineSimilarity

# Perform a cosine based semantic similarity search, using the first function as query 
sim = CosineSimilarity(dim=-1)
cosine = sim(embeddings[0], embeddings[1:])
scores, indices = cosine.topk(5)

print(f"Scores: {scores.tolist()}")
print()
print(f"Query:\n---\n{code[0]}")
print()
topk = '\n'.join([code[i] for i in indices])
print(f"Top K:\n---\n{topk}")

The output:

Scores: [0.9909096360206604, 0.9864522218704224, 0.9837372899055481, 0.9776582717895508, 0.9704807996749878]

Query:
---
def start_transaction ( self , sort , address , price = None , data = None , caller = None , value = 0 , gas = 2300 ) : assert self . _pending_transaction is None , "Already started tx" self . _pending_transaction = PendingTransaction ( sort , address , price , data , caller , value , gas )

Top K:
---
def remove_node ( self , id ) : if self . has_key ( id ) : n = self [ id ] self . nodes . remove ( n ) del self [ id ] # Remove all edges involving id and all links to it. for e in list ( self . edges ) : if n in ( e . node1 , e . node2 ) : if n in e . node1 . links : e . node1 . links . remove ( n ) if n in e . node2 . links : e . node2 . links . remove ( n ) self . edges . remove ( e )
def find_essential_genes ( model , threshold = None , processes = None ) : if threshold is None : threshold = model . slim_optimize ( error_value = None ) * 1E-02 deletions = single_gene_deletion ( model , method = 'fba' , processes = processes ) essential = deletions . loc [ deletions [ 'growth' ] . isna ( ) | ( deletions [ 'growth' ] < threshold ) , : ] . index return { model . genes . get_by_id ( g ) for ids in essential for g in ids }
async def play_now ( self , requester : int , track : dict ) : self . add_next ( requester , track ) await self . play ( ignore_shuffle = True )
def _handleAuth ( fn ) : @ functools . wraps ( fn ) def wrapped ( * args , * * kwargs ) : # auth, , authenticate users, internal from yotta . lib import auth # if yotta is being run noninteractively, then we never retry, but we # do call auth.authorizeUser, so that a login URL can be displayed: interactive = globalconf . get ( 'interactive' ) try : return fn ( * args , * * kwargs ) except requests . exceptions . HTTPError as e : if e . response . status_code == requests . codes . unauthorized : #pylint: disable=no-member logger . debug ( '%s unauthorised' , fn ) # any provider is sufficient for registry auth auth . authorizeUser ( provider = None , interactive = interactive ) if interactive : logger . debug ( 'retrying after authentication...' ) return fn ( * args , * * kwargs ) raise return wrapped
def write_log ( log_path , data , allow_append = True ) : append = os . path . isfile ( log_path ) islist = isinstance ( data , list ) if append and not allow_append : raise Exception ( 'Appending has been disabled' ' and file %s exists' % log_path ) if not ( islist or isinstance ( data , Args ) ) : raise Exception ( 'Can only write Args objects or dictionary' ' lists to log file.' ) specs = data if islist else data . specs if not all ( isinstance ( el , dict ) for el in specs ) : raise Exception ( 'List elements must be dictionaries.' ) log_file = open ( log_path , 'r+' ) if append else open ( log_path , 'w' ) start = int ( log_file . readlines ( ) [ - 1 ] . split ( ) [ 0 ] ) + 1 if append else 0 ascending_indices = range ( start , start + len ( data ) ) log_str = '\n' . join ( [ '%d %s' % ( tid , json . dumps ( el ) ) for ( tid , el ) in zip ( ascending_indices , specs ) ] ) log_file . write ( "\n" + log_str if append else log_str ) log_file . close ( )

Notice that the cosine is very high for the top-5 examples, which is unexpected as these examples are chosen randomly. Manually inspecting them, they don't appear to be very relevant to the query.

My questions:

  • Am I doing something wrong?
  • Is there a better way to do semantic similarity searching/clustering with CodeBERT? Here I am following the canonical pipeline for sentence embeddings.
  • One possible source of error is the tokenization. Am I supposed to use the CodeBERT tokenizer on code, or just text?
@fengzhangyin
Copy link
Collaborator

Hi @JohnGiorgi ,
CodeBERT is pretrained with masked language model objective and replaced token detection objective. We should finetune CodeBERT to support downstream tasks, while you directly use CodeBERT for semantic text similarity / clustering without finetuning.

We release a new pipeline for Clone Detection task, which is similar to your task. Please refer to the website.

@JohnGiorgi
Copy link
Author

I see. So https://huggingface.co/microsoft/codebert-base has not been fine-tuned on code search or a related task.

I followed the link but I don't see a pretrained model. Is there a pretrained model available for this pipeline so I do not have to fine-tune it myself? If not, are there plans to release it? It would be great to have a CodeBERT fine-tuned for search on https://huggingface.co/models!

@fengzhangyin
Copy link
Collaborator

Sorry, we don't have this plan at the moment. You can use the released pipeline to finetune CodeBERT yourself. It won't take you too much time.

@JohnGiorgi
Copy link
Author

JohnGiorgi commented Oct 8, 2020

Thanks a lot.

I just have two more questions:

  • Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.
  • Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.

@guoday
Copy link
Contributor

guoday commented Oct 9, 2020

Thanks a lot.

I just have two more questions:

  • Did you fine-tune the model on 2 GPUs in the documentation here? I just want to make sure we are using the same effective batch size.
  • Is there a list of programming languages that are in the POJ-104 dataset? I looked at the paper but I can't find it mentioned.
  1. Yes, we fine-tune the model on 2 GPUs. See the last figure of here for training and inference cost.
  2. POJ-104 dataset contains C++/C programming language, which is mentioned in figures of here.

@guody5 guody5 closed this as completed Oct 12, 2020
@shaileshj2803
Copy link

Hi @JohnGiorgi I am trying to detect if two codes are similar by using the cosine similarity very much similar to what you mentioned earlier. Would like to know if you were able to fine-tine the model and could you share the approach you took.
Thanks a lot

@JohnGiorgi
Copy link
Author

Hi @shaileshj2803, I didn't end up pursuing this, so I don't have any advice beyond what is in this thread!

@nashid
Copy link
Contributor

nashid commented Jul 23, 2022

@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?

@guoday
Copy link
Contributor

guoday commented Jul 24, 2022

@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach?

Hi, nashid. I suggest that you can follow this readme https://github.com/microsoft/CodeBERT/tree/master/UniXcoder#2-similarity-between-code-and-nl.

@nashid
Copy link
Contributor

nashid commented Jul 24, 2022

@guoday thanks for suggesting the link. However, please note for my case I only have two code snippet without natural language.

So natural language like docstring is not present in my case.

Will UniXcoder would still be effective in my case?

@guoday
Copy link
Contributor

guoday commented Jul 24, 2022

If you carefully read the readme, you will know UniXcoder doe sn't need natural language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants