New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using CodeBERT for code based semantic search / clustering #13
Comments
Hi @JohnGiorgi , We release a new pipeline for Clone Detection task, which is similar to your task. Please refer to the website. |
I see. So https://huggingface.co/microsoft/codebert-base has not been fine-tuned on code search or a related task. I followed the link but I don't see a pretrained model. Is there a pretrained model available for this pipeline so I do not have to fine-tune it myself? If not, are there plans to release it? It would be great to have a CodeBERT fine-tuned for search on https://huggingface.co/models! |
Sorry, we don't have this plan at the moment. You can use the released pipeline to finetune CodeBERT yourself. It won't take you too much time. |
Thanks a lot. I just have two more questions:
|
|
Hi @JohnGiorgi I am trying to detect if two codes are similar by using the cosine similarity very much similar to what you mentioned earlier. Would like to know if you were able to fine-tine the model and could you share the approach you took. |
Hi @shaileshj2803, I didn't end up pursuing this, so I don't have any advice beyond what is in this thread! |
@shaileshj2803 and @JohnGiorgi I am trying to do semantic code search based on cosine similarity. Curious to know what you ended up with doing? Have you used CodeBERT at all for this purpose or have taken an alternative approach? |
Hi, nashid. I suggest that you can follow this readme https://github.com/microsoft/CodeBERT/tree/master/UniXcoder#2-similarity-between-code-and-nl. |
@guoday thanks for suggesting the link. However, please note for my case I only have two code snippet without natural language. So natural language like docstring is not present in my case. Will UniXcoder would still be effective in my case? |
If you carefully read the readme, you will know UniXcoder doe sn't need natural language. |
Hi,
I am interested in using CodeBERT for semantic text similarity / clustering on code but my results are rather poor. Here is my process:
Download the data:
Grab some examples to embed:
Embed the examples
Then I arbitrarily cosine the first inputs embedding with the rest of the inputs embeddings:
The output:
Notice that the cosine is very high for the top-5 examples, which is unexpected as these examples are chosen randomly. Manually inspecting them, they don't appear to be very relevant to the query.
My questions:
The text was updated successfully, but these errors were encountered: