<a href="https://colab.research.google.com/github/hjiang13/CodeBERT/blob/master/Trying_out_CodeBERT_%F0%9F%94%A5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trying out CodeBERT

We have recently released [codeBERT](https://huggingface.co/codistai/codeBERT-small-v2) using the mighty HuggingFace library and platform. In this short tutorial we will show you how to use it in conjunction with one other utility library that we have made available. e.g.



*   [code-bert](https://github.com/autosoft-dev/code-bert)


We have also released [tree-hugger](https://github.com/autosoft-dev/tree-hugger), and we will write a second tutorial to show how to set up a full pipeline from reading and mining the code (using tree-hugger) then pre-processing it (using code-bert) and then using the model to run some experiements on it.

Stay tuned!



### First thing: Gettting to setup

In [2]:
!pip3 install -U -q tensorboard transformers

In [3]:
!pip3 install -q dpu-utils invoke

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.3/73.3 kB[0m [31m743.4 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.3/160.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m394.5/394.5 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m164.7/164.7 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.4/193.4 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.5/101.5 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for docopt (setup.py) ... [?25l[?25hdone


In [4]:
# Let's clone the small code-bert library, as we do not have pip wheel for it

!git clone https://github.com/autosoft-dev/code-bert && cd code-bert && pip install -e .

Cloning into 'code-bert'...
remote: Enumerating objects: 383, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 383 (delta 12), reused 0 (delta 0), pack-reused 362[K
Receiving objects: 100% (383/383), 260.18 KiB | 4.41 MiB/s, done.
Resolving deltas: 100% (204/204), done.
Obtaining file:///content/code-bert
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: CodeBERT
  Running setup.py develop for CodeBERT
Successfully installed CodeBERT-0.2.0


In [5]:
!ls

code-bert  sample_data


# RESTART THE RUNTIME. HERE!! (⌘/Ctrl+M)

In [None]:
import os
os.kill(os.getpid(), 9)

Hope you did not forget to restart the runtime as was written in big just above. Otherwise the following code will throw an error

Now that seems that we are all setup, we are going to do the code parsing and the pretokenization step


Let's import the `process_code` function from `code_bert.core.data_reader` module

In [1]:
from code_bert.core.data_reader import process_code

In [2]:
# We are going to read some example files that are already there. ANd see how the tokenization works

with open("code-bert/test_files/test_code_add.py") as f:
  code = f.read()
print(code)

def add(a, b):
    """
    sums two numbers and returns the result
    """
    return a + b


def return_all_even(lst):
    """
    numbers that are not really odd
    """
    if not lst:
        return None
    return [a for a in lst if a % 2 == 0]



OK, our code is a small function called `add` which takes two params and returns the sum of them. This function is defined in the file called `test_code_add.py` under `code-bert/test_files` directory.

When we read it this way, from a Python point of view it is just a string.


Now, let's use `process_code` on the code.

In [3]:
process_code(code)

['def add ( a , b ) : indent',
 '""" sums two numbers and returns the result """',
 'return a + b',
 'dedent def return all even ( lst ) : indent',
 '""" numbers that are not really odd """',
 'if not lst : indent',
 'return none',
 'dedent return [ a for a in lst if a % 2 == 0 ] dedent']

As we can see from the above, the code is parsed in a specific way.

**Here are few things to consider**

*   The return from `process_code` is a list. Each item in the list is one logical line of code, such as `def add ( a , b ) : indent`
*   All the code tokens are space seperated.
*   There are two special tokens, e.g. `indent` and `dedent` specifying where in the code we have made either an `INDENT` or a `DEDENT`. As we know that those two are essential for the logical structure of a python code.



## Testing the language model

Now we are going to load and test the MLM model on code.

In [4]:
from transformers import *

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


In [5]:
tokenizer = AutoTokenizer.from_pretrained("codistai/codeBERT-small-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--codistai--codeBERT-small-v2/snapshots/01695bc17a6157b5e24cb003c8d0b0ce88c87894/config.json
Model config RobertaConfig {
  "_name_or_path": "codistai/codeBERT-small-v2",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 1024,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.38.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 60000
}



vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/548k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--codistai--codeBERT-small-v2/snapshots/01695bc17a6157b5e24cb003c8d0b0ce88c87894/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--codistai--codeBERT-small-v2/snapshots/01695bc17a6157b5e24cb003c8d0b0ce88c87894/merges.txt
loading file tokenizer.json from cache at None
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--codistai--codeBERT-small-v2/snapshots/01695bc17a6157b5e24cb003c8d0b0ce88c87894/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--codistai--codeBERT-small-v2/snapshots/01695bc17a6157b5e24cb003c8d0b0ce88c87894/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--codistai--codeBERT-small-v2/snapshots/01695bc17a6157b5e24cb003c8d0b0ce88c87894/config.json
Model config R

In [6]:
model = AutoModelWithLMHead.from_pretrained("codistai/codeBERT-small-v2")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--codistai--codeBERT-small-v2/snapshots/01695bc17a6157b5e24cb003c8d0b0ce88c87894/config.json
Model config RobertaConfig {
  "_name_or_path": "codistai/codeBERT-small-v2",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 1024,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.38.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 60000
}



pytorch_model.bin:   0%|          | 0.00/763M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--codistai--codeBERT-small-v2/snapshots/01695bc17a6157b5e24cb003c8d0b0ce88c87894/pytorch_model.bin
Some weights of the model checkpoint at codistai/codeBERT-small-v2 were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of RobertaForMaskedLM were initialized from the model checkpoint at codistai/codeBERT-small-v2.
If your task is similar to the task the model of t

Great. Now we are ready to run some tests using the model.

To do that, we write an utility function first, which, given a list of logical lines combines them (*optionally until a certain limit, as the max_positional_embedding of the original model is 1024*) and returns the combined lines and also takes care of replacing a certain token (The first appearance of it) with the special `<mask>` token if we want to.

In [7]:
def combine_logical_lines_of_code(logical_lines, replace_mask_with=None, limit=None):
  l = len(logical_lines) if not limit else limit
  combined_code = " ".join(logical_lines[:l])
  return combined_code.replace(replace_mask_with, "<mask>", 1) if replace_mask_with else combined_code

Now we use the utility function in composition with `process_code` to get the code string that we are later going to use for our model

In [8]:
c = combine_logical_lines_of_code(process_code(code), "b")
print(c)

def add ( a , <mask> ) : indent """ sums two numbers and returns the result """ return a + b dedent def return all even ( lst ) : indent """ numbers that are not really odd """ if not lst : indent return none dedent return [ a for a in lst if a % 2 == 0 ] dedent


### Define a pipleline and then call it with the final string

In [9]:
p = pipeline('fill-mask', model=model, tokenizer=tokenizer)

In [10]:
p(c)[0]

{'score': 0.9800990223884583,
 'token': 321,
 'token_str': ' b',
 'sequence': 'def add ( a, b ) : indent """ sums two numbers and returns the result """ return a + b dedent def return all even ( lst ) : indent """ numbers that are not really odd """ if not lst : indent return none dedent return [ a for a in lst if a % 2 == 0 ] dedent'}

Whoa!! It predicted the masked token to be `b` rightly with almost 92% certainty :) 🎉 👍🏽 🔥 💥