# Using a pre-trained model from Huggingface
This is the model pre-trained on the Singleton examples and WolfSSL code. It is available on Huggingface.

The model is a BERT model with a single layer of 768 hidden units. It was trained on the examples from the Singleton paper and the WolfSSL code. The mdoel was trained with 100 epochs and a batch size of 64, using the 2080 8GB card. 

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Downloading the model

When we use the model from Huggingface, we do not have to worry about tokenizers and other elements. We can just use the model directly. 


In [None]:
# import the model via the huggingface library
from transformers import AutoTokenizer, AutoModelForMaskedLM

# load the tokenizer and the model for the pretrained SingBERTa
tokenizer = AutoTokenizer.from_pretrained('microsoft/codebert-base')

# load the model
model = AutoModelForMaskedLM.from_pretrained("microsoft/codebert-base")

# Using the right pipeline

Every model has a set of pipelines defined for it, depending on what the model can do. For example, a model that is trained for text generation will have a pipeline for text generation.

The model we are using is a BERT model, which is trained for token classification. We can use the `pipeline` function to get the right pipeline for the model.

In [None]:
# import the pipeline 
from transformers import pipeline

# load the pipeline for text generation
fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=tokenizer
)

# define the code snippet
code_snippet = '''def HelloWorld():
'''
# generate the code
generatedCode = fill_mask(code_snippet + "return '<mask>'")

# print the suggestions for the generated code
generatedCode

In [None]:
# now, let's do the same for CodeParrot model
from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline

# load the tokenizer and the model for the pretrained SingBERTa
tokenizer = AutoTokenizer.from_pretrained('codeparrot/codeparrot')

# load the model
model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot")

# load the pipeline for text generation
pipe = pipeline("text-generation", model="codeparrot/codeparrot")

# define the code snippet
code_snippet = '''def HelloWorld():
'''
# generate the code
generatedCode = pipe(code_snippet)

# print the suggestions for the generated code
generatedCode

# Using the pipeline to understand code similarity

We do not have to generate the code, as long as we can just generate the embeddings for the code. We can then use the embeddings to find the similarity between the code.

In [None]:
# import the feature extraction pipeline
from transformers import pipeline

# create the pipeline, which will extract the embedding vectors
# the models are already pre-defined, so we do not need to train anything here
features = pipeline(
    "feature-extraction",
    model="microsoft/codebert-base",
    tokenizer="microsoft/codebert-base", 
    return_tensor = False
)

# extract the features == embeddings
lstFeatures = features('Class SingletonX1')

# print the first token's embedding [CLS]
# which is also a good approximation of the whole sentence embedding
# the same as using np.mean(lstFeatures[0], axis=0)
lstFeatures[0][0]

## Using models for code summarization

This model is the CodeT5 model for summarizing source code. 

An example model is available here: https://huggingface.co/Salesforce/codet5-base-multi-sum 

In [2]:
from transformers import RobertaTokenizer, T5ForConditionalGeneration


tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum')
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')

text = """def svg_to_image(string, size=None):
if isinstance(string, unicode):
    string = string.encode('utf-8')
    renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
if not renderer.isValid():
    raise ValueError('Invalid SVG data.')
if size is None:
    size = renderer.defaultSize()
    image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
    painter = QtGui.QPainter(image)
    renderer.render(painter)
return image"""

input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=20)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

# this prints: "Convert a SVG string to a QImage."

vocab.json:   0%|          | 0.00/703k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/294k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/12.5k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/902 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Converts a SVG string to a QImage.


## Using the pre-trained model (from our task)

The model is available here: https://huggingface.co/mstaron/wolfBERTa 

In [6]:
# import the model via the huggingface library
from transformers import AutoTokenizer, AutoModelForMaskedLM

# load the tokenizer and the model for the pretrained wolfBERTa
tokenizer = AutoTokenizer.from_pretrained('mstaron/wolfBERTa')

# load the model
model = AutoModelForMaskedLM.from_pretrained("mstaron/wolfBERTa")

from transformers import pipeline
unmasker = pipeline('fill-mask', 
                    model=model, 
                    tokenizer=tokenizer)
unmasker("int x = <mask>")

[{'score': 0.20149990916252136,
  'token': 274,
  'token_str': ' {',
  'sequence': 'int x = {'},
 {'score': 0.05696285516023636,
  'token': 325,
  'token_str': ' b',
  'sequence': 'int x = b'},
 {'score': 0.050293806940317154,
  'token': 265,
  'token_str': ' 0',
  'sequence': 'int x = 0'},
 {'score': 0.03858976066112518,
  'token': 522,
  'token_str': ' 2',
  'sequence': 'int x = 2'},
 {'score': 0.030387546867132187,
  'token': 278,
  'token_str': ' (',
  'sequence': 'int x = ('}]