# Using a pre-trained model to make inferences

This workbook is based on the book by Denis Rothman. 

Copyright 2021, Denis Rothman. Denis Rothman adapted a Hugging Face reference notebook to pretrain a transformer model. The next steps would be to work on building a larger dataset and testing several transformer models. 

## What's in this workbook
In contrast to the original workbook, this one contains only the inference pipeline. It loads a pre-trained model and uses it to create embeddings for the source code. 

### The Hugging Face original Reference and notes:

Notebook edition (link to original of the reference blogpost [link](https://huggingface.co/blog/how-to-train)).


In [1]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Define the model

In [2]:
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

2023-02-06 16:37:14.553173: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-02-06 16:37:14.553211: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [3]:
print(config)

RobertaConfig {
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}



# Loading the tokenizer in transformers

In [4]:
#@title Step 8: Re-creating the Tokenizer in Transformers
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("./SingletonSSLBERT", max_length=512)

# Extracting embeddings

When we extract the embeddings, the feature extraction pipeline provides the embeddings for the last layer of the model.
The first element of that list is the embedding for the [CLS] token. This is also a 768-dimensional vector that contains the average value for the entire sentence.

Please note that we do not check if the number of elements in that vector is within the limits of the actual model input (512).

In [5]:
# import the feature extraction pipeline
from transformers import pipeline

# create the pipeline, which will extract the embedding vectors
# the models are already pre-defined, so we do not need to train anything here
features = pipeline(
    "feature-extraction",
    model="./SingletonSSLBERT",
    tokenizer="./SingletonSSLBERT", 
    return_tensor = False
)

Some weights of the model checkpoint at ./SingletonSSLBERT were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at ./SingletonSSLBERT and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# tokenize the input string to check how many tokens we get
strInput = 'Class SingletonX1'
tokenizedInput = tokenizer.encode(strInput)

tokens = tokenizer.convert_ids_to_tokens(tokenizedInput)

# print the tokens
print(tokens)

# and print their encoding (BPE in this case)
print(tokenizedInput)

['<s>', 'Class', 'ĠSingleton', 'X', '1', '</s>']
[0, 3035, 273, 60, 21, 2]


In [7]:
# extract the features == embeddings
lstFeatures = features('Class SingletonX1')

# print the first token's embedding [CLS]
# which is also a good approximation of the whole sentence embedding
# the same as using np.mean(lstFeatures[0], axis=0)
lstFeatures[0][0]

[-0.30440953373908997,
 3.4135632514953613,
 0.2808285653591156,
 -1.3889353275299072,
 -0.3860173225402832,
 -0.9159358739852905,
 0.2484295517206192,
 -1.1038519144058228,
 1.6971060037612915,
 0.5052031874656677,
 1.2371865510940552,
 1.216167688369751,
 2.116595983505249,
 0.5468905568122864,
 0.7566933631896973,
 1.3500117063522339,
 -0.7627058625221252,
 -1.4250905513763428,
 -0.641603946685791,
 0.6625863313674927,
 -1.0787094831466675,
 -0.31401994824409485,
 1.0479545593261719,
 -0.5259291529655457,
 -0.1737566590309143,
 -0.3669564723968506,
 -0.3512457311153412,
 -1.8327451944351196,
 2.205744981765747,
 -1.0959172248840332,
 -0.40989235043525696,
 -0.6881439685821533,
 -0.4288038909435272,
 -1.785530686378479,
 1.0144950151443481,
 0.06939089298248291,
 -0.08995966613292694,
 -0.3894636631011963,
 -1.082709789276123,
 -0.1534804254770279,
 -0.4671172499656677,
 0.22012339532375336,
 0.8331872224807739,
 -1.363080620765686,
 1.233833909034729,
 -1.6185020208358765,
 -0.21296

In [8]:
# now, let's encode another input string and see what happens
# just to understand how well the programs are connected
strInput = 'Class SingletonX2'
tokenizedInput = tokenizer.encode(strInput)

tokens = tokenizer.convert_ids_to_tokens(tokenizedInput)
print(tokens)

lstFeatures2 = features('Class SingletonX2')

['<s>', 'Class', 'ĠSingleton', 'X', '2', '</s>']


In [9]:
# compare the two embeddings using consine similarity
from scipy import spatial
result1 = spatial.distance.cosine(lstFeatures[0][0], lstFeatures2[0][0])

In [10]:
# encode a third input string which has nothing to do with the Singleton pattern
strInput = 'printf("Hello World");'

tokenizedInput = tokenizer.encode(strInput)

tokens = tokenizer.convert_ids_to_tokens(tokenizedInput)
print(tokens)

lstFeatures3 = features('printf("Hello World");')

# compare the two embeddings using consine similarity
result2 = spatial.distance.cosine(lstFeatures[0][0], lstFeatures3[0][0])



['<s>', 'printf', '("', 'Hello', 'ĠW', 'orld', '");', '</s>']


In [11]:
print(f'Distance between singletons: {result1:.3f} and between non-singletons: {result2:.3f}')

Distance between singletons: 0.079 and between non-singletons: 0.111


## Inference for a single file 

In this part, we read the file one-by-one and extract the embeddings for each line. We save them in a file. 

In [19]:
# read the file from the data directory
with open('/mnt/c/users/miros/documents/data/wolfssl-master/src/internal.c', 'r') as f:
    lstLines = f.readlines()

# now go through all the lines and extract embeddings
dictEmbeddings = {}

# counter of the lines
i = 0

for strLine in lstLines:

    # print the progress
    i += 1
    if i % 1000 == 0:
        print(f'Processed {i} lines of {len(lstLines)}')

    # extract the features == embeddings
    lstFeatures = features(strLine)

    # get the embedding of the first token [CLS]
    # which is also a good approximation of the whole sentence embedding
    # the same as using np.mean(lstFeatures[0], axis=0)
    lstEmbedding = lstFeatures[0][0]

    # store the embedding in the dictionary
    dictEmbeddings[strLine] = lstEmbedding

Processed 1000 lines of 37009
Processed 2000 lines of 37009
Processed 3000 lines of 37009
Processed 4000 lines of 37009
Processed 5000 lines of 37009
Processed 6000 lines of 37009
Processed 7000 lines of 37009
Processed 8000 lines of 37009
Processed 9000 lines of 37009
Processed 10000 lines of 37009
Processed 11000 lines of 37009
Processed 12000 lines of 37009
Processed 13000 lines of 37009
Processed 14000 lines of 37009
Processed 15000 lines of 37009
Processed 16000 lines of 37009
Processed 17000 lines of 37009
Processed 18000 lines of 37009
Processed 19000 lines of 37009
Processed 20000 lines of 37009
Processed 21000 lines of 37009
Processed 22000 lines of 37009
Processed 23000 lines of 37009
Processed 24000 lines of 37009
Processed 25000 lines of 37009
Processed 26000 lines of 37009
Processed 27000 lines of 37009
Processed 28000 lines of 37009
Processed 29000 lines of 37009
Processed 30000 lines of 37009
Processed 31000 lines of 37009
Processed 32000 lines of 37009
Processed 33000 l

In [23]:
import pandas as pd
dfEmbeddings = pd.DataFrame.from_dict(dictEmbeddings, orient='index')
dfEmbeddings.head()

# save the embeddings to a file
dfEmbeddings.to_csv('./embeddings_lines.csv', sep='$')

In [25]:
# embedding for the entire file 
# which is the average of all the embeddings
# of the individual lines
import numpy as np
lstEmbedding = np.mean(dfEmbeddings.values, axis=0)
dictEmbeddingFile = {'internal.c': lstEmbedding}

dfEMbeddingFile = pd.DataFrame.from_dict(dictEmbeddingFile, orient='index')
dfEMbeddingFile.head()

# save the embeddings to a file
dfEMbeddingFile.to_csv('./embeddings_file.csv', sep='$')