<a href="https://colab.research.google.com/github/mostafa-ja/Anomaly-detection/blob/main/swisslog_semantic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install transformers



In [4]:
from transformers import BertTokenizer, BertModel
import torch


[link text](https://github.com/google-research/bert#pre-trained-models)

Pre-trained models

We are releasing the BERT-Base and BERT-Large models from the paper. Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith. The Uncased model also strips out any accent markers. Cased means that the true case and accent markers are preserved. Typically, the Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging).

These models are all released under the same license as the source code (Apache 2.0).

For information about the Multilingual and Chinese model, see the Multilingual README.

When using a cased model, make sure to pass --do_lower=False to the training scripts. (Or pass do_lower_case=False directly to FullTokenizer if you're using your own script.)

The links to the models are here (right-click, 'Save link as...' on the name):

BERT-Large, Uncased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters

BERT-Large, Cased (Whole Word Masking): 24-layer, 1024-hidden, 16-heads, 340M parameters

BERT-Base, Uncased: 12-layer, 768-hidden, 12-heads, 110M parameters

BERT-Large, Uncased: 24-layer, 1024-hidden, 16-heads, 340M parameters

BERT-Base, Cased: 12-layer, 768-hidden, 12-heads , 110M parameters

BERT-Large, Cased: 24-layer, 1024-hidden, 16-heads, 340M parameters

BERT-Base, Multilingual Cased (New, recommended): 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

BERT-Base, Multilingual Uncased (Orig, not recommended) (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

In [3]:
#Uncased means that the text has been lowercased before WordPiece tokenization
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [4]:
sentences = ['Another sentence goes here.']

# Tokenize and convert to tensor
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Forward pass through the model
with torch.no_grad():
    outputs = model(**inputs)




In [5]:
print(inputs)

{'input_ids': tensor([[ 101, 2178, 6251, 3632, 2182, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}


The outputs variable will contain a dictionary with several keys, including:

last_hidden_state: A tensor of shape (batch_size, sequence_length, hidden_size), containing the contextual embeddings for each token in the input sentence. In this case, since there is only one sentence, the tensor will have shape (1, sequence_length, hidden_size).

pooler_output: A tensor of shape (batch_size, hidden_size), representing a pooled representation of the entire input sentence. This tensor is obtained by applying a pooling operation (typically mean or max pooling) to the contextual embeddings of all tokens in the sentence.

hidden_states: A tuple of tensors containing the hidden states of the model at various layers. The number of hidden states will depend on the architecture of the BERT model.

In [7]:
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [8]:
outputs.last_hidden_state.shape   #outputs['last_hidden_state'].shape

torch.Size([1, 7, 768])

In [11]:
outputs.last_hidden_state.mean(dim=1).shape # mix all words of the sentence

torch.Size([1, 768])

In [10]:
outputs.pooler_output.shape

torch.Size([1, 768])

In [12]:
a = ['Receiving', 'block', '<*>' ]
sen = ' '.join(a)
sen

'Receiving block <*>'

In [5]:
from transformers import BertTokenizer, BertModel
import torch

def vectorize_sentences(sentences, model_name='bert-base-uncased'):
    # Load pre-trained BERT model and tokenizer
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)

    # Tokenize and convert to tensors for all sentences
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

    # Forward pass through the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Extract sentence embeddings (mean of token embeddings)
    sentence_embeddings = outputs.last_hidden_state.mean(dim=1)

    return sentence_embeddings


In [2]:
sentences_to_vectorize = [
    "This is the first sentence.",
    "Another sentence goes here.",
    "And a third sentence for vectorization."
]

# Get the vector representations of the sentences
vector_representations = vectorize_sentences(sentences_to_vectorize)

# Print the vector representations
print(vector_representations)

tensor([[-0.2604, -0.2182,  0.2819,  ...,  0.0782,  0.1706,  0.0177],
        [ 0.3117,  0.0221,  0.1525,  ...,  0.0644,  0.0831,  0.2578],
        [-0.1589, -0.2530, -0.2959,  ..., -0.0673, -0.1896,  0.2215]])


In [3]:
vector_representations.shape

torch.Size([3, 768])

In [4]:
from tqdm import tqdm
import pickle

t2wPath = '/content/templates.pkl'
outputPath = '/content/bert_encoding.pkl'

with open(t2wPath, 'rb') as f:
    data = pickle.load(f)
# group all words into one sentence
templateSentence = dict()
for i, v in tqdm(data.items()):
    sen = ' '.join(v)  #This line joins the list of words for each template into a single sentence, separated by spaces
    templateSentence[i] = vectorize_sentences(sen)

with open(outputPath, 'wb') as f:
    pickle.dump(templateSentence,f)

print('Successfully Finished BERT Encoding')

100%|██████████| 35/35 [01:25<00:00,  2.46s/it]

Successfully Finished BERT Encoding





In [6]:
templateSentence[0].shape

torch.Size([1, 768])

In [7]:
templateSentence.keys()

dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34])

[link text](https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens)

[all models in this page](https://huggingface.co/sentence-transformers?sort_models=downloads#models)

In [1]:
!pip install -U sentence-transformers


Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[

In [2]:
from sentence_transformers import SentenceTransformer, util
# explain this model :  https://pypi.org/project/sentence-transformers/0.3.2/
model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')


Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [6]:
sentences = ["open the file", "run the file", "close the file", "file is not found","file is corrupted"]
embeddings1 = model.encode(sentences)
print(embeddings1.shape)

(5, 768)


In [7]:
embeddings2 = vectorize_sentences(sentences)
print(embeddings2.shape)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

torch.Size([5, 768])


In [8]:
util.cos_sim(embeddings1, embeddings1)

tensor([[1.0000, 0.8163, 0.4640, 0.1087, 0.3439],
        [0.8163, 1.0000, 0.5776, 0.2320, 0.4698],
        [0.4640, 0.5776, 1.0000, 0.5535, 0.5734],
        [0.1087, 0.2320, 0.5535, 1.0000, 0.5736],
        [0.3439, 0.4698, 0.5734, 0.5736, 1.0000]])

In [9]:
util.cos_sim(embeddings2, embeddings2)

tensor([[1.0000, 0.8742, 0.8805, 0.6650, 0.7023],
        [0.8742, 1.0000, 0.8193, 0.6852, 0.7128],
        [0.8805, 0.8193, 1.0000, 0.6403, 0.6987],
        [0.6650, 0.6852, 0.6403, 1.0000, 0.8279],
        [0.7023, 0.7128, 0.6987, 0.8279, 1.0000]])

In [12]:
sentences = ["i love apple", "i like apple", "i hate apple", "i love orange","i love her"]
embeddings1 = model.encode(sentences)
print(util.cos_sim(embeddings1, embeddings1))
embeddings2 = vectorize_sentences(sentences)
print('------------------------------------------')
print(util.cos_sim(embeddings2, embeddings2))

tensor([[1.0000, 0.9672, 0.4314, 0.7308, 0.5697],
        [0.9672, 1.0000, 0.4252, 0.7006, 0.4990],
        [0.4314, 0.4252, 1.0000, 0.2827, 0.1134],
        [0.7308, 0.7006, 0.2827, 1.0000, 0.6801],
        [0.5697, 0.4990, 0.1134, 0.6801, 1.0000]])
------------------------------------------
tensor([[1.0000, 0.8813, 0.8662, 0.9577, 0.8105],
        [0.8813, 1.0000, 0.8118, 0.8413, 0.6976],
        [0.8662, 0.8118, 1.0000, 0.8347, 0.7605],
        [0.9577, 0.8413, 0.8347, 1.0000, 0.8032],
        [0.8105, 0.6976, 0.7605, 0.8032, 1.0000]])


[how to use transformer models for embedding](https://www.sbert.net/examples/applications/computing-embeddings/README.html)

In [13]:
model2 = SentenceTransformer('all-MiniLM-L6-v2')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [14]:
sentences = ["i love apple", "i like apple", "i hate apple", "i love orange","i love her"]
embeddings1 = model.encode(sentences)
print(util.cos_sim(embeddings1, embeddings1))
embeddings2 = model2.encode(sentences)
print('------------------------------------------')
print(util.cos_sim(embeddings2, embeddings2))

tensor([[1.0000, 0.9672, 0.4314, 0.7308, 0.5697],
        [0.9672, 1.0000, 0.4252, 0.7006, 0.4990],
        [0.4314, 0.4252, 1.0000, 0.2827, 0.1134],
        [0.7308, 0.7006, 0.2827, 1.0000, 0.6801],
        [0.5697, 0.4990, 0.1134, 0.6801, 1.0000]])
------------------------------------------
tensor([[1.0000, 0.9163, 0.8457, 0.4465, 0.3444],
        [0.9163, 1.0000, 0.8308, 0.4366, 0.2561],
        [0.8457, 0.8308, 1.0000, 0.3143, 0.2145],
        [0.4465, 0.4366, 0.3143, 1.0000, 0.3292],
        [0.3444, 0.2561, 0.2145, 0.3292, 1.0000]])


In [74]:
sentences =['A man is eating food.',
          'A man is eating a piece of bread.',
          'The girl is carrying a baby.',
          'A man is riding a horse.',
          'A woman is playing violin.',
          'Two men pushed carts through the woods.',
          'A man is riding a white horse on an enclosed ground.',
          'A monkey is playing drums.',
          ]
embeddings1 = model.encode(sentences)
print(util.cos_sim(embeddings1, embeddings1))
embeddings2 = model2.encode(sentences)
print('------------------------------------------')
print(util.cos_sim(embeddings2, embeddings2))

tensor([[ 1.0000,  0.8437,  0.0013,  0.1889, -0.0633,  0.3480,  0.1330,  0.2096],
        [ 0.8437,  1.0000, -0.0136,  0.0795,  0.0034,  0.2797,  0.1350,  0.2860],
        [ 0.0013, -0.0136,  1.0000,  0.0978,  0.4002,  0.0498, -0.0104,  0.2725],
        [ 0.1889,  0.0795,  0.0978,  1.0000,  0.3309,  0.4154,  0.7206,  0.1594],
        [-0.0633,  0.0034,  0.4002,  0.3309,  1.0000,  0.0259,  0.2359,  0.1104],
        [ 0.3480,  0.2797,  0.0498,  0.4154,  0.0259,  1.0000,  0.3744,  0.0845],
        [ 0.1330,  0.1350, -0.0104,  0.7206,  0.2359,  0.3744,  1.0000,  0.1071],
        [ 0.2096,  0.2860,  0.2725,  0.1594,  0.1104,  0.0845,  0.1071,  1.0000]])
------------------------------------------
tensor([[ 1.0000,  0.6601, -0.0559,  0.2464, -0.0930, -0.0289,  0.1883,  0.0325],
        [ 0.6601,  1.0000, -0.0216,  0.2232,  0.0020, -0.0945,  0.1767,  0.0480],
        [-0.0559, -0.0216,  1.0000, -0.0135, -0.0157, -0.0339, -0.0626, -0.0273],
        [ 0.2464,  0.2232, -0.0135,  1.0000, -0.0748, 

In [None]:
model2.encode('A man is eating food.')

In [17]:
#with normalizing, we can use "util.dot_score" instead of "util.cos_sim" which make computation fast
model2.encode('A man is eating food.',convert_to_tensor=True,normalize_embeddings=True)

tensor([ 3.3242e-02,  4.4061e-03, -6.2770e-03,  4.8379e-02, -1.3870e-01,
        -3.3617e-02,  1.0113e-01, -5.4385e-02, -4.3248e-02, -3.9941e-02,
         7.7863e-03, -1.2749e-02, -6.6830e-02, -1.7387e-02,  4.7451e-02,
        -5.7724e-02,  1.0189e-01, -9.1164e-04,  8.2261e-02, -5.0342e-02,
         6.7730e-02,  4.0877e-02, -3.5802e-02, -1.0068e-01, -6.6936e-03,
        -5.3169e-02,  1.0034e-01, -5.4614e-02, -2.2848e-02,  1.3839e-02,
         7.4866e-02, -6.1788e-02,  6.3922e-02,  1.6239e-02, -5.3230e-02,
        -3.8608e-02,  3.1528e-02, -8.1153e-02, -3.3143e-02, -5.3852e-04,
        -3.9607e-03, -1.5273e-02, -9.8640e-04,  9.5799e-02, -5.4292e-02,
         1.8457e-02, -1.0714e-01,  1.3888e-02,  3.9407e-02, -2.6924e-02,
        -9.1599e-02, -1.1420e-02,  3.3814e-02, -2.5844e-02,  6.4262e-02,
         1.2114e-02,  2.1777e-02,  9.1483e-02, -1.0504e-01, -2.1919e-02,
         3.1334e-02, -5.5160e-02,  2.8510e-02, -2.4123e-02,  4.9336e-02,
        -6.8366e-02, -1.9275e-02, -1.2098e-02, -2.4