<a href="https://colab.research.google.com/github/mostafa-ja/Anomaly-detection/blob/main/LogADEmpirical3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generate_embeddings

In [1]:
!wget 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip'

--2023-07-22 15:34:44--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.225.142.121, 13.225.142.88, 13.225.142.76, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.225.142.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1523785255 (1.4G) [application/zip]
Saving to: ‘crawl-300d-2M.vec.zip’


2023-07-22 15:35:24 (37.2 MB/s) - ‘crawl-300d-2M.vec.zip’ saved [1523785255/1523785255]



In [2]:
!unzip "/content/crawl-300d-2M.vec.zip" -d "/content/"

Archive:  /content/crawl-300d-2M.vec.zip
  inflating: /content/crawl-300d-2M.vec  


In [6]:
import sys

import nltk
nltk.download('stopwords')

import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
from typing import List
from time import time
import json

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
template_df = pd.read_csv('/content/HDFS.log_templates.csv')
templates = template_df['EventTemplate'].tolist()
print(templates[:5])
print(len(templates))

['Receiving block <*> src: <*> dest: <*>', 'BLOCK* NameSystem.allocateBlock: <*> <*>', 'PacketResponder <*> for block <*> <*>', 'Received block <*> of size <*> from <*>', 'BLOCK* NameSystem.addStoredBlock: blockMap updated: <*> is added to <*> size <*>']
48


In [7]:
st = time()
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('./crawl-300d-2M.vec', binary=False)
stop_words = set(stopwords.words('english'))
tokenizer = RegexpTokenizer(r'\w+')
print("Loaded word2vec model in {:.2f} seconds".format(time() - st))


Loaded word2vec model in 0.01 seconds


In [8]:
# remove stop word and  punctuation, split by camel case
def clean_template(template: str, remove_stop_words: bool = True):
    template = " ".join([word.lower() if word.isupper() else word for word in template.strip().split()])
    template = re.sub('[A-Z]', lambda x: " " + x.group(0), template)  # camel case
    word_tokens = tokenizer.tokenize(template)  # tokenize
    word_tokens = [w for w in word_tokens if not w.isdigit()]  # remove digital
    if remove_stop_words:  # remove stop words, we can close this function
        filtered_sentence = [w.lower() for w in word_tokens if w not in stop_words]
    else:
        filtered_sentence = [w.lower() for w in word_tokens]

    template_clean = " ".join(filtered_sentence)
    return template_clean  # return string

IMPORTANT POINTS :

1 . this model, word2vec_model.get_vector(word) can gives vector for any meaningless word

2 . TfidfVectorizer is more commonly used as it combines tokenization and TF-IDF
transformation in a single step, making it easier to use for most text
tasks. On the other hand, TfidfTransformer is useful when you already have a
matrix of term frequencies and want to compute the corresponding TF-IDF matrix.
If you have a collection of text documents and want to obtain their TF-IDF
representation, it is more straightforward to use TfidfVectorizer.

3 . because matrix_weight , are normalized , we dont need for each template, we get the mean of the vectors we sum  

4 . we use strategy = 'average' in situation we have new template which we havent seen before

5 . some points about normalizing word2vec :
- Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent.
- Most applications of word embeddings explore not the word vectors themselves, but relations between them to solve, for example, similarity and word relation tasks. For these tasks, it was found that using normalised word vectors improves performance. Word vector length is therefore typically ignored.
- A word that is consistently used in a similar context will be represented by a longer vector than a word of the same frequency that is used in different contexts.
Not only the direction, but also the length of word vectors carries important information.
Word vector length furnishes, in combination with term frequency, a useful measure of word significance.

In [35]:
def generate_embeddings_fasttext(templates, strategy = 'average'):
  """
  Generate embeddings for templates using fasttext
  Parameters
  ----------
  templates: list of templates
  strategy: average or tfidf

  Returns
  -------
  embeddings: dict of embeddings
  """

  cleaned_templates = [clean_template(template) for template in templates]
  embeddings = {}

  if strategy == 'average':
    for i, (cleaned_template, template) in enumerate(zip(cleaned_templates, templates)):
      template2vec = np.zeros(300) #300 = word2vector size
      for word in cleaned_template.split():
              template2vec += 1/(len(cleaned_template.split())) * word2vec_model.get_vector(word)

    embeddings[template] = template2vec.tolist()

  elif strategy == 'tfidf':


    vectorizer = TfidfVectorizer()
    matrix_weight = vectorizer.fit_transform(cleaned_templates)
    dic = vectorizer.vocabulary_

    for i, (cleaned_template, template) in enumerate(zip(cleaned_templates, templates)):
        template2vec = np.zeros(300) #300 = word2vector size
        for word in cleaned_template.split():
            j = dic.get(word)  # If the key is not present, dic.get(word)(or dic.get(word, default_value)) will return None (or any default value you provide), while dic[word] will raise a KeyError if the key is not found.
            if j is not None:
                template2vec += matrix_weight[i, j] * word2vec_model.get_vector(word)

        embeddings[template] = template2vec.tolist()

  return embeddings

In [36]:
embeddings = generate_embeddings_fasttext(templates, strategy='tfidf')
with open('/content/embeddings_tfidf.json', 'w') as f:
        json.dump(embeddings, f)
