<a href="https://colab.research.google.com/github/mostafa-ja/Anomaly-detection/blob/main/B(vectors).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PART 1: semantic vectors

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import re
import pandas as pd
import json
import gensim.downloader


In [2]:
print(list(gensim.downloader.info()['models'].keys()))


['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [None]:
# Download the 'word2vec-google-news-300' embeddings
word2vec = gensim.downloader.load('word2vec-google-news-300')



In [None]:
# Read log templates file into a DataFrame
df = pd.read_csv('/content/HDFS_templates.csv')
df.head(3)

In [None]:
templates = df['EventTemplate'].tolist()
templates[:3]

In [None]:

# we keep some stop words such as on, over, not, .. which can have significant meaning
stop_words = {
    'a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren',
    "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by',
    'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don',
    "don't", 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't",
    'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how',
    'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me',
    'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'nor', 'now', 'o',
    'of', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'own', 're', 's', 'same', 'shan',
    "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some', 'such', 't', 'than',
    'that', "that'll", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this',
    'those', 'through', 'to', 'too', 'until', 've', 'very', 'was', 'wasn', "wasn't", 'we', 'were', 'weren',
    "weren't", 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'won', "won't",
    'wouldn', "wouldn't", 'y', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself',
    'yourselves'
}

# Pre-compiling the regular expression pattern using re.compile() can improve the performance of the regular expression operations
pattern = re.compile(r'\W+|\d')

In [None]:
def tokenized(text):
    """
    Normalize text to extract most salient tokens
    """
    # Replace special characters with space and remove digits
    text = pattern.sub(' ', text)

    # Convert camel case to snake case, then replace _ with space
    text = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', text)
    text = re.sub('([a-z0-9])([A-Z])', r'\1_\2', text).lower().replace('_', ' ')

    normalized_tokens = [w for w in text.split() if w not in stop_words]

    # Return the filtered sentence, our output will be sentences not a list of words
    return ' '.join(normalized_tokens)


In [None]:
tokenized_template = [tokenized(sentence) for sentence in df['EventTemplate'] ]
print(tokenized_template)

IMPORTANT POINTS :

1 . this model, word2vec_model.get_vector(word) can gives vector for any meaningless word

2 . TfidfVectorizer is more commonly used as it combines tokenization and TF-IDF transformation in a single step, making it easier to use for most text tasks. On the other hand, TfidfTransformer is useful when you already have a matrix of term frequencies and want to compute the corresponding TF-IDF matrix. If you have a collection of text documents and want to obtain their TF-IDF representation, it is more straightforward to use TfidfVectorizer.

3 . because matrix_weight , are normalized , we dont need for each template, we get the mean of the vectors we sum

4 . we use strategy = 'average' in situation we have new template which we havent seen before

5 . some points about normalizing word2vec(because of them we dont normalize) :

Vectors are normalized to unit length before they are used for similarity calculation, making cosine similarity and dot-product equivalent. Most applications of word embeddings explore not the word vectors themselves, but relations between them to solve, for example, similarity and word relation tasks. For these tasks, it was found that using normalised word vectors improves performance. Word vector length is therefore typically ignored. A word that is consistently used in a similar context will be represented by a longer vector than a word of the same frequency that is used in different contexts(two same meaning words , have same angle but the size depends on ferequency). Not only the direction, but also the length of word vectors carries important information. Word vector length furnishes, in combination with term frequency, a useful measure of word significance.

6 . with the methode of tfidf, we find importance of words based on avaible templates not in general

In [None]:
def generate_embeddings(templates, strategy = 'tfidf'):
  """
  Generate embeddings for templates using fasttext
  Parameters
  ----------
  templates: list of templates
  strategy: average or tfidf

  Returns
  -------
  embeddings: dict of embeddings
  """

  cleaned_templates = [tokenized(template) for template in templates]

  embedding_shape = word2vec.get_vector('word').shape
  num_templates = len(cleaned_templates)
  embeddings = np.zeros((num_templates, embedding_shape[0]))

  if strategy == 'average':
    for i, cleaned_template in enumerate(cleaned_templates):
      vector = np.zeros(embedding_shape)
      for word in cleaned_template.split():
              vector += (1/(len(cleaned_template.split()))) * word2vec.get_vector(word)

      embeddings[i] = vector

  elif strategy == 'tfidf':


    vectorizer = TfidfVectorizer()
    matrix_weight = vectorizer.fit_transform(cleaned_templates)
    dic = vectorizer.vocabulary_

    for i, cleaned_template in enumerate(cleaned_templates):
        vector = np.zeros(embedding_shape)
        for word in cleaned_template.split():
            j = dic.get(word)  # If the key is not present, dic.get(word)(or dic.get(word, default_value)) will return None (or any default value you provide), while dic[word] will raise a KeyError if the key is not found.
            if j is not None:
                vector += matrix_weight[i, j] * word2vec.get_vector(word)

        embeddings[i] = vector

  return embeddings

In [None]:
# generate semantic vectors from templates
embeddings = generate_embeddings(templates, strategy = 'tfidf')
embeddings


In [None]:
# save semantic vectors
with open('/content/embeddings_tfidf.json', 'w') as f:
        json.dump(embeddings, f)