# Vulture

## Introduction to Text Operations with Vulture

In [1]:
import os
import pickle
import pathlib
import pandas as pd

from TELF.pre_processing import Vulture

## 0. Load Dataset

### Input

In [2]:
DATA_DIR = os.path.join('..', '..', 'data')
DATA_DIR = pathlib.Path(DATA_DIR).resolve()

In [3]:
DATA_FILE = 'documents.p'

In [4]:
documents = pickle.load(open(os.path.join(DATA_DIR, DATA_FILE), 'rb'))
len(documents)

9

### Output

In [5]:
RESULTS_DIR = 'results'
RESULTS_DIR = pathlib.Path(RESULTS_DIR).resolve()

In [19]:
RESULTS_FILE = 'operated_documents'

In [7]:
try:
    os.mkdir(RESULTS_DIR)
except FileExistsError:
    pass

### Examine Data Format

In [8]:
# key serve as document unique ids
list(documents.keys())

['ad68055e-677f-11ee-95d4-4ab2673ea3f0',
 'ad680626-677f-11ee-95d4-4ab2673ea3f0',
 'ad680658-677f-11ee-95d4-4ab2673ea3f0',
 'ad680680-677f-11ee-95d4-4ab2673ea3f0',
 'ad6806a8-677f-11ee-95d4-4ab2673ea3f0',
 'ad6806d0-677f-11ee-95d4-4ab2673ea3f0',
 'ad6806f8-677f-11ee-95d4-4ab2673ea3f0',
 'ad680716-677f-11ee-95d4-4ab2673ea3f0',
 'ad68073e-677f-11ee-95d4-4ab2673ea3f0']

In [9]:
# values are the text that needs to be cleaned
documents[next(iter(documents))]

'Supervisory Control and Data Acquisition (SCADA) systems often serve as the nervous system for substations within power grids. These systems facilitate real-time monitoring, data acquisition, control of equipment, and ensure smooth and efficient operation of the substation and its connected devices. As the dependence on these SCADA systems grows, so does the risk of potential malicious intrusions that could lead to significant outages or even permanent damage to the grid. Previous work has shown that dimensionality reduction-based approaches, such as Principal Component Analysis (PCA), can be used for accurate identification of anomalies in SCADA systems. While not specifically applied to SCADA, non-negative matrix factorization (NMF) has shown strong results at detecting anomalies in wireless sensor networks. These unsupervised approaches model the normal or expected behavior and detect the unseen types of attacks or anomalies by identifying the events that deviate from the expected 

## NER Operation

The Vulture library is composed of multiple operation modules that can work one after another to perform a custom operation on the text. These modules are flexible and their order can be re-arranged depending on the user's preferences. By default Vulture implements NER pipeline so that new users can quickly get started. In this section we will examine the Vulture default pipeline and apply the name entity recognition to the sample text.

The pipeline is a just list of Vulture modules that are to be updated sequentially. The default pipeline contains a single module - the ```NEDetector```.

In [10]:
Vulture.DEFAULT_OPERATOR_PIPELINE

[NEDetector(module_type='OPERATOR', backend=None)]

### Setup Vulture

Create a single-node multi-process Vulture object

In [11]:
vulture = Vulture(n_jobs  = 1, 
                  verbose = 10,  # Disable == 0, Verbose >= 1
                 )

### Apply NER

If we do not pass the ```save_path```, it will return a list of results where each entry in the list is for the given operation.

In [12]:
operation_results =  vulture.operate(documents)                   

[Vulture]: Cleaning 9 documents
  0%|          | 0/1 [00:00<?, ?it/s]

[Vulture]: Running NEDetector module
  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
100%|██████████| 9/9 [00:02<00:00,  3.12it/s]
100%|██████████| 1/1 [00:02<00:00,  2.89s/it]


Each entry is a tuple where index 0 is the name of the operation and index 1 is the results of the operation in dictionary format.

In [14]:
operation_results[0][0]

'NEDetector'

Operation for each document is given in dictionary format where key is the document ID and its value is the operation results, in this case NER

In [15]:
operation_results[0][1]

{'ad68055e-677f-11ee-95d4-4ab2673ea3f0': {'ORG': {'LANL',
   'Los Alamos National Laboratory',
   'SCADA'},
  'GPE': {'Los Alamos County'}},
 'ad680626-677f-11ee-95d4-4ab2673ea3f0': {'PRODUCT': {'SeNMFk'},
  'CARDINAL': {'two'}},
 'ad680658-677f-11ee-95d4-4ab2673ea3f0': {'ORG': {'NVIDIA'},
  'PRODUCT': {'NMFk'},
  'CARDINAL': {'4096', 'approximately 25,000'},
  'QUANTITY': {'11 Exabyte', '340 Terabyte'}},
 'ad680680-677f-11ee-95d4-4ab2673ea3f0': {'PRODUCT': {'HNMFk Classifier',
   'the HNMFk Classifier'},
  'CARDINAL': {'0.80', 'nearly 2,900', 'nearly 388,000'}},
 'ad6806a8-677f-11ee-95d4-4ab2673ea3f0': {},
 'ad6806d0-677f-11ee-95d4-4ab2673ea3f0': {'PRODUCT': {'Malware-DNA',
   'R&D100',
   'SmartTensors AI Platform'},
  'DATE': {'2021'},
  'ORDINAL': {'first'},
  'CARDINAL': {'one'}},
 'ad6806f8-677f-11ee-95d4-4ab2673ea3f0': {'CARDINAL': {'One',
   'two',
   '~2 million+'},
  'ORG': {'SeNMFk', 'arXiv'}},
 'ad680716-677f-11ee-95d4-4ab2673ea3f0': {'CARDINAL': {'One', 'one', 'two'},
  'O

In [20]:
%time vulture.operate(documents, save_path=RESULTS_DIR, file_name=RESULTS_FILE)                   

[Vulture]: Cleaning 9 documents
  0%|          | 0/1 [00:00<?, ?it/s][Vulture]: Running NEDetector module
100%|██████████| 9/9 [00:01<00:00,  6.21it/s]
100%|██████████| 1/1 [00:01<00:00,  1.45s/it]

CPU times: user 8.29 s, sys: 7.11 s, total: 15.4 s
Wall time: 1.45 s





In [22]:
saved_file = ! ls $RESULTS_DIR
saved_file

['operated_documents_NEDetector.p']

### Look at Cleaned Documents

In [23]:
operated_documents = pickle.load(open(os.path.join(RESULTS_DIR, saved_file[0]), 'rb'))

In [24]:
def to_df(documents, operated_documents):
    data = {
        'id': [],
        'text': [],
        'operation_result': []
    }

    for i, text in documents.items():
        data['id'].append(i)
        data['text'].append(text)
        data['operation_result'].append(operated_documents.get(i))

    return pd.DataFrame.from_dict(data)

In [25]:
df = to_df(documents, operated_documents)
df

Unnamed: 0,id,text,operation_result
0,ad68055e-677f-11ee-95d4-4ab2673ea3f0,Supervisory Control and Data Acquisition (SCAD...,"{'ORG': {'LANL', 'Los Alamos National Laborato..."
1,ad680626-677f-11ee-95d4-4ab2673ea3f0,Highly specific datasets of scientific literat...,"{'PRODUCT': {'SeNMFk'}, 'CARDINAL': {'two'}}"
2,ad680658-677f-11ee-95d4-4ab2673ea3f0,We propose an efficient distributed out-of-mem...,"{'ORG': {'NVIDIA'}, 'PRODUCT': {'NMFk'}, 'CARD..."
3,ad680680-677f-11ee-95d4-4ab2673ea3f0,Identification of the family to which a malwar...,"{'PRODUCT': {'the HNMFk Classifier', 'HNMFk Cl..."
4,ad6806a8-677f-11ee-95d4-4ab2673ea3f0,Malware is one of the most dangerous and costl...,{}
5,ad6806d0-677f-11ee-95d4-4ab2673ea3f0,Malware is one of the most dangerous and costl...,"{'PRODUCT': {'SmartTensors AI Platform', 'Malw..."
6,ad6806f8-677f-11ee-95d4-4ab2673ea3f0,Topic modeling is one of the key analytic tech...,"{'CARDINAL': {'~2 million+', 'One', 'two'}, 'O..."
7,ad680716-677f-11ee-95d4-4ab2673ea3f0,Non-negative matrix factorization (NMF) with m...,"{'CARDINAL': {'one', 'One', 'two'}, 'ORDINAL':..."
8,ad68073e-677f-11ee-95d4-4ab2673ea3f0,"We propose an efficient, distributed, out-of-m...",{'CARDINAL': {'1'}}
