# Vulture

## Introduction to Text Operations with Vulture

In [1]:
import os
import pickle
import pathlib
import pandas as pd

from TELF.pre_processing import Vulture
from TELF.pre_processing.Vulture.modules import SubstitutionOperator

## 0. Load Dataset

In [2]:
DATA_DIR = os.path.join('..', '..', 'data')
DATA_DIR = pathlib.Path(DATA_DIR).resolve()
DATA_FILE = 'documents.p'

documents = pickle.load(open(os.path.join(DATA_DIR, DATA_FILE), 'rb'))
print(len(documents))
documents


9


{'ad68055e-677f-11ee-95d4-4ab2673ea3f0': 'Supervisory Control and Data Acquisition (SCADA) systems often serve as the nervous system for substations within power grids. These systems facilitate real-time monitoring, data acquisition, control of equipment, and ensure smooth and efficient operation of the substation and its connected devices. As the dependence on these SCADA systems grows, so does the risk of potential malicious intrusions that could lead to significant outages or even permanent damage to the grid. Previous work has shown that dimensionality reduction-based approaches, such as Principal Component Analysis (PCA), can be used for accurate identification of anomalies in SCADA systems. While not specifically applied to SCADA, non-negative matrix factorization (NMF) has shown strong results at detecting anomalies in wireless sensor networks. These unsupervised approaches model the normal or expected behavior and detect the unseen types of attacks or anomalies by identifying t

### Output

In [3]:
RESULTS_DIR = 'results'
RESULTS_DIR = pathlib.Path(RESULTS_DIR).resolve()
RESULTS_FILE = 'operated_documents'

try:
    os.mkdir(RESULTS_DIR)
except FileExistsError:
    pass

### Setup Vulture

In [4]:
vulture = Vulture(n_jobs  = 1, 
                  verbose = 10,  # Disable == 0, Verbose >= 1
                 )

### Apply Substitutions

If we do not pass the ```save_path```, it will return a list of results where each entry in the list is for the given operation.

In [5]:
for d in documents.keys():
    print(f'"{d}":"",')

"ad68055e-677f-11ee-95d4-4ab2673ea3f0":"",
"ad680626-677f-11ee-95d4-4ab2673ea3f0":"",
"ad680658-677f-11ee-95d4-4ab2673ea3f0":"",
"ad680680-677f-11ee-95d4-4ab2673ea3f0":"",
"ad6806a8-677f-11ee-95d4-4ab2673ea3f0":"",
"ad6806d0-677f-11ee-95d4-4ab2673ea3f0":"",
"ad6806f8-677f-11ee-95d4-4ab2673ea3f0":"",
"ad680716-677f-11ee-95d4-4ab2673ea3f0":"",
"ad68073e-677f-11ee-95d4-4ab2673ea3f0":"",


In [6]:
document_substitutions = {
    "ad68055e-677f-11ee-95d4-4ab2673ea3f0":{"Supervisory Control and Data Acquisition (SCADA)": "supervisory_control_and_data_acquisition", "SCADA":"supervisory_control_and_data_acquisition"},
    "ad680626-677f-11ee-95d4-4ab2673ea3f0":{'Highly specific datasets of scientific literature': "dense_domain_specific_datasets"},
    "ad680658-677f-11ee-95d4-4ab2673ea3f0":{},
    "ad680680-677f-11ee-95d4-4ab2673ea3f0":{"malware":'software_designed_to_harm'},
    "ad6806a8-677f-11ee-95d4-4ab2673ea3f0":{"Malware":'malicious_software'},
    "ad6806d0-677f-11ee-95d4-4ab2673ea3f0":{},
    "ad6806f8-677f-11ee-95d4-4ab2673ea3f0":{"Topic modeling": 'matrix_decomposition'},
    "ad680716-677f-11ee-95d4-4ab2673ea3f0":{"NMF":'matrix_factorization_where_no_values_are_negative'},
    "ad68073e-677f-11ee-95d4-4ab2673ea3f0":{"We propose": "we show"},
}

corpus_substitutions = {
    'malware': 'malicious_software',
    'NMFk': 'nonnegative_matrix_factorization_k',
    'NMF': 'nonnegative_matrix_factorization'
}


In [7]:
vulture.operate(    documents, 
                    steps=[ SubstitutionOperator(   document_substitutions = document_substitutions,
                                                    corpus_substitutions = corpus_substitutions,
                                                    document_priority = True )], 
                    save_path=RESULTS_DIR, file_name=RESULTS_FILE)                                   

[Vulture]: Cleaning 9 documents
  0%|          | 0/1 [00:00<?, ?it/s][Vulture]: Running SubstitutionOperator module
100%|██████████| 9/9 [00:00<00:00, 7332.70it/s]
100%|██████████| 1/1 [00:00<00:00, 344.93it/s]


Each entry is a tuple where index 0 is the name of the operation and index 1 is the results of the operation in dictionary format.

In [13]:
operation_results = pickle.load(open(os.path.join(RESULTS_DIR, 'operated_documents_SubstitutionOperator.p'), 'rb'))

In [15]:
operation_results

{'ad68055e-677f-11ee-95d4-4ab2673ea3f0': {'replaced_text': 'Supervisory Control and Data Acquisition (supervisory_control_and_data_acquisition) systems often serve as the nervous system for substations within power grids. These systems facilitate real-time monitoring, data acquisition, control of equipment, and ensure smooth and efficient operation of the substation and its connected devices. As the dependence on these supervisory_control_and_data_acquisition systems grows, so does the risk of potential malicious intrusions that could lead to significant outages or even permanent damage to the grid. Previous work has shown that dimensionality reduction-based approaches, such as Principal Component Analysis (PCA), can be used for accurate identification of anomalies in supervisory_control_and_data_acquisition systems. While not specifically applied to supervisory_control_and_data_acquisition, non-negative matrix factorization (nonnegative_matrix_factorization) has shown strong results a

Operation for each document is given in dictionary format where key is the document ID and its value is the operation results, in this case NER

In [39]:
import ast
def to_df(documents, operated_documents):
    data = {
        'id': [],
        'text': [],
        'replaced_text': []
    }

    for i, text in documents.items():
        data['id'].append(i)
        data['text'].append(text)

        operation_current_doc =  operated_documents.get(i)

        
        # operation_current_doc = ast.literal_eval(operated_documents.get(i))
        print(type(operation_current_doc), operation_current_doc)
        data['replaced_text'].append(operation_current_doc['replaced_text'])

    return pd.DataFrame.from_dict(data)

In [40]:
operation_results

{'ad68055e-677f-11ee-95d4-4ab2673ea3f0': {'replaced_text': 'Supervisory Control and Data Acquisition (supervisory_control_and_data_acquisition) systems often serve as the nervous system for substations within power grids. These systems facilitate real-time monitoring, data acquisition, control of equipment, and ensure smooth and efficient operation of the substation and its connected devices. As the dependence on these supervisory_control_and_data_acquisition systems grows, so does the risk of potential malicious intrusions that could lead to significant outages or even permanent damage to the grid. Previous work has shown that dimensionality reduction-based approaches, such as Principal Component Analysis (PCA), can be used for accurate identification of anomalies in supervisory_control_and_data_acquisition systems. While not specifically applied to supervisory_control_and_data_acquisition, non-negative matrix factorization (nonnegative_matrix_factorization) has shown strong results a

In [41]:
df = to_df(documents, operation_results)
df

<class 'dict'> {'replaced_text': 'Supervisory Control and Data Acquisition (supervisory_control_and_data_acquisition) systems often serve as the nervous system for substations within power grids. These systems facilitate real-time monitoring, data acquisition, control of equipment, and ensure smooth and efficient operation of the substation and its connected devices. As the dependence on these supervisory_control_and_data_acquisition systems grows, so does the risk of potential malicious intrusions that could lead to significant outages or even permanent damage to the grid. Previous work has shown that dimensionality reduction-based approaches, such as Principal Component Analysis (PCA), can be used for accurate identification of anomalies in supervisory_control_and_data_acquisition systems. While not specifically applied to supervisory_control_and_data_acquisition, non-negative matrix factorization (nonnegative_matrix_factorization) has shown strong results at detecting anomalies in w

KeyError: 'replaced_text'