# Vulture

## Introduction to Text Operations with Vulture

In [None]:
import os
import pickle
import pathlib
import pandas as pd

from TELF.pre_processing import Vulture

## 0. Load Dataset

### Input

In [None]:
DATA_DIR = os.path.join('..', '..', 'data')
DATA_DIR = pathlib.Path(DATA_DIR).resolve()

In [None]:
DATA_FILE = 'documents.p'

In [None]:
documents = pickle.load(open(os.path.join(DATA_DIR, DATA_FILE), 'rb'))
len(documents)

### Output

In [None]:
RESULTS_DIR = 'results'
RESULTS_DIR = pathlib.Path(RESULTS_DIR).resolve()

In [None]:
RESULTS_FILE = 'operated_documents'

In [None]:
try:
    os.mkdir(RESULTS_DIR)
except FileExistsError:
    pass

### Examine Data Format

In [None]:
# key serve as document unique ids
list(documents.keys())

In [None]:
# values are the text that needs to be cleaned
documents[next(iter(documents))]

## NER Operation

The Vulture library is composed of multiple operation modules that can work one after another to perform a custom operation on the text. These modules are flexible and their order can be re-arranged depending on the user's preferences. By default Vulture implements NER pipeline so that new users can quickly get started. In this section we will examine the Vulture default pipeline and apply the name entity recognition to the sample text.

The pipeline is a just list of Vulture modules that are to be updated sequentially. The default pipeline contains a single module - the ```NEDetector```.

In [None]:
Vulture.DEFAULT_OPERATOR_PIPELINE

### Setup Vulture

Create a single-node multi-process Vulture object

In [None]:
vulture = Vulture(n_jobs  = 1, 
                  verbose = 10,  # Disable == 0, Verbose >= 1
                 )

### Apply NER

If we do not pass the ```save_path```, it will return a list of results where each entry in the list is for the given operation.

In [None]:
operation_results =  vulture.operate(documents)                   

Each entry is a tuple where index 0 is the name of the operation and index 1 is the results of the operation in dictionary format.

In [None]:
operation_results[0][0]

Operation for each document is given in dictionary format where key is the document ID and its value is the operation results, in this case NER

In [None]:
operation_results[0][1]

In [None]:
%time vulture.operate(documents, save_path=RESULTS_DIR, file_name=RESULTS_FILE)                   

In [None]:
saved_file = ! ls $RESULTS_DIR
saved_file

### Look at Cleaned Documents

In [None]:
operated_documents = pickle.load(open(os.path.join(RESULTS_DIR, saved_file[0]), 'rb'))

In [None]:
def to_df(documents, operated_documents):
    data = {
        'id': [],
        'text': [],
        'operation_result': []
    }

    for i, text in documents.items():
        data['id'].append(i)
        data['text'].append(text)
        data['operation_result'].append(operated_documents.get(i))

    return pd.DataFrame.from_dict(data)

In [None]:
df = to_df(documents, operated_documents)
df