# Using Selectors and Adding Features and Tags to Kodexa Documents

In this notebook, we'll be exploring two key elements of a document's content nodes, features and tags.  We'll also learn how to choose specific nodes within a document using selectors.  

Features give us a way to store additional information about a content node, such as spatial coordinates or other node-enriching data.  Tags are a special type of feature who's "feature_type" property is always "tag".  

You can read more about features & tags in our [Developer Documentation](https://developer.kodexa.com/kodexa-platform/key-concepts/selectors-tags-and-features)

Adding tags and features to documents is widely used in Kodexa.  Tagging allows users to identify content and "label" it for later uses, such as data extraction or logical decision making.  Tags also allow users to add data to a doucment without altering the document's underlying structure.


This notebook will give you examples of adding features and tags to documents, and using those tags to perform additional processing steps.


## Setup our imports
1. We'll be building pipelines to process our document, so we'll import Kodexa's Pipeline module
2. Import all the FolderConnector - we'll provide it to the Pipeline in order to access our file
3. Some of our parsing will need to happen in the could, so we'll import the KodexaPlatform and RemoteAction modules
4. All files that have been processed/parsed in Kodexa become Kodexa Documents, so we'll import that module as well.

We're also setting the CLOUD_URL value to the platform environment on which we want to perform our processing.


In [1]:
from kodexa import Document, Pipeline, KodexaPlatform, RemoteAction, FolderConnector

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.  **You will then need to manulally set control at the next cell.**

In [2]:
import getpass

ACCESS_TOKEN = getpass.getpass("Enter access token:")

KodexaPlatform.set_url(CLOUD_URL)
KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [3]:

# We'll be using one of the text files (parsed and saved as kdxa) to work through the examples - setting the path to it here
import os

# Setting up location of data folders and files
DATA_FOLDER = '_data'
TEXT_FOLDER = 'texts'
TEXT_DATA_FILE = 'tongue_twister.kdxa'

TEXT_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, TEXT_FOLDER)
FULL_PATH = os.path.join(TEXT_FOLDER_PATH, TEXT_DATA_FILE)


## Get pre-parsed document from .kdxa file

In [4]:

kodexa_doc = Document.from_kdxa(FULL_PATH)

# Print the processed doc's contents
for n in kodexa_doc.get_root().children:
    print(f'\nNode uuid: {n.uuid} - contents: {n.get_all_content()}')
    for f in n.get_features():
        print(f'Feature: {f}')



Node uuid: 88781031-165e-44e0-9a67-b27cf1b646ea - contents: A flea and a fly got stuck in a flue.

Node uuid: 3e297ffc-1adc-48bf-962f-13b272657cd5 - contents: Said the flea to the fly, "What shall we do?"

Node uuid: efaae33d-aec9-4e7b-8b31-8d002194b7a8 - contents: Said the fly, "Let us flee!"

Node uuid: a2254d7c-6abe-46c6-99ce-8171bf7b59b1 - contents: Said the flea, "Let us fly!"

Node uuid: edb867e2-144c-475b-8da3-1c2ce35a13bc - contents: So they flew through a flaw in the flue.


## We can see each node's contents and features

Each of the child nodes have content values, but none of them have features.  We'll tag a few nodes and then re-inspect them.


## Let's tag all the nodes that contain the word 'flue'

We're going to set up a pipeline and use the NodeTagger action to perform the tagging.


In [5]:

# Creating a new pipeline.  We'll use the document we just loaded as the connector
pipeline = Pipeline(kodexa_doc)

# We only want those nodes of that contain the word 'flue'.  Tag any nodes that match with 'has_flue'
pipeline.add_step(RemoteAction(slug='kodexa/node-tagger', 
                                     options={'selector':'//*[contentRegex(".*flue.*")]', 'tag_to_apply':'has_flue', 'node_only':True}))

# Just do it!
pipeline.run()

# Get the document off the context, freshly tagged
kodexa_doc = pipeline.context.output_document


## Re-inspect the document's node contents and features

In [6]:
# Print the processed doc's contents
for n in kodexa_doc.get_root().children:
    print(f'\nNode uuid: {n.uuid} - contents: {n.get_all_content()}')
    for f in n.get_features():
        print(f'Feature: {f}')



Node uuid: 88781031-165e-44e0-9a67-b27cf1b646ea - contents: A flea and a fly got stuck in a flue.
Feature: Feature [type='tag' name='has_flue' value='[{'start': None, 'end': None, 'value': None, 'data': None}]' single='True']

Node uuid: 3e297ffc-1adc-48bf-962f-13b272657cd5 - contents: Said the flea to the fly, "What shall we do?"

Node uuid: efaae33d-aec9-4e7b-8b31-8d002194b7a8 - contents: Said the fly, "Let us flee!"

Node uuid: a2254d7c-6abe-46c6-99ce-8171bf7b59b1 - contents: Said the flea, "Let us fly!"

Node uuid: edb867e2-144c-475b-8da3-1c2ce35a13bc - contents: So they flew through a flaw in the flue.
Feature: Feature [type='tag' name='has_flue' value='[{'start': None, 'end': None, 'value': None, 'data': None}]' single='True']


## Hey!  Look at that!  We've got features!

Remember, features are used to add additional data to a document.  In the examples in notebook 1 (Kodexa_Document_Content_Model), we printed the features generated during PDF spatial analytic processing.  In that case, those spatial feature values are used to identify exactly where in a document each node would be rendered.

When we tag a document, we're adding a special type of feature with type 'tag'.  Tags are used to mark or identify nodes, and are also used to enrich nodes with extra meaning/information.  In this simple example, we tagged those nodes that contain the word 'flue'.  Now that they're tagged, we can pick those nodes out fo the document using the tag name.


## Selector version 1 - use of hasTag selector to get the tagged nodes

In [7]:
tagged_nodes_1 = kodexa_doc.select("//*[hasTag('has_flue')]")

tagged_nodes_1

[<kodexa.model.model.ContentNode at 0x7fd710c52a50>,
 <kodexa.model.model.ContentNode at 0x7fd718329c50>]

## Selector version 2 - use tagRegex to get the tagged nodes

In [8]:
tagged_nodes_2 = kodexa_doc.select("//*[tagRegex('has_flue')]")

tagged_nodes_2

[<kodexa.model.model.ContentNode at 0x7fd710c52a50>,
 <kodexa.model.model.ContentNode at 0x7fd718329c50>]

## Selector version 3 - using hasFeature to get the tagged nodes

In [9]:
tagged_nodes_3 = kodexa_doc.select("//*[hasFeature('tag','has_flue')]")

tagged_nodes_3

[<kodexa.model.model.ContentNode at 0x7fd710c52a50>,
 <kodexa.model.model.ContentNode at 0x7fd718329c50>]

## The different selectors return the same nodes

You can see that the nodes returned by each of these selector calls achieves the same goal, even though they use different syntax.  For more information on selector syntax, review our [Developer Documentation](https://developer.kodexa.com/developers/documentation/selectors).