# Adding Features and Tagging Kodexa Documents

In this notebook, we'll be exploring two key elements of a document's content nodes, features and tags.  

Features give us a way to store additional information about a content node, such as spatial coordinates or other node-enriching data.  Tags are a special type of feature who's "feature_type" property is always "tag".  

You can read more about featuers & tags in our [Developer Documentation](https://developer.kodexa.com/kodexa-platform/key-concepts/selectors-tags-and-features)

Adding tags and features to documents is widely used in Kodexa.  Tagging allows users to identify content and "label" it for later uses, such as data extraction or logical decision making.  Tags also allow users to add data to a doucment without altering the document's underlying structure.


This notebook will give you examples of adding features and tags to documents, and using those tags to perform additional processing steps.


## Setup our imports
1. We'll be building pipelines to process our document, so we'll import Kodexa's Pipeline module
2. Import all the FolderConnector - we'll provide it to the Pipeline in order to access our file
3. Some of our parsing will need to happen in the could, so we'll import the KodexaPlatform and KodexaAction modules
4. All files that have been processed/parsed in Kodexa become Kodexa Documents, so we'll import that module as well.

We're also setting the CLOUD_URL value to the platform environment on which we want to perform our processing.


In [1]:
from kodexa import Document, Pipeline, KodexaPlatform, KodexaAction, FolderConnector

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.

In [2]:
import getpass

ACCESS_TOKEN = getpass.getpass("Enter access token:")

KodexaPlatform.set_url(CLOUD_URL)
KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [3]:

# We'll be using one of the text files to work through the examples - setting the path to it here
import os

# Setting up location of data folders and files
DATA_FOLDER = '_data'
TEXT_FOLDER = 'texts'
TEXT_DATA_FILE = 'tongue_twister.txt'
TEXT_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, TEXT_FOLDER)

## Create a Pipeline and parse our text file

In [4]:

pipeline = Pipeline(FolderConnector(path=TEXT_FOLDER_PATH, file_filter=TEXT_DATA_FILE))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(KodexaAction(slug='kodexa/text-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the first 200 characters of the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()[:200]}')


Processed doc's contents:
A flea and a fly got stuck in a flue.
Said the flea to the fly, "What shall we do?"
Said the fly, "Let us flee!"
Said the flea, "Let us fly!"
So they flew through a flaw in the flue.


## Let's tag all the nodes that contain the word 'fly'