# Extracting Data to Stores

## Setup our imports
1. We'll be building pipelines to process our document, so we'll import Kodexa's Pipeline module
2. Import all the FolderConnector - we'll provide it to the Pipeline in order to access our file
3. Some of our parsing will need to happen in the could, so we'll import the KodexaPlatform and KodexaAction modules
4. All files that have been processed/parsed in Kodexa become Kodexa Documents, so we'll import that module as well.

We're also setting the CLOUD_URL value to the platform environment on which we want to perform our processing.


In [2]:
from kodexa import Document, Pipeline, KodexaPlatform, KodexaAction, FolderConnector

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.  **You will then need to manulally set control at the next cell.**

In [3]:
import getpass

ACCESS_TOKEN = getpass.getpass("Enter access token:")

KodexaPlatform.set_url(CLOUD_URL)
KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [4]:

# We'll be using one of the text files (parsed and saved as kdxa) to work through the examples - setting the path to it here
import os

# Setting up location of data folders and files
DATA_FOLDER = '_data'
TEXT_FOLDER = 'texts'
TEXT_DATA_FILE = 'tongue_twister.kdxa'

TEXT_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, TEXT_FOLDER)
FULL_PATH = os.path.join(TEXT_FOLDER_PATH, TEXT_DATA_FILE)

# Getting our pre-processed document from the .*kdxa file
kodexa_doc = Document.from_kdxa(FULL_PATH)

## Let's tag all the nodes that contain the word 'flue' AND write that data to a TableDataStore

We're going to set up a pipeline and use the NodeTagger action to perform the tagging.

In [None]:

# Creating a new pipeline.  We'll use the document we just loaded as the connector
pipeline = Pipeline(kodexa_doc)

# We only want those nodes of that contain the word 'flue'.  Tag any nodes that match with 'has_flue'
pipeline.add_step(KodexaAction(slug='kodexa/node-tagger', 
                                     options={'selector':'//*[contentRegex(".*flue.*")]', 'tag_to_apply':'has_flue', 'node_only':True}))

# Extract Tags to key/value pair
pipeline.add_step(KodexaAction(slug='kodexa/tags-to-key-value-pair-extractor', options={
    "store_name": "tagged_data",
    "include_node_content": True
}))

pipeline.run()

## Once again, getting the document off the pipeline
tagged_kodexa_doc = pipeline.context.output_document


In [None]:

tagged_pairs_store = pipeline.context.get_store('tagged_data')


In [None]:
tagged_pairs_store.to_df()

In [8]:
for n in tagged_kodexa_doc.get_root().children:
    print(f'\nNode uuid: {n.uuid} - contents: {n.get_all_content()}')
    for f in n.get_features():
        print(f'Feature: {f}')


Node uuid: 88781031-165e-44e0-9a67-b27cf1b646ea - contents: A flea and a fly got stuck in a flue.
Feature: Feature [type='tag' name='has_flue' value='[{'start': None, 'end': None, 'value': None, 'data': None}]' single='True']

Node uuid: 3e297ffc-1adc-48bf-962f-13b272657cd5 - contents: Said the flea to the fly, "What shall we do?"

Node uuid: efaae33d-aec9-4e7b-8b31-8d002194b7a8 - contents: Said the fly, "Let us flee!"

Node uuid: a2254d7c-6abe-46c6-99ce-8171bf7b59b1 - contents: Said the flea, "Let us fly!"

Node uuid: edb867e2-144c-475b-8da3-1c2ce35a13bc - contents: So they flew through a flaw in the flue.
Feature: Feature [type='tag' name='has_flue' value='[{'start': None, 'end': None, 'value': None, 'data': None}]' single='True']
