# Providing Data to Pipelines with Connectors

Connectors are used to provide data to Pipelines.  Just as there are many types of unstructure data, there are multiple ways to connect that data to pipelines (we're just so flexible).  You can read files, folders, URLs, or just pass in plain text.  We've provided examples of those connectors you're most likely to find useful in your applications.

* Plain text
* Existing Kodexa document
* From a Kodexa JsonDocumentStore
* From a folder, using the FolderConnector
* From a file, using the FileHandleConnector
* From a URL, using the URLConnector

When creating a Kodexa Document from plain text or reading a JsonDocumentStore of fully formed Kodexa Documents, you do not need to add a step to the pipeline to parse the documents.  The pipeline will be able to process these documents into fully-formed Kodexa Documents. 

When using connectors of other types (files, folders, URLs) to read non-Kodexa Documents, you will need to add a pipeline step to parse the document so that it's fully-formed.  If you do not provide a parser, Kodexa Documents will still be returned by the pipeline, but there will be no content node text on the document, only metadata describing the connector and document source details.

## Setup our imports
1. We'll be building pipelines to process our document, so we'll import Kodexa's Pipeline module
2. Import all the connector types that we plan to try out
3. Some of our parsing will need to happen in the could, so we'll import the KodexaPlatform and KodexaAction modules
4. All files that have been processed/parsed in Kodexa become Kodexa Documents, so we'll import that module as well.

We're also setting the CLOUD_URL value to the platform environment on which we want to perform our processing.


In [2]:
from kodexa import Document, Pipeline, KodexaPlatform, RemoteAction, JsonDocumentStore, FileHandleConnector, FolderConnector, UrlConnector

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.  **You will then need to manulally set control at the next cell.**

In [3]:
import getpass

ACCESS_TOKEN = getpass.getpass("Enter access token:")

KodexaPlatform.set_url(CLOUD_URL)
KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [4]:
import os

# Setting up location of data folders and files
DATA_FOLDER = '_data'
TEXT_FOLDER = 'texts'
JSON_STORE_FOLDER = 'json_doc_stores'
TEXT_DATA_FILE = 'tongue_twister.txt'

TEXT_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, TEXT_FOLDER)
FULL_PATH = os.path.join(TEXT_FOLDER_PATH, TEXT_DATA_FILE)
JSON_STORE_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, JSON_STORE_FOLDER, 'text_json_store')

## Using plain text

In [5]:
# Using a childhood tongue twister for our input text
text = 'A flea and a fly got stuck in a flue.  Said the flea to the fly, "What shall we do?" \
Said the fly, "Let us flee!" Said the flea, "Let us fly!" So they flew through a flaw in the flue.'

# Create the pipeline using our text string
# Since this Kodexa Document is being created from plain text (not from a file or other input), 
# we don't need to add an parsing action to the pipeline
pipeline = Pipeline.from_text(text)

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')


Processed doc's contents:
A flea and a fly got stuck in a flue.  Said the flea to the fly, "What shall we do?" Said the fly, "Let us flee!" Said the flea, "Let us fly!" So they flew through a flaw in the flue.


## Using existing Kodexa Document

In [6]:
# Create the pipeline using the kodexa_doc created in the previous example cell.
pipeline = Pipeline(kodexa_doc)

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')


Processed doc's contents:
A flea and a fly got stuck in a flue.  Said the flea to the fly, "What shall we do?" Said the fly, "Let us flee!" Said the flea, "Let us fly!" So they flew through a flaw in the flue.


## Using a JsonDocumentStore

A JsonDocumentStore contains Kodexa documents that have already been processed and have been stored in JSON format.

In [7]:

# instantiate the store and provide the location of our already-prepared data
json_doc_store = JsonDocumentStore(store_path=JSON_STORE_PATH)

# Using the store as the connector for the pipeline.  We don't need to parse these documents as they're already in the Kodexa Document structure
pipeline = Pipeline(json_doc_store)
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nContents from JsonDocumentStore doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')


AttributeError: 'NoneType' object has no attribute 'get_root'

## Using a FolderConnector


In [8]:

# Create the pipeline with the full path to the folder.  
# We can also specify a file_filter to limit selection.  In this case, we're 
# filtering by the file name, but we could also have passed in the file extension ('*.txt'), and it would have pulled all text files
pipeline = Pipeline(FolderConnector(path=TEXT_FOLDER_PATH, file_filter=TEXT_DATA_FILE))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(RemoteAction(slug='kodexa/text-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nContents from FolderConnector doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')



Contents from FolderConnector doc's contents:
A flea and a fly got stuck in a flue.
Said the flea to the fly, "What shall we do?"
Said the fly, "Let us flee!"
Said the flea, "Let us fly!"
So they flew through a flaw in the flue.


## Using a UrlConnector

In [9]:

# We're going to read the Privacy page from Kodexa's website
HTML_FILE_URL = 'https://www.kodexa.com/privacy.html'

# Create the Pipeline and provide the URL
pipeline = Pipeline.from_url(HTML_FILE_URL)

# We know we're reading HTML, so our first step in the Pipeline is to parse it.  
#The parser will produce a Kodexa document, which we'll retrive after the pipeline runs
pipeline.add_step(RemoteAction(slug='kodexa/html-parser', options={"summarize":True,"encoding":"utf8"}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')




Processed doc's contents:
Your privacy is important to us. It is Kodexa, Inc's policy to respect your privacy regarding any information we may collect from you across our website,
       
      , and other sites we own and operate.
      http://www.kodexa.com We only ask for personal information when we truly need it to provide a service to you. We collect it by fair and lawful means, with your knowledge and consent. We also let you know why we’re collecting it and how it will be used. We only retain collected information for as long as necessary to provide you with your requested service. What data we store, we’ll protect within commercially acceptable means to prevent loss and theft, as well as unauthorized access, disclosure, copying, use or modification. We don’t share any personally identifying information publicly or with third-parties, except when required to by law. Our website may link to external sites that are not operated by us. Please be aware that we have no control over

## Using a FileHandleConnector

In [10]:
"""
# When using the FileHandleConnector, provide the entire path to the source file
pipeline = Pipeline(FileHandleConnector(FULL_PATH))

# When using a FileHandleConnector, we must specify the parser that should be used for this document
pipeline.add_step(RemoteAction(slug='kodexa/text-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')
"""

"\n# When using the FileHandleConnector, provide the entire path to the source file\npipeline = Pipeline(FileHandleConnector(FULL_PATH))\n\n# When using a FileHandleConnector, we must specify the parser that should be used for this document\npipeline.add_step(RemoteAction(slug='kodexa/text-parser', options={}, attach_source=True))\n\n# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context\npipeline.run()\nkodexa_doc = pipeline.context.output_document\n\n# Print the processed doc's contents\nprint(f'\nProcessed doc's contents:\n{kodexa_doc.get_root().get_all_content()}')\n"