# Parsing Sources with Kodexa's Parsers


In [1]:
from kodexa import Document, Pipeline, KodexaPlatform, RemoteAction, FolderConnector

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.  **You will then need to manulally set control at the next cell.**

In [2]:
import getpass

ACCESS_TOKEN = getpass.getpass("Enter access token:")

KodexaPlatform.set_url(CLOUD_URL)
KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [3]:
import os

# Setting up location of data file
DATA_FOLDER = '_data'

TEXT_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, 'texts')
EXCEL_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, 'excel_workbooks')
PDF_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, 'pdfs')
HTML_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, 'htmls')


## Parsing a Text File

We'll be adding a step to the pipeline that uses the 'kodexa/text-parser'

In [4]:
# Create the pipeline with the full path to the folder.
# We're also limiting the results to our specific file with the file_filter
pipeline = Pipeline(FolderConnector(path=TEXT_FOLDER_PATH, file_filter='tongue_twister.txt'))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(RemoteAction(slug='kodexa/text-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')


Processed doc's contents:
A flea and a fly got stuck in a flue.
Said the flea to the fly, "What shall we do?"
Said the fly, "Let us flee!"
Said the flea, "Let us fly!"
So they flew through a flaw in the flue.


## Parsing an Excel File

We'll be adding a step to the pipeline that uses the 'kodexa/excel-parser'

In [8]:
# Create the pipeline with the full path to the folder.
# We're also limiting the results to our specific file with the file_filter
pipeline = Pipeline(FolderConnector(path=EXCEL_FOLDER_PATH, file_filter='2019_Business_Expenses.xlsx'))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(RemoteAction(slug='kodexa/excel-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the first 100 chars of the processed doc's contents
print(f'\nProcessed doc\'s contents (first 100 chars):\n{kodexa_doc.get_root().get_all_content()[:100]}....')


Processed doc's contents (first 100 chars):
2019 Reimbursables UTILITIES Phone ISP Electricity Natural Gas Water Sewer Amt due from Business Jan....


## Parsing a PDF File

We'll be adding a step to the pipeline that uses the 'kodexa/pdf-parser'

In [9]:
# Create the pipeline with the full path to the folder.
# We're also limiting the results to our specific file with the file_filter
pipeline = Pipeline(FolderConnector(path=PDF_FOLDER_PATH, file_filter='Kodexa_Privacy.pdf'))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(RemoteAction(slug='kodexa/pdf-parser', options={"layout_analysis_options":{"rollup":"word","space_multiplier":1},
                                      "analyze_layout":True}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the first 100 chars of the processed doc's contents
print(f'\nProcessed doc\'s contents (first 100 chars):\n{kodexa_doc.get_root().get_all_content()[:100]}....')


Processed doc's contents (first 100 chars):
6/12/2020 Page | Canvas  PRIVACY POLICY Your privacy is important to us. It is Kodexa, Inc's policy....


## Parsing an HTML File from a URL

This pipeline reads a file from a URL and processes it using the 'kodexa/html-parser'

In [10]:
# Create the pipeline with the full path to the folder.
# We're also limiting the results to our specific file with the file_filter
pipeline = Pipeline(FolderConnector(path=HTML_FOLDER_PATH, file_filter='Kodexa_Privacy.html'))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(RemoteAction(slug='kodexa/html-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the first 100 chars of the processed doc's contents
print(f'\nProcessed doc\'s contents (first 100 chars):\n{kodexa_doc.get_root().get_all_content()[:100]}....')


Processed doc's contents (first 100 chars):
Your privacy is important to us. It is Kodexa, Inc's policy to respect your privacy regarding any in....


## Parsing an SEC EDGAR Filing Document (text)

This is a specialty parser designed to EDGAR filings.   These filings are available on the SEC's website.

We'll be adding a step to the pipeline that uses the 'kodexa/sec-edgar-parser'


In [11]:

# The EDGAR document we're parsing is on the SEC's website.
# We'll use the 'from_url' method to connect that source to the pipeline
pipeline = Pipeline.from_url("https://www.sec.gov/Archives/edgar/data/1606180/0001564590-20-014614.txt")

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(RemoteAction(slug='kodexa/sec-edgar-parser', options={}))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the first 100 chars of the processed doc's contents
print(f'\nProcessed doc\'s contents (first 100 chars):\n{kodexa_doc.get_root().get_all_content()[:100]}....')



Processed doc's contents (first 100 chars):
aac-nt10k_20191231.DOCX.htm SEC FILE NUMBER 001-36643 CUSIP NUMBER 000307108  UNITED STATES SECURITI....


## Parsing an RSS Feed

We'll be adding a step to the pipeline that uses the 'kodexa/rss-parser'

In [12]:
## Demo coming soon!

## Using the Slack Event parser

We'll be adding a step to the pipeline that uses the 'kodexa/slack-parser'


In [13]:
## Demo coming soon!