# Parsing Sources with Kodexa's Parsers


In [1]:
from kodexa import Document, Pipeline, KodexaPlatform, KodexaAction, FolderConnector

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.

In [2]:
import getpass

ACCESS_TOKEN = getpass.getpass("Enter access token:")

KodexaPlatform.set_url(CLOUD_URL)
KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [3]:
import os

# Setting up location of data file
DATA_FOLDER = '_data'

TEXT_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, 'texts')
EXCEL_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, 'excel_workbooks')
PDF_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, 'pdfs')
HTML_FOLDER_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, 'htmls')


## Parsing a Text File

We'll be adding a step to the pipeline that uses the 'kodexa/text-parser'

In [7]:
# Create the pipeline with the full path to the folder.
# We're also limiting the results to our specific file with the file_filter
pipeline = Pipeline(FolderConnector(path=TEXT_FOLDER_PATH, file_filter='tongue_twister.txt'))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(KodexaAction(slug='kodexa/text-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')


Processed doc's contents:
A flea and a fly got stuck in a flue.
Said the flea to the fly, "What shall we do?"
Said the fly, "Let us flee!"
Said the flea, "Let us fly!"
So they flew through a flaw in the flue.


## Parsing an Excel File

We'll be adding a step to the pipeline that uses the 'kodexa/excel-parser'

In [5]:
# Create the pipeline with the full path to the folder.
# We're also limiting the results to our specific file with the file_filter
pipeline = Pipeline(FolderConnector(path=EXCEL_FOLDER_PATH, file_filter='2019_Business_Expenses.xlsx'))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(KodexaAction(slug='kodexa/excel-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')


Processed doc's contents:
2019 Reimbursables UTILITIES Phone ISP Electricity Natural Gas Water Sewer Amt due from Business January 60.5 59.99 4.316046428571429 6.192921428571429 2.7588785714285717 133.75784642857144 February 60.5 59.99 3.5990035714285713 5.315514285714285 3.0407928571428573 132.4453107142857 March 57.79375 59.99 3.7471107142857147 5.297639285714286 2.674610714285714 129.5031107142857 April 60.512499999999996 59.99 3.318621428571429 0.6991678571428571 2.495860714285714 127.01615 May 60.512499999999996 69.99 3.3875678571428574 1.4126357142857144 2.651117857142857 137.95382142857144 June 60.5125 69.99 5.086714285714286 1.0694357142857145 3.8941964285714286 140.55284642857143 July 60.7125 69.99 9.156596428571428 0.9213285714285714 5.1970285714285716 145.97745357142855 August 60.53125 69.99 7.94365 0.9213285714285714 4.774667857142857 144.16089642857142 September 60.51875 69.99 6.726107142857143 1.10825 4.85485 143.19795714285712 October 191.84875 69.99 3.7455785714285716 

## Parsing a PDF File

We'll be adding a step to the pipeline that uses the 'kodexa/pdf-parser'

In [7]:
# Create the pipeline with the full path to the folder.
# We're also limiting the results to our specific file with the file_filter
pipeline = Pipeline(FolderConnector(path=PDF_FOLDER_PATH, file_filter='Kodexa_Privacy.pdf'))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(KodexaAction(slug='kodexa/pdf-parser', options={"layout_analysis_options":{"rollup":"word","space_multiplier":1},
                                      "analyze_layout":True}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')


Processed doc's contents:
6/12/2020 Page | Canvas  PRIVACY POLICY Your privacy is important to us. It is Kodexa, Inc's policy to respect your privacy regarding any information we may collect from you across our website, http://www.kodexa.com, and other sites we own and operate. We only ask for personal information when we truly need it to provide a service to you. We collect it by fair and lawful means, with your knowledge and consent. We also let you know why we’re collecting it and how it will be used. We only retain collected information for as long as necessary to provide you with your requested service. What data we store, we’ll protect within commercially acceptable means to prevent loss and theft, as well as unauthorized access, disclosure, copying, use or modi cation. We don’t share any personally identifying information publicly or with third-parties, except when required to by law. Our website may link to external sites that are not operated by us. Please be aware that we h

## Parsing an HTML File

We'll be adding a step to the pipeline that uses the 'kodexa/html-parser'

In [5]:
# Create the pipeline with the full path to the folder.
# We're also limiting the results to our specific file with the file_filter
pipeline = Pipeline(FolderConnector(path=HTML_FOLDER_PATH, file_filter='Kodexa_Privacy.html'))

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(KodexaAction(slug='kodexa/html-parser', options={}, attach_source=True))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')


Processed doc's contents:
Your privacy is important to us. It is Kodexa, Inc's policy to respect your privacy regarding any information we may collect from you across our website,  , and other sites we own and operate. http://www.kodexa.com We only ask for personal information when we truly need it to provide a service to you. We collect it by fair and lawful means, with your knowledge and consent. We also let you know why we’re collecting it and how it will be used. We only retain collected information for as long as necessary to provide you with your requested service. What data we store, we’ll protect within commercially acceptable means to prevent loss and theft, as well as unauthorized access, disclosure, copying, use or modification. We don’t share any personally identifying information publicly or with third-parties, except when required to by law. Our website may link to external sites that are not operated by us. Please be aware that we have no control over the content and pr

## Parsing an SEC EDGAR Filing Document (text)

This is a specialty parser designed to EDGAR filings.   These filings are available on the SEC's website.

We'll be adding a step to the pipeline that uses the 'kodexa/sec-edgar-parser'


In [12]:

# The EDGAR document we're parsing is on the SEC's website.
# We'll use the 'from_url' method to connect that source to the pipeline
pipeline = Pipeline.from_url("https://www.sec.gov/Archives/edgar/data/1606180/0001564590-20-014614.txt")

# When using a FolderConnector, we must specify the parser that should be used for this document
pipeline.add_step(KodexaAction(slug='kodexa/sec-edgar-parser', options={}))

# Run the pipeline and get the pipeline's context.  We'll then get the processed document from the context
pipeline.run()
kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')




Processed doc's contents:
aac-nt10k_20191231.DOCX.htm SEC FILE NUMBER 001-36643 CUSIP NUMBER 000307108  UNITED STATES SECURITIES AND EXCHANGE COMMISSION WASHINGTON, DC 20549  FORM 12b-25  NOTIFICATION OF LATE FILING (Check One)  ☒ Form 10-K ☐ Form 20-F ☐ Form 11-K ☐ Form 10-Q ☐ Form 10-D ☐ Form N-SAR ☐ Form N-CSR  For Period Ended: December 31, 2019  ☐ Transition Report on Form 10-K ☐ Transition Report on Form 20-F ☐ Transition Report on Form 11-K ☐ Transition Report on Form 10-Q ☐ Transition Report on Form N-SAR  For the Transition Period Ended:  Nothing in this form shall be construed to imply that the Commission has verified any information contained herein. If the notification relates to a portion of the filing checked above, identify the Item(s) to which the notification relates:  PART I - REGISTRANT INFORMATION AAC HOLDINGS, INC. (Full Name of Registrant) N/A (Former Name if Applicable) 200 Powell Place  (Address of Principal Executive Office (Street and Number) ) Brentwood, Ten

## Parsing an RSS Feed

We'll be adding a step to the pipeline that uses the 'kodexa/rss-parser'

In [13]:
"""
pipeline = Pipeline.from_text('hello')

pipeline.add_step(KodexaAction(slug='kodexa/rss-parser', 
                               options={'urls':['https://www.fool.com/a/feeds/foolwatch?format=rss2&id=foolwatch&apikey=foolwatch-feed']}, 
                               attach_source=True))
pipeline.run()

kodexa_doc = pipeline.context.output_document

# Print the processed doc's contents
print(f'\nProcessed doc\'s contents:\n{kodexa_doc.get_root().get_all_content()}')
"""
    

"\npipeline = Pipeline.from_text('hello')\n\npipeline.add_step(KodexaAction(slug='kodexa/rss-parser', \n                               options={'urls':['https://www.fool.com/a/feeds/foolwatch?format=rss2&id=foolwatch&apikey=foolwatch-feed']}, \n                               attach_source=True))\npipeline.run()\n\nkodexa_doc = pipeline.context.output_document\n\n# Print the processed doc's contents\nprint(f'\nProcessed doc's contents:\n{kodexa_doc.get_root().get_all_content()}')\n"

# Using the Slack Event parser

We'll be adding a step to the pipeline that uses the 'kodexa/slack-parser'
