# Working with PDF files in Kodexa

PDF (portable document files) are ubiqutious in the business world and are used for all types of communications, contracts, agreements, etc.  Kodexa provides the means to parse PDF files into our document content model, find and tag values within the document, and use the document with a variety of Kodexa actions.  This notebook will demonstrate how easy it is to access data from an PDF file.  We will be using concepts introduced in the "Getting Started" notebooks, so you may want to review/refer to them as you proceed through this example.

By the end of this example, you'll be able to parse a PDF file, examine the existing data, tag and extract tables in the PDF file, and export that data to a datastore that you could use for further processing.

All of our processing will occur in Kodexa's cloud environment.  In order to access the platform, you'll need to register for an account and generate an access token.  If you haven't done that already, follow the steps in our [Getting Started](https://developer.kodexa.com/kodexa-cloud/accessing-kodexa-cloud) guide.

Import the dependencies:
1. Since all of our actions occur in the cloud, we'll need to import the KodexaPlatform and RemoteAction modules
2. Import Kodexa's Pipeline module so we can build pipelines and process our document
3. The PDF file we'll be processing is located in a file folder, so we'll import the FolderConnector in order to access it.
4. All files that have been processed/parsed in Kodexa (Excel, PDF, etc) become Kodexa Documents, so we'll import that module as well.

We're also setting the CLOUD_URL value to the platform environment on which we want to perform our processing


In [3]:
# The kodexa package is public
from kodexa import Document, Pipeline, FolderConnector, KodexaPlatform, RemoteAction

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.  **You will then need to manulally set control at the next cell.**

In [4]:
import getpass

# Only request a login if we aren't logged in

if KodexaPlatform.get_access_token() is None:

    ACCESS_TOKEN = getpass.getpass("Enter access token:")

    KodexaPlatform.set_url(CLOUD_URL)
    KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [5]:
import os

# Setting up location of data file
DATA_FOLDER = '_data'
PDF_FOLDER = 'pdfs'
DATA_FILE = 'Navient_Test.pdf'

FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, PDF_FOLDER, DATA_FILE)

print(f'\nThis is where the PDF document is located: {FULL_PATH}\n')



This is where the PDF document is located: /home/skep/Projects/Kodexa/get-started-with-python/2_Examples_by_Source_Type/../_data/pdfs/Navient_Test.pdf



In [6]:
# Create a pipeline in order to access the PDF file
pipeline = Pipeline.from_file(FULL_PATH)

# Our first step in the Pipeline is to parse the PDF file.  
# The parser will produce a Kodexa document, which we'll retrive after the pipeline runs
pipeline.add_step(RemoteAction(slug='kodexa/pdf-parser', options={}, attach_source=True))

# Bam!
pipeline.run()

kodexa_doc = pipeline.context.output_document

## Let's see what was returned

Like any Kodexa document, all of the base methods and properties for the Document class are available to us.  We can access the metadata, get the content, etc.  Even though the source of this document was originally a PDF file, it still follows the same tree-like ContentNode structure as all Kodexa documents.

We can try it out here - let's view the first 1000 characters of content


In [9]:
kodexa_doc.get_root().get_all_content()[:1000]

'Navient Private Education Loan Trust 2016-A Monthly Servicing Report Distribution Date 07/15/2020 Collection Period 06/01/2020 - 06/30/2020 Navient Credit Funding, LLC - Depositor Navient Solutions - Servicer and Administrator Deutsche Bank National Trust Company - Indenture Trustee Deutsche Bank Trust Company Americas - Trustee Navient Credit Funding - Excess Distribution Certificateholder Page 1 of 10 I. Deal Parameters A Student Loan Portfolio Characteristics 02/04/2016 05/31/2020 06/30/2020 Principal Balance $ 702,816,146.01 $ 352,280,935.24 $ 348,066,397.59 Interest to be Capitalized Balance 9,495,421.47 1,766,212.19 1,590,403.89 Pool Balance $ 712,311,567.48 $ 354,047,147.43 $ 349,656,801.48 Weighted Average Coupon (WAC) 7.54% 7.07% 7.09% Weighted Average Remaining Term 164.44 165.93 165.96 Number of Loans 62,798 34,345 33,909 Number of Borrowers 47,570 26,492 26,151 Pool Factor 0.497039728 0.490876208 Since Issued Constant Prepayment Rate 8.86% 8.80% B Debt Securities Cusip/Isi

## What makes this document special?

This is still a standard Kodexa document with all the basic document properties, but because this represents a PDF document, it does have a few differences that make it different from a base Kodexa document.

The ContentNodes in this document are all one of the following types:  'content-area', 'figure-line', 'line', 'page', 'rect', 'root', or 'word'.  When PDF documents are parsed, we apply a layer of logic to determine logical groupings that most likley represent lines of text, the content-areas that contain them (sort of like a paragraph), and tables (containing the 'rect' and 'figure-line' nodes).  We've found that it's pretty important to know where in a document and where on a page text is found - this 'spatial analysis' process helps us retain that information and represent it through node types.

We can inspect the document for all node types present with a quick bit of code:

In [10]:
# get all nodes, then create a set of unique types 
all_nodes = kodexa_doc.select('//*')

all_types = set()
for n in all_nodes:
    all_types.add(n.node_type)
    
print(all_types)

{'word', 'line', 'content-area', 'figure-line', 'page', 'rect', 'root'}


In [11]:
# Let's take a look at the first page

print(kodexa_doc.select('//page')[0].get_all_content())

Navient Private Education Loan Trust 2016-A Monthly Servicing Report Distribution Date 07/15/2020 Collection Period 06/01/2020 - 06/30/2020 Navient Credit Funding, LLC - Depositor Navient Solutions - Servicer and Administrator Deutsche Bank National Trust Company - Indenture Trustee Deutsche Bank Trust Company Americas - Trustee Navient Credit Funding - Excess Distribution Certificateholder Page 1 of 10
