# Kodexa's Document Content Model

In this notebook, we'll cover the core features of Kodexa's content model (document), the structure of a Kodexa Document, and how that stucture make working with unstructured data of various types easier.


## What is the Kodexa Document content model?

Kodexa's content model, the "Document", provides a structure for describing the features and relationship of elements within a data source.  Because unstructured data is found in many different source types and file formats, our document content model must be flexible so that it can capture & reflect the specific differences that are present in different source types.  

We know that learning to work with different source types/file formats is time consuming, and integrating processing capabilities across different source types can be difficult.  To make working with multiple source types easier, we’ve built capabilities (Actions) that work off a single common document content model.  While different source types may produce Kodexa documents with slightly different features, they share common elements and the approach to working with them is the same across all source types.

As you learn to “think in Kodexa” and solve problems using Kodexa, you will find that the flexibility of the Document is your friend. It provides a consistent way to work across use-cases, and since the model and API is consistent across source types, it means you can write reusable code that can be leveraged in multiple use-cases. 


### A common approach to different source types

That means that if you parse an Excel file to a Kodexa Document, the relationship between each cell and its row is maintained, as well as the relationship between the row and its parent worksheet.  It’s all taken care of by the Document’s content model.  If you parse a PDF file into a Kodexa Document, the relationships between a line of text (node of type line) and its parent/child/sibling nodes are also maintained by the Document’s content model.  Both of these data sources (Excel & PDF) can be parsed and transformed into a common format, and common actions can be applied to them.


## Importing Kodexa

The kodexa package is public and available for installation with a simple 'pip install kodexa'.  It's already been included in the environment.yml provided with this repository and if you've set up your conda enviroment according to those instructions, it should already be availble to you by selecting the 'kodexa_python_quickstart' kernel.


## Load up a document and take a look inside

So that we can skip the 'how-to' of parsing and just focus on the Document and its structures, we're going to use an example PDF that's already been parsed and saved to a *.kdxa file (our Kodexa file format). The original source PDF is located in the _data/pdf folder ('../_data/pdfs/Kodexa_Privacy.pdf')


In [2]:

# The kodexa package is public
from kodexa import Document

# Setting up location of the example data file
import os
DATA_FOLDER = '_data'
PDF_FOLDER = 'pdfs'
PDF_KDXA_FILE = 'Kodexa_Privacy.kdxa'
PDF_KDXA_FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, PDF_FOLDER, PDF_KDXA_FILE)

print(f'\nThe .kdxa of our PDF file is located at: {PDF_KDXA_FULL_PATH}\n')


The .kdxa of our PDF file is located at: /home/skep/Projects/Kodexa/get-started-with-python/1_Getting_Started/../_data/pdfs/Kodexa_Privacy.kdxa



In [3]:

# Load the *.kdxa file as a Kodexa Document
kodexa_doc_pdf = Document.from_kdxa(PDF_KDXA_FULL_PATH)


## Let's dive into the Document structure

Kodexa Documents are represented in a generalized document structure consisting of a collection of metadata and a set of content nodes.  When source content has been parsed and structured in a Kodexa Document, it's broken up into a tree of nodes with a "root" content node at the top and one or more child content nodes branching off as leaves.  Each of these nodes can capture a portion of the original content (text), and may also have one or more "features". Features allow flexibility in the way that we capture metadata about content, and allow us to add new information to a node that we may want to use in later processing.

Getting started, there are two properties on the Document that you'll want to be familiar with:

* **metadata**:  Metadata about/for this document
* **content_node**: The top-most ContentNode in the document, also called the "root_node".  It will have zero-to-many child content nodes. 

## Digging into the ContentNode

The ContentNode is the heart of the content model.  A content node has several core properties such as type, content, content_parts, children, and feature(s).
* **type:** The specific type of this content node.
* **parent:** Each node is aware of its parent. Only a root content node on a document would not have a parent.
* **features:** Features are a way to store/add additional information about a content node.  The features property on a content node is a collection of features.
* **content:** The text representation of the content.
* **content_parts:** An array version of the content, this is used to break the content and intersperse it with the children, allowing us to understand where the child nodes fit into the content.  It is not always present, and is only present if the structure allows for child nodes to be embedded into the content at specific locations.
* **children:** The child content nodes that roll up to this content node.

While at a generic level everything can be thought of as a content node, we leverage the 'type' property on the content node to provide meaning to the hierarchy. If we were working with a HTML file, we would expect to see content nodes with types such as 'div', 'p', 'span', etc.  If we were parsing a PDF file, we may see type values such as 'page' or 'line'.

For this example, we'll be focusing on the **type**, **features**, and **content** of our content nodes.

More info on Kodexa's Content Model and Document structure can be found here:  [Kodexa Developer Guide](https://developer.kodexa.com/developers/what-is-kodexa/document-structure)


## Let's check out the content model


In [4]:

print(f'\nmetadata :: {kodexa_doc_pdf.metadata}')
print(f'\nroot_node :: {kodexa_doc_pdf.get_root()}\n')



metadata :: {'connector_options': {'id': '7684716fb8344b539aee792f5b1f4d72'}, 'connector': 'cloud-content'}

root_node :: ContentNode [type:root] (0 features, 1 children) [None]



## The document's metadata information has been set and there's a root ContentNode present

We should see these properties on all Kodexa Documents, regardless of what source type they originate from.  The actual data will vary between document and document source types, but the general structure will be the same.

The metadata on this document indicates that it was parsed via a cloud-content connector.  We discuss connectors and joining them to pipelines in notebook 2, Connecting_Data_to_Pipelines.


## The original source of this document was PDF.  Let's see what kind of ContentNode types are available on it

In [5]:

# Selecting all nodes on the kodexa_doc_pdf
pdf_doc_nodes = kodexa_doc_pdf.select('//*')

pdf_node_set = set()
for t in pdf_doc_nodes:
    pdf_node_set.add(t.type)
    
print(f'\nOur sample text document has the following node types: {pdf_node_set}')



Our sample text document has the following node types: {'word', 'image', 'root', 'content-area', 'line', 'page'}


## Using selectors to choose nodes on a document

Selectors allow us to identify specific nodes on a document so we can inspect their contents or perform operations on them.  You'll find that many of Kodexa's Actions, available for use in pipelines, include a 'selector' parameter so you can narrow down specific nodes to be utilized in the function.  The syntax for selectors is *similar* to XPath, and is designed so you can build &  combine them to varying degrees of complexity.  

Additional documentation on selectors, including examples of usage, can be found in our documentat: [Developer Documentation - Selectors](https://developer.kodexa.com/developers/documentation/selectors)

In [6]:

# How many pages are in this PDF?

# We can write the selector as '//*[typeRegex("page")]' or as '//page'

num_pdf_pages = len(kodexa_doc_pdf.select('//*[typeRegex("page")]'))
print(f'\nThere are {num_pdf_pages} pages in this PDF\n')



There are 1 pages in this PDF



In [20]:

# Let's get the lines from the first (and only) page and print out the first 5
# The select operation returns a list of ContentNodes and the select operation is available on every ContentNode

# We can write the selector as '//*[typeRegex("line")]' or as '//line'
page_one_lines = kodexa_doc_pdf.select('//*[typeRegex("page")]')[-1].select('//line')

for l in page_one_lines[:5]:
    print(f'Node uuid: {l.uuid}')  #printing the uuid for the node
    print(f'\tContent is : {l.content}')  #printing the content for THIS node
    for f in l.get_features():
        print(f'\tFeature type: {f.feature_type} - feature name: {f.name}')  #printing the feature
    
    print(f'The node\'s content AND the content for all its children is:  {l.get_all_content()}\n')  #printing the content for this node AND all children

Node uuid: 2ea988bf-277c-4846-8197-ed5614a387c4
	Content is : None
	Feature type: spatial - feature name: statistics
	Feature type: spatial - feature name: bbox
The node's content AND the content for all its children is:  6/12/2020 Page | Canvas

Node uuid: 10c46e64-51c3-4d47-ae5f-58dcfb50a6ce
	Content is : None
	Feature type: spatial - feature name: statistics
	Feature type: spatial - feature name: bbox
The node's content AND the content for all its children is:  

Node uuid: 91b280d7-b42e-49d3-8c8e-b118ceb23358
	Content is : None
	Feature type: spatial - feature name: statistics
	Feature type: spatial - feature name: bbox
The node's content AND the content for all its children is:  PRIVACY POLICY

Node uuid: 664a21e3-bbac-4371-a729-f6e518c0453a
	Content is : None
	Feature type: spatial - feature name: statistics
	Feature type: spatial - feature name: bbox
The node's content AND the content for all its children is:  Your privacy is important to us. It is Kodexa, Inc's policy to respe

## ContentNodes don't necessarily have a content value

In the example above, we can see that none of the nodes of type 'line' have a content value - the child nodes contain the content.  That's because the 'line' nodes act as the parent for each of the 'word' nodes - the content model reflects the relationship of words to lines.

Each of these line nodes also has features of type 'spatial' with values for 'bbox' and 'statistics'  These features are added to the document during the PDF parsing process when we perform spatial analysis.  They are the values we use to know where each word/line is located on a page.

## Do a bit more inspection on the nodes and their children

We can look a little deeper into one of of the lines and see what the children look like:


In [23]:

# Let's look at the child nodes of the first line
for c in page_one_lines[0].children:
    print(f'\nNode uuid: {c.uuid} - type: {c.type}')
    print(f'\tContent: {c.content}')
    for f in c.get_features():
        print(f'\tFeature type: {f.feature_type} - feature name: {f.name}')  #printing the feature
    


Node uuid: e7c2a09c-c4d5-459b-bea8-1dc493f2990b - type: word
	Content: 6/12/2020
	Feature type: text - feature name: font
	Feature type: spatial - feature name: size
	Feature type: spatial - feature name: upright
	Feature type: spatial - feature name: bbox

Node uuid: 60d6ef99-b9d1-4d46-94fd-7eea18ddee73 - type: word
	Content: Page
	Feature type: text - feature name: font
	Feature type: spatial - feature name: size
	Feature type: spatial - feature name: upright
	Feature type: spatial - feature name: bbox

Node uuid: f3504741-d3d9-4bf0-888c-5c8ab958226c - type: word
	Content: |
	Feature type: text - feature name: font
	Feature type: spatial - feature name: size
	Feature type: spatial - feature name: upright
	Feature type: spatial - feature name: bbox

Node uuid: 0aafa0d2-cdf3-4086-9a2e-03f0bdc64c1f - type: word
	Content: Canvas
	Feature type: text - feature name: font
	Feature type: spatial - feature name: size
	Feature type: spatial - feature name: upright
	Feature type: spatial - fea

## Child nodes are also ContentNodes, and they can have the same types of values

We can see from the results above that these child nodes have the same types of elements as their parent nodes.  This is because all nodes are ContentNodes.

We'll now go through the same exercise using a different document source type so you can see how the same syntax/logic can be applied to multiple document source types.

## Create a Kodexa document from an existing parsed Excel file

Let's load an Excel file into a Kodexa document and navigate its nodes in the same manner as above.  This Excel has already been parsed and saved as a .kdxa file, and we'll use that so we don't have to go through the details of parsing.  The original source Excel is located in the _data/excel folder ('../_data/excel_workbooks/2019_Business_Expenses.kdxa')


In [24]:
import os

# Setting up location of data file
DATA_FOLDER = '_data'
EXCEL_FOLDER = 'excel_workbooks'
EXCEL_FILE = '2019_Business_Expenses.kdxa'

EXCEL_FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, EXCEL_FOLDER, EXCEL_FILE)

print(f'\nThis is where the Excel document is located: {EXCEL_FULL_PATH}\n')



This is where the Excel document is located: /home/skep/Projects/Kodexa/get-started-with-python/1_Getting_Started/../_data/excel_workbooks/2019_Business_Expenses.kdxa



In [25]:

# Load the *.kdxa file as a Kodexa Document
kodexa_doc_excel = Document.from_kdxa(EXCEL_FULL_PATH)

print(f'\nmetadata :: {kodexa_doc_excel.metadata}')
print(f'\nroot_node :: {kodexa_doc_excel.get_root()}\n')



metadata :: {'connector_options': {'id': 'f73a9e04965446c9a69563bcd467cd19'}, 'connector': 'cloud-content'}

root_node :: ContentNode [type:workbook] (0 features, 7 children) [None]



## Let's see what kind of ContentNode types are available on this Excel-based Kodexa document

In [28]:

# Selecting all nodes on the kodexa_doc_excel
excel_doc_nodes = kodexa_doc_excel.select('//*')

excel_node_set = set()
for e in excel_doc_nodes:
    excel_node_set.add(e.type)
    
print(f'\nOur sample Excel document has the following node types: {excel_node_set}')



Our sample Excel document has the following node types: {'worksheet', 'workbook', 'row', 'cell'}


## Ah!  So the node types are different between Excel and PDF sources

This makes sense, right?  PDFs are made of pages, paragraphs, lines, words, images, etc., while Excel files are workbooks that have worksheets, rows, and cells.  

## Use the same syntax to navigate documents from different source types

Let's see how we can use selectors to choose the nodes in a Kodexa document that was sourced from Excel.

In [29]:

# Let's get the lines from the first (and only) worksheet and print out the first 5 rows

# We can write the selector as '//*[typeRegex("row")]' or as '//row'

worksheet_one_rows = kodexa_doc_excel.select('//*[typeRegex("worksheet")]')[-1].select('//row')

for l in worksheet_one_rows[:5]:
    print(f'Node uuid: {l.uuid}')  #printing the uuid for the node
    print(f'\tContent is : {l.content}')  #printing the content for THIS node
    for f in l.get_features():
        print(f'\tFeature type: {f.feature_type} - feature name: {f.name}')  #printing the feature
        
    print(f'The node\'s content AND the content for all its children is:  {l.get_all_content()}\n')  #printing the content for this node AND all children


Node uuid: a323b530-cf82-44e0-8b41-a30db89757c3
	Content is : None
The node's content AND the content for all its children is:  Date Miles Reimbursement Amount Starting Point Ending Point Round Trip? Client Comments

Node uuid: dd183674-6f4d-4b0b-86d7-17d38f4a6fbf
	Content is : None
The node's content AND the content for all its children is:  2019-01-07 00:00:00 0.58 80.8 46.864 Home 1 Client's office address Yes Client 1 Client in-office day

Node uuid: 8bec740f-d48f-4991-a0ba-7728177c07f4
	Content is : None
The node's content AND the content for all its children is:  2019-01-14 00:00:00 0.58 80.8 46.864 Home 1 Client's office address Yes Client 1 Client in-office day

Node uuid: b2c57491-9968-45be-a427-b4dead5c9c5e
	Content is : None
The node's content AND the content for all its children is:  2019-01-21 00:00:00 0.58 80.8 46.864 Home 1 Client's office address Yes Client 1 Client in-office day

Node uuid: 8bf3fedd-6d08-4a82-97ec-d0c8895e4082
	Content is : None
The node's content AND 

## ContentNodes don't necessarily have features

In the document sourced from an Excel file, we find there are no features on the nodes.  This is ok!  Features are used to add additional details & information to ContentNodes - in this case none has been set.  


## Creating a Kodexa Document from plain text

Here's a method to create a Kodexa Document from simple text


In [30]:

#Simple document from text
kodexa_doc_text = Document.from_text('Hello!  Today is a warm, sunny day!')


print(f'\nmetadata :: {kodexa_doc_text.metadata}')
print(f'\nroot_node :: {kodexa_doc_text.get_root()}\n')



metadata :: {}

root_node :: ContentNode [type:text] (0 features, 0 children) [Hello!  Today is a warm, sunny day!]



## Simple source, but same structure

This document will still have the main structures found in documents created from richer sources.  



We'll look at the metadata and ContentNode types.

In [31]:

# Selecting all nodes on the kodexa_doc_text
text_doc_nodes = kodexa_doc_text.select('//*')

text_node_set = set()
for t in text_doc_nodes:
    text_node_set.add(t.type)
    
print(f'\nOur sample text document has the following node types: {text_node_set}')



Our sample text document has the following node types: {'text'}


## Connecting the Kodexa Document object to other types of sources

In these examples, we've connected our Document object to pre-parsed .kdxa documents.  We did this so we could skip past the explainations of how to set up pipelines with parsers to process files (cheating!).  

There are additional methods available on the Document object that will allow you to connect to different file types, but most require you to parse the documents to make the contents available.  These methods are:


* from_text - no parsing needed.  A root ContentNode of type 'text' will be created and the node content will be set to the text provided
* from_file - will require appropriate parser
* from_dict - dict should represent the Kodexa Document structure
* from_json - json should represent the Kodexa Document structure
* from_msgpack - msgpack should represent the Kodexa Document structure
* from_url - will require appropriate parser


### Parsing Examples

Complete examples of using pipelines and wiring in parsers can be found in notebook 3, Parsing_Documents.
