# Kodexa's Document Content Model

In this notebook, we'll cover the core features of Kodexa's content model (document), the structure of a Kodexa document, and how the elements of a document make working with unstructured data easier.


## So what is the Kodexa document content model?

Kodexa's content model, the "Document", provides a structure for describing the features of elements within a data source, and relationships describing the relationships between the elements.  Because unstructured data is found in many different source types and file formats, our document content model must be flexible so that it can capture & reflect the specific differences that are present in different source types.  

We know that learning how to work with different source types/file formats is time consuming, and integrating processing capabilities across different source types can be difficult.  To make working with multiple source types easier, we’ve built capabilities (Actions) that work off this single common document content model.  While different source types may produce Kodexa documents with slightly different features, the common elements of them and the thinking/approach to working with them remain the same across all source types.

As you learn to “think in Kodexa” and solve problems using Kodexa, you will learn that the flexibility of the Document is your friend. It provides you a consistent way to work across use-cases, and since the model and API is consistent across source types, it means that you can write reusable code that can be leveraged in multiple use-cases. 


### What does that mean to me?

That means that if you parse an Excel file to a Kodexa Document, the relationship between each cell and its row is maintained, as well as the relationship between the row and its parent worksheet.  It’s all taken care of by the Document’s content model.  If you parse an HTML file into a Kodexa Document, the relationships between an HTML node and its parent/child/sibling tags are also maintained by the Document’s content model.  Both of these data sources (Excel & HTML) can be parsed and transformed into a common format, and common actions can be applied to them.


## Load up a document and take a look inside

So that we skip the 'how-to' of parsing and just focus on the Document and its structures, we're going to use an example PDF that's already been parsed and saved to a *.kdxa file (our Kodexa file format).  The original source PDF is located in the _data/pdf folder ('../_data/pdfs/Kodexa_Privacy.pdf')

### Importing Kodexa

We'll need to import the Document module of kodexa.  The kodexa package is public and available for installation with a simple 'pip install kodexa'.  

In [17]:

# The kodexa package is public
from kodexa import Document

# Setting up location of data file
import os
DATA_FOLDER = '_data'
PDF_FOLDER = 'pdfs'
PDF_KDXA_FILE = 'Kodexa_Privacy.kdxa'
PDF_KDXA_FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, PDF_FOLDER, PDF_KDXA_FILE)

print(f'\nThe .kdxa of our PDF file is located at: {PDF_KDXA_FULL_PATH}\n')


The .kdxa of our PDF file is located at: /home/skep/Projects/Kodexa/kodexa-demo-notebooks/1_Getting_Started/../_data/pdfs/Kodexa_Privacy.kdxa



In [18]:

# Load the *.kdxa file as a Kodexa Document
kodexa_doc = Document.from_kdxa(PDF_KDXA_FULL_PATH)


## Let's dive into the Document structure

Kodexa Documents documents are represented in a generalized document structure consisting of a collection of metadata and a set of content nodes.  When source content has been parsed and structured in a Kodexa document, it's broken up into a tree of nodes with a "root" content node at the top and one or more child content nodes branching off as leaves.  Each of these nodes can capture a portion of the original content (text), and may also have one or more "features". Features allow flexibility in the way that we capture metadata about content, and allow us to add new information to a node that we may want to use in later processing.

Getting started, there are two properties on the Document that you'll want to be familiar with:

* **metadata**:  Metadata about/for this document
* content_node: The top-most ContentNode in the document, also called the "root_node".  It will have zero-to-many child content nodes. 

## Now we'll dig into the ContentNode

The ContentNode is the heart of the content model.  A content node has several core properties such as type, content, content_parts, children, and feature(s).
* **type:** The specific type of this content node.
* **parent:** Each node is aware of its parent. Only a root content node on a document would not have a parent.
* **features:** Features are a way to store/add additional information about a content node.  The features property on a content node is a collection of features.
* **content:** The text representation of the content.
* **content_parts:** An array version of the content, this is used to break the content and intersperse it with the children, allowing us to understand where the child nodes fit into the content.  It is not always present, and is only present if the structure allows for child nodes to be embedded into the content at specific locations.
* **children:** The child content nodes that roll up to this content node.

While at a generic level everything can be thought of as a content node, we leverage the 'type' property on the content node to provide meaning to the hierarchy. If we were working with a PDF file, we would expect to see content nodes with types such as 'div', 'p', 'span', etc.  If we were parsing a PDF file, we may see type values such as 'page' or 'line'.

For this example, we'll be focusing on the **type**, **features**, and **content** of our content nodes.

More info on Kodexa's Content Model and Document structure can be found here:  [Kodexa Developer Guide](https://developer.kodexa.com/developers/what-is-kodexa/document-structure)


## Let's see the content model in action

We're going to create two different Kodexa documents from two different types of sources, text and Excel.  We'll demonstarate how both 

In [29]:

print(f'\nmetadata :: {kodexa_doc.metadata}')
print(f'\nroot_node :: {kodexa_doc.get_root()}\n')



metadata :: {'connector_options': {'id': '7684716fb8344b539aee792f5b1f4d72'}, 'connector': 'cloud-content'}

root_node :: ContentNode [type:root] (0 features, 1 children) [None]



In [9]:
from kodexa import Document

text = 'A flea and a fly got stuck in a flue.\n \
Said the flea to the fly, "What shall we do?"\n \
Said the fly, "Let us flee!"\n \
Said the flea, "Let us fly!"\n \
So they flew through a flaw in the flue.'

kodexa_doc_text = Document.from_text(text)

print(kodexa_doc_text.get_root())

ContentNode [type:text] (0 features, 0 children) [A flea and a fly got stuck in a flue.
 Said the flea to the fly, "What shall we do?"
 Said the fly, "Let us flee!"
 Said the flea, "Let us fly!"
 So they flew through a flaw in the flue.]


In [10]:
text_nodes = kodexa_doc_text.select('//*')

print()

for t in text_nodes:
    print(t.type)


text


In [11]:
import os

# Setting up location of data file
DATA_FOLDER = '_data'
EXCEL_FOLDER = 'excel_workbooks'
EXCEL_FILE = '2019_Business_Expenses.xlsx'

EXCEL_FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, EXCEL_FOLDER, EXCEL_FILE)

print(f'\nThis is where the Excel document is located: {EXCEL_FULL_PATH}\n')



This is where the Excel document is located: /home/skep/Projects/Kodexa/kodexa-demo-notebooks/1_Getting_Started/../_data/excel_workbooks/2019_Business_Expenses.xlsx



In [12]:
kodexa_doc_excel = Document.from_file(EXCEL_FULL_PATH)

In [13]:
kodexa_doc_excel.get_root()

In [14]:
# Setting up location of data file
DATA_FOLDER = '_data'
TEXT_FOLDER = 'texts'
TEXT_FILE = 'tongue_twister.txt'

TEXT_FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, TEXT_FOLDER, TEXT_FILE)

print(f'\nThis is where the text document is located: {TEXT_FULL_PATH}\n')



This is where the text document is located: /home/skep/Projects/Kodexa/kodexa-demo-notebooks/1_Getting_Started/../_data/texts/tongue_twister.txt



In [20]:
text_file_doc = Document.from_file(TEXT_FULL_PATH)

text_file_doc.get_root()

def print_doc_info(doc):
    print(f'\metadata:: {doc.metadata}\nroot_node:: {doc.get_root()}\nmixins:: {doc.get_mixins()}\nuuid:: {doc.uuid}\nsource::{doc.source}')


print_doc_info(text_file_doc)


\metadata:: {'connector': 'file-handle', 'connector_options': {'file': '/home/skep/Projects/Kodexa/kodexa-demo-notebooks/1_Getting_Started/../_data/texts/tongue_twister.txt'}}
root_node:: None
mixins:: ['core']
uuid:: faae1970-ba4f-5436-b4e4-88dfc001d971
source::<kodexa.model.model.SourceMetadata object at 0x7f2742e07c90>


In [None]:
from_json


In [None]:
from_dict


In [None]:
from_url