# Working with text files in Kodexa


All of our processing will occur in Kodexa's cloud environment.  In order to access the platform, you'll need to register for an account and generate an access token.  If you haven't done that already, follow the steps in our [Getting Started](https://developer.kodexa.com/kodexa-cloud/accessing-kodexa-cloud) guide.

Import the dependencies:
1. Since all of our actions occur in the cloud, we'll need to import the KodexaPlatform and RemoteAction modules
2. Import Kodexa's Pipeline module so we can build pipelines and process our document
3. The PDF file we'll be processing is located in a file folder, so we'll import the FolderConnector in order to access it.
4. All files that have been processed/parsed in Kodexa (Excel, PDF, etc) become Kodexa Documents, so we'll import that module as well.

We're also setting the CLOUD_URL value to the platform environment on which we want to perform our processing


In [1]:
# The kodexa package is public
from kodexa import Document, Pipeline, FolderConnector, KodexaPlatform, RemoteAction

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.  **You will then need to manulally set control at the next cell.**

In [3]:
import getpass

# Only request a login if we aren't logged in

if KodexaPlatform.get_access_token() is None:
    
    ACCESS_TOKEN = getpass.getpass("Enter access token:")

    KodexaPlatform.set_url(CLOUD_URL)
    KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [15]:
import os

# Setting up location of data file
DATA_FOLDER = '_data'
TEXT_FOLDER = 'texts'
DATA_FILE = 'news_story.txt'

FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, TEXT_FOLDER, DATA_FILE)

print(f'\nThis is where the text document is located: {FULL_PATH}\n')


This is where the text document is located: /home/skep/Projects/Kodexa/get-started-with-python/2_Examples_by_Source_Type/../_data/texts/news_story.txt



In [16]:

# Create a pipeline in order to access the Excel file
pipeline = Pipeline.from_file(FULL_PATH)

# Our first step in the Pipeline is to parse the Excel file.  
# The parser will produce a Kodexa document, which we'll retrive after the pipeline runs
pipeline.add_step(RemoteAction(slug='kodexa/text-parser', options={}, attach_source=True))

# Do it!
pipeline.run()

# Let's get that freshly parsed document!
kodexa_doc = pipeline.context.output_document

In [17]:
kodexa_doc.get_root().get_all_content()[:1000]

'The UK will remain in the "containment" stage of its response to the coronavirus following an emergency Cobra meeting.\n\nIt comes as the country\'s chief medical adviser confirmed a fourth person had died from the virus in the UK.\n\nThere were 319 confirmed cases in the UK as of 09:00 GMT on Monday, a rise of 46 since the same time on Sunday.\n\nHowever, measures to delay the virus\' spread with "social distancing" will not be introduced yet, ministers said.\n\nNumber 10 said it accepted that the virus "is going to spread in a significant way", however.\n\nThe latest person to die from the virus was in their 70s and had underlying health conditions, according to the UK government\'s chief medical adviser Prof Chris Whitty.\n\nHe said the patient, who was being treated at a hospital in Wolverhampton, appeared to have contracted the virus in the UK and that officials were tracing people they had been in contact with.\n\nFollowing the Cobra meeting, Downing Street said the prime minist

## Tag NER (apply named entity relationship tagging)

In [18]:

pipeline = Pipeline(kodexa_doc)

#pipeline.add_step(RemoteAction(slug='kodexa/ner-tagger', options={}))
pipeline.add_step(RemoteAction(slug='aws/ner-tagger', options={}))

# Do it!
pipeline.run()

# Let's get that freshly parsed document!
kodexa_doc = pipeline.context.output_document

In [19]:
kodexa_doc.get_root().get_all_tags()

['PERSON', 'DATE', 'EVENT', 'OTHER', 'QUANTITY', 'ORGANIZATION', 'LOCATION']

In [20]:
kodexa_doc.get_root().get_features()

[<kodexa.model.model.ContentFeature at 0x7f0ebfe2e510>,
 <kodexa.model.model.ContentFeature at 0x7f0ebfe2ec50>,
 <kodexa.model.model.ContentFeature at 0x7f0ebfe2e8d0>,
 <kodexa.model.model.ContentFeature at 0x7f0ebfe2ec10>,
 <kodexa.model.model.ContentFeature at 0x7f0ebfe2edd0>,
 <kodexa.model.model.ContentFeature at 0x7f0ebfe2ef50>,
 <kodexa.model.model.ContentFeature at 0x7f0ebfe2ead0>]