# Reading, Saving and Loading PTbank files

Given a raw Penn Treebank file with labeled sentences (**instances**) the `postag` library is able to read it and parse it to the library own notation with Python objects. Here's how a PT bank instance looks like in a .txt file, for instance:

```
((NP (DT an) (NNP Oct.) (CD 19) (NN review)))
```

The most external parentheses indicates that this line represents a unique instance, that begins with a `NP` class.

## Reading a raw labeled file

In order to read a labeled file using postag import the Treebank class. Then use the method `read_file` with the path as argument.

In [2]:
import time
from postag import Treebank

start = time.time()
ptb = Treebank.read_file('data/traindata')
end = time.time()

print("It took " + str(end-start) + " seconds to read it!")

It took 11.052475690841675 seconds to read it!


In [3]:
len(ptb.instances) # Number of instances in the PT bank set

39831

It is possible to access the tree and walk through it for a object in the instance set, fetching by the index of the entity. For instance `ptb[0]` returns the first instance loaded to `ptb`. When printed, the result is a human-readable structure for a better interpretation of the object.

In [4]:
print(ptb[1]) # the second instance

((S (NP-SBJ (NNP Ms.)
    (NNP Haag))
  (VP (VBZ plays)
    (NP (NNP Elianti)))
  (. .)))


In [5]:
print(ptb[1][0][1][0]) # accessing a element in the tree

(VBZ plays)


You can see the nodes of a instance or fetch all nodes from the treebank using `.get_nodes()`

In [15]:
# List nodes of the second instance
for node in ptb[1].get_nodes():
    print(node)

(NNP Ms.)
(NNP Haag)
(VBZ plays)
(NNP Elianti)
(. .)


In [23]:
# Print the sentence as a row:
print(" ".join([node.value for node in ptb[1].get_nodes()]))

# A more readable alternative would be:
for node in ptb[1].get_nodes():
    print("%s " % node.value, end='')

Ms. Haag plays Elianti .
Ms. Haag plays Elianti .

## Saving and loading

With the current ptbank structure loaded in memory, just use the `.save` method passing the storage path as argument.

In [10]:
ptb.save('data/dumps/my_ptb_struct')

'data/dumps/my_ptb_struct'

The command to load a ptbank struct is also pretty simple. `.load` method loads the structure in an existing instance of Treebank. Note that if the current instance already has any instance they will be deleted.

In [11]:
start = time.time()

my_new_ptb = Treebank()
my_new_ptb.load('data/dumps/my_ptb_struct')

end = time.time()

print("It took " + str(end-start) + " seconds to load!")

It took 5.60997462272644 seconds to load!


In [12]:
print(my_new_ptb[1])

((S (NP-SBJ (NNP Ms.)
    (NNP Haag))
  (VP (VBZ plays)
    (NP (NNP Elianti)))
  (. .)))
