# Reading, Saving and Loading PTbank files

Given a raw Penn Treebank file with labeled sentences (**instances**) the `postag` library is able to read it and parse it to the library own notation with Python objects. Here's how a PT bank instance looks like in a .txt file, for instance:

```
((NP (DT an) (NNP Oct.) (CD 19) (NN review)))
```

The most external parentheses indicates that this line represents a unique instance, that begins with a `NP` class.

## Reading a raw labeled file

In order to read a labeled file using postag import the Treebank class. Then use the method `read_file` with the path as argument.

In [9]:
import time
from postag import Treebank

start = time.time()
ptb = Treebank.read_file('data/traindata')
end = time.time()

print("It took " + str(end-start) + " seconds to read it!")

It took 46.40592646598816 seconds to read it!


In [11]:
len(ptb.instances) # Number of instances in the PT bank set

39831

As you can see, a large instance set can take a long time to be parsed and loaded in memory. That's why storing the ptbank generated structure itself is a good way to improve this loading time.

## Saving and loading

With the current ptbank structure loaded in memory, just use the `.save` method passing the storage path as argument. Notice that the generated file is heavier than the original one.

In [13]:
ptb.save('data/dumps/my_ptb_struct')

The command to load a ptbank struct is also pretty simple. `.load` method loads the structure in an existing instance of Treebank. Note that if the current instance already has any instance they will be deleted.

In [14]:
start = time.time()

my_new_ptb = Treebank()
my_new_ptb.load('data/dumps/my_ptb_struct')

end = time.time()

print("This time it just took " + str(end-start) + " seconds to load!")

This time it just took 4.44110107421875 seconds to load!
