In [None]:
# Preparing where we were at the end of the last notebook
from htrc_features import FeatureReader
fr = FeatureReader(['data/sample-file1.basic.json.bz2', 'data/sample-file2.basic.json.bz2'])
vol = fr.first()
tokens = vol.tokens_per_page()
tl = vol.tokenlist()

# Loading a Token List

The information contained in `vol.tokens_per_page()` is minimal, a sum of all words in the body of each page. 

The Extracted Features dataset also provides token counts with much more granularity: for every part of speech (e.g. noun, verb) of every occurring capitalization of every word of every section (i.e. header, footer, body) of every page of the volume. 

`tokens_per_page()` only kept the "for every page" grouping, to get section-,pos-, and word-specific details, you can use `vol.tokenlist()`:

In [None]:
tl = vol.tokenlist()
# Let's look at some words deeper into the book:
# from 1000th to 1100th row, skipping by 10 [1000:1100:10]
tl[1000:1100:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
24,body,years,NNS,1
25,body,7,CD,1
25,body,Oh,UH,1
25,body,asked,VBD,1
25,body,could,MD,1
25,body,give,VB,1
25,body,him,PRP,2
25,body,lace,NN,1
25,body,may,MD,1
25,body,n't,RB,1


As before, the data is returned as a Pandas DataFrame. This time, there is much more information. Consider a single row:

In [None]:
tl[1000:1001]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,count
page,section,token,pos,Unnamed: 4_level_1
24,body,years,NNS,1


The columns in bold are an index. Unlike the typical one-dimensional index seen before, here there are four dimensions to the index: page, section, token, and pos. This row says that for the 24th page, in the body section (i.e. ignoring any words in the header or footer), the word 'years' occurs 1 time as an plural noun. The part-of-speech tag for a plural noun, `NNS`, follows the Penn Treebank ([https://goo.gl/6NVDJv](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)) definition.

> The "words" on the first page seems to be OCR errors for the cover of the book. The HTRC Feature Reader refers to "pages" as the $n^{th}$ scanned image of the volume, not the actual number printed on the page. This is why "page 1" for this example is the cover.

Tokenlists can be retrieved with arguments that fold certain dimensions, such as `case`, `pos`, or `page`. You may also notice that, by default, only 'body' is returned, a default that can be overridden.

Look at the following list of commands: can you guess what the output will look like? Try for yourself and observe how the output changes.

```python
vol.tokenlist(case=False)
vol.tokenlist(pos=False)
vol.tokenlist(pages=False, case=False, pos=False)
vol.tokenlist(section='header')
vol.tokenlist(section='group')
```

Details for what arguments are taken are in the documentation ([http://goo.gl/hgCqgJ](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Volume.tokenlist)) for the Feature Reader.