# Get all Proper Nouns

This example shows how one might gather all the proper nouns from a collection of books using the HTRC Feature Reader.

In [None]:
from htrc_features import FeatureReader
import pandas as pd

First, collect the list of files that you hope to extract the nouns from.

In [None]:
import glob
paths = glob.glob('../data/PZ-volumes/*.basic.json.bz2')
fr = FeatureReader(paths)

For now, let's walk through what we would do with just one volume. We'll set the first volume of the FeatureReader to `vol` and return a tokenlist, without page level information.

In [None]:
vol = next(fr.volumes())
tl = vol.tokenlist(pages=False)
tl[:1]

I'm interested in the occurance of words across years, so we'll add a `date` column and absorb it into the MultiIndex as a new level. At the same time, we'll drop the `section` level, since it's all redundant information. You can read about [Pandas MultIndexes in the Pandas documentation](pandas.pydata.org/pandas-docs/stable/advanced.html).

In [None]:
# Remove 'section'
tl.index = tl.index.droplevel(0)
# Add date column, convert to index level, and reorder levels
tl['date'] = vol.year
tl = tl.set_index('date', append=True).reorder_levels(['date', 'token', 'pos'])

Here's what the DataFrame looks like now:

In [None]:
tl[:1]

To get all the proper nouns, we'll 'slice' all the columns that have `NNP` or `NNPS` as the part-of-speech (POS) value.

Slicing involves using the `.loc[]` to ask for, in order: all `date` rows, all `token` rows, and just `pos` rows that match `NNP` or `NNPS`. Below I use `IndexSlice` simply for a more familiar syntax, but `idx[:,:,('NNP', 'NNPS')]` is equivalent to asking for `(slice(None),slice(None),('NNP', 'NNPS'))`.

In [None]:
idx = pd.IndexSlice
proper_nouns = tl.loc[idx[:,:,('NNP', 'NNPS')],]
# Show only proper nouns that occur more than once
proper_nouns[proper_nouns['count'] > 1].sort_values('count', ascending=False)[:10]

That's it. Let's collect the info for all our volumes.

In [None]:
idx = pd.IndexSlice

def get_proper_nouns(vol):
    tl = vol.tokenlist(pages=False)
    tl.index = tl.index.droplevel(0)
    tl['date'] = vol.year
    tl = tl.set_index('date', append=True).reorder_levels(['date', 'token', 'pos'])
    try:
        proper_nouns = tl.loc[idx[:,:,('NNP', 'NNPS')],]
        proper_nouns.index = proper_nouns.index.droplevel(2)
        return proper_nouns[proper_nouns['count'] > 1]
    except:
        return pd.DataFrame()

In [None]:
# Collect all results in a list, then concat the dataframes together
nnp_dfs = []
for vol in fr.volumes():
    nnp_dfs.append(get_proper_nouns(vol))
all_nnp = pd.concat(nnp_dfs)
del nnp_dfs 

In [None]:
all_nnp = pd.concat(nnp_dfs)
all_nnp.sort_values('count', ascending=False)[:100]

Of course, these counts are biased by the fact that there are only 15 books in the sample. Let's look at what terms occurred at least twice is the most number of books.

In [None]:
all_nnp['occurred'] = 1
all_nnp.reset_index().groupby(['token']).sum()\
       .sort_values('occurred', ascending=False)[:20]