# Get all Proper Nouns

This example shows how one might gather all the proper nouns from a collection of books using the HTRC Feature Reader.

In [None]:
from htrc_features import FeatureReader
import pandas as pd

First, collect the list of files that you hope to extract the nouns from.

In [None]:
import glob
paths = glob.glob('../data/PZ-volumes/*.basic.json.bz2')
fr = FeatureReader(paths)

For now, let's walk through what we would do with just one volume. We'll set the first volume of the FeatureReader to `vol` and return a tokenlist, without page-level information.

In [None]:
vol = next(fr.volumes())
tl = vol.tokenlist(pages=False)
tl[:2]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
section,token,pos,Unnamed: 3_level_1
body,!,.,279
body,!—it,",",1


I'm interested in the occurance of words across years, so we'll add a `date` column and absorb it into the MultiIndex as a new level. At the same time, we'll drop the `section` level, since it's all redundant information. You can read about [Pandas MultIndexes in the Pandas documentation](pandas.pydata.org/pandas-docs/stable/advanced.html).

In [None]:
# Remove 'section', which is level 0 of the MultiIndex
tl.index = tl.index.droplevel(0)
# Add date column, convert to index level, and reorder levels
tl['date'] = vol.year
tl = tl.set_index('date', append=True).reorder_levels(['date', 'token', 'pos'])

Here's what the DataFrame looks like now:

In [None]:
tl[:2]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
date,token,pos,Unnamed: 3_level_1
1901,!,.,279
1901,!—it,",",1


The Extracted Features dataset using the part-of-speech tags from the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). In Penn, proper nouns are labelled `NNP` and plural proper nouns are labelled `NNPS`.
 
To get all the proper nouns, we'll 'slice' all the columns that have `NNP` or `NNPS` as the part-of-speech (POS) value.

Slicing involves using the `.loc[]`. Note that we ask for `idx[:,:,('NNP', 'NNPS')]` below. This is asking, in order, for 

 1. any matching `date`, 
 2. any matching `token`, and 
 3. only `pos` rows that match `NNP` or `NNPS`. 
    
Below I use `IndexSlice` simply for a more familiar syntax where colons can be used to ask for everything or a range of options, but `idx[:,:,('NNP', 'NNPS')]` is equivalent to asking for `(slice(None),slice(None),('NNP', 'NNPS'))`. [More details on slicing MultiIndexes](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index).

In [None]:
idx = pd.IndexSlice
proper_nouns = tl.loc[idx[:,:,('NNP', 'NNPS')],]
# Show only proper nouns that occur more than once
proper_nouns[proper_nouns['count'] > 1].sort_values('count', ascending=False)[:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,count
date,token,pos,Unnamed: 3_level_1
1901,Carmela,NNP,236
1901,Gigli,NNP,77
1901,Roberto,NNP,59
1901,Traetta,NNP,59
1901,Rocco,NNP,58
1901,Naples,NNP,58
1901,Gargiulo,NNP,53
1901,Captain,NNP,53
1901,Cecilia,NNP,42
1901,Minino,NNP,41


That's it. Let's collect the info for all our volumes. Depending on the system, this may take a short bit of time. If you're looking to run a similar analysis over much larger sets of volumes, I prepared a short example [on how to parallelize this code](./GetAllProperNouns-Parallel.ipynb).

In [None]:
idx = pd.IndexSlice

def get_proper_nouns(vol):
    tl = vol.tokenlist(pages=False)
    tl.index = tl.index.droplevel(0)
    tl['date'] = vol.year
    tl = tl.set_index('date', append=True).reorder_levels(['date', 'token', 'pos'])
    try:
        proper_nouns = tl.loc[idx[:,:,('NNP', 'NNPS')],]
        proper_nouns.index = proper_nouns.index.droplevel(2)
        return proper_nouns[proper_nouns['count'] > 1]
    except:
        return pd.DataFrame()

In [None]:
# Collect all results in a list, then concat the list of dataframes together into a single dataframe
nnp_dfs = []
for vol in fr.volumes():
    nnp_dfs.append(get_proper_nouns(vol))
all_nnp = pd.concat(nnp_dfs)
del nnp_dfs 

In [None]:
all_nnp.sort_values('count', ascending=False)[:100]

Unnamed: 0_level_0,Unnamed: 1_level_0,count
date,token,Unnamed: 2_level_1
1901,Ranald,1146
1860,Fanny,800
1860,Madame,793
1902,Garwood,720
1920,Prince,603
1855,Chryssa,592
1860,Marie,546
1860,Roche,504
1916,June,492
1891,Guy,477


Of course, these counts are biased by the fact that there are only 15 books in the sample. Let's look at what terms occurred in the most number of books, remembering that we're only looking at words that occured more than once.

In [None]:
all_nnp['occurred'] = 1
all_nnp.reset_index().groupby(['token']).sum()\
       .sort_values('occurred', ascending=False)[:20]

Unnamed: 0_level_0,count,occurred
token,Unnamed: 1_level_1,Unnamed: 2_level_1
Poor,78,14
God,461,14
Mr.,1025,13
Sunday,90,11
II,66,10
Mrs.,684,10
St,190,9
X,34,9
"No,",61,9
Father,79,9
