# Library Carpentry: Tools for Humanists

## Python Lesson

### Part 3: Working with HathiTrust Extracted Features

#### Objectives
 - Explore HathiTrust extracted-feature datasets for textual analysis
 - Work with core elements for the computational analysis of texts, including token counts, parts of speech (POS) tags, and collocations
 - Practice using the pandas library in Python to query, sort, and group datasets

#### 1. HathiTrust as a data source
 - Collection of digitized material from major research libraries
 - Full text available for _selected_ works in the public domain 
 - Permits _non-consumptive_ uses of works in copyright
   - The [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) model

##### Exercise
Visit the [HathiTrust catalog](https://www.hathitrust.org/) and look for the edition of Newton's _Opticks_ published by William Innys in 1730. 

##### Notes
https://catalog.hathitrust.org/Record/007709419?type%5B%5D=all&lookfor%5B%5D=newton%20opticks&ft=ft

#### 2. Getting extracted features
 - Getting the volume ID
 - installing the feature reader library
 - using the `Volume` module to load the dataset into Python

Open the `Full View` link. The volume ID can be found in the URL of the volume, between the `id=` and the `&view` parts of the URL: `https://babel.hathitrust.org/cgi/pt?id=uc2.ark:/13960/t3ws8zp9j&view=1up&seq=3&skin=2021`

In [None]:
newton_id = 'uc2.ark:/13960/t3ws8zp9j'

We need to install a specific Python library in order to retrieve the extracted features for this volume. To do so, we use the Python tool `pip`. 

Note the exclamation point: `pip` is a shell command, not a Python function

In [None]:
!pip install htrc_features

In [None]:
from htrc_features import Volume

In [None]:
vol = Volume(newton_id)

Basic metadata -- we use the "dot" notation to access _attributes_ of a complex Python object.

In [None]:
vol.title

In [None]:
vol.author

In [None]:
vol.pub_place

In [None]:
vol.publisher

#### 3. Working with tokens and token counts
 - A _token_ is a quantifiable unit of text
 - For English text, it's typical to treat the following as tokens:
   - Individual words (separated by white space and/or punctuation)
   - Punctuation marks

This function returns a pandas DataFrame showing the frequency of each token in the volume.

In [None]:
term_counts = vol.term_volume_freqs(page_freq=False)
term_counts

##### Exercise 1
Compare these token counts with those you obtained in our Python lesson #1. How does your tokenization differ from what's represented here?

##### Notes
 - We only split on whitespace, which included punctuation marks as part of tokens
 - Including part-of-speech tags distinguishes between homonyms, e.g. "that" as conjunction ("the book that I read") and determiner ("that book is a good one").

In [None]:
term_counts.loc[term_counts.token == 'that']

##### Exercise 2
Use the `term_counts` DataFrame to find out how many times the token "light" appears in this text. 

If it helps, you can look up the meaning of a `pos` tag on the [Penn Treebank Project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) page.

##### Notes
 - Use `.loc` to filter on the condition where the `token` column matches a certain string. 
 - Looking for lowercase "light" returns only adjectival uses. 
 - Because of the period in which this was printed, the substantive uses are capitalized.
 - OCR quality makes a difference, too.

In [None]:
term_counts.loc[term_counts.token == 'light']

In [None]:
term_counts.loc[term_counts.token == 'Light']

We can use the `str.startswith` method to find words similar to "light." Notice some that seem like OCR errors.

In [None]:
term_counts.loc[term_counts.token.str.startswith('Ligh')]

#### 4. Doing more with tokens: collocation
 - Using a cleaner text
 - Getting unique values from a column
 - Grouping by part of speech
 - Filtering by part of speech
 - Grouping by page
 - Finding collocations

Let's try with a cleaner version of Newton's text. [This edition](https://catalog.hathitrust.org/Record/100956962) was published in 1931.

Note that the full-text is not available for view. But we can still access the extracted feartures.

In [None]:
vol1931 = Volume('umn.319510005360571')

This time we'll use the `tokenlist` method to get a DataFrame showing token count and type by page.

The `case` parameter will perform a _casefold_ on the tokens, allowing us to ignore the difference between capitalized and uncapitalized terms.

In [None]:
tokens = vol1931.tokenlist(case=False)

We use the `reset_index` method to make the DataFrame easier to work with.

In [None]:
df = tokens.reset_index()

##### Exercise 3
Take a moment to explore the structure of this DataFrame. How does it differ from the previous one?

##### Notes
 - Includes token counts for each page/section of the page
 - Can't get a total count of token frequency simply by filtering

We can use the `.loc` indexer to limit the DataFrame to just rows containing the token(s) we are interested in. Then we can access the token column and call the `unique` method to get a list of the _unique_ values in that column. 

Note that the token column is now called `lowercase` because we passed the `case=False` argument to the `tokenlist` method.

In [None]:
light_words = df.loc[df.lowercase.str.startswith('ligh')]
light_words.lowercase.unique()

The same method can help us see what kinds of parts of speech are present in the text.

In [None]:
df.pos.unique()

The `unique` method doesn't tell us how frequently these occur. For that, we need to summarize all the occurences across all pages and all tokens.  

##### Performing aggregations in pandas
 - Write out a section of the DataFrame on the board, showing multiple pages
 - Walk through what a group operation would look like (for grouping by pages)
 - Sorting the DataFrame by a different column changes the ordering of the elements
 - The `groupby` function as implicit iteration
 - Steps:
   - Identify the column (or columns) that holds the groups => in this case, it's `pos`
   - Create a groupby object using that column
   - Identify the column you want to summarize (it should be numeric)
   - Apply an aggregate function to that column
 - Time permitting, show `groupby` "under the hood"

In [None]:
# Note: it's not necessary to sort the DataFrame before using groupby
# We're doing it here just for illustration
df.sort_values(by='pos')

In [None]:
pos_groups = df.groupby('pos')
pos_groups['count'].sum()

##### Under the hood of the `groupby` method

- The `groupby` method creates an object that consists of Python tuples (a data structure similar to lists)
- The first element in each tuple is the label for each group -- in this case, the part of speech
- These labels are equivalent to the result of `df.pos.unique()`
- The second element in each tuple is a slice of the original DataFrame corresponding to rows where that label is in the groupby column
- The latter is equivalent to `df.loc[df.pos==label]` where `label` has the value of the first element in the tuple. 
- Applying an aggregation function -- like `sum` -- to a column on a `GroupBy` object performs that function _on each slice of the original DataFrame_ and returns a new pandas `Series`.
- In the Series, the index consists of the labels, and the values consist of the results of the aggregation.

In [None]:
it = iter(pos_groups)
next(it)

In [None]:
next(it)

In [None]:
next(it)

##### Exercise 4
Can you recreate the kind of token count we got from `term_volume_freqs` by using the `groupby` method on the page-level DataFrame?

##### Notes
 - `df.groupby('lowercase')['count'].sum()`

##### Collocations
A popular method of text analysis consists in seeing which words occur in close proximity to other words. In fact, many sophisticated techniques, such as word vectorization, rely on a version of collocation. 

With collocation, we treat the text has having more structure than a simple "bag of words."

We can perform a kind on our extracted features dataset, using the _page_ as the measure of collocation. In other words, which tokens occur most frequently with which other tokens on the same page of Newton's text? 

We'll stick to the token `light` to make things easier to implement. Which other tokens appear on the same page as the token `light`?

Let's also restrict the kinds of tokens we're including by part of speech. Otherwise, we'll end up with a lot of words like "the" and "of" in the top collocations.

Another approach to doing this is to use [stop words](https://en.wikipedia.org/wiki/Stop_word).

First we define a list of parts of speech. Let's stick to singular and plural nouns.

In [None]:
pos = ['NN', 'NNS']

We use the `isin` method with `.loc` to filter our DataFrame of tokens to just those whose parts of speech are in our list.

In [None]:
noun_df = df.loc[df.pos.isin(pos)]

Now we want to filter our DataFrame to just those _pages_ where the token "light" appears. Note that it's not sufficient to filter like this: `noun_df.loc[noun_df.token == 'light']`. That would exclude all rows where the token was anything else _but_ "light." But those are the tokens we are trying to count!

What we want to do is filter by _groups_, where each group consists of all the tokens on a particular page. 

We can do that with a special `filter` method on a `GroupBy` object.

##### Functions that act on other functions

We've used functions and methods in Pythons with arguments. The `groupby` method is one: as an argument, it takes the name of a DataFrame column or a list of column names.

But Python functions can also take _other functions_ as their arguments. If you have taken a course in logic or calculus, this is equivalent to the idea of `F(G(x))`, where `F` and `G` are both well-defined mathematical or logical expressions or equations.

What we need is a function that will take a DataFrame slice as its argument and return just those rows where _at least one of the rows contains the token "light_.  

What would such a function look like?

Let's take an example page of our DataFrame.

In [None]:
page100 = noun_df.loc[noun_df.page==100]

This page happens to contain the word "light."

In [None]:
page100

How could we tell without looking? One way would be to write this.

In [None]:
page100.lowercase == 'light'

Most of the rows here evaluate to `False`, because the token in that row is not "light." But for the row that contains "light," the result is `True`. 

But we want a single result -- `True` or `False` -- depending on whether ANY row in this DataFrame slice contains the token "light." Handily, there's a Python function called `any` for just these situations.

In [None]:
any(page100.lowercase == 'light')

So now we can write a function to perform this test on an arbitrary slice of our DataFrame.

The argument to our function is a DataFrame slice. The return value will be `True` or `False`.

In [None]:
def has_light(df_slice):
    return any(df_slice.lowercase == 'light')

Then we group by the `page` column and pass our `has_light` function to the `filter` method. Note that we are not _calling_ our `has_light` function -- there are no parentheses. The `filter` method will take care of that for us.

In [None]:
with_light = noun_df.groupby('page').filter(has_light)

The result will include rows for _just those pages_ that contain the token "light."

Now we can group a second time, this time by token (`lowercase`), aggregating over the `count` column. 

Finally, we can use the `sort_values` method with an `ascending` argument to show us the most frequent occurences. 

In [None]:
with_light.groupby('lowercase')['count'].sum().sort_values(ascending=False)

##### Exercise 5
Find a volume in HathiTrust that interests you. See if you can replicate the steps above to do the following: 
 1. Load the extracted features dataset.
 2. Find the most common tokens.
 3. Find the most common noun tokens (or some other part of speech).
 4. Pick a particular token and find which other tokens occur most commonly with it on the same page.