<a href="https://colab.research.google.com/github/mkane968/Extracted-Features/blob/master/Experiments_with_HTRC_Feature_Reader_UPDATED_8_1_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ü™ê Introduction

This notebook is an updated version of the HTRC_SF_experiments notebook: https://github.com/gwijthoff/HTRC_SF_experiments/blob/main/htrc_sf_experiments.ipynb 

In this notebook, you will learn how to analyze over three thousand speculative fiction novels using HathiTrust Research Center (HTRC) Analytics. Rather than working with the complete text of these novels, we will use "Extracted Features": a data format devised by HathiTrust in order to enable text analysis on [post-1926 books still under copyright protection](https://en.wikipedia.org/wiki/Public_domain_in_the_United_States).

Beginning with a print book that looks like this...

<img src="img/wells_moon_print.jpg" alt="first page of Wells The First Men in the Moon" style="width: 400px;"/>

...then scanning and OCRing it to grab its text...

```
THE FIRST MEN IN 
 THE MOON 
 
 MR. BEDFORD MEETS MR. CAVOR AT LYMPNE 
 
 As I sit down to write here amidst the 
 shadows of vine-leaves under the blue sky of 
 southern Italy, it comes to me with a certain 
 quality of astonishment that my participation 
 in these amazing adventures of Mr. Cavor 
 was, after all, the outcome of the purest acci- 
 dent. It might have been any one. I fell 
 into these things at a time when I thought 
 myself removed from the slightest possibility 
 of disturbing experiences. I had gone to 
 Lympne because I had imagined it the most 
 uneventful place in the world. " Here, at any 
 rate," said I, " I shall find peace and a chance 
 to work ! " 
 ' And this book is the sequel. So utterly at 
```

...HTRC finally transforms that text into Extracted Features: a compressed `.json` file no longer readable by human eyes ("consumptive" reading), yet containing "quantitative abstractions of a book‚Äôs written content" that we can explore through text analysis ("non-consumptive" reading):

```json
":1,"l":1,"r":1,"o":1},"tokenPosCount":{"rate":{"NN":1},"accident":{"NN":1},"IN":{"IN":1},"astonishment":{"NN":1},"down":{"RB":1},"slightest":{"JJS":1},"quality":{"NN":1},"find":{"VB":1},"disturbing":{"JJ":1},"AT":{"IN":1},"vine-leaves":{"NNS":1},"any":{"DT":2},"southern":{"JJ":1},"myself":{"PRP":1},"have":{"VB":1},"is":{"VBZ":1},"MOON":{"NN":1},"said":{"VBD":1},"Lympne":{"NNP":1},"sit":{"VBP":1},"thought":{"VBD":1},".":{".":5},"adventures":{"NNS":1},"blue":{"JJ":1},"THE":{"DT":2},"world":{"NN":1},"fell":{"VBD":1},"CAVOR":{"NNP":1},"all":{"DT":1},"book":{"NN":1},"had":{"VBD":2},"imagined":{"VBN":1},"it":{"PRP":2},"!":{".":1},"A":{"DT":1},"a":{"DT":3},"And":{"CC":1},"utterly":{"RB":1},"sky":{"NN":1},"shadows":{"NNS":1},"outcome":{"NN":1},"Here":{"RB":1},"because":{"IN":1},"Mr.":{"NNP":1},"purest":{"JJS":1},"removed":{"VBD":1},"certain":{"JJ":1},"comes":{"VBZ":1},"MEN":{"NNP":1},"I":{"PRP":7},"LYMPNE":{"NNP":1},"work":{"VB":1},"that":{"WDT":1},"possibility":{"NN":1},"to":{"TO":4},"participation":{"NN":1},"MEETS":{"VBZ":1},",":{",":6},"most":{"RBS":1},"here":{"RB":1},"these":{"DT":2},"was":{"VBD":1},"at":{"IN":3},"been":{"VBN":1},"FIRST":{"NNP":1},"'":{"''":1},"my":{"PRP$":1}
```

Each element you see in the `.json` sample above is a *feature,* a "quantifiable marker of something measurable, a datum," as Peter Organisciak and Boris Capitanu put it in their [*Programming Historian* tutorial](https://programminghistorian.org/en/lessons/text-mining-with-extracted-features) on text mining with HTRC. They continue:

> A computer cannot understand the meaning of a sentence implicitly, but it can understand the counts of various words and word forms, or the presence or absence of stylistic markers, from which it can be trained to better understand text. Many text features are non-consumptive in that they don‚Äôt retain enough information to reconstruct the book text.

As we'll see, Extracted Features files will allow us not only to count "tokens" (words) in each "volume" (published book), but also to filter by parts of speech, browse extensive bibliographic metadata, view quantitative information about each printed page in the dataset, use named entity recognition (NER) to identify people, places, or organizations in the text, graph these elements, and more.

# ü™ê Overview of HTRC data

HTRC provides a number of [specialized, recommended worksets](https://analytics.hathitrust.org/staticrecommendedworksets): genre- or period-specific collections of books (or "volumes") that have been digitized by HathiTrust. Here, we'll work with David Mimno and Laure Thompson's **20th Century English-Language Speculative Fiction** workset. It contains 3,236 "volumes of speculative fiction [from 1900-1999] identified both through matching titles and authors to [Worlds Without End](https://www.worldswithoutend.com/), an extensive fan-built database of speculative fiction, and via computational text similarity analysis techniques."

As we explore this workset of speculative fiction, it is important to remember the ways that databases and bibliographies of SF are inherently exclusionary. Suzanne Boswell, who uses The Internet Speculative Fiction Database (ISFDB) in a network analysis of women writers in pulp magazines of the 1920s-40s ([article](https://www.liverpooluniversitypress.co.uk/journals/article/61993/) | [data](https://github.com/sfboswell/Gender_Pulps)), argues that standard bibliographies of SF are almost never representative.

> The ISFDB chooses what constitutes science fiction, horror, and fantasy by deciding what to archive in its database. For the early twentieth century, this mainly means the pulps. This decision makes it difficult to track the contributions of women to the science fiction genre: if the pulps excluded women, and bibliographic archives only count the pulps as science fiction, where do we find the women? Another example: in the early twentieth century, most science fiction by Black authors was published outside of the pulps (W. E. B. Dubois's ‚ÄúThe Comet‚Äù [1920]; Pauline Hopkins's *Of One Blood* [1902‚Äì1903]; George Schuyler's short stories in *The Pittsburgh Courier* [1936‚Äì1937]). The ISFDB will have the bibliographic information for, say, George Schuyler‚Äîbut it does not have bibliographic information for other Black-dominated magazines, or Black fiction pamphlets, where other Black speculative fiction writers may exist outside of sf archives. Marginalized authors who write outside the pulps enter science fiction archives as exceptions: their community does not come with them. In this way, science fiction archives repeat the exclusionary patterns of the early twentieth century.

Building text corpora using databases like WWE and ISFDB can severely limit the diverse range of voices that contributed to 20th-century speculative fiction.

# ü™ê Downloading HTRC data

The [HTRC Analytics interface](https://analytics.hathitrust.org) provides a few different, pre-defined ways of accessing, analyzing, and downloading the data in their recommended worksets. I have already pre-downloaded all of the files needed for this tutorial and included them in this GitHub repository. So, if it's speculative fiction you want to work with, feel free to move on to the tutorial's [next ü™ê section](#Working-with-metadata-for-entire-SF-workset) to explore that data. Otherwise, if you would like to use one of HTRC's recommended worksets, create one of your own, or learn the steps I took to get these files, continue reading below the line.

***********

In order to download our volumes, we'll use Feature Reader, a Python library for working with the HTRC Extracted Features dataset.

For future reference: another way of downloading and analyzing Extracted Features is by generating an `rsync` file on HathiTrust's website. You can browse a number of recommended, genre- or period-specific worksets by navigating to <https://analytics.hathitrust.org/worksets> and logging in or create an account. Here are the full [instructions for doing so](https://analytics.hathitrust.org/algorithms/Extracted_Features_Download_Helper?wsname=20th+Century+English-Language+Speculative+Fiction%40htrc). That process generates a shell script, `EF_Rsync.sh`. Once downloaded and run, that script will download the Extracted Features files for all volumes in that workset, storing them in a tricky pairtree file/folder format. Using that process, you can also download pre-processed analyses of the texts in the workset, including named entity recognition, token counts, and topic modeling.

But in this tutorial, we'll do the downloading right in this notebook using Python. Again, *I've already done so and downloaded all of the Extracted Features data into the [`data/SF_Extracted_Features`](https://github.com/gwijthoff/HTRC_SF_experiments/tree/main/data/SF_Extracted_Features) directory of this repo.*

Thompson and Mimno's original list actually included 5,168 identified volumes of SF in HathiTrust, but there many duplicate copies (i.e. the [1998 Anchor Books edition](https://catalog.hathitrust.org/Record/004963699) and the [2006 Everyman edition](https://catalog.hathitrust.org/Record/009819360) of Margaret Atwood's *The Handmaid's Tale.* So, let's begin by reading a .tsv file of all those volumes using [Pandas](https://pandas.pydata.org/), a Python library for working with tabular data. Then let's remove the duplicates.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# create dataframe from Thomson and Mimno's full list of 5,168 SF volumes
import pandas as pd
df = pd.read_csv('thompson-mimno-SF-final-matches.tsv', delimiter='	')
df

In [None]:
# remove duplicate titles, leaving us with 3,236 unique works
df_minus_duplicates = df.drop_duplicates(subset=['Title'])
df_minus_duplicates

GitHub limits us to storing 1,000 files per folder, so let's take a random sample of 1,000 of these volumes to download and work with. *If you are running this notebook and downloading locally, you can skip this stem and download all 3,236 volumes!*

In [None]:
# create a new dataframe with 1,000 volumes randomly sampled from our list
# random_state=1 keeps the same 1,000 works, rather than re-sampling
df_sample = df_minus_duplicates.sample(n=1000, random_state=1)
df_sample

In [None]:
# create a list of the HathiTrust IDs column
sf_ids = df_sample['HTID'].tolist()
sf_ids[:10] # view the first 10 results

Next we need to install HTRC's Feature Reader, the Python library that will allow us to download and explore the Extracted Features data.

In [None]:
# install the HTRC Feature Reader
! pip install htrc-feature-reader

We can now download the Extracted Features files. Altogether, these 1,000 volumes will be 164.5 MB and will take a few minutes to download. If you're downloading all 3,236 volumes, it will be 534 MB. (For details on the download process and some other methods for doing so, see [HTRC's documentation on downloading Extracted Features](https://github.com/htrc/htrc-feature-reader/blob/master/examples/ID_to_Rsync_Link.ipynb).)

In [None]:
# download Extracted Features files to a subdirectory in our data folder
#ONLY RUN THIS ONCE!
#Takes a couple of minutes to load
from htrc_features import utils
utils.download_file(htids=sf_ids, outdir = '/content')

**********************

# ü™ê Pulling data on one volume

In [None]:
# Recall the first dataframe we created...
df_sample.sample(5)

In [None]:
# let's reset the index to number each row
SFlist = df_sample.reset_index().drop(columns='index')

In [None]:
SFlist

We could use a HathiTrust ID to pull Extracted Features information online. For example, taking the dataframe index 996 (the number in the leftmost column) for Stanis≈Çaw Lem's *Solaris,* using the ID in the first column `mdp.39015002969064`.

In [None]:
from htrc_features import Volume
vol = Volume("mdp.39015002969064")
vol

However, we want to work with the local files we just downloaded. So, we need to create a new column in our dataframe that points to the path for the Extracted Features file of each volume. Each file uses a HathiTrust ID, needs to be prefaced by its directory location `data/SF_Extracted_Features/`, and suffixed with its filetype `.json.bz2`.

In [None]:
# create a new column combining directory location and filetype strings
# with existing HTID column
SFlist['Initial Path'] = SFlist['HTID'] + '.json.bz2'
SFlist.sample(5)

One thing to note is that some HathiTrust IDs contain `:` and `/` characters, which imply local directory locations. When downloading Extracted Features files, Feature Reader replaces those characters with `+` and `=`, respectively. For those volumes, we need to replace those characters to reflect the actual location. For example, Ursula K. Leguin's *A Wizard of Earthsea,* will be changed in the `Initial Path` column from

`dul1.ark:/13960/t51g5w741.json.bz2`

to 

`dul1.ark+=13960=t51g5w741.json.bz2`.

In [None]:
# to see which files contain those characters
# SFlist[SFlist['HTID'].str.contains('/')]

# create a new Path column for our final location
SFlist['Path'] = SFlist['HTID'].str.replace(':','+').str.replace('/','=')

# add directory slugs and file type
SFlist['Path'] = SFlist['Path'] + '.json.bz2'

# remove Initial Path column with incorrect location
SFlist = SFlist.drop(columns=['Initial Path'])

# verify that the Le Guin volume path is now correct
SFlist[SFlist['Title'].str.contains('Earthsea')]

Now, we can select the file path for one row in our dataframe, and pull the Extracted Features for that volume. We'll use `iloc` to select a cell at row 996 and column 5 (counting over from the left beginning at 0). Once we have that local file path, we can access the Extracted Features data for that volume using Feature Reader.

In [None]:
# select line 996 for Solaris and column 5 for the file path
lem_path = SFlist.iloc[996,5]
# pull the Extracted Features
vol = Volume(lem_path)
vol

And we can interact with the Extracted Features in many ways. For example, we can pull various metadata fields and insert them into a string of text:

In [None]:
print(f'There are {vol.page_count}pp. in {vol.title}, which was published in {vol.pub_place} and can be identified by OCLC number {vol.oclc}.')
print(f'This copy of {vol.title} was originally digitized at {vol.source_institution}.')

To view a list of all available metadata fields for this volume (others will have more or less metadata available), run this:

In [None]:
vol.parser.meta.keys()

To view the values for any of those fields, simply enter `vol.FIELD`. For example:

In [None]:
# date of the latest copyright update for this book
vol.last_rights_update_date

In [None]:
# count the number of tokens per page
tokenspp = vol.tokens_per_page()
tokenspp.head()

In [None]:
# graph the count of tokens per page over the course of the book
tokenspp.plot()

Return details on each token's part of speech (represented by the [Penn Tree Bank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) for English), number of times it appears in the book, and the section it appears in the book.

In [None]:
tl = vol.tokenlist()
tl.sample(10)

In [None]:
# create a list of all the unique tokens / words in the book
words = vol.tokens()
# count those words / measure size of the book's vocabulary
print(len(words))

In [None]:
# view other section features, counting number of tokens, 
# lines, and sentences on each page of the book
features = vol.section_features()
features[100:110]

# ü™ê Getting Word Frequencies in One and Multiple Texts

In [None]:
#Get all occurrencecs of a word in one text (Solaris, example text from above)
count_df = tl.loc[(slice(None), slice(None), "space")]
count_df.head()

In [None]:
#Get sum of all occurences of space in Solaris
count = count_df['count'].sum()

#Print results of word count
print(f'The word "space" appears {count} times in {vol.title}')

In [None]:
#Get multiple texts to compare word frequency
##Could use whole corpus but it takes a very long time to run (also need to fix errors when word is not used)
##Could select specific files, here just comparing a random sample
df = SFlist.sample(5)
df

In [None]:
#Prepare new dataframe to add counts--we only need title, author and path
df= df.drop(columns=['HTID','WWE Novel ID', 'Hand / Auto?'])
df

In [None]:
#Get all occurences of a word in the selected texts
texts = df
space_count = []

for item in texts['Path']:
    vol = Volume(item)
    tl = vol.tokenlist()
    count_df = tl.loc[(slice(None), slice(None), "space")]
    space_count.append(count_df['count'].sum())
    

In [None]:
#Check that counts have been tallied
space_count[:5]

In [None]:
#Append title and sum of  occurences to dataframe
#To make this neater, could drop columns so only title and count are left
df['Space Count'] = space_count
df

In [None]:
#Create bar plot based on frequency 
from matplotlib.pyplot import *

df.plot.bar(x="Title", y="Space Count", rot=70, title="Frequency of 'Space' in SF Texts");

plot(block=True)


# ü™ê Enriching our bibliography

Next, we can pull some of that Extracted Features data for each volume back into `SFlist`, the dataframe of all the volumes we're working with. We'll create a new dataframe for all of this bibliographic data, calling it `SFbib`.

In [None]:
# clean up the df, dropping columns we won't use
SFbib = SFlist.drop(columns=['WWE Novel ID', 'Hand / Auto?'])
SFbib.sample(5)

In [None]:
# for some reason it looks like
# mdp.39015058731913 is not in the EF dataset
# and returns non-zero exit status 23 when trying to dl
# so let's remove it. index #868 in the current DF
SFbib = SFbib.drop(868)

Let's loop through each line of our dataframe, pulling the published year of each volume.

In [None]:
# Attempt at fixing above to pull pub years 
# Append all pub years to list and then put years into new column --but are years aligned with correct titles?

test = SFbib
years = []

for item in test['Path']:
    vol = Volume(item)
    years.append(vol.year)

In [None]:
test['Pub Year'] = years
test

In [None]:
# Graph the distribution of publication years for volumes in the list

In [None]:
df = test['Pub Year'].value_counts()

In [None]:
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', quality=100)

In [None]:
#Create bar plot based on publication years 
from matplotlib.pyplot import *
# Importing the matplotlib library
import matplotlib.pyplot as plt
# Declaring the figure or the plot (y, x) or (width, height)
plt.figure(figsize = (12,7))

x="Year"
y="Year Count"

default_x_ticks = range(len(x))
plt.plot.bar(default_x_ticks, y, title="Titles Published Per Year")

plot(block=True)

# ü™ê Filtering our new bibliographic dataframe

We can now use this DataFrame containing bibliographic and file path info to query Extracted Features data on each book.

Let's filter the author column to all volumes containing the string "Delany," to find works included in this list written by Samuel R. Delany.

In [None]:
SFbib[SFbib['Author'].str.contains('Delany')]

# ü™ê What's next

Check out these tutorials for more examples of how to work with Extracted Features data and HTRC's Feature Reader.

- [README.ipynb](https://github.com/htrc/htrc-feature-reader/blob/master/README.ipynb) from the HTRC github documentation
- [Text Mining in Python through the HTRC Feature Reader](https://programminghistorian.org/en/lessons/text-mining-with-extracted-features) from the *Programming Historian*
- [Analyzing Documents with TF-IDF](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) at the *Programming Historian* also uses HTRC Extracted Features
- visualize the rise and fall of topic models across a book with [htrc-book-models](https://github.com/organisciak/htrc-book-models)