# Visualize all illustrations for a given query

The project data can be unwieldy to work with. In many cases, it is desirable to islate a subset of the 2.5+ million illustrated regions. Analysis can then be done at a smaller scale and more quickly.

One intresting question about early 19C publishing concerns the range of artistic styles employed by a given publisher. Did publishers tend to suit their illustrations to the genre, perhaps employing specialist engraving workshops for different types of books? Or did they more or less draw on a common stock of available engravings?

This notebook shows how to get started with such research. The goal will be to find the metadata for all books published in 1800-1850 by the Boston firm Munroe & Francis.

## Step 1: Search Hathifile for publisher

Hathifiles can be very big, so we iteratively search them for field (column) values matching a query. This can take some finesse, since publisher names are often very similar and the name of a firm can be written in slightly different ways (e.g. '&' vs. 'and').

In [1]:
import pandas as pd
import numpy as np
import os, random, re, sys
from glob import glob
from annoy import AnnoyIndex

In [2]:
# the volumes used in the ACS project
HATHIFILE = "google_ids_1800-1850.txt.gz"
HATHICOLS = "hathifiles/hathi_field_list.txt"

In [3]:
def search_hathifile(ht_file, col_file, search_col, search_expr):
    """
    Given a hathifile and field names, return dataframe of rows
    where search_col contains search_expr (a regex)
    """
    # Use iterative method to scale to full hathifiles
    with open(col_file, "r") as fp:
        col_names = fp.readline().strip('\n').split('\t')
        num_cols = len(col_names)

    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
    iter_csv = pd.read_csv(
        ht_file, 
        sep='\t', 
        header=None,
        names=col_names,
        engine='c',
        # quicker if we can assert some types for the fields
        dtype={
            'htid': 'str',
            'rights_date_used': 'object', # values NOT guaranteed to be numeric
            'pub_place': 'str', # sadly, this is just the partner lib
            'imprint': 'str'
        },
        iterator=True,
        chunksize=5000,
        error_bad_lines=False)

    df = pd.DataFrame()
    for i, chunk in enumerate(iter_csv):
        condition = (chunk[search_col].str.contains(search_expr, na=False, flags=re.IGNORECASE))
        
        # hathifile idx has no relation to Neighbor tree: ignore
        df = pd.concat([df, chunk[condition]], ignore_index=True)
        
    return df

In [4]:
# find publishers "Munroe, Francis", "Munroe and Francis", "Munroe & Francis" (with matching group)
search_col = 'imprint'
search_expr = r"\bMunroe(?:,| and| &) Francis\b"

# a label for the results of this experiment (in case you want to compare later)
search_label = "munroe-francis"

df = search_hathifile(HATHIFILE, HATHICOLS, search_col, search_expr)
df.shape

(360, 26)

In [5]:
df.columns # use title, rights_date_used, imprint

Index(['htid', 'access', 'rights', 'ht_bib_key', 'description', 'source',
       'source_bib_num', 'oclc_num', 'isbn', 'issn', 'lccn', 'title',
       'imprint', 'rights_reason_code', 'rights_timestamp', 'us_gov_doc_flag',
       'rights_date_used', 'pub_place', 'lang', 'bib_fmt', 'collection_code',
       'content_provider_code', 'responsible_entity_code',
       'digitization_agent_code', 'access_profile_code', 'author'],
      dtype='object')

In [7]:
# convert date objects to integers, for the year of publication
df['rights_date_used'] = pd.to_numeric(df['rights_date_used']).astype(int)

In [23]:
# show a few results -- just the search field and the date published
df[[search_col, 'title', 'rights_date_used']].sample(10)

Unnamed: 0,imprint,title,rights_date_used
185,"Munroe and Francis, Charles S. Francis, 1833.",The children's friend; tr. from the French of ...,1833
330,"Munroe and Francis, 1847.","Paul Preston's voyages,travels and remarkable ...",1847
295,"Munroe and Francis, 1822.","An essay concerning tussis convulsiva, or, who...",1822
175,"Printed by Munroe, Francis & Parker, for thems...",The works of William Shakespeare. In nine volu...,1812
95,"Published by E. Sargeant, and M. & W. Ward; Mu...",The Spectator; a new edition corrected from th...,1810
277,"Printed for Wells & Lilly, Richardson & Lord, ...",The works of Cornelius Tacitus: with an essay ...,1822
27,"[Munroe and Francis], 1817-",Spirit of the English magazines.,1823
57,"Printed by Munroe & Francis, 1807.",The dramatick works of William Shakespeare : p...,1807
7,Munroe and Francis [etc.],"The Monthly anthology, and Boston review.",1810
223,"Munroe and Francis, 1804-1811.","The Monthly anthology, and Boston review.",1804


## Step 2: Find search result matches in illustration metadata

We have a bunch of `htid`s from the Hathifile, but many of them will not contain any illustrations. To narrow down our set of results, we need to look up the `htid`s in our illustration metadata. This can be done with the main CSV file or with the vectors.tar file. Either way, the goal is to get a list of all image or vector files corresponding to specific regions of interest (illustrations) for the volumes returned in our search.

If you want to work with the vectors in `vectors.tar`, you will want to convert to HTRCs stubbytree format.

In [9]:
# Utility functions from Hathi's feature datasets
# https://github.com/htrc/htrc-feature-reader/blob/39010fd41c049f4f86b9c8ff4a44e000217093c2/htrc_features/utils.py
def _id_encode(id):
    '''
    :param id: A Pairtree ID. If it's a Hathitrust ID, this is the part after the library
        code; e.g. the part after the first period for vol.123/456.
    :return: A sanitized id. e.g., 123/456 will return as 123=456 to avoid filesystem issues.
    '''
    return id.replace(":", "+").replace("/", "=").replace(".", ",")

def _id_decode(id):
    '''
    :param id: A sanitized Pairtree ID.
    :return: An original Pairtree ID.
    '''
    return id.replace("+", ":").replace("=", "/").replace(",", ".")

def clean_htid(htid):
    '''
    :param htid: A HathiTrust ID of form lib.vol; e.g. mdp.1234
    :return: A sanitized version of the HathiTrust ID, appropriate for filename use.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)
    return '.'.join([libid, volid_clean])

def id_to_stubbytree(htid, format = None, suffix = None, compression = None):
    '''
    Take an HTRC id and convert it to a 'stubbytree' location.
    '''
    libid, volid = htid.split('.', 1)
    volid_clean = _id_encode(volid)

    suffixes = [s for s in [format, compression] if s is not None]
    filename = ".".join([clean_htid(htid), *suffixes])
    path = os.path.join(libid, volid_clean[::3], filename)
    return path

In [10]:
# keep a mapping from unencoded htids from the hathifile...
stubby_dict = {id_to_stubbytree(htid): htid for htid in df.htid.values}

In [12]:
# N.B. this assumes the roi-vectors.tar file has been extracted to a directory named roi-vectors
# adjust the path as necessary
#VEC_DIR = os.path.abspath("roi-vectors")

In [13]:
# for each volume, find associated .npy vectors within stubbytree directory -- store in dictionary
munroe_francis = {}

for stubby_id in stubby_dict.keys():
    vol_path = os.path.join(VEC_DIR, stubby_id + "*.npy")
    vol_vectors = glob(vol_path)
    if len(vol_vectors) != 0:
        munroe_francis[stubby_id] = vol_vectors

## Step 3: Reformat ROIs with metadata for Pixplot

We can reformat our selected ROIs, taking selected columns and renaming them. If we are able to acquire image data, this will allow us to attach the metadata and build a PixPlot visualization.

See https://github.com/YaleDHLab/pix-plot for more details.

In [24]:
# columns we want to keep from hathifile: these will map to 'description' and 'year' in PixPlot's format
col_map = {
    'rights_date_used': 'year',
    'title': 'description'
}

rows = []
for k, v in munroe_francis.items():
    
    # transform .npy file into jpeg, separate from rest of path
    for npy_file in v:
        
        vec_base = os.path.basename(npy_file)
        img_base = os.path.splitext(vec_base)[0] + '.jpg'
        
        # remember the unencoded htid
        htid = stubby_dict[k]
        
        # row to be added to df_pixplot
        row = {}
        
        # get metadata for this volume
        metadata = df[df['htid'] == htid][col_map.keys()]
        
        # tricky, since values could be a list or object
        for col in metadata.columns:
            row[col_map[col]] = metadata[col].values[0]

        # add img_base path and label
        row['filename'] = img_base
        row['label'] = search_label
        
        rows.append(row)

In [25]:
# turn dict rows into dataframe -- 'filename' shows the convention for image paths used in the project
df_pixplot = pd.DataFrame.from_dict(rows)
df_pixplot.tail(5)

Unnamed: 0,year,description,filename,label
1472,1846,Peter Parley's book of Bible stories for child...,hvd.hwrcv7_00000255_00.jpg,munroe-francis
1473,1836,The year book : an astronomical and philosophi...,uc1.b3082741_00000006_00.jpg,munroe-francis
1474,1836,The year book : an astronomical and philosophi...,uc1.b3082741_00000254_00.jpg,munroe-francis
1475,1836,The year book : an astronomical and philosophi...,nyp.33433112037308_00000012_00.jpg,munroe-francis
1476,1836,The year book : an astronomical and philosophi...,nyp.33433112037308_00000260_00.jpg,munroe-francis


In [18]:
# use the search label to make a metadata path
metadata_csv = "{}_metadata.csv".format(search_label)

# save as a CSV that PixPlot can accept
df_pixplot.to_csv(metadata_csv, sep=',', header=True, index=False)

## Step 4 (optional): Create Annoy index using project vectors

You can experiment with building a smaller Annoy index with just these results.

In [None]:
# Modified from: https://github.com/spotify/annoy
f = 1000
t = AnnoyIndex(f, 'angular')
i = 0

# Find all vectors per volume and index them from 0
for k,v in munroe_francis.items():
    for vec in v:
        item = np.load(vec)
        # transpose vector since it needs to be (1000,1) not (1,1000)
        t.add_item(i, item.T)
        i += 1

# Try with 1000 trees
t.build(1000)
t.save('munroe-francis.ann')

In [None]:
u = AnnoyIndex(f, 'angular')
u.load('munroe-francis.ann')
print(u.get_nns_by_item(0, 10))