Skip to content

Synthesizing overlap data from HathiTrust

Ruth edited this page Apr 29, 2024 · 14 revisions

This documentation was created when we were getting monthly HathiTrust holdings overlap files. These allowed us to provide links allowing patrons to check out digitized HathiTrust copies of physical books in our collection that could not be checked out physically due to the COVID 19 lockdown.

Thanks to Chad Nelson from Temple University Libraries for posting this article on his process for this which we cribbed from heavily. Below are steps very similar to Chad's with a couple small changes:

HathiTrust full data process

  1. Download the monthly HathiTrust file from the HathiFiles page.

  2. Pare the monthly file down to just the needed data: OCLC number, HathiTrust id, HathiTrust bib_key and HathiTurst access code

    1. csvcut to limit to just the needed columns
    2. csvgrep to eliminate rows without required fields (compact)
    3. sort and uniq to eliminate duplicates
    gunzip -c hathi_full_#{args[:period]}.txt.gz | \
    csvcut -t -c 8,1,4,2 -z 1310720 | \
    csvgrep -c 1,2,3,4 -r ".+" | \
    sort | uniq > hathi_all_dedupe.csv
  3. Process the lines with multiple oclc's

    1. Extract lines with multiple oclc's
    csvgrep -c 1 -r "," hathi_all_dedupe_with_headers.csv > hathi_multi_oclc.csv
    1. Split multiple oclc's
    2. Merge splitted lines to deduped full file and add back headers
    csvgrep -c 1 -r "," -i hathi_all_dedupe_with_headers.csv > hathi_single_oclc.csv
    cat hathi_single_oclc.csv hathi_multi_oclc_split.csv > hathi_full_dedupe_with_headers.csv

Overlap report data process

Take the overlap report HathiTrust provides and extract the unique set of OCLC numbers for records that have some value for access (which would be allow or deny). The deny values are the ones available to checkout through ETAS:

csvgrep -t -c 4 -r ".+" #{args[:overlap_file]} | \
csvcut -c 1,3 | \
sort | uniq  > overlap_all_unique.csv`

Synthesize

Then filter the pared down Hathi data using the overlap OCLC numbers as the filter input:

  1. Split the overlap report by item_type: mono and multi/serial

    csvgrep -H -c 2 -m "mono" overlap_all_unique.csv | \
    csvcut -c 1 | \
    sort | uniq  > overlap_mono_unique.csv`
    
    csvgrep -H -c 2 -m "mono" -i overlap_all_unique.csv | \
    csvcut -c 1 | \
    sort | uniq  > overlap_multi_unique.csv`
  2. Filter the pared down HathiTrust data using the overlap OCLC numbers as the filter input:

    csvgrep -c 1 -f overlap_mono_unique.csv hathi_full_dedupe_with_headers.csv | \
    csvcut -C 3 | \
    sort | uniq > final_hathi_mono_overlap.csv`
    
    csvgrep -c 1 -f overlap_multi_unique.csv hathi_full_dedupe_with_headers.csv | \
    csvcut -C 2 | \
    sort | uniq > final_hathi_multi_overlap.csv

To generate these two files locally, run RUBY_ENVIRONMENT=dev bundle exec rake hathitrust:process_hathi_overlap on psulib_traject repo. Make sure to edit hathi_overlap_path, hathi_load_period and overlap_file settings if needed also have the overlap report for the given period downloaded from the Box folder to the hathi_overlap_path directory before running the hathitrust:process_hathi_overlap task.

We upload these two files to blackcat01qa and import them into our Solr repository to produce links to checkout materials from Hathi based on an existing match to OCLC number and, importantly, to affect the facet for "Access Online" so that patrons can see a much larger set of item available online.

More info on importing HathiTrust data: https://github.com/psu-libraries/psulib_blacklight_deploy#hathitrust-files

Other useful links:

Clone this wiki locally