In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import pathlib
import pandas as pd

# Stanford AI-ETD 2020 Work-Cycle
Last October, Stanford University Libraries initiated a two 1/2 week 
exploratory project for automatically assigning OCLC [FAST][FAST] subject 
headings to over publicly accessible 7,000 electronic theses and dissertations (ETD) in 
Stanford's Digital Repository. 

Currently, Stanford Libraries does not assign subject headings to ETDs, making this material harder 
to find in [SearchWorks](https://searchworks.stanford.edu/), Stanford's Discovery environment. 
We then downloaded the PDFs, converted the PDFs to text, and saved the result using the Druid 
(an internal identifier used in Stanford's Digital Repository) as the filename. 
 
[FAST]: http://fast.oclc.org/

In [2]:
# Extract all of available full-text
full_text_etds = pathlib.Path('/Users/jpnelson/2020/sul-dlss/tmp/etd-10-27/results')
len(list(full_text_etds.glob('*.txt')))

7284

In [3]:
first_druid = next(full_text_etds.glob('*.txt'))
print(f"File name: {first_druid.name} size: {first_druid.stat().st_size} bytes")
yq942qc7340_text = first_druid.read_text()

File name: yq942qc7340.txt size: 231897 bytes


In [4]:
print(yq942qc7340_text[0:1000])

   SIMPLIFICATION ALGORITHMS FOR LARGE VIRTUAL
                      WORLDS
                   A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE
     AND THE COMMITTEE ON GRADUATE STUDIES
              OF STANFORD UNIVERSITY
    IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
                 FOR THE DEGREE OF
               DOCTOR OF PHILOSOPHY
                       Tahir Azim
                     December 2013

                       © 2013 by Tahir Azim. All Rights Reserved.
           Re-distributed by Stanford University under license with the author.
                            This work is licensed under a Creative Commons Attribution-
                            Noncommercial 3.0 United States License.
                            http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/yq942qc7340
                                               ii

I certify that I have read this dissertation and that, in my opinion, it 

In [6]:
print(yq942qc7340_text[30000:33000])

 worlds have greatly gained in popularity over the past decade or so. This
includes online communities, such as Second Life [64] and Onverse [55], and massive
multiplayer online role-playing games (MMORGs), such as World of Warcraft [78]
and Eve Online [28]. Today, millions of people sign in to virtual worlds such as these
to meet other users, create and explore virtual neighborhoods, play games, and even
engage in educational activities.
   However, there still exists a wide gulf between the visions of virtual worlds in
▯ction and what they look like in reality. Fictional descriptions of virtual worlds
describe richly detailed 3D environments, with sweeping vistas, highly interactive user
interfaces, and objects with complex behaviors. In contrast, due to the limitations of
networks and graphics resources, virtual worlds today have relatively simple graphical
content, display limited scenes, and make it di▯cult to program objects with complex
behavior.
   This dissertation focuses on 

## Initial Approach
We investigated two similar approaches,
both leveraging HuggingFace's [transformers](https://huggingface.co/transformers/) 
library, to fine-tune a BERT pre-trained model for FAST classification of ETDs.

- In the May 27, 2020 article, [Transformers for Multi-Label Classification made simple](https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1) by Ronak Patel, uses [PyTorch](https://pytorch.org/) to build
a multi-class classification transformer. 
- Kaushal Trivedi's [FAST BERT](https://github.com/kaushaltrivedi/fast-bert) 

Either approach requires labeled training and validation data sets.

Given the scope of the project and the time constraints, we decided to focus on a subset of the ETDs that were
associated with a Biology-related department by querying [SearchWorks](https://searchworks.stanford.edu/) for the 
MODS metadata and extracting the department. This also allowed us to restrict the FAST Subject Headings to a much
smaller subset of around 1,700 Biology-related Subject Headings.

## Training Data Set Attempt #1 - Preexisting Google Books
Our first attempt at creating a training data-set was to use the full-text monographs 
available from Stanford's partnership with Google that have already been cataloged 
using Library of Congress subject headings. A member of the project team who is a 
metadata cataloger at Stanford then found the corresponding FAST subject headings. 
She then provided a spreadsheet that maps the FAST Label with the Druid of the Google Book.

In [7]:
# CSV file of Google Books
google_bks = pd.read_csv("/Users/jpnelson/Google Drive File Stream/Shared drives/SUL AI 2020/Project - ETDs/data/google_books_qh-qr_druids_fast.csv",
                         names=["Druid", "FAST Label"])
print(f"Size {len(google_bks)}")
google_bks.head()

Size 6642


Unnamed: 0,Druid,FAST Label
0,druid:mm024dj8321,Communication in science
1,druid:mm024dj8321,Discourse analysis
2,druid:mm024dj8321,Evolution
3,druid:mj309px4330,Damselflies
4,druid:mj309px4330,Dragonflies


From this listing of Google Books Druids and FAST Labels, we took the corresponding FAST URI and batched the full-text from the Google Book to create a DataFrame. Each row contains the druid, 512 characters text batch, and for each Biology FAST URI a 0 or 1 if the Book had that particular FAST subject heading.

In [8]:
goog_bks_sample = pd.read_csv("/Users/jpnelson/2020/sul-dlss/goog-bks-csv/b-druids-all.csv")
goog_bks_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34817 entries, 0 to 34816
Columns: 1795 entries, druid to http://id.worldcat.org/fast/43064
dtypes: float64(1793), object(2)
memory usage: 476.8+ MB


In [9]:
goog_bks_sample.head()

Unnamed: 0,druid,text,http://id.worldcat.org/fast/870268,http://id.worldcat.org/fast/894932,http://id.worldcat.org/fast/917265,http://id.worldcat.org/fast/887377,http://id.worldcat.org/fast/897386,http://id.worldcat.org/fast/1204623,http://id.worldcat.org/fast/1205427,http://id.worldcat.org/fast/1065823,...,http://id.worldcat.org/fast/530762,http://id.worldcat.org/fast/179858,http://id.worldcat.org/fast/434309,http://id.worldcat.org/fast/1423826,http://id.worldcat.org/fast/1423871,http://id.worldcat.org/fast/235269,http://id.worldcat.org/fast/185500,http://id.worldcat.org/fast/675822,http://id.worldcat.org/fast/444817,http://id.worldcat.org/fast/43064
0,bx738nq4215,\nGENERA\nEUPHORBIACEARUM\nA1an Radcliffe Smith\n,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,bx738nq4215,\nGENERA\nEUPHORBIACEARUM\n,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,bx738nq4215,\n5\nw\n7.\n4\n13\n9\n8.\n5\n10\n2\n1. Acalyph...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,bx738nq4215,GENERA\nEUPHORBIACEARUM\nAlan Radcliffe-Smith\...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,bx738nq4215,"© 2001 The Board of Trustees, Royal Botanic Ga...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


From this list, we downloaded a sample of the Google full-text and attempted to 
train a model using both the FAST-BERT and Patel's approach but ran into problems because of limited GPU computing resources. 

## Training Data Set Attempt #2 - SpaCy and Streamlit App

After running into problems with using Google Books as training set, we created a 
small [Streamlit](https://www.streamlit.io/) application. 
to allow catalogers to assign FAST subject headings to a particular ETD's abstract based on 
a custom [spaCy][SPACY] pipeline. These assignments were saved to a JSON file with the intention to use 
these manual assignments as our training data set.

The dataset for the app was generated by harvesting the ETD MODS metadata. The app displays a random Abstract that has been tagged with the spaCy FAST subject headings. The cataloger then selects those FAST headings that are relevant and they also have the option to search the FAST subject headings that might apply and then we save the assignments using a Google Firebase.

The app is hosted on Heroku at https://biology-fast-etds.herokuapp.com/ with the source code 
available on Github at [https://github.com/sul-dlss-labs/biology-fast-etds](https://github.com/sul-dlss-labs/biology-fast-etds))

[SPACY]: https://spacy.io/

In [10]:
bio_abstracts = pd.read_pickle("/Users/jpnelson/02021/sul-dlss/labs/biology-fast-etds/data/biology.pkl")
bio_abstracts.head()

Unnamed: 0,druids,abstracts,departments,title
0,zd879dy3740,Circumstantial evidence suggests vast post-tra...,Department of Biochemistry.,Regulation of gene expression by RNA-binding p...
1,sc349gp0346,The brain is a complex organ formed from billi...,Department of Biology.,Cell identity and wiring specificity in the Dr...
2,gh439vr9294,Because of their capability of assembling hier...,Department of Chemistry,Controlled surface grafting of poly (Gamma-ben...
3,qv108wv0750,L-type voltage-gated calcium channels (LTCs) p...,Neurosciences Program.,From calcium channels to autism
4,cr369nn5134,"During cell division, chromosome segregation m...",Department of Biochemistry,Chemical inhibitor studies of polo-like kinase...


The custom [spaCy][SPACY] Named Entity Recognition (NER) pipeline was restricted
to the previousely identified Biology-related FAST subject headings. 

[SPACY]: https://spacy.io/

## ETD Clustering 
Inspired by Andromeda Yelton's Hamlet project (see her recent blog post https://andromedayelton.com/2020/12/11/though-these-be-matrices-yet-there-is-method-in-them/) we decided to do some more general K-Means clustering based on the ETD's abstract (using a similar process to create a dataset created for the previous app) and then normalizing the abstract by removing stopwords and removing capitalization. We used a BERT model to create an abstract embedding along with a Streamlit visualization app to allow catalogers and interested users to vary the number of clusters.

The app is hosted on Heroku at https://etd-abstract-similarity.herokuapp.com/, the source code for this app (along with data) is available on Github at https://github.com/sul-dlss-labs/etd-abstract-similarity. 

In [11]:
etd_similarity = pd.read_pickle("/Users/jpnelson/2020/sul-dlss/etd-abstract-similarity/data/abstracts.pkl")
etd_similarity.head()

Unnamed: 0,druids,abstracts,abstracts_cleaned,departments,title,area
0,yq942qc7340,Metaverses are virtual worlds where users crea...,metaverses virtual worlds users create entire ...,Department of Computer Science.,Simplification algorithms for large virtual wo...,Computer Science
1,yx753dx0216,"In this dissertation, I use two novel test sco...",dissertation use two novel test score data se...,Graduate School of Education.,Gender disparities in U.S. educational achieve...,Graduate School of Business
2,yd874rr2274,A 20-minute documentary film was created to ac...,minute documentary film created accelerate ...,Department of Geological and Environmental Sci...,"matched pair, cluster randomized, controlled t...","Earth, Energy and Environmental Sciences"
3,sh260yn9550,The usage of hydrogen as an alternative energy...,usage hydrogen alternative energy carrier beco...,Department of Bioengineering.,Improving functions of redox proteins for hydr...,Bioengineering
4,nk877ng0918,High-order methods in Computational Fluid Dyna...,high order methods computational fluid dynamic...,Department of Aeronautics and Astronautics.,analysis of stability of the flux reconstructi...,Aeronautics & Astronautics


## Challenges
- Small data set of around ~7,000 ETDs not uniformly distributed among the
  different Stanford departments and divisions.
- Large number of FAST Subject Headings
- Lack of GPUs and other computing resources