Parse a list of names from raw text into a dictiory of unique authors. [for use with LIS authority control]
The lib-name-parser was originally intended for use with the dublin-core-text-parser. It will be used to process a list of unformatted strings containing names (e.g. "Dr. Douglas Raymond Murrow III") and create a uniquely identified and combined object used for matching.
This will (hopefully) enable the linking of similar names across multiple occurances ("Douglas Murrow", "Douglas Ray Murrow" "Doug R. Murrow"), and perspectively allow for the detection of typos or misspellings ("Daug R. Marrow"). With some extra work, it could also automate the creation of authority records between issues and within collections.
The program follows the lineage from Josh Fraser's original implementation of the php-name-parser, and from Garve Hays's java port, NameParser. Its naming scheme is meant to reflect this geneology.
Automate the complex, time-consuming, and often mind-numbing process of manual authority control in an extensible, manageable way.
Going through any periodical or other succession of similarly sourced items and logging metadata about its contributors, you often find small inconsistencies with how the names are displayed.
Whether this comes in the form of typos/OCR errors (Doug Murrow <=> Dog Murrow), progressive revelation (Doug Murrow => Douglas Murrow => Douglas R. Murrow), or title acquisition (Douglas R. Murrow => Dr. Douglas R. Murrow, PhD), these minor changes over time can complicate searching and collocation in catalogs, databases, etc.
This software is intended for use as an external library (software sense) to other applications, assisting them with combining these disparate references to the same author/individual by combining them under a single universal ID, object, and/or authority record (maybe doing this, we'll see) in a way that can remain controlled/monitored or fully automated.
I intend to ensure that the program itself remains agnostic as to the balance between standardization (fixing things to make them match up) and provenance (retaining how they originally appeared), by providing the means to run it at varying levels of automation and with certain features turned off. Knowing, of course, that this will be an open source project and more advanced edits can be made by the user according to need.
Name parsing is a very common operation in software development. I wanted to find (or create) a standard algorithm for doing this with LIS software, but found that @joshfraser had already created one in PHP and JavaScript (links below).
Looking into these derivatives, I first forked @gkhays 'NameParser', but found that this would be a much heftier implementation given the time/occurrences component and that it might be wise to create my own from scratch instead of pull requesting a gigantic alteration.
This software, as mentioned, follows this authorial lineage from Josh to Garve to myself. The application of this algorithm to library authority control is, to my knowledge, a novel and meaningful contribution.
- 2010 article by Patrick McKenzie of Kalzumeus Software on 'Faleshoods Programmers Believe About Names'.
- academic paper entitled "Accuracy of simple, initials-based methods for author name disambiguation" by Stasa Milojevic.
- ...
- Josh Fraser's original 2009 article 'splitting names' describing algorithm
- associated php and js github repositories.
- port by Garve Hays in java that was very helpful
- short introductory write-up by the library at Florida International University (FIU)
- wikipedia entry on authority control with a section specifically about authority records
- long but approachable 2007 article on 'the purpose of authority control'
- A presentation by Karen Calhoun from the Taxonomic Authority File (TAF) Workshop giving an overview of the history and agreement on authority control.
note: I did come up with a very similar algorithm and parsing scheme independent of these sources, but given the clarity of their documentation and the prior existence of their implementations and their usefulness to me, it would be dishonest to claim full credit for it