The codes necessary to replicate Marx/Fuegi 2019 are contained in this directory. This code operates on, and assumes the presence of, a set of files from the Microsoft Academic Graph (MAG) and USPTO non-patent literature (NPL) references, described below.
The code is unsupported and is largely undocumented. It is provided primarily for those interested in understanding how the NPL linkages to MAG were accomplished. Moreover, it is executable only in a Sun Grid Engine (or similar) Unix environment with STATA installed as well as several packages including ftools and gtools and the Perl module Text::LevenshteinXS. It assumes the directory structure described below and contains hardcoded, fully-qualified pathnames. Moreover, you will need at least 5 terabytes of disk space, perhaps as much as 10.
There are four general steps in executing the matches: First, preparing the MAG data. Second, preparing the NPL data. Third, generating a first-pass set of "loose" matches. Fourth, scoring those "loose" matches and picking the best match for each NPL. Each of these major steps includes a number of sub-steps; there is no "master" script to run the process from beginning to end.
Many of the programs assume /project/nb/marxnsf1/dropbox/ but this can be replaced by another prefix (but should be a fully-qualified pathname, not a relative reference – no environment variable is set to easily substitute, sorry). Beneath that directory, the necessary structure is:
PROGRAMS TO RUN
STEP 1: PREPARE MAG FILES
- download from MAG the following files into the mag/txt directory: Papers.txt, ConferenceSeries.txt, Journals.txt, Authors.txt, and Affiliations.txt. Instructions for accessing MAG are here. You will need to create an Azure account in order to download the files. You may be able to download a recent snapshot here.
- in the mag/txt directory, execute the script createsubsets_tsv_nochopNOPAT.sh to create a number of derivative files from the MAG originals in the mag/txt directory.
- in the mag/txt directory, run "cat papertitle.tsv | translategreekletters.sh > papertitle-transliteratedgreek.tsv" to write out all greek characters in MAG titles as alphas.
- in the mag/txt directory, run fixauthornames.sh to reverse the order of first and last names in the MAG data. This calls process_lastnames-justauthoridname.pl from mag/code to create authorname-fixed.txt and authorname-surfirst.txt in the mag/txt directory.
- execute the Stata script buildmagdata.do in the mag/code directory. This will read a number of files from mag/txt and output stata-formatted versions of them in mag/dta.
- in the nplmatch/inputs/mag directory, execute the Stata script dumpmagfieldsfornpl.do. This will combine the individual MAG files from mag/dta and write them out in a single mergedmagfornpl.dta file in the nplmatch/inputs/mag directory as well as a tab-delimited mergedmagfornpl.tsv file.
- in the nplmatch/inputs/mag directory, run fixnamesfornplmatch.sh to lowercase the MAG files and swap the order of given and surnames in the author field. this creates a file mergedmagfornpl-fixednames.tsv.
- in the nplmatch/inputs/mag directory, run terracemag.sh to split up the MAG papers by year into nplmatch/inputs/mag/magbyyear.
STEP 2: PREPARE NPL FILES
- download and extract the "otherreference" file from Patentsview (www.patentsview.org/download - depending on the release date, the exact URL to the file may change.) copy the extracted file otherreference.tsv to this directory.
- type the command "cut -f2,3 otherreference.tsv > npl.1976-present.tsv" to extract just the fields needed.
- in nplmatch/inputs, run the command "cat npl.1926.1975.tsv | ocrtrim.sh | ocrnpldash.pl > npl.1926.1975-patnplOCRautofix.tsv" to clean up OCRed NPLs prior to 1976.
- in nplmatch/inputs, run the command "cat npl.1926.1975-patnplOCRautofix.tsv npl.1976-present.tsv | tr [:upper:] [:lower:] | perl screen_npljunk.pl > npl.1926.2018-lowercaseOCRautofixnononsci.tsv" to combine the NPL files, lowercase them, and strip out the bulk of references not to papers but random things like websites, product brochures, etc.
- in nplmatch/inputs, run terracenpl.sh to split the combined NPLs into individual years as well as a "fake" year, 1799, for NPLs with no year in them. All of these files are placed in nplmatch/inputs/nplbyrefyear.
STEP 3: DO THE "LOOSE" FIRST-PASS MATCHING
- copy the files journalabbrevs.tsv and journalabbrevs-extended.tsv to the nplmatch/inputs/journalabbrev.
- copy the files commonsurnames.csv, verycommonsurnames.csv, probablyonlywords.txt to nplmatch/process_matches.
- in nplmatch/splitword, run splitword.sh to submit an array job that runs splitword.pl for each year from 1800-2018. This creates a hash of all words in the NPLs in subdirectories of nplmatch/splitword such as '1980/a/c/achieve' containing all 1980 NPLs that include the word "achieve"
- in nplmatch/splittitle, run "buildtitleregex_1799_lev.pl mag" to generate rules for NPLs without years in nplmatch/splittitle/year_regex_scripts_mag. These rules are based on primary author surname in MAG and finding either the longest or second longes word in the title.
- in nplmatch/splittitle, run sge_buildtitleregex_magLEV.sh to generate rules for NPLs with years in nplmatch/splittitle/year_regex_scripts_mag. These rules are based on primary author surname in MAG and finding either the longest or second longes word in the title.
- in nplmatch/splittitle, run set_sge_lev_mag_splittitle.sh to simultaneously apply the generated rules against the NPL data
- in nplmatch/splityear, run splityear.sh to submit an array job that runs splityear.pl for each year from 1800-2018. This creates a hash of all number in the NPLs in subdirectories of nplmatch/splityear such as '1963/1/15330' contaning all 1963 NPLs that incldue the number 15330.
- in nplmatch/splitcode, run "buildsplitregex_1799_lev.pl mag" to generate matching rules without titles for NPLs missing years. These rules are based on primary author surname in MAG and finding the first page of the article (or volume if there is no first page in MAG).
- in nplmatch/splitcode, run sge_buildsplitregex_lev_mag.sh to generate matching rules without titles for NPLs with years. These rules are based on primary author surname in MAG and finding the first page of the article (or volume if there is no first page in MAG).
- in nplmatch/splitcode, run set_sge_lev_mag_splitcode.sh to launch thousands of simultaneous scripts to apply the generated rules against the NPL data.
STEP 4: GATHER THE "LOOSE" MATCHES AND SCORE THEM
- in nplmatch/process_matches, run sge_collectmatches_mag.sh to gather the output of the loose-match processes (both title-based and non-title-based) into nplmatch/process_matches/peryearuniqmatches/mag/, one file per year (and one file for missing years). This process also sorts the matches and removes duplicates.
- in nplmatch/process_matches, run "cat peryearuniqmatches/mag/* > bothmatchestoscore_mag.tsv" to combine the matches for each year into a single file
- run score_matches_mag.sh, which scores the loosematches and retains all matches for every NPL (with confidence score >0) in scoredmag.tsv. It splits bothmatchestoscore_mag.tsv into hundreds of smaller, parallelizable pieces in nplmatch/process_matches/pieces.
- copy or move scoredmag.tsv from nplmatch/process_matches to nplmatch/sort_scored_matches.
- in nplmatch/sort_scored_matches, run sort_scored_mag.sh to pick the best match for each NPL from those with confidence score > 0, creating the output file scoredmag_bestonly.tsv (and along the way, scoredmag_sorted.tsv, though this file can be ignored).