Skip to content
linkages from non-patent literature (NPL) references from USPTO patents since 1947 to academic papers since 1800 using Microsoft Academic Graph
Perl Stata Shell Other
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
README.md splityear fix Jul 12, 2019
READMEnplprep initial commit for 5/31 release Jun 19, 2019
buildmagdata.do initial commit for 5/31 release Jun 19, 2019
buildmagyearauthorfreq.sh initial commit for 5/31 release Jun 19, 2019
buildsplitregex_1799_lev.pl initial commit for 5/31 release Jun 19, 2019
buildsplitregex_byyear_lev.pl initial commit for 5/31 release Jun 19, 2019
buildtitleregex_1799_lev.pl initial commit for 5/31 release Jun 19, 2019
buildtitleregex_byyear_lev.pl initial commit for 5/31 release Jun 19, 2019
combinematches_mag.sh initial commit for 5/31 release Jun 19, 2019
commonwords added input files needed for scoring Jun 25, 2019
createsubsets_tsv_nochopNOPAT.sh removing extraneous stuff from the preprocesing Jun 25, 2019
dumpmagfieldsfornpl.do initial commit for 5/31 release Jun 19, 2019
exportmatches-mag.do initial commit for 5/31 release Jun 19, 2019
findbest_match.pl initial commit for 5/31 release Jun 19, 2019
fixauthornames.sh initial commit for 5/31 release Jun 19, 2019
fixnamesfornplmatch.sh initial commit for 5/31 release Jun 19, 2019
journalabbrevs-extended.tsv added input files needed for scoring Jun 25, 2019
journalabbrevs.tsv added input files needed for scoring Jun 25, 2019
nonsge_soundexmatchalltitlespieces.sh initial commit for 5/31 release Jun 19, 2019
noyearnpl.pl initial commit for 5/31 release Jun 19, 2019
npl.1926.1975 initial commits Jun 25, 2019
ocrnpldash.pl initial commit for 5/31 release Jun 19, 2019
ocrtrim.sh initial commit for 5/31 release Jun 19, 2019
prepinputs.sh initial commit for 5/31 release Jun 19, 2019
prepnpls.sh initial commit for 5/31 release Jun 19, 2019
probablyonlywords.txt added input files needed for scoring Jun 25, 2019
process_lastnames-justauthoridname.pl initial commits Jun 25, 2019
score_matches.pl initial commit for 5/31 release Jun 19, 2019
score_matches_mag.sh initial commit for 5/31 release Jun 19, 2019
screen_npljunk.pl initial commit for 5/31 release Jun 19, 2019
set_sge_lev_mag_splitcode.sh initial commit for 5/31 release Jun 19, 2019
set_sge_lev_mag_splittitle.sh initial commit for 5/31 release Jun 19, 2019
sge_buildsplitregex_lev_mag.sh initial commit for 5/31 release Jun 19, 2019
sge_buildtitleregex_magLEV.sh initial commit for 5/31 release Jun 19, 2019
sge_collectmatches_mag.sh initial commit for 5/31 release Jun 19, 2019
sge_runregex_lev_mag.sh initial commit for 5/31 release Jun 19, 2019
sge_soundexmatchalltitlespieces.sh initial commit for 5/31 release Jun 19, 2019
somewhatcommonsurnames.csv added input files needed for scoring Jun 25, 2019
sort_scored_mag.sh initial commit for 5/31 release Jun 19, 2019
splitword.pl initial commit for 5/31 release Jun 19, 2019
splitword.sh initial commit for 5/31 release Jun 19, 2019
splityear.pl initial commit for 5/31 release Jun 19, 2019
splityear.sh initial commit for 5/31 release Jun 19, 2019
terracemag.sh initial commit for 5/31 release Jun 19, 2019
terracemagtitles.sh initial commit for 5/31 release Jun 19, 2019
terracenpl.sh initial commit for 5/31 release Jun 19, 2019
terracenplsnotmatchedwtitles.sh initial commit for 5/31 release Jun 19, 2019
translategreekletters.sh initial commit for 5/31 release Jun 19, 2019
verycommonsurnames.csv added input files needed for scoring Jun 25, 2019

README.md

The codes necessary to replicate Marx/Fuegi 2019 are contained in this directory. This code operates on, and assumes the presence of, a set of files from the Microsoft Academic Graph (MAG) and USPTO non-patent literature (NPL) references, described below.

DISCLAIMERS

The code is unsupported and is largely undocumented. It is provided primarily for those interested in understanding how the NPL linkages to MAG were accomplished. Moreover, it is executable only in a Sun Grid Engine (or similar) Unix environment with STATA installed as well as several packages including ftools and gtools and the Perl module Text::LevenshteinXS. It assumes the directory structure described below and contains hardcoded, fully-qualified pathnames. Moreover, you will need at least 5 terabytes of disk space, perhaps as much as 10.

There are four general steps in executing the matches: First, preparing the MAG data. Second, preparing the NPL data. Third, generating a first-pass set of "loose" matches. Fourth, scoring those "loose" matches and picking the best match for each NPL. Each of these major steps includes a number of sub-steps; there is no "master" script to run the process from beginning to end.

DIRECTORY STRUCTURE

Many of the programs assume /project/nb/marxnsf1/dropbox/ but this can be replaced by another prefix (but should be a fully-qualified pathname, not a relative reference – no environment variable is set to easily substitute, sorry). Beneath that directory, the necessary structure is:

  • mag
  • mag/code
  • mag/dta
  • mag/txt
  • nplmatch
  • nplmatch/inputs
  • nplmatch/inputs/mag
  • nplmatch/inputs/mag/magbyyear
  • nplmatch/inputs/npl
  • nplmatch/inputs/npl/nplbyrefyear
  • nplmatch/inputs/journalabbrev
  • nplmatch/splityear
  • nplmatch/splitword
  • nplmatch/splittitle
  • nplmatch/splittitle/year_regex_scripts_mag
  • nplmatch/splittitle/year_regex_output_mag
  • nplmatch/splitcode
  • nplmatch/splitcode/year_regex_scripts_mag
  • nplmatch/splitcode/year_regex_output_mag
  • nplmatch/process_matches
  • nplmatch/process_matches/peryearuniqmatches
  • nplmatch/process_matches/peryearuniqmatches/mag
  • nplmatch/process_matches/pieces
  • nplmatch/sort_scored_matches

PROGRAMS TO RUN

STEP 1: PREPARE MAG FILES

  1. download from MAG the following files into the mag/txt directory: Papers.txt, ConferenceSeries.txt, Journals.txt, Authors.txt, and Affiliations.txt. Instructions for accessing MAG are here. You will need to create an Azure account in order to download the files. You may be able to download a recent snapshot here.
  2. in the mag/txt directory, execute the script createsubsets_tsv_nochopNOPAT.sh to create a number of derivative files from the MAG originals in the mag/txt directory.
  3. in the mag/txt directory, run "cat papertitle.tsv | translategreekletters.sh > papertitle-transliteratedgreek.tsv" to write out all greek characters in MAG titles as alphas.
  4. in the mag/txt directory, run fixauthornames.sh to reverse the order of first and last names in the MAG data. This calls process_lastnames-justauthoridname.pl from mag/code to create authorname-fixed.txt and authorname-surfirst.txt in the mag/txt directory.
  5. execute the Stata script buildmagdata.do in the mag/code directory. This will read a number of files from mag/txt and output stata-formatted versions of them in mag/dta.
  6. in the nplmatch/inputs/mag directory, execute the Stata script dumpmagfieldsfornpl.do. This will combine the individual MAG files from mag/dta and write them out in a single mergedmagfornpl.dta file in the nplmatch/inputs/mag directory as well as a tab-delimited mergedmagfornpl.tsv file.
  7. in the nplmatch/inputs/mag directory, run fixnamesfornplmatch.sh to lowercase the MAG files and swap the order of given and surnames in the author field. this creates a file mergedmagfornpl-fixednames.tsv.
  8. in the nplmatch/inputs/mag directory, run terracemag.sh to split up the MAG papers by year into nplmatch/inputs/mag/magbyyear.

STEP 2: PREPARE NPL FILES

  1. download and extract the "otherreference" file from Patentsview (www.patentsview.org/download - depending on the release date, the exact URL to the file may change.) copy the extracted file otherreference.tsv to this directory.
  2. type the command "cut -f2,3 otherreference.tsv > npl.1976-present.tsv" to extract just the fields needed.
  3. in nplmatch/inputs, run the command "cat npl.1926.1975.tsv | ocrtrim.sh | ocrnpldash.pl > npl.1926.1975-patnplOCRautofix.tsv" to clean up OCRed NPLs prior to 1976.
  4. in nplmatch/inputs, run the command "cat npl.1926.1975-patnplOCRautofix.tsv npl.1976-present.tsv | tr [:upper:] [:lower:] | perl screen_npljunk.pl > npl.1926.2018-lowercaseOCRautofixnononsci.tsv" to combine the NPL files, lowercase them, and strip out the bulk of references not to papers but random things like websites, product brochures, etc.
  5. in nplmatch/inputs, run terracenpl.sh to split the combined NPLs into individual years as well as a "fake" year, 1799, for NPLs with no year in them. All of these files are placed in nplmatch/inputs/nplbyrefyear.

STEP 3: DO THE "LOOSE" FIRST-PASS MATCHING

  1. copy the files journalabbrevs.tsv and journalabbrevs-extended.tsv to the nplmatch/inputs/journalabbrev.
  2. copy the files commonsurnames.csv, verycommonsurnames.csv, probablyonlywords.txt to nplmatch/process_matches.
  3. in nplmatch/splitword, run splitword.sh to submit an array job that runs splitword.pl for each year from 1800-2018. This creates a hash of all words in the NPLs in subdirectories of nplmatch/splitword such as '1980/a/c/achieve' containing all 1980 NPLs that include the word "achieve"
  4. in nplmatch/splittitle, run "buildtitleregex_1799_lev.pl mag" to generate rules for NPLs without years in nplmatch/splittitle/year_regex_scripts_mag. These rules are based on primary author surname in MAG and finding either the longest or second longes word in the title.
  5. in nplmatch/splittitle, run sge_buildtitleregex_magLEV.sh to generate rules for NPLs with years in nplmatch/splittitle/year_regex_scripts_mag. These rules are based on primary author surname in MAG and finding either the longest or second longes word in the title.
  6. in nplmatch/splittitle, run set_sge_lev_mag_splittitle.sh to simultaneously apply the generated rules against the NPL data
  7. in nplmatch/splityear, run splityear.sh to submit an array job that runs splityear.pl for each year from 1800-2018. This creates a hash of all number in the NPLs in subdirectories of nplmatch/splityear such as '1963/1/15330' contaning all 1963 NPLs that incldue the number 15330.
  8. in nplmatch/splitcode, run "buildsplitregex_1799_lev.pl mag" to generate matching rules without titles for NPLs missing years. These rules are based on primary author surname in MAG and finding the first page of the article (or volume if there is no first page in MAG).
  9. in nplmatch/splitcode, run sge_buildsplitregex_lev_mag.sh to generate matching rules without titles for NPLs with years. These rules are based on primary author surname in MAG and finding the first page of the article (or volume if there is no first page in MAG).
  10. in nplmatch/splitcode, run set_sge_lev_mag_splitcode.sh to launch thousands of simultaneous scripts to apply the generated rules against the NPL data.

STEP 4: GATHER THE "LOOSE" MATCHES AND SCORE THEM

  1. in nplmatch/process_matches, run sge_collectmatches_mag.sh to gather the output of the loose-match processes (both title-based and non-title-based) into nplmatch/process_matches/peryearuniqmatches/mag/, one file per year (and one file for missing years). This process also sorts the matches and removes duplicates.
  2. in nplmatch/process_matches, run "cat peryearuniqmatches/mag/* > bothmatchestoscore_mag.tsv" to combine the matches for each year into a single file
  3. run score_matches_mag.sh, which scores the loosematches and retains all matches for every NPL (with confidence score >0) in scoredmag.tsv. It splits bothmatchestoscore_mag.tsv into hundreds of smaller, parallelizable pieces in nplmatch/process_matches/pieces.
  4. copy or move scoredmag.tsv from nplmatch/process_matches to nplmatch/sort_scored_matches.
  5. in nplmatch/sort_scored_matches, run sort_scored_mag.sh to pick the best match for each NPL from those with confidence score > 0, creating the output file scoredmag_bestonly.tsv (and along the way, scoredmag_sorted.tsv, though this file can be ignored).
You can’t perform that action at this time.