RCALLSTRINGDIST: Call R's stringdist package from Stata using rcall
- 0.3.0 11jul2019:
- adds additional options to clean strings before comparison (ignorecase, ascii, whitespace, punctuation)
- moved handling out of csv and into .dta files (faster when merging with original data) using R package haven (appears better at handling diacritics than readstata13)
- small speed improvements
- 0.2.0 16apr2019:
- adds several options: matrix (for one and two variables), duplicates, and sortwords
- (0.2.3) significant increases in speed
- 0.1.0 15apr2019:
- first version of the command
I'd like to thank the authors of both packages:
stringdistwas written by Mark van der Loo, Jan van der Laan, R Core Team, Nick Logan, and Chris Muir.
rcallwas written by E. F. Haghish
- Install R directly or with RStudio for a graphical interface.
- Install this package using the
githubcommand by E. F. Haghish. This will also install dependencies automatically.
net install github, from("https://haghish.github.io/github/") replace github install luispfonseca/stata-rcallstringdist
- Commands from
gtoolsby Mauricio Caceres Bravo are used to speed up the command when available.
If R is installed on your machine, all these dependencies will be automatically installed when following the earlier instrutions. The file dependency.do is executed automatically after installing
rcallstringdist package. Make sure R is installed on your machine before you attempt to install these packages on Stata.
See the help file in Stata for details about each option.
* Comparing two lists of strings clear input str30 nameA "Gates Bill" "Gates, Bill" "bill gates" "William H. Gates III" end input str30 nameB "Bill Gates" "Bill Gates" "Bill Gates" "William Henry Gates III" compress ** Comparing two variables, row by row *** default method (osa), default arguments, default generated variable name rcallstringdist nameA nameB *** specific variable names rcallstringdist nameA nameB, gen(osa) rcallstringdist nameA nameB, method(cosine) q(3) gen(cosine) *** sometimes it's worth sorting words within each string. *** the first row will now be a perfect match rcallstringdist nameA nameB, gen(osa_sortw) sortwords *** it can also be worth cleaning up the strings before feeding them ****(e.g. lowercase, remove punctuation and diacritics) gen nameAclean = lower(nameA) gen nameBclean = lower(nameB) rcallstringdist nameAclean nameBclean, gen(osa_clean) rcallstringdist nameAclean nameBclean, gen(osa_clean_sortw) sortwords ** Comparing two variables, all possible combinations *** by calling the matrix option, we can compare all possible combinations *** of strings from one variable with the other variable *** be aware: this option will clear your current working dataset from memory *** see the following example clear input str30 nameA "Gates Bill" "Gates, Bill" "bill gates" "William H. Gates III" "Bill Gates" "Bill Gates" end input str30 nameB "Bill Gates" "William Henry Gates III" "Bill Gates" end compress save example_dataset *** each string of nameA will be compared with each string of nameB *** nameA has 5 unique strings, while name B has 2 *** 10 pairs will be compared rcallstringdist nameA nameB, matrix * Comparing one list of strings with itself, all possible combinations *** if only one variable is passed, compare all pairs of strings within *** we have 5 unique strings, 5x4/2=10 combinations use example_dataset, clear rcallstringdist nameA, matrix *** to keep all permutations (5x4=20), we can use the keepduplicates option use example_dataset, clear rcallstringdist nameA, matrix keepduplicates
London Business School
lfonseca london edu