The accompanying code and data for the Springer 2017 publication "What's missing in geographical parsing?" in Language Resources and Evaluation.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

What's Missing In Geoparsing?

"Science is a wonderful thing if one does not have to earn one's living at it." -- Albert Einstein


Thanks for stopping by! In this repository, you will find the accompanying code and data for the publication "What's missing in geographical parsing?" in the journal Language Resources and Evaluation. In the unlikely case of any files missing, please track me down and I'll upload 👍

What's included

  1. data - This is the output of all systems on both datasets (2 * 5 files) plus the gold standard (2 files)
  2. The dataset WikToR(SciPaper).xml is the original data as described and used in the paper.
  3. The LGL dataset, which is also used for evaluation is included as lgl.xml
  4. Essential experiment files (plus supporting scripts)

How to replicate

You should have some basic Python libraries like Numpy, NLTK, Matplotlib (if you want graphics), ... to start with.

  • is the main python script for running the experiments (requires the script)
  • Please install GeoPy to calculate the distances between coordinates.
  • Also install Wikipedia for Python, nice API wrapper 👍
  • Scroll down to the end of the file to see example usage, I included all necessary instructions and comments.
  • Enjoy!

How to (re)create and modify WikToR

The dataset (WikToR) can be created (and unite tested) from scratch, extended, reduced, with more or fewer sentences added, etc. If you wish to do that, great! Here's what you need:

  • The file is the python script used to (re)generate and unit test WikToR.
  • Download the allCountries.txt data dump from GeoNames and save in the same directory as the script.
  • Please sign up for a GeoNames account and a USERNAME, which you will need to fill in on line 42 to ensure the API query works.
  • The first half of is for CORPUS CREATION, the second half is for CORPUS TESTING.
  • Enjoy!

"The science of today is the technology of tomorrow." -- Edward Teller