# Easy Automated Geocoding of Text with CLIFF-up

**By [Andy Halterman](http://www.andrewhalterman.com)**. 
**Modified by [Rahul Bhargava](https://twitter.com/@rahulbot)**

MIT's [CLIFF](http://cliff.mediameter.org/) is a piece of software for extracting geolocation data from text, bundled into a server that can be accessed via API calls. I've bundled CLIFF into a Vagrant virtual machine for people (like me) who aren't thrilled about learning how to set up Tomcat servers and get Java configurations right. See the [CLIFF-up repo on Github](https://github.com/c4fcm/CLIFF-up) for the code to get CLIFF running easily inside a Vagrant virtual machine. Follow the instructions there to get CLIFF-up and then come back here for a walkthough of how to use CLIFF once it's running.

CLIFF is built on a number of free and open source projects, including Berico Technologies' [CLAVIN](https://github.com/Berico-Technologies/CLAVIN) geoparsing software, Stanford's [CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml) natural language software, and [Geonames.org](http://www.geonames.org/)'s free gazetteer of place names and coordinates.

You can use the [mediameter-cliff Python module](https://pypi.python.org/pypi/mediameter-cliff) to make queries to your server.  Install it with `pip intstall mediameter-cliff` and then fire up this python notebook.

In [60]:
from mediameter.cliff import Cliff
cliff = Cliff('http://localhost',8999)

Give it a sentence you're interested in geolocating. (From the [New York Times](http://www.nytimes.com/2014/11/15/world/europe/sweden-confirms-mystery-vessel-in-its-waters-was-a-foreign-submarine.html?_r=0)):

In [61]:
sentence = "In Sweden, the episode brought back memories of another incident in 1981, when Sweden discovered that a Soviet submarine had run aground off Swedish shores at Karlskrona in the south of the country."
print sentence

In Sweden, the episode brought back memories of another incident in 1981, when Sweden discovered that a Soviet submarine had run aground off Swedish shores at Karlskrona in the south of the country.


You can simple call the `parseText` method to get some results.

In [62]:
data = cliff.parseText(sentence)
data

{u'milliseconds': 179,
 u'results': {u'organizations': [],
  u'people': [],
  u'places': {u'focus': {u'cities': [{u'countryCode': u'SE',
      u'countryGeoNameId': u'2661886',
      u'featureClass': u'P',
      u'featureCode': u'PPLA',
      u'id': 2701713,
      u'lat': 56.16156,
      u'lon': 15.58661,
      u'name': u'Karlskrona',
      u'population': 32309,
      u'score': 1,
      u'stateCode': u'02',
      u'stateGeoNameId': u'2721357'}],
    u'countries': [{u'countryCode': u'SE',
      u'countryGeoNameId': u'2661886',
      u'featureClass': u'A',
      u'featureCode': u'PCLI',
      u'id': 2661886,
      u'lat': 62.0,
      u'lon': 15.0,
      u'name': u'Kingdom of Sweden',
      u'population': 9555893,
      u'score': 3,
      u'stateCode': u'00',
      u'stateGeoNameId': u''}],
    u'states': [{u'countryCode': u'SE',
      u'countryGeoNameId': u'2661886',
      u'featureClass': u'A',
      u'featureCode': u'ADM1',
      u'id': 2721357,
      u'lat': 56.33333,
      u'lon': 15.

You'll see that it returns some data about the query (time elapsed, status, version), a list of organizations mentioned, a list of people, and the places in the story. One of CLIFF's big selling points is that it distinguishes between "focus" places–the location the text is really about, and "mention" places that appear peripherally in the text. You'll also notice that it's fast: 52 milliseconds, which is the return on the long Lucene index building process when you did `vagrant up`.

We can cut the results down to just the "focus" places, which is presumably what we're interested in.

In [63]:
data['results']['places']['focus']

{u'cities': [{u'countryCode': u'SE',
   u'countryGeoNameId': u'2661886',
   u'featureClass': u'P',
   u'featureCode': u'PPLA',
   u'id': 2701713,
   u'lat': 56.16156,
   u'lon': 15.58661,
   u'name': u'Karlskrona',
   u'population': 32309,
   u'score': 1,
   u'stateCode': u'02',
   u'stateGeoNameId': u'2721357'}],
 u'countries': [{u'countryCode': u'SE',
   u'countryGeoNameId': u'2661886',
   u'featureClass': u'A',
   u'featureCode': u'PCLI',
   u'id': 2661886,
   u'lat': 62.0,
   u'lon': 15.0,
   u'name': u'Kingdom of Sweden',
   u'population': 9555893,
   u'score': 3,
   u'stateCode': u'00',
   u'stateGeoNameId': u''}],
 u'states': [{u'countryCode': u'SE',
   u'countryGeoNameId': u'2661886',
   u'featureClass': u'A',
   u'featureCode': u'ADM1',
   u'id': 2721357,
   u'lat': 56.33333,
   u'lon': 15.33333,
   u'name': u'Blekinge',
   u'population': 152315,
   u'score': 1,
   u'stateCode': u'02',
   u'stateGeoNameId': u'2721357'}]}

Or we can pare it down even further and just look at the city names:

In [64]:
for i in data['results']['places']['focus']['cities']:
    print i['name']

Karlskrona


### Multi-sentence example: Syria

The point of automated geocoding is obviously to do it at scale, perhaps as part of an [event data project](https://github.com/openeventdata).

Let's take a look at an example a little closer to what we'll be doing with event data, where we'd like to use it to extract the places where events are happening. Our event extraction software, [PETRARCH](https://github.com/openeventdata/petrarch) handles the event extraction, but we will rely on a separate program to figure out the places associated with events in each sentence.

Here, I'm giving it a list of sentences from a recent [Reuters story](http://www.reuters.com/article/2014/11/05/us-mideast-crisis-turkey-idUSKBN0IP10B20141105
) about Syria. Normally, I would split the paragraph into sentences automatically using CoreNLP's sentence splitter function, but I've done that step by hand here to keep this example light weight. 

In [65]:
paragraph = ["The United States continued its assault on Islamic State militants this week, conducting 14 airstrikes in recent days in Syria and Iraq, U.S. Central Command said, three of them near the predominantly Kurdish border town of Kobani.", "Turkish President Tayyip Erdogan has criticized the U.S.-led coalition's focus on Kobani, which has been besieged by Islamic State for more than a month, and warned its attention needed to be turned to other parts of the conflict.", "The Syrian civil war has killed close to 200,000 people and forced more than 3 million refugees to flee the country, according to the United Nations.", "At least 11 children were killed in Damascus when mortars fell on a school in an eastern district of the Syrian capital, the Britain-based Syrian Observatory for Human Rights, which monitors the war, said on Wednesday.","The school was in a rebel-held part of Qaboun, a district in the east of the city which is contested between government and rebel forces, the monitoring group said.","The death toll was expected to rise because a number of those wounded were in critical condition, it said.","Fighters linked to al Qaeda also took ground from moderate Syrian rebels last week in the northern province of Idlib, expanding their control.","A member of the Syrian rebel forces based in southeastern Turkey said on Wednesday the Nusra Front had made further gains in recent days."]

for i in paragraph:
    print i
    print "\n"

The United States continued its assault on Islamic State militants this week, conducting 14 airstrikes in recent days in Syria and Iraq, U.S. Central Command said, three of them near the predominantly Kurdish border town of Kobani.


Turkish President Tayyip Erdogan has criticized the U.S.-led coalition's focus on Kobani, which has been besieged by Islamic State for more than a month, and warned its attention needed to be turned to other parts of the conflict.


The Syrian civil war has killed close to 200,000 people and forced more than 3 million refugees to flee the country, according to the United Nations.


At least 11 children were killed in Damascus when mortars fell on a school in an eastern district of the Syrian capital, the Britain-based Syrian Observatory for Human Rights, which monitors the war, said on Wednesday.


The school was in a rebel-held part of Qaboun, a district in the east of the city which is contested between government and rebel forces, the monitoring group s

In [66]:
for i, sentence in enumerate(paragraph):
    p = cliff.parseText(sentence)
    num = i
    if 'cities' in p['results']['places']['focus']:
        for i in p['results']['places']['focus']['cities']:
            place = i['name']
            lat = i['lat']
            lon = i['lon']
            print "Cities in sentence " + str(num) + ": " + place + " (" + str(lat) + ", " + str(lon) + ")"


Cities in sentence 0: ‘Ayn al ‘Arab (36.89095, 38.35347)
Cities in sentence 3: Damascus (33.5102, 36.29128)
Cities in sentence 4: Qābūn (33.54309, 36.33604)
Cities in sentence 6: Idlib (35.93062, 36.63393)


Although the first one doesn't look right (the first sentence [#0] is about Kobani), it turns out that 
‘Ayn al ‘Arab is actually the same place. ([See it on a map](https://www.google.com/maps/place/Ayn+al-Arab%2FAleppo+Governorate,+Turkey/@36.8897215,38.355556,15z/data=!3m1!4b1!4m7!1m4!3m3!1s0x15315e572c090bcd:0xe533e57bae2a797a!2sUnnamed+Rd,+Kubani,+Syria!3b1!3m1!1s0x15315e570e650e43:0x948999c15c6032ef)).

As the makers of CLIFF point out, assessing automated document geolocation is very difficult. Over the next few week, we at the Open Event Data Alliance will start evaluating CLIFF for our needs and see if we can move it into production as our geolocating service.