# Comparing the publications volume of cities in Europe

## Part 2: Enriching Cities Data with Geonames.org Lat/Long

This notebook shows how to use the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) to measure the distribution of overall number of publications per European cities. For the purpose of this exercise, we will look at a specific year i.e. 2018.

In this section we want to take all cities information and enhance it by adding geocoordinates. These will become handy later on if we want to plot the data on a map.

We will use the free service [geonames.org](http://www.geonames.org/export/web-services.html) for this. 

> Geonames API are freely available but one must create an account/username in order to use it: see also http://www.geonames.org/export/credits.html. 

Geonames information will also let us do more data clean up: due to co-author information, the cities list we currenlty have contains some non-EU places 
    (e.g. New York, because authors from NY happened to be co-authoring papers with EU authors). 
These values can be easily filtered out once we the cities information from Dimensions is enriched with country codes (from geonames.org). 

#### Prerequisites

In [2]:
# data analysis libraries 
import time 
import pandas as pd 
import requests  
import json 
# Dimensions API query helper
import dimcli
dimcli.login()
dsl = dimcli.Dsl()
# 

DimCli v0.5.4 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


Also let's store the geonames settings we're gonna be using later

In [5]:
# ****** REPLACE AS NEEDED ******
GEONAMES_USER = "YOUR_USERNAME_HERE"
#
GEONAMES_URL = "http://api.geonames.org/hierarchyJSON?geonameId=%s&username=%s"
#

## Let's load the data we extracted in Part-1 of this tutorial

In [3]:
df1 = pd.read_csv("data/merged_cities_data.csv")
df1.head()

Unnamed: 0.1,Unnamed: 0,id,name,count
0,0,2643743,London,468
1,1,2988507,Paris,293
2,2,3128760,Barcelona,176
3,3,2759794,Amsterdam,172
4,4,3117735,Madrid,171


In [4]:
# Drop the first column and rename the second one
df1.drop(['Unnamed: 0'], axis=1, inplace=True)
df1.columns = ['geonamesId', 'name', 'count']
df1.head()

Unnamed: 0,geonamesId,name,count
0,2643743,London,468
1,2988507,Paris,293
2,3128760,Barcelona,176
3,2759794,Amsterdam,172
4,3117735,Madrid,171


# Querying the Geonames API

We want to set up a function that queries geonames.org based on the geonames city IDs we extracted from Dimensions. 

Then for each city records, we will extract useful latitude and longitude data.

PS geonames API has a hourly limit of 1000 credits/requests, hence we will slow down the execution using `time.sleep()` later on. 

In [6]:
# eg http://api.geonames.org/hierarchyJSON?geonameId=2643743&username=mpasin
def open_geonames(city_id):
    r = requests.get(GEONAMES_URL % (str(city_id), GEONAMES_USER))
    return r.json()

def geonames_details(city_id):
    city_id = int(city_id) # make sure it's a number
    data = open_geonames(city_id)
    try:
        for x in data['geonames']:
            if x['geonameId'] == city_id:
                lat = x['lat']
                lng = x['lng']
                countryCode = x['countryCode']
                countryName = x['countryName']
                return [lat, lng, countryCode, countryName]
    except Exception as e: 
        print(e)
        print("Error parsing JSON: %s" % str(data))
        return None
    

The function above works like this:

```
geonames_details(3522210)
=> ['20.11697', '-98.73329', 'MX', 'Mexico']
```

In order to prevent losing data is the geonames API returns an error, we store the cities data a dict which we can save as JSON.  
This can be updated with geonames infos incrementally, using separate runs (if necessary). 

We use a list of dictionaries so to preserve the original order.

In [20]:
temp = {}
temp['data'] = []
for x in df1['geonamesId']:
    temp['data'].append({x: None})
with open('tmp/geonames_temp.json', 'w') as outfile:  
    json.dump(temp  , outfile)

Read the data back in for iteration

In [35]:
with open('tmp/geonames_temp.json') as infile:  
    tempfiledata = json.load(infile)

Iterate and enrich the dict with the geonames details. 

If geonames API fails, the value for a place ID remains null. 

The iteration only takes null-place info elements, so we can rerun this cell as many times as needed to get data for all places. 

In [36]:
counter = 0
for ddict in tempfiledata['data']:
    _id = next(iter(ddict)) # get first element
    if not ddict[_id]:
        print(counter, "...")
        res = geonames_details(_id)
        if res:
            tempfiledata['data'][counter][_id] = res
        time.sleep(1)
    else:
        pass 
        # print("skipping", counter)
    counter += 1

# now save to file
with open('tmp/geonames_temp.json', 'w') as outfile:  
    json.dump(tempfiledata, outfile)
print("DONE: data saved")

1435 ...
DONE: data saved


Finally, add the geonames data to the original data frame. 

We can do this by creating a list for each of the new columns, making sure they have the exact same number of elements of the original data frame.

In [37]:
lats, longs, countryCodes, countryNames = [], [], [], []
for x in tempfiledata['data']:
    _id = next(iter(x)) # get first element
    if not x[_id]:
        lats.append([""])
        longs.append([""])
        countryCodes.append([""])
        countryNames.append([""])
    else:
        lats.append(x[_id][0])
        longs.append(x[_id][1])
        countryCodes.append(x[_id][2])
        countryNames.append(x[_id][3])

len(df1) == len(lats) == len(longs) == len(countryCodes) == len(countryNames)

True

In [38]:
# when finished, update the dataframe and save 
df1['lat'] = lats
df1['lng'] = longs
df1['countryCode'] = countryCodes
df1['countryName'] = countryNames
df1.to_csv('tmp/enriched_cities_data_2019-04-23.csv')

Finally, we can remove non-EU cities from the data table

In [41]:
df1 = pd.read_csv("tmp/enriched_cities_data_2019-04-23.csv")
europe_countries = ["AD","AL","AT","AX","BA","BE","BG","BY","CH","CZ","DE","DK","EE","ES","FI","FO","FR","GB","GG","GI","GR","HR","HU","IE","IM","IS","IT","JE","LI","LT","LU","LV","MC","MD","ME","MK","MT","NL","NO","PL","PT","RO","RS","RU","SE","SI","SJ","SK","SM","UA","VA"]

to_drop = []

for x,y in enumerate(df1['countryCode']):
    if y not in europe_countries:
        to_drop += [x]
    
df1.drop(df1.index[to_drop], inplace=True)
df1.head()



Unnamed: 0.1,Unnamed: 0,geonamesId,name,count,lat,lng,countryCode,countryName
0,0,2643743,London,468,51.50853,-0.12574,GB,United Kingdom
1,1,2988507,Paris,293,48.85341,2.3488,FR,France
2,2,3128760,Barcelona,176,41.38879,2.15899,ES,Spain
3,3,2759794,Amsterdam,172,52.37403,4.88969,NL,Netherlands
4,4,3117735,Madrid,171,40.4165,-3.70256,ES,Spain


In [51]:
df1.describe()

Unnamed: 0,geonamesId,count
count,808.0,808.0
mean,2627860.0,13.493812
std,813793.4,30.379793
min,251833.0,1.0
25%,2641306.0,1.0
50%,2825690.0,3.0
75%,3031248.0,13.0
max,6543862.0,468.0


There are **808** cities from the EU listed now. 

As a final step, let's remove the first column and save the data as CSV. 

In [None]:
df1.drop(['Unnamed: 0'], axis=1, inplace=True)

In [52]:
df1.to_csv('cities_data_final_2019-04-23.csv')

That's it. This is the final dataset, which includes only EU cities infos.

---
# What next?


This tutorial is the second of a three-parts series. The other ones can be found on the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which contains many other tutorials and reusable Jupyter notebooks for scholarly data analytics. 