# Comparing the publications volume of cities in Europe

## Part 1: Extracting Publications Data Based on Countries

This notebook shows how to use the [Dimensions Analytics API](https://www.dimensions.ai/dimensions-apis/) to measure the distribution of overall number of publications per European cities. For the purpose of this exercise, we will look at a specific year i.e. 2018.

This data will allow us to create a nice-looking map where cities producing more research can be identified visually.

> **Customize this notebook:** simply change the geographical area and time-span to adapt this analysis to your needs! 


#### Prerequisites

In [1]:
# data analysis libraries 
import time 
import pandas as pd 
import requests  
import json 
# Dimensions API query helper
import dimcli
dimcli.login()
dsl = dimcli.Dsl()
# 

DimCli v0.5.4 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)


Let's put the list of countries we need into a Python list (PS: country codes are based on geonames, see also https://www.geonames.org/countries/). This will be used later when we generate queries dynamically.

In [2]:
EUROPEAN_COUNTRIES = ["AD","AL","AT","AX","BA","BE","BG","BY","CH","CZ","DE","DK","EE","ES","FI","FO","FR","GB","GG","GI","GR","HR","HU","IE","IM","IS","IT","JE","LI","LT","LU","LV","MC","MD","ME","MK","MT","NL","NO","PL","PT","RO","RS","RU","SE","SI","SJ","SK","SM","UA","VA"]

We are looking at a specific year, hence:

In [3]:
YEAR = 2018

## 1. Getting the publications data from Dimensions

Dimensions DSL API can return a maximim of 1000 results per call, so in order to get the data we are interested in we need to break down the queries into smaller components. 

We can create a number of **smaller queries**, one for for **each EU country**, for a single year, and extract all the cities information from it.

This is what this query would look like, for a single country:

```
search publications
where type="article" and year="2018" and research_org_countries in ["GB"]
return research_org_cities limit 1000
```

The result of this query contains a list of cities and publication counts, like this:

```
n	count	id		name
0	63893	2643743	London
1	16173	2653941	Cambridge
2	16073	2640729	Oxford
3	10695	2643123	Manchester
4	10403	2650225	Edinburgh
```

In essence, we want to run the query above for all countries in the EU, then collate the data into a single table. 

We can set up a parametrized query loop as follows 

In [4]:
q_template = f"""search publications
where type="article" and date="{YEAR}" and research_org_countries in ["%s"]
return research_org_cities limit 1000"""

In [5]:
# dataframe container object
df_master = pd.DataFrame()
# query loop
for c in EUROPEAN_COUNTRIES:
    query = q_template % (c)
    res = dsl.query(query, show_results=False)
    data = res['research_org_cities']
    if data:
        if df_master.empty:
            df_master = pd.DataFrame.from_dict(data)
        else:
            df_temp = pd.DataFrame.from_dict(data)
            df_master = df_master.append(df_temp, sort=True)
    print("Querying country:", c, ". Results Rows:", len(df_master))
    time.sleep(1.5) # ensure we respect the API limit of 30 queries per second
print("--")    
print("Done")    

Querying country: AD . Results Rows: 0
Querying country: AL . Results Rows: 6
Querying country: AT . Results Rows: 179
Querying country: AX . Results Rows: 179
Querying country: BA . Results Rows: 184
Querying country: BE . Results Rows: 485
Querying country: BG . Results Rows: 514
Querying country: BY . Results Rows: 585
Querying country: CH . Results Rows: 963
Querying country: CZ . Results Rows: 1108
Querying country: DE . Results Rows: 1711
Querying country: DK . Results Rows: 2001
Querying country: EE . Results Rows: 2019
Querying country: ES . Results Rows: 2464
Querying country: FI . Results Rows: 2621
Querying country: FO . Results Rows: 2621
Querying country: FR . Results Rows: 3219
Querying country: GB . Results Rows: 3942
Querying country: GG . Results Rows: 3942
Querying country: GI . Results Rows: 3942
Querying country: GR . Results Rows: 4120
Querying country: HR . Results Rows: 4129
Querying country: HU . Results Rows: 4183
Querying country: IE . Results Rows: 4258
Query

#### That's it! We have extracted the cities/publications data for all the EU countries. 

Let's save the raw data as a CSV file so that we can use it later. 

In [8]:
df_master.to_csv('data/raw_cities_data.csv')

## 2. Cleaning the data by removing duplicates 

Next up, we want to combine the raw data so that we have only one entry per city. 

In [9]:
df_master_merged = df_master.pivot_table(index=['id', 'name'], aggfunc=sum).sort_values(by=['count'], ascending=False)

In [10]:
# fix column names (https://stackoverflow.com/questions/33290374/pandas-pivot-table-column-names)
df_master_merged.reset_index(inplace=True)

Let's save this intermediary step as a different CSV. 

In [12]:
df_master_merged.to_csv('data/merged_cities_data.csv')

Quick look at the data we collected

In [11]:
len(df_master_merged)

1466

In [17]:
df_master_merged.head(10)

Unnamed: 0.1,Unnamed: 0,id,name,count
0,0,2643743,London,468
1,1,2988507,Paris,293
2,2,3128760,Barcelona,176
3,3,2759794,Amsterdam,172
4,4,3117735,Madrid,171
5,5,3173435,Milan,163
6,6,3169070,Rome,154
7,7,2657896,Zürich,147
8,8,2950159,Berlin,135
9,9,2673730,Stockholm,126


---
# What next?


This tutorial is the first of a three-parts series. The other ones can be found on the [Dimensions API Lab](https://digital-science.github.io/dimensions-api-lab/) website, which contains many other tutorials and reusable Jupyter notebooks for scholarly data analytics. 