# Domestic Collaboration Queries

Usecase:

> How can we identify domestic collaborations, including also publications that have both domestic and international collaborations? For example, a publication may have authors from the University of Auckland (NZ), University of Otago (NZ) and Stanford University (US), so we'd like to include it in a report focusing on New Zealand.

So the goal is to extract all publications that are affiliated with more than one NZ institutions - irrespectively of whether they have also other international affiliations. 


In [0]:
# fill in as needed!
username = ""
password = ""

In [0]:
!pip install dimcli --quiet
import dimcli
dimcli.login(username, password)

DimCli v0.6 - Succesfully connected to <https://app.dimensions.ai> (method: manual login)


In [0]:
dsl = dimcli.Dsl()

## Approach 1

Using this approach we can identify publications with more than one affiliation, but with all affiliation from a specific country (eg NZ). This can be achieved in two steps:

#### Step 1. identify countries co-appearing with NZ in multi-country publications 

```
search publications
where count(research_orgs) > 1
and research_org_country_names = "New Zealand"
return research_org_countries limit 1000
```

#### Step 2. use the countries list to form a new query where these are filtered out (so to return only research orgs from NZ). 

```
search publications
where count(research_orgs) > 1
and research_org_country_names = "New Zealand" 
and research_org_country_names != "United States" 
and research_org_country_names != "Australia [etc....]
return funders limit 1000"""
```

> **Limitation of this approach**
The main limitation of this approach is that it won't produce any result where we have a collaboration between 2 NZ organizations and another international one. 

### Implementation

In [0]:
all_countries = dsl.query("""
search publications
where count(research_orgs) > 1
and research_org_country_names = "New Zealand"
return research_org_countries limit 1000""").as_dataframe()

Returned Research_org_countries: 228


In [0]:
countries = list(all_countries['name'])
countries.remove("New Zealand")

In [0]:
query_component = ""
for x in countries:
    query_component += """and research_org_country_names != "{}" """.format(x)

In [0]:
query_component

'and research_org_country_names != "United States" and research_org_country_names != "Australia" and research_org_country_names != "United Kingdom" and research_org_country_names != "Germany" and research_org_country_names != "Canada" and research_org_country_names != "China" and research_org_country_names != "France" and research_org_country_names != "Italy" and research_org_country_names != "Netherlands" and research_org_country_names != "Japan" and research_org_country_names != "Spain" and research_org_country_names != "Switzerland" and research_org_country_names != "Sweden" and research_org_country_names != "Belgium" and research_org_country_names != "India" and research_org_country_names != "Brazil" and research_org_country_names != "Singapore" and research_org_country_names != "Austria" and research_org_country_names != "Finland" and research_org_country_names != "Denmark" and research_org_country_names != "South Korea" and research_org_country_names != "Ireland" and research_org

Eg to return funders we can do the following

In [0]:
all_funders = dsl.query(f"""
search publications
where count(research_orgs) > 1
and research_org_country_names = "New Zealand" {query_component}
return funders limit 1000""").as_dataframe()

Returned Funders: 197


In [0]:
all_funders.head()

Unnamed: 0,id,count,acronym,name,country_name
0,grid.431594.e,1493,MBIE,"Ministry of Business, Innovation and Employment",New Zealand
1,grid.452999.a,758,HRC,Health Research Council of New Zealand,New Zealand
2,grid.431457.0,532,RSNZ,Royal Society of New Zealand,New Zealand
3,grid.417738.e,276,,AgResearch (New Zealand),New Zealand
4,grid.419676.b,231,NIWA,National Institute of Water and Atmospheric Re...,New Zealand


# Approach 2

Another appraoch that will lead to more complete results is to 
1. extract all publications records, and 
2. aggregate them by research org / country in order to highlight collaboration patterns. 

The main drawback of this approach is that it takes longer since we are downloading lots of data. In particular, since the maximum one can download from a single API search is 50k records, the source dataset (= all the publications) might need to be created by concatenating the results of multiple queries. 

For the purpose of this example, we only take records from 2015.  

In [0]:
data = dsl.query_iterative("""
search publications where count(research_orgs) > 1 
and year = 2015
and research_org_country_names = "New Zealand"
return publications
""", limit=500).as_dataframe_authors_affiliations()

500 / 7591
1000 / 7591
1500 / 7591
2000 / 7591
2500 / 7591
3000 / 7591
3500 / 7591
4000 / 7591
4500 / 7591
5000 / 7591
5500 / 7591
6000 / 7591
6500 / 7591
7000 / 7591
7500 / 7591
7591 / 7591


Now we can identify only affiliations to a certain country eg NZ

In [0]:
country_affiliations = data.query('aff_country == "New Zealand"')
country_affiliations.head()

Unnamed: 0,aff_name,aff_id,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name
4,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1028580728,,Edward J.,Gane
19,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1028580728,ur.016520575067.32,Catherine A.M.,Stedman
34,Christchurch Hospital,grid.414299.3,Christchurch,2192360.0,New Zealand,NZ,,,pub.1030129103,ur.0712013775.41,Nicholas M.,Douglas
48,Auckland University of Technology,grid.252547.3,Auckland,2193730.0,New Zealand,NZ,,,pub.1013651732,ur.013116146115.26,Tao,Gao
49,Auckland University of Technology,grid.252547.3,Auckland,2193730.0,New Zealand,NZ,,,pub.1013651732,ur.010060432633.19,Nikola,Kasabov


Next we want a list of the publications where we have at list two separate GRIDS - in other words, the 
* publications that appear more than once in the table above
* publications that have at least 2 different GRIDs

In [0]:
country_affiliations["grids_tot"] =  country_affiliations.groupby(['pub_id'])['aff_id'].transform('count')
country_affiliations["grids_unique"] =  country_affiliations.groupby(['pub_id'])['aff_id'].transform('nunique')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


We have added two new columns to the table: `grids_tot` tells us how many affiliations we have per publication (normally one per author) and `grids_unique` tell us instead how many of these are unique in the context of a publication.  

In [0]:
country_affiliations.head()

Unnamed: 0,aff_name,aff_id,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name,grids_tot,grids_unique
4,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1028580728,,Edward J.,Gane,2,1
19,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1028580728,ur.016520575067.32,Catherine A.M.,Stedman,2,1
34,Christchurch Hospital,grid.414299.3,Christchurch,2192360.0,New Zealand,NZ,,,pub.1030129103,ur.0712013775.41,Nicholas M.,Douglas,1,1
48,Auckland University of Technology,grid.252547.3,Auckland,2193730.0,New Zealand,NZ,,,pub.1013651732,ur.013116146115.26,Tao,Gao,2,1
49,Auckland University of Technology,grid.252547.3,Auckland,2193730.0,New Zealand,NZ,,,pub.1013651732,ur.010060432633.19,Nikola,Kasabov,2,1


Now we can pull out publications rows where `grids_unique` > 1 in order to identify NZ collaborations.

In [0]:
country_affiliations.query("grids_unique > 1").head(10)

Unnamed: 0,aff_name,aff_id,aff_city,aff_city_id,aff_country,aff_country_code,aff_state,aff_state_code,pub_id,researcher_id,first_name,last_name,grids_tot,grids_unique
91,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1004307959,ur.01166573331.81,Tracy R.,Melzer,8,3
94,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1004307959,ur.01205755463.23,Michael R.,MacAskill,8,3
96,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1004307959,ur.0660231346.35,Toni L.,Pitcher,8,3
98,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1004307959,ur.0615411424.15,Leslie,Livingston,8,3
103,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1004307959,ur.0777753224.45,John C.,Dalrymple-Alford,8,3
104,University of Canterbury,grid.21006.35,Christchurch,2192360.0,New Zealand,NZ,,,pub.1004307959,ur.0777753224.45,John C.,Dalrymple-Alford,8,3
106,University of Otago,grid.29980.3a,Dunedin,2191560.0,New Zealand,NZ,,,pub.1004307959,ur.07523657762.60,Tim J.,Anderson,8,3
107,Christchurch Hospital,grid.414299.3,Christchurch,2192360.0,New Zealand,NZ,,,pub.1004307959,ur.07523657762.60,Tim J.,Anderson,8,3
128,University of Waikato,grid.49481.30,Hamilton,2190320.0,New Zealand,NZ,,,pub.1003242215,ur.01275740267.59,Mathew G.,Allan,3,2
137,University of Waikato,grid.49481.30,Hamilton,2190320.0,New Zealand,NZ,,,pub.1003242215,ur.010451702061.39,David P.,Hamilton,3,2


## Next steps 

* the list of publications with `grids_unique > 1` can be used to do more analyses on affiliation names or cities 
* the same list can be used to identify funders eg `search publications where id in [.....] return funders`