# Collaborations

examining worldwide collaborations on academic articles relating to the coranvirus outbreak 2020

## Source Datasets
* COVID-19 Open Research Dataset Challenge
    * *https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge*
    * 29,000 full text scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses

* Google Geocode API

## Methods

### Stage 1
The first stage of the analysis took place in a separate Kaggle notebook (https://www.kaggle.com/jeffmacinnes/covid-collaborations/edit) that was used to walk through the COVID-19 Open Research Dataset of 29,000+ academic articles, and create the following datafiles:

1. `allAuthors.csv`: a table showing the **article id**, **author name**, and **author institution** of every author on every article in the corpus. Authors without an associated institution were removed

2. `allArticles.csv`: a table showing the **article id**, **publication source**, **doi**, and **pubmed_id** of each article in the corpus

3. `institutions.csv`: a table showing the **instution name**, **address**, **lat**, and **lng** of every institution represented in the corpus. The institution info was obtained by first grabbing the `institution` name from each author affiliation in the metadata for each article in the corpus. Next, the address and lng/lat of each institution was obtained by passing the institution name into the Google Geocode API. This approach returned a geocoded location on approximately 85% of the 12,000+ unique institutions (*however, see **limitations** below*). Institutions that could not be matched to a location were removed from the dataset

### Stage 2
This notebook uses the source data created in Stage 1 to calculate all cross-institute collaborations present across this body of literature. 

A *collaboration* is defined as **any two authors working together on the same article**. A *cross-institute collaboration* is defined as when those two authors are affiliated separate institutions. For instance, given the following hypothetical author list on a given article:

| Author | Institution |
| ------ | ------ |
| Jane Doe | Institute A |
| John Doe | Institute B |
| Jackie Doe | Institue B |
| Jerimiah Doe | Institute C |

the *cross-institute collaborations* on this article would be:

| Collaborations |  |
| ------ | ------ |
| Institute A | Institue B |
| Institute A | Institue C |
| Institute B | Institue C |

This notebook is used to assemble a dataframe of ALL cross-institute collaborations found in this body of literature

## Limitations
* Due to the way institutions are named in the article metadata, the `institutions.csv` may contain multiple instances of the same institution. For example, the "University of California-Berkeley" is considered a separate institution than "University of California at Berkeley", and so on. However, since these two instances would have the same lat/lng coordinates, it is not a problem for the visualization

* The institution name was used to obtain a geocoded location for that institution, and so mapping the collaborations on a given article to a specific location is only as precise as the listed institution names on that article. For example, if an author works at the University of California - Berkeley, but the institution name is only listed as 'University of California', the author will get placed at whatever UC campus the google API returns for 'University of California'.  

In [1]:
import pandas as pd
import numpy as np
import os

from os.path import join
from itertools import combinations

import matplotlib.pyplot as plt

In [2]:
dataDir = '../data'

In [3]:
author_df = pd.read_csv(join(dataDir, 'allAuthors.csv'))

In [4]:
author_df.head()

Unnamed: 0,paper_id,firstName,lastName,middle,institution
0,25621281691205eb015383cbac839182b838514f,Dominik,Dornfeld,,University of Freiburg
1,25621281691205eb015383cbac839182b838514f,Alexandra,Dudek,H,University of Freiburg
2,25621281691205eb015383cbac839182b838514f,Sira,Günther,C,University of Freiburg
3,25621281691205eb015383cbac839182b838514f,Judd,Hultquist,F,University of California
4,25621281691205eb015383cbac839182b838514f,Sebastian,Giese,,University of Freiburg


# Prep Institutions data

In [78]:
inst_df = pd.read_csv(join(dataDir, 'institutions.csv'))

In [79]:
inst_df.head()

Unnamed: 0,institution,addr,lat,lng
0,CHU de Bicê tre,"78 Rue du Général Leclerc, 94270 Le Kremlin-Bi...",48.810596,2.35225
1,Ecole Nationale Vétérinaire de Toulouse,"23 Chemin des Capelles, 31300 Toulouse, France",43.598014,1.380306
2,"Zhuhai United Laboratories Co., Ltd","Yulin Rd, Jinwan Qu, Zhuhai Shi, Guangdong She...",22.038476,113.33548
3,Hedong District,"Hedong, Tianjin, China",39.128291,117.251586
4,National United University,"360, Taiwan, Miaoli County, Miaoli City, 恭敬里聯大1號",24.545804,120.81232


In [80]:
inst_df['instID'] = inst_df.index

In [82]:
inst_df = inst_df.drop(columns=['addr'])

In [94]:
inst_df.to_json(join(dataDir, 'allInstitutions.json'), orient='records', indent=0)

## Define function to return a dataframe of cross-institute collabs on given article

In [5]:
test_id = author_df.loc[0, 'paper_id']
test_df = author_df[author_df['paper_id'] == test_id]

In [6]:
test_df

Unnamed: 0,paper_id,firstName,lastName,middle,institution
0,25621281691205eb015383cbac839182b838514f,Dominik,Dornfeld,,University of Freiburg
1,25621281691205eb015383cbac839182b838514f,Alexandra,Dudek,H,University of Freiburg
2,25621281691205eb015383cbac839182b838514f,Sira,Günther,C,University of Freiburg
3,25621281691205eb015383cbac839182b838514f,Judd,Hultquist,F,University of California
4,25621281691205eb015383cbac839182b838514f,Sebastian,Giese,,University of Freiburg
5,25621281691205eb015383cbac839182b838514f,Daria,Khokhlova-Cubberley,,Zymo Research Corp
6,25621281691205eb015383cbac839182b838514f,Yap,Chew,C,Zymo Research Corp
7,25621281691205eb015383cbac839182b838514f,Nevan,Krogan,J,University of California
8,25621281691205eb015383cbac839182b838514f,Martin,Schwemmle,,University of Freiburg


In [7]:
instList = test_df['institution'].unique().tolist()

uniqueInsts = []
for combo in combinations(instList, 2):
    uniqueInsts.append([test_id, combo[0], combo[1]])

collabs_df = pd.DataFrame(uniqueInsts, columns=['paper_id', 'institute A', 'institute B'])
collabs_df

Unnamed: 0,paper_id,institute A,institute B
0,25621281691205eb015383cbac839182b838514f,University of Freiburg,University of California
1,25621281691205eb015383cbac839182b838514f,University of Freiburg,Zymo Research Corp
2,25621281691205eb015383cbac839182b838514f,University of California,Zymo Research Corp


# Apply func to full article dataframe

In [8]:
def getArticleCollabs(article_df):
    # return a dataframe of unique cross-institute collaborations on this article
    instList = article_df['institution'].unique().tolist()
    
    # remove nans and strip leading/trailing whitespace
    instList = [x.strip() for x in instList if str(x) != 'nan']
    uniqueInsts = []
    for combo in combinations(instList, 2):
        uniqueInsts.append([combo[0], combo[1]])

    return pd.DataFrame(uniqueInsts, columns=['institute A', 'institute B'])

In [9]:
collabs_df = author_df.groupby('paper_id').apply(getArticleCollabs).reset_index()

In [10]:
collabs_df.shape

(35818, 4)

This is still a lot to visualize, so may need to consider removing duplicates, and instead having a single connection between each institute and every other institute they collaborate with. Number of articles collaborated on between two institutions could be represented by color or thickness of the arc. 

In [11]:
collabs_df.head(10)

Unnamed: 0,paper_id,level_1,institute A,institute B
0,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,0,Chi Mei Medical Center,National Taiwan University Hospital
1,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,1,Chi Mei Medical Center,New Taipei City
2,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,2,Chi Mei Medical Center,National Taiwan University College of Medicine
3,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,3,National Taiwan University Hospital,New Taipei City
4,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,4,National Taiwan University Hospital,National Taiwan University College of Medicine
5,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,5,New Taipei City,National Taiwan University College of Medicine
6,001d8d54a7e73e761f779c81661595cc5ae2ca08,0,University of Antwerp,Vrije Universiteit Amsterdam
7,001ec025dd65db8a47827c8f49d9c41d60f35d00,0,University of Murcia,University of Zagreb
8,001ec025dd65db8a47827c8f49d9c41d60f35d00,1,University of Murcia,Uludag University
9,001ec025dd65db8a47827c8f49d9c41d60f35d00,2,University of Murcia,University of Evora


## Try to remove any collaboration that is not represented by a valid insitution
(as defined by the institutions df created above)

In [63]:
def isValidInst(instName):
    # check if given inst is valid
    try:
#         return inst_df['institution'].str.contains(str(instName)).any()
        return (inst_df['institution'] == instName).any()
    except:
        return False

def areInstitutesValid(row):
    for inst in ['institute A', 'institute B']:
        if not isValidInst(row[inst]):
            return False
    return True


In [64]:
collabs_df['validInstitutes'] = collabs_df.apply(areInstitutesValid, axis=1)  # takes ~5 min

In [65]:
collabs_df = collabs_df[collabs_df['validInstitutes'] == True]

In [66]:
collabs_df.shape

(30496, 5)

In [67]:
collabs_df.head()

Unnamed: 0,paper_id,level_1,institute A,institute B,validInstitutes
0,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,0,Chi Mei Medical Center,National Taiwan University Hospital,True
1,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,1,Chi Mei Medical Center,New Taipei City,True
2,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,2,Chi Mei Medical Center,National Taiwan University College of Medicine,True
3,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,3,National Taiwan University Hospital,New Taipei City,True
4,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,4,National Taiwan University Hospital,National Taiwan University College of Medicine,True


## Add lat/lng for each inst

In [68]:
inst = 'Ludwig-Maximilians-University'

isValidInst(inst)

False

In [59]:
(inst_df['institution'] == 'University of Tiradentes').any()

True

In [56]:
a

Unnamed: 0,institution,addr,lat,lng


In [48]:
a['lat']

32.111154

In [69]:
def getLngLat(inst):
    # get the latitude of the specified institution
    instRecord = inst_df[inst_df['institution'] == inst].iloc[0]
    return [instRecord['lng'], instRecord['lat']]
    

In [70]:
collabs_df['instA_coords'] = collabs_df['institute A'].apply(getLngLat)

In [71]:
collabs_df['instB_coords'] = collabs_df['institute B'].apply(getLngLat)

In [72]:
collabs_df.head()

Unnamed: 0,paper_id,level_1,institute A,institute B,validInstitutes,instA_coords,instB_coords
0,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,0,Chi Mei Medical Center,National Taiwan University Hospital,True,"[120.22193689999999, 23.0207771]","[121.5189863, 25.0407391]"
1,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,1,Chi Mei Medical Center,New Taipei City,True,"[120.22193689999999, 23.0207771]","[121.4627868, 25.0169826]"
2,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,2,Chi Mei Medical Center,National Taiwan University College of Medicine,True,"[120.22193689999999, 23.0207771]","[121.51953259999999, 25.0395902]"
3,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,3,National Taiwan University Hospital,New Taipei City,True,"[121.5189863, 25.0407391]","[121.4627868, 25.0169826]"
4,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,4,National Taiwan University Hospital,National Taiwan University College of Medicine,True,"[121.5189863, 25.0407391]","[121.51953259999999, 25.0395902]"


In [73]:
collabs_df = collabs_df.drop(['level_1', 'validInstitutes'], axis=1)

### Add inst ID to each

In [86]:
def getID(inst):
    instRecord = inst_df[inst_df['institution'] == inst].iloc[0]
    return instRecord['instID']

In [87]:
collabs_df['instA_id'] = collabs_df['institute A'].apply(getID)

In [88]:
collabs_df['instB_id'] = collabs_df['institute B'].apply(getID)

In [89]:
collabs_df.head()

Unnamed: 0,paper_id,institute A,institute B,instA_coords,instB_coords,instA_id,instB_id
0,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,Chi Mei Medical Center,National Taiwan University Hospital,"[120.22193689999999, 23.0207771]","[121.5189863, 25.0407391]",1134,5009
1,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,Chi Mei Medical Center,New Taipei City,"[120.22193689999999, 23.0207771]","[121.4627868, 25.0169826]",1134,10160
2,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,Chi Mei Medical Center,National Taiwan University College of Medicine,"[120.22193689999999, 23.0207771]","[121.51953259999999, 25.0395902]",1134,6086
3,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,National Taiwan University Hospital,New Taipei City,"[121.5189863, 25.0407391]","[121.4627868, 25.0169826]",5009,10160
4,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,National Taiwan University Hospital,National Taiwan University College of Medicine,"[121.5189863, 25.0407391]","[121.51953259999999, 25.0395902]",5009,6086


In [90]:
collabs_df = collabs_df.drop(columns=["institute A", "institute B"])

### Write out to disk

In [91]:
collabs_df.to_json(join(dataDir, 'allCollabs.json'), orient='records', indent=0)

# Sandbox

In [92]:
inst_df

Unnamed: 0,institution,lat,lng,instID
0,CHU de Bicê tre,48.810596,2.352250,0
1,Ecole Nationale Vétérinaire de Toulouse,43.598014,1.380306,1
2,"Zhuhai United Laboratories Co., Ltd",22.038476,113.335480,2
3,Hedong District,39.128291,117.251586,3
4,National United University,24.545804,120.812320,4
...,...,...,...,...
10779,the University of Texas MD Anderson Cancer Center,29.706815,-95.397155,10779
10780,Yamaguchi University,34.148619,131.467873,10780
10781,Universidad Católica del Norte,-23.681115,-70.410535,10781
10782,Institut Pasteur de Bangui,4.373818,18.573630,10782


In [103]:
test = {
    "institute A": ["L", "M", "M", "O", "P"],
    "institute B": ["M", "A", "L ", "B", "C"]
}

test_df = pd.DataFrame(test)

In [104]:
test_df['sorted'] = test_df.apply(sortInstitutes, axis=1)

In [105]:
test_df

Unnamed: 0,institute A,institute B,sorted
0,L,M,"L,M"
1,M,A,"A,M"
2,M,L,"L ,M"
3,O,B,"B,O"
4,P,C,"C,P"


In [106]:
tesCount_df = test_df.groupby('sorted').count().reset_index()

In [107]:
tesCount_df

Unnamed: 0,sorted,institute A,institute B
0,"A,M",1,1
1,"B,O",1,1
2,"C,P",1,1
3,"L ,M",1,1
4,"L,M",1,1


# Projection Resources

* https://github.com/vasturiano/three-globe
* https://medium.com/@xiaoyangzhao/drawing-curves-on-webgl-globe-using-three-js-and-d3-draft-7e782ffd7ab
* http://learningthreejs.com/blog/2013/09/16/how-to-make-the-earth-in-webgl/
* rotate camera to look at specific point: https://discourse.threejs.org/t/solved-rotate-camera-to-face-point-on-sphere/1676
* rotate camera to specific lat/lng: https://stackoverflow.com/questions/35465654/rotating-a-sphere-to-a-specific-point-not-the-camera
* design inspo: http://shinyeonpark.com/project/ana-flight-connections/