# Collaborations

examining worldwide collaborations on academic articles relating to the coranvirus outbreak 2020

## Source Datasets
* COVID-19 Open Research Dataset Challenge
    * *https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge*
    * 29,000 full text scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses

* Google Geocode API

## Methods

### Stage 1
The first stage of the analysis took place in a separate Kaggle notebook (https://www.kaggle.com/jeffmacinnes/covid-collaborations/edit) that was used to walk through the COVID-19 Open Research Dataset of 29,000+ academic articles, and create the following datafiles:

1. `allAuthors.csv`: a table showing the **article id**, **author name**, and **author institution** of every author on every article in the corpus. Authors without an associated institution were removed

2. `allArticles.csv`: a table showing the **article id**, **publication source**, **doi**, and **pubmed_id** of each article in the corpus

3. `institutions.csv`: a table showing the **instution name**, **address**, **lat**, and **lng** of every institution represented in the corpus. The institution info was obtained by first grabbing the `institution` name from each author affiliation in the metadata for each article in the corpus. Next, the address and lng/lat of each institution was obtained by passing the institution name into the Google Geocode API. This approach returned a geocoded location on approximately 85% of the 12,000+ unique institutions (*however, see **limitations** below*). Institutions that could not be matched to a location were removed from the dataset

### Stage 2
This notebook uses the source data created in Stage 1 to calculate all cross-institute collaborations present across this body of literature. 

A *collaboration* is defined as **any two authors working together on the same article**. A *cross-institute collaboration* is defined as when those two authors are affiliated separate institutions. For instance, given the following hypothetical author list on a given article:

| Author | Institution |
| ------ | ------ |
| Jane Doe | Institute A |
| John Doe | Institute B |
| Jackie Doe | Institue B |
| Jerimiah Doe | Institute C |

the *cross-institute collaborations* on this article would be:

| Collaborations |  |
| ------ | ------ |
| Institute A | Institue B |
| Institute A | Institue C |
| Institute B | Institue C |

This notebook is used to assemble a dataframe of ALL cross-institute collaborations found in this body of literature

## Limitations
* Due to the way institutions are named in the article metadata, the `institutions.csv` may contain multiple instances of the same institution. For example, the "University of California-Berkeley" is considered a separate institution than "University of California at Berkeley", and so on. However, since these two instances would have the same lat/lng coordinates, it is not a problem for the visualization

* The institution name was used to obtain a geocoded location for that institution, and so mapping the collaborations on a given article to a specific location is only as precise as the listed institution names on that article. For example, if an author works at the University of California - Berkeley, but the institution name is only listed as 'University of California', the author will get placed at whatever UC campus the google API returns for 'University of California'.  

In [1]:
import pandas as pd
import numpy as np
import os

from os.path import join
from itertools import combinations

In [2]:
dataDir = '../data'

In [3]:
author_df = pd.read_csv(join(dataDir, 'allAuthors.csv'))

In [4]:
author_df.head()

Unnamed: 0,paper_id,firstName,lastName,middle,institution
0,25621281691205eb015383cbac839182b838514f,Dominik,Dornfeld,,University of Freiburg
1,25621281691205eb015383cbac839182b838514f,Alexandra,Dudek,H,University of Freiburg
2,25621281691205eb015383cbac839182b838514f,Sira,Günther,C,University of Freiburg
3,25621281691205eb015383cbac839182b838514f,Judd,Hultquist,F,University of California
4,25621281691205eb015383cbac839182b838514f,Sebastian,Giese,,University of Freiburg


## Define function to return a dataframe of cross-institute collabs on given article

In [8]:
test_id = author_df.loc[0, 'paper_id']
test_df = author_df[author_df['paper_id'] == test_id]

In [9]:
test_df

Unnamed: 0,paper_id,firstName,lastName,middle,institution
0,25621281691205eb015383cbac839182b838514f,Dominik,Dornfeld,,University of Freiburg
1,25621281691205eb015383cbac839182b838514f,Alexandra,Dudek,H,University of Freiburg
2,25621281691205eb015383cbac839182b838514f,Sira,Günther,C,University of Freiburg
3,25621281691205eb015383cbac839182b838514f,Judd,Hultquist,F,University of California
4,25621281691205eb015383cbac839182b838514f,Sebastian,Giese,,University of Freiburg
5,25621281691205eb015383cbac839182b838514f,Daria,Khokhlova-Cubberley,,Zymo Research Corp
6,25621281691205eb015383cbac839182b838514f,Yap,Chew,C,Zymo Research Corp
7,25621281691205eb015383cbac839182b838514f,Nevan,Krogan,J,University of California
8,25621281691205eb015383cbac839182b838514f,Martin,Schwemmle,,University of Freiburg


In [22]:
instList = test_df['institution'].unique().tolist()

uniqueInsts = []
for combo in combinations(instList, 2):
    uniqueInsts.append([test_id, combo[0], combo[1]])

collabs_df = pd.DataFrame(uniqueInsts, columns=['paper_id', 'institute A', 'institute B'])
collabs_df

Unnamed: 0,paper_id,institute A,institute B
0,25621281691205eb015383cbac839182b838514f,University of Freiburg,University of California
1,25621281691205eb015383cbac839182b838514f,University of Freiburg,Zymo Research Corp
2,25621281691205eb015383cbac839182b838514f,University of California,Zymo Research Corp


# Apply func to full article dataframe

In [43]:
def getArticleCollabs(article_df):
    # return a dataframe of unique cross-institute collaborations on this article
    #paper_id = article_df.iloc[0]['paper_id']
    instList = article_df['institution'].unique().tolist()
    uniqueInsts = []
    for combo in combinations(instList, 2):
        uniqueInsts.append([combo[0], combo[1]])

    return pd.DataFrame(uniqueInsts, columns=['institute A', 'institute B'])

In [49]:
collabs_df = author_df.groupby('paper_id').apply(getArticleCollabs).reset_index()

In [50]:
collabs_df.shape

(38085, 4)

In [51]:
author_df.shape

(109276, 5)

In [52]:
collabs_df.dropna(inplace=True)

In [53]:
collabs_df.shape

(35818, 4)

This is still a lot to visualize, so may need to consider removing duplicates, and instead having a single connection between each institute and every other institute they collaborate with. Number of articles collaborated on between two institutions could be represented by color or thickness of the arc. 

In [54]:
collabs_df.head(100)

Unnamed: 0,paper_id,level_1,institute A,institute B
0,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,0,Chi Mei Medical Center,National Taiwan University Hospital
1,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,1,Chi Mei Medical Center,New Taipei City
2,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,2,Chi Mei Medical Center,National Taiwan University College of Medicine
3,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,3,National Taiwan University Hospital,New Taipei City
4,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,4,National Taiwan University Hospital,National Taiwan University College of Medicine
5,000affa746a03f1fe4e3b3ef1a62fdfa9b9ac52a,5,New Taipei City,National Taiwan University College of Medicine
8,001d8d54a7e73e761f779c81661595cc5ae2ca08,0,University of Antwerp,Vrije Universiteit Amsterdam
9,001ec025dd65db8a47827c8f49d9c41d60f35d00,0,University of Murcia,University of Zagreb
10,001ec025dd65db8a47827c8f49d9c41d60f35d00,1,University of Murcia,Uludag University
11,001ec025dd65db8a47827c8f49d9c41d60f35d00,2,University of Murcia,University of Evora


# Projection Resources

* https://github.com/vasturiano/three-globe
* https://medium.com/@xiaoyangzhao/drawing-curves-on-webgl-globe-using-three-js-and-d3-draft-7e782ffd7ab
* http://learningthreejs.com/blog/2013/09/16/how-to-make-the-earth-in-webgl/
* rotate camera to look at specific point: https://discourse.threejs.org/t/solved-rotate-camera-to-face-point-on-sphere/1676
* rotate camera to specific lat/lng: https://stackoverflow.com/questions/35465654/rotating-a-sphere-to-a-specific-point-not-the-camera
* design inspo: http://shinyeonpark.com/project/ana-flight-connections/