# AI research drift

I quantify the research "jump" (delta) of AI researchers that shifted their attention to COVID-19. This notebook does the following:

- Read a table with bioRxiv, medRxiv and arXiv publications
- Identify COVID-19 papers using a [keyword matching approach](https://blogs.cornell.edu/arxiv/2020/03/30/new-covid-19-quick-search/).
- Preprocess paper abstracts and find trigrams.
- Train a word2vec model, handpick ML related terms and use a keyword matching approach to flag papers as AI.
- Create a dataframe with paper IDs and author IDs.
- Create a TFIDF projection of the AI papers on arXiv.
- Reduce the dimensionality of the TFIDF vectors with SVD.
- Fit UMAP to project the SVD vectors to 3D and visualise them.
- Identify the authors in arXiv that have at least 3 AI papers, where one of them is on COVID-19.
- Develop a vector-based diversity metric.
- Measure the research diversity of authors without and with their COVID-19 contributions.
- Plot the delta.

In [1]:
%load_ext autoreload

In [7]:
%autoreload 2
%matplotlib inline

import pandas as pd
import numpy as np
import cord19
from cord19.utils.utils import get_engine
from cord19.transformers.nlp import tfidf_vectors
from cord19.transformers.dim_reduction import svd, umap_embeddings
from cord19.visualisation.plot import scatter_3d, bar_chart
from cord19.estimators.diversity import distance
from cord19.transformers.nlp import clean_and_tokenize
from gensim.models.phrases import Phrases, Phraser
from gensim.models.word2vec import Word2Vec

## Read *rxiv from DB

In [3]:
%%time
# Connect to db
con = get_engine(f"{cord19.project_dir}/innovation-mapping-5712.config")

# Read papers in chunks
columns = cord19.config["rxiv_columns"]
chunks = pd.read_sql_table("arxiv_articles", con, columns=columns, chunksize=1000)
papers = pd.concat(chunks)

# Drop index
papers = papers.reset_index(drop=True)

# Drop papers without a title or abstract
papers = papers.dropna(subset=["title", "abstract"])

# Keep the year from the publication date
papers["year"] = papers.created.apply(lambda x: x.year)

# # Store interim table
# papers.to_csv(f"{cord19.project_dir}/data/interim/papers.csv", index=False)

CPU times: user 2min 39s, sys: 21.9 s, total: 3min 1s
Wall time: 8min 44s


In [4]:
papers.head(1)

Unnamed: 0,id,created,title,abstract,mag_id,citation_count,article_source,mag_authors,year
0,704.0001,2007-04-02,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,1529925000.0,35.0,arxiv,"[{'author_id': 2303728598, 'author_name': 'Csa...",2007


## Identify COVID-19 papers using this [query](https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=COVID-19&terms-0-field=title&terms-1-operator=OR&terms-1-term=SARS-CoV-2&terms-1-field=abstract&terms-3-operator=OR&terms-3-term=COVID-19&terms-3-field=abstract&terms-4-operator=OR&terms-4-term=SARS-CoV-2&terms-4-field=title&terms-5-operator=OR&terms-5-term=coronavirus&terms-5-field=title&terms-6-operator=OR&terms-6-term=coronavirus&terms-6-field=abstract&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=200&order=-announced_date_first&source=home-covid-19)

title=COVID-19; OR abstract=SARS-CoV-2; OR abstract=COVID-19; OR title=SARS-CoV-2; OR title=coronavirus; OR abstract=coronavirus

In [5]:
# Flag covid-19 papers
covid_keywords = cord19.config["keywords"]["covid_19"]
papers["is_Covid"] = [
    1
    if any(term in row["abstract"] for term in covid_keywords)
    or any(term in row["title"] for term in covid_keywords)
    else 0
    for idx, row in papers.iterrows()
]

print(f"Total COVID-19 papers in *rxiv: {papers.is_Covid.sum()}")

Total COVID-19 papers in *rxiv: 4104


## Preprocess abstracts

In [8]:
%%time
# Tokenise paper abstracts
abstracts = [clean_and_tokenize(d, remove_stops=True) for d in papers.abstract]

# Create trigrams
phrases = Phrases(abstracts, min_count=5, threshold=10)
bigram = Phraser(phrases)
trigram = Phrases(bigram[abstracts], min_count=5, threshold=3)
abstracts_with_ngrams = list(trigram[abstracts])

2020-05-23 15:28:11,904 - gensim.models.phrases - INFO - collecting all words and their counts
2020-05-23 15:28:11,910 - gensim.models.phrases - INFO - PROGRESS: at sentence #0, processed 0 words and 0 word types
2020-05-23 15:28:13,254 - gensim.models.phrases - INFO - PROGRESS: at sentence #10000, processed 681139 words and 510553 word types
2020-05-23 15:28:14,494 - gensim.models.phrases - INFO - PROGRESS: at sentence #20000, processed 1362977 words and 909150 word types
2020-05-23 15:28:15,678 - gensim.models.phrases - INFO - PROGRESS: at sentence #30000, processed 2024987 words and 1250529 word types
2020-05-23 15:28:16,928 - gensim.models.phrases - INFO - PROGRESS: at sentence #40000, processed 2687913 words and 1566950 word types
2020-05-23 15:28:18,107 - gensim.models.phrases - INFO - PROGRESS: at sentence #50000, processed 3371642 words and 1872857 word types
2020-05-23 15:28:19,317 - gensim.models.phrases - INFO - PROGRESS: at sentence #60000, processed 4047617 words and 21617

## Train word2vec

In [10]:
%%time
# Train a word2vec model
w2v = Word2Vec(
    abstracts_with_ngrams, size=300, window=10, min_count=5, seed=42, iter=2
)

2020-05-23 16:05:49,438 - gensim.models.word2vec - INFO - collecting all words and their counts
2020-05-23 16:05:49,442 - gensim.models.word2vec - INFO - PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-05-23 16:05:49,694 - gensim.models.word2vec - INFO - PROGRESS: at sentence #10000, processed 559341 words, keeping 78389 word types
2020-05-23 16:05:49,973 - gensim.models.word2vec - INFO - PROGRESS: at sentence #20000, processed 1119377 words, keeping 121077 word types
2020-05-23 16:05:50,207 - gensim.models.word2vec - INFO - PROGRESS: at sentence #30000, processed 1662871 words, keeping 153314 word types
2020-05-23 16:05:50,458 - gensim.models.word2vec - INFO - PROGRESS: at sentence #40000, processed 2208663 words, keeping 180619 word types
2020-05-23 16:05:50,677 - gensim.models.word2vec - INFO - PROGRESS: at sentence #50000, processed 2771012 words, keeping 205197 word types
2020-05-23 16:05:50,903 - gensim.models.word2vec - INFO - PROGRESS: at sentence #60000,

## Identify AI papers

In [11]:
ml_keywords = cord19.config["keywords"]["ai"]
papers["is_AI"] = [
    1 if any(k in tokens for k in ml_keywords) else 0
    for tokens in abstracts_with_ngrams
]
print(f"Total AI papers in *rxiv: {papers.is_AI.sum()}")

Total AI papers in *rxiv: 81675


In [None]:
# # Save papers, processed abstracts and models
# papers.to_csv(f"{cord19.project_dir}/data/interim/papers.csv", index=False)

# with open(
#     f"{cord19.project_dir}/data/interim/processed_abstracts.pickle", "wb"
# ) as h:
#     pickle.dump(abstracts_with_ngrams, h)

# model.save(f"{cord19.project_dir}/models/word2vec.model")

## Author-level research "jumps"

### Create a paper IDs | author IDs table

In [12]:
# Use only arXiv and AI
ai_papers_arxiv = papers[(papers.article_source=='arxiv') & (papers.is_AI==1)]
print(f'Number of AI papers in arXiv: {ai_papers_arxiv.shape[0]}')

Number of AI papers in arXiv: 76540


In [13]:
%%time
author_ids = []
author_names = []
paper_ids = []
for _, row in ai_papers_arxiv.iterrows():
    if isinstance(row['mag_authors'], list):
        for author in row['mag_authors']:
            paper_ids.append(row['id'])
            author_ids.append(author['author_id'])
            author_names.append(author['author_name'])
            
mag_paper_authors = pd.DataFrame({'id':paper_ids, 'author_id':author_ids, 'author_name':author_names})
mag_paper_authors.head(2)

CPU times: user 16.6 s, sys: 2.59 s, total: 19.2 s
Wall time: 22.9 s


Unnamed: 0,id,author_id,author_name
0,704.0047,2060993184,tadej kosel
1,704.0047,2210224347,igor grabec


### Create TFIDF vectors for the AI papers and reduce dimensionality with SVD and UMAP

In [15]:
X = tfidf_vectors(ai_papers_arxiv.abstract, cord19.config["tfidf"]["max_features"])

# Dim reductions with SVD
X = svd(X, cord19.config["svd"]["n_components"])

# Dim reduction with UMAP
umap_config = cord19.config["umap"]
embed = umap_embeddings(X, **umap_config)

### Visualise arXiv's AI papers with a covid-19 flag.

In [None]:
scatter_3d(embed, ai_papers_arxiv)

### Find the authors with covid-19 publications
I will keep authors with at least three publications in order to measure the delta between the sets with and without covid-19.

In [16]:
# Reset index to fetch the TFIDF vector by it
ai_papers_arxiv = ai_papers_arxiv.reset_index()

# Add a covid-19 flag
mag_paper_authors = mag_paper_authors.merge(ai_papers_arxiv[['id', 'is_Covid']], left_on='id', right_on='id')

# Author IDs with covid-19 publications
author_ids_with_covid_pub = mag_paper_authors[mag_paper_authors.is_Covid==1]['author_id'].values

# Group paper IDs by author IDs
g = mag_paper_authors[mag_paper_authors.author_id.isin(author_ids_with_covid_pub)].groupby('author_id')['id'].apply(list)

# Keep only authors with more than 3 publications
d = {idx:len(item) for idx, item in g.iteritems()}
ids = [k for k, v in d.items() if v > 2]

# Subset mag_paper_authors by the ids
authors_covid_contrib = mag_paper_authors[mag_paper_authors.author_id.isin(ids)]

# Paper IDs and arrays - only for authors working in AI and have covid-19 contributions
ids = []
arr = []
for idx, row in ai_papers_arxiv[ai_papers_arxiv.id.isin(authors_covid_contrib.id.unique())].iterrows():
    ids.append(row['id'])
    arr.append(X[idx])
    
arrays = pd.DataFrame({'id':ids, 'arr':arr})

authors_covid_contrib = authors_covid_contrib.merge(arrays, left_on='id', right_on='id')

print(f'Unique authors with more than 3 papers and at least one covid-19 contribution: {authors_covid_contrib.author_id.unique().shape[0]}')

authors_covid_contrib.head(1)

Unique authors with more than 3 papers and at least one covid-19 contribution: 117


Unnamed: 0,id,author_id,author_name,is_Covid,arr
0,809.2553,2439529108,ming li,0,"[0.1349755253034826, -0.058632180914855336, 0...."


### Measure author-level diversity with and without covid-19 publications

In [20]:
author_div = {}
for id_ in authors_covid_contrib.author_id.unique():
    frame = authors_covid_contrib[authors_covid_contrib.author_id==id_]
    try:
        author_div[id_] = distance(frame)
    except ZeroDivisionError as e:
        continue
        
# Author-level diversity deltas
author_div_diff = {}
for k, v in author_div.items():
    author_div_diff[k] = (v['with_covid'] - v['no_covid'])[0][0]

# Read as dataframe and rename column
author_div_diff = pd.DataFrame.from_dict(author_div_diff, orient='index')
author_div_diff = author_div_diff.rename(index=str, columns={0:'delta'})
author_div_diff = author_div_diff.reset_index()
author_div_diff.head(2)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,index,delta
0,2439529108,0.025292
1,2139473605,0.002215


In [21]:
bar_chart(author_div_diff, 'index', 'delta', 'Diversity delta due to covid-19 publications')