# Machines of healing grace?

Code with basic analysis and results from the AI v Covid paper

**Sections**

1. Descriptive analysis
  * How much Covid and AI activity do we detect in our data sources?
  * Is AI over or underrepresented in Covid research
  * How has AI activity evolved over time?
2. Topical analysis
  * What is the topical composition of Covid research and in what areas is AI focusing?
  * What are some examples of AI research to tackle Covid?
  * How has it evolved over time?
3. Geography
  * Where is AI research happening?
  * Who is doing it?
  * Do we find any differences in the topics that different countries focus on?
  * What reflects whether a country focuses on Covid research? Demand pull or supply push?
4. Knowledge base
  * On what topics do AI researchers draw on?
4. **Analysis of quality**
  * What are the levels of quality (impact) of Covid AI research papers?
  * What are the levels of experience of AI researchers focusing on Covid?
  * How does the above differ between AI research clusters?
  * Could we look at other data sources such as altmetrics?

## Preamble

In [None]:
%run ../notebook_preamble.ipy

In [None]:
import altair as alt
from altair_saver import save
from toolz.curried import *
import random
import geopandas as gp

In [None]:
FIG_PATH = f"{project_dir}/reports/figures/report_1"
SRC_PATH = f"{project_dir}/data/processed/ai_research"


In [None]:
pd.options.mode.chained_assignment = None

In [None]:
def save_fig(figure,name):
    save(figure,f'{FIG_PATH}/{name}.png',method='selenium',
         webdriver=DRIVER,scale_factor=3)
    
def preview(x):
    print(x.head())
    print(x.shape)
    return(x)


## 1. Read data

In [None]:
#All arXiv data
xiv = pd.read_csv(f"{SRC_PATH}/xiv_papers_labelled.csv",dtype={'id':str}).pipe(preview)

In [None]:
xiv.columns = [x.lower() for x in xiv.columns]

In [None]:
ai_ids = set(xiv.loc[xiv['is_ai']==True]['id'])

In [None]:
#Create a cov df

cov = xiv.query("is_covid == True").reset_index(drop=True).pipe(preview)

In [None]:
#All topics
cluster_memberships = pd.read_csv(f"{project_dir}/data/processed/ai_research/paper_cluster.csv",header=None)
cluster_lookup = cluster_memberships.set_index(0).to_dict()[1]

In [None]:
# Load the author data HERE (or recalculate myself)

## 2. Analyse data

In [None]:
# How do the levels of citations for Covid and non-Covid research compare?

In [None]:
xiv_2020 = xiv.query('year == 2020')

In [None]:
b = pd.cut(xiv_2020['citation_count'],bins=[0,1,2,3,5,10,20,100,1000],right=False,include_lowest=True)
b.value_counts(normalize=True)

In [None]:
cit_groups = xiv_2020.groupby(
    ['is_covid','is_ai','article_source'])['citation_count'].mean().reset_index().pipe(preview)

alt.Chart(cit_groups).mark_bar().encode(x='is_covid:N',y='citation_count',
                                        column='is_ai:N',
                                        row='article_source').properties(height=100,width=50)