# ArXiv

This notebook introduces the arXiv dataset for analysis in the AI / CI project.

The provenance of this data is as follows:

1. Collect all papers from arXiv
2. Match them with Microsoft Academic Graph (a publication database) on titles to get institutions
3. Match institutitions with GRID (a research institution database) to get locations
4. Identify AI papers through a semantic analysis

## Preamble

In [None]:
%run notebook_preamble.ipy

In [None]:
# Functions etc here

## 1. Download data

As with CrunchBase, you can access the data with one of the data getters.

In [None]:
from data_getters.arxiv_grid import get_arxiv_grid

my_config_path = '../mysqldb_team.config'

df = get_arxiv_grid(my_config_path)

In [None]:
df.head()

In [None]:
df.shape

375K institution - paper pairs

In [None]:
#Unique papers?
len(set(df['article_id']))

We have only included papers in computer science and statistics/machine learning because most of the other fields in arXiv (Physics, Biology etc.) are unlikely to be relevant for the CIs

In [None]:
#These are the ids for AI papers based on Kostas' analysis for the Women in AI report

#I will send you the file separately
ai_path = '../../../ai_analysis/data/external/dl_paper_ids.csv'

ml_ids = pd.read_csv(ai_path,dtype={'paper_id':str})

ml_ids_set = set(list(ml_ids.loc[ml_ids['is_AI']==True,'paper_id']))

In [None]:
df['ai'] = [x in ml_ids_set for x in df['article_id']]

df.drop_duplicates('article_id')['ai'].sum()

60K AI papers

## 2 Tour of the data

Most of the information in the data is quite self explanatory. 

Some observations



### ArXiv categories

Taxonomy [here](http://arxitics.com/help/categories?group=cs).



In [None]:
df['arxiv_categories'] = [x.split(' ') for x in df['arxiv_categories']]

In [None]:
df['arxiv_categories'].head(n=10)

Sometimes there is more than one category per paper

### Multinational

Institutions with presence of multiple countries (multinationals) gert matched with all of them, which isn't great. 

This is not a problem when analysing the global picture (eg papers in general) as we can group by paper id and remove duplicates in the names of participant organisations. When we do the geographical analysis we should drop any is_multinational matches.

In [None]:
df.loc[df['is_multinational']==True].head()