# Collect cord19 data

This notebook outlines preliminary findings of collecting cord19 data.

Screenshots and other relevant images can be found under the `img/` directory, and can be displayed in this notebook using

    ![title](img/picture.png)

where applicable.

## Source of the data

The [Semantic Scholar](https://www.semanticscholar.org/cord19) team at the Allen Institute for AI has partnered with leading research groups to provide CORD-19.

## Collection options

Here we show different options for collection, where applicable. Any prototype code used for data collection is provided in `collect.py`.

### Option a) 
CORD19 is released **daily** and you can download a ZIP file with all the data from [here](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html). The ZIP contains the following files:
- `changelog`: A text file summarizing changes between this and the previous version.
- `cord_19_embeddings.tar.gz`: A collection of precomputed SPECTER document embeddings for each CORD-19 paper
- `document_parses.tar.gz`: A collection of JSON files that contain full text parses of a subset of CORD-19 papers
- `metadata.csv`: Metadata for all CORD-19 papers.

You can find a detailed description (and data dictionary) for each file in this [page](https://github.com/allenai/cord19#overview).

**Important note**: The Semantic Scholar team recommends to _primarily use metadata.csv & augment data when needed with full text in document_parses/_.

### Option b) [where applicable]

You can also access individual files like this:  
`https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/<date_iso_str>/<file_name>`

Replace `<date_iso_str>` with the release date formatted as `YYYY-MM-DD`, and `<file_name>` with one of the below:

- Paper metadata: `metadata.csv`
- Full text JSON: `document_parses.tar.gz`
- SPECTER embeddings: `cord_19_embeddings.tar.gz`

Here, I show how to collect the most recent `metadata.csv` file from the CORD19 dataset.

In [1]:
from collect import get_latest_cord19

In [2]:
get_latest_cord19?

[0;31mSignature:[0m
[0mget_latest_cord19[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mURL[0m[0;34m=[0m[0;34m'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfiles[0m[0;34m=[0m[0;34m[[0m[0;34m'metadata.csv'[0m[0;34m,[0m [0;34m'document_parses.tar.gz'[0m[0;34m,[0m [0;34m'cord_19_embeddings.tar.gz'[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Download the latest CORD19 files and store them in the working directory.
[0;31mFile:[0m      ~/Desktop/nesta/data-triage/datasets/cord19/collect/collect.py
[0;31mType:[0m      function


In [3]:
get_latest_cord19(files=["metadata.csv"])

## Practical considerations
This is where we consider CPU time, financial cost, disk space requirements, and last (but not least) development time/uncertainty.

### CPU time
#### Integrated collection time
*This is an estimate of the time required to collect the data, without batching or parallelisation.*

As shown below, fetching all files takes ~15min.

In [3]:
%%time
get_latest_cord19()

CPU times: user 1min 21s, sys: 29.4 s, total: 1min 50s
Wall time: 15min 9s


#### Can the procedure be batched? Are there any caveats to this?
IMO, no need to do so.

#### Real world collection time / cost
*Assume a maximum of 200 concurrent 8GB 2-core machines*

*NB (at time of writing based on [this](https://aws.amazon.com/ec2/pricing/on-demand/)) such a machine would cost $0.0944 per hour*

### Disk space (GB)

#### By entity type, estimate how many "rows" there are to collect (e.g. 100s, 1000s, etc)

10000s. At the time of writing, the main file (`metadata.csv`) has 274,033 rows.

#### By entity type, and based on the field types, what is the estimated disk space?

Disk space at the time of writing:
- Main file (`metadata.csv`): 380MB
- `cord_19_embeddings.tar.gz` (ZIP): 1.77GB
- `document_parses.tar.gz` (ZIP): 2.95GB

#### What does this imply for database storage costs?

Negligible

### Development time
*How long do you think it will take to develop the codebase for the collection?*

Semantic Scholar has done the heavy-lifting by aggregating the data from various sources. The `collect.py` gives a solid starting point on how to fetch the data dumps.

*What uncertainties can you foresee?*

We don't know when Semantic Scholar will stop updating the database.