# Collect Elsevier publication data (focussing on Vaccine and Vaccine-x)

This notebook outlines preliminary findings of collecting vaccine-x data.

Screenshots and other relevant images can be found under the `img/` directory, and can be displayed in this notebook using


where applicable.

## Sources

[Vaccine-X](https://www.journals.elsevier.com/vaccine-x) which is the open access mirror journal of Elsevier's [Vaccine](https://www.journals.elsevier.com/vaccine) journal. In general, any approach taken here will be applicable to all of Elsevier's journals or their [open access journals](https://www.elsevier.com/about/open-science/open-access/open-access-journals/mirror-journals).

For example:
![Example of an article on Vaccine: X](img/vaccine-x-preview.png)

## Collection options

### Option a) CrossRef + PlumX (abstracts + extended metadata)

Unfortunately everything under `journals.elsevier.com` is javascript generated, so selenium would be required to go down this route. After some digging around, I found a semi-hidden API from PlumX which *is* open in the following format:

    https://plu.mx/api/v1/artifact/doi/{{doi}}?hideUsage=true

where, for example, [{{doi}} = "10.1016/j.jvacx.2018.100001"](https://plu.mx/api/v1/artifact/doi/10.1016/j.jvacx.2018.100001?hideUsage=true)

However, this strategy requires collecting the DOIs from somewhere, which is where CrossRef comes in.

Strategy: 

- Use CrossRef to get all DOIs for every article in Vaccine: X.
- Use PlumX to retrieve abstracts and extended metadata, from the doi.

Note, the strategy was found to work for any Elsevier article, not just the open ones. For example, this strategy will also work for the journal "Vaccine", as opposed to "Vaccine: X"

In [42]:
%%time
import requests
from habanero import Crossref
crossref = Crossref(mailto='email address', ua_string='a good identifier')

# Get all Vaccine: X DOIs
response = crossref.works(filter={"container-title": "Vaccine: X"}, limit=1000), 
if type(response) is tuple and len(response) == 1:
    response = response[0]
items = response['message']['items']
cr_fields = set([k for item in items for k in item.keys()])  #<--- Summary of all field names from CrossRef
dois = [item['DOI'] for item in items]

CPU times: user 127 ms, sys: 20.2 ms, total: 147 ms
Wall time: 17.8 s


In [44]:
len(dois)

1000

In [45]:
%%time
# Example of PlumX metadata + abstract
doi = dois[0]
r = requests.get(f"https://plu.mx/api/v1/artifact/doi/{doi}", params={'hideUsage':False})
metadata = r.json()
px_fields = set(metadata.keys()) #<--- Summary of all field names from PlumX

CPU times: user 22.2 ms, sys: 3.42 ms, total: 25.6 ms
Wall time: 464 ms


In [71]:
#metadata   # uncomment to see response

### Option b) Bonus: full-text (open access only)

Following from option a), the PlumX API also returns the PII (Publisher Item Identifier) which is then enough to identify Elsevier's page for this article.

    https://www.sciencedirect.com/science/article/pii/{{PII}}
    
where, for example, [{{PII}} = "S2590136218300019"](http://www.sciencedirect.com/science/article/pii/S2590136218300019). Again, this is javascript generated, so the strategy is therefore to use selenium to generate the full-text in the html, which is very well structured.

### Option c) Microsoft Academic Graph (will give similar data as CrossRef + PlumX)

Nesta already has the tools to collect the abstract data using MAG. The only disadvantage here is that the abstract text is reconstructed from the MAG "inverted abstract", and so a small amount of data quality is lost in the process, compared with extracting the data from PlumX.

In [63]:
%%time
from nesta.packages.magrxiv.collect_magrxiv import get_magrxiv_articles
out = []
for i, article in enumerate(get_magrxiv_articles('vaccine', MAG_API_KEY)):
    if i == 100:
        break

# >>> CPU times: user 325 ms, sys: 26.1 ms, total: 351 ms
# >>> Wall time: 9.7 s

## Practical considerations
This is where we consider CPU time, financial cost, disk space requirements, and last (but not least) development time/uncertainty.

### CPU time
#### Integrated collection time
*This is an estimate of the time required to collect the data, without batching or parallelisation.*

### Option a) CrossRef + PlumX

With the CrossRef + PlumX method, the collection will take less than 500ms per article.

For `Vaccine: X`, the number of articles is currently very low: only 57 since April 2019 - so the data collection time would only be 30 seconds.

For `Vaccine`, the number of articles there are over 25,000 pier-reviewed articles, which would take a few hours.

### Option b) Full-texts with selenium

Budgeting 3 seconds per article, would requisite a further 3 minutes to collect `Vaccine-X` full-texts, albeit there aren't many of them.

### Option c) Microsoft Academic Graph

For MAG, the collection of all Vaccine abstracts would take only 45 minutes.

#### Can the procedure be batched? Are there any caveats to this?

This is unnecessary, except in the case of Option B) if this were to be scaled up to all Open-Access journals. The main caveat here is that running selenium in docker has quite a long development cycle.

#### Real world collection time / cost
*Assume a maximum of 200 concurrent 8GB 2-core machines*

*NB (at time of writing based on [this](https://aws.amazon.com/ec2/pricing/on-demand/)) such a machine would cost 0.0944 dollars per hour*

**Option a) and c)** Costs are effectively zero, and collection time less than a few hours.

**Option b)** Around 2 dollars to process 1000 articles. Realistically, batches of 1000 per hour would be feasible, such that up to 200,000 articles could be processed in an hour at a total cost of 400 dollars.

### Disk space (GB)

#### By entity type, estimate how many "rows" there are to collect (e.g. 100s, 1000s, etc)

- **Articles**: Over 25,000

#### By entity type, and based on the field types, what is the estimated disk space?

Mainly string types, total around 5kB per article. Total diskspace should a maximum of around 125MB.

In [55]:
len(str(metadata))

3733

#### What does this imply for database storage costs?

Neglible

### Development time
*How long do you think it will take to develop the codebase for the collection?*
*What uncertainties can you foresee?*

#### Option a) CrossRef + PlumX: 2 days for development + 1 day for collection

#### Option b) Selenium: 1 week for development + 1 day for collection
There are significant uncertainties in this, possibly adding another week to development, but also potentially changing estimates on cost and time scales.

#### Option c) MAG: 0 days for development + 1 day for collection