# Updating the Inventory
---
This notebook will provide the code necessary to perform an update to the Biodata Resource Inventory, using trained models.

The steps include:
* Run new query on EuropePMC
* Classify new articles
* Run NER to get resource names for predicted positives
* Get URLs for predicted positives
* Gather other metadata



# Setup
---
### Mount Drive

First, mount Google Drive to have access to files necessary for the run:


In [None]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/GitHub/inventory_2022/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/GitHub/inventory_2022


In [None]:
! make setup_colab

Run the make target to install Python dependencies.

You may see the error: `ERROR: pip's dependency resolver does not currently take into account all the packages that are installed`, but the code should run regardless.



In [1]:
! make setup_for_updating

python3.8 -m pip install -r requirements.txt
Collecting numpy==1.19 (from -r requirements.txt (line 4))
  Using cached numpy-1.19.0-cp38-cp38-manylinux2010_x86_64.whl (14.6 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.0
    Uninstalling numpy-1.23.0:
      Successfully uninstalled numpy-1.23.0
Successfully installed numpy-1.19.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
echo "import nltk \nnltk.download('punkt')" | python3 /dev/stdin
[nltk_data] Downloading package punkt to /home/ken/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
pip install --upgrade numpy==1.23
Collecting numpy==1.23
  Using cached numpy-1.23.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Installing colle

If you need to download the model checkpoints for the best classifier and NER models, run the cell below. If the `train_and_predict` pipeline was run, then the models are already present.

In [4]:
# Get trained article classification model
# Create output directory
! mkdir -p out/classif_train_out/best
# Print name of model to the necessary file
! echo "out/classif_train_out/article_classifier.pt" > out/classif_train_out/best/best_checkpt.txt
# Download the model
! wget -O "out/classif_train_out/article_classifier.pt" https://huggingface.co/globalbiodata/inventory/resolve/main/article_classifier.pt
# Check that it downloaded properly
! echo "5718a7f70becacb46d46501734c83aab81c86feec563594f6a25c116aa31b521 out/classif_train_out/article_classifier.pt" | sha256sum -c

# Get trained NER model
! mkdir -p out/ner_train_out/best
! echo "out/ner_train_out/named_entity_recognition.pt" > out/ner_train_out/best/best_checkpt.txt
! wget -O "out/ner_train_out/named_entity_recognition.pt" https://huggingface.co/globalbiodata/inventory/resolve/main/named_entity_recognition.pt
! echo "dc0bc8b4929e33da52bc92e12720260b392421883889e0a36c809cb0b5c40f5d out/ner_train_out/named_entity_recognition.pt" | sha256sum -c

--2024-01-27 11:08:30--  https://huggingface.co/globalbiodata/inventory/resolve/main/article_classifier.pt
Resolving huggingface.co (huggingface.co)... 18.155.173.64, 18.155.173.45, 18.155.173.122, ...
Connecting to huggingface.co (huggingface.co)|18.155.173.64|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/0c/71/0c71d65c1ad08d6d90c7a439d78f4e3d7683b4b87ed90f6649fde934a1e6a69d/5718a7f70becacb46d46501734c83aab81c86feec563594f6a25c116aa31b521?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27article_classifier.pt%3B+filename%3D%22article_classifier.pt%22%3B&Expires=1706639144&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwNjYzOTE0NH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy8wYy83MS8wYzcxZDY1YzFhZDA4ZDZkOTBjN2E0MzlkNzhmNGUzZDc2ODNiNGI4N2VkOTBmNjY0OWZkZTkzNGExZTZhNjlkLzU3MThhN2Y3MGJlY2FjYjQ2ZDQ2NTAxNzM0YzgzYWFiODFjODZmZWVjNTYzNTk0ZjZhMjV

# Setting up Configurations

Before running the automated pipelines, first update the configuration file `config/update_inventory.yml`. It can be accessed in Google Drive, though you may need to download it and edit it in a text editor such as Notepad, then reupload it.

* **Europe PMC query publication date range**: These are stored as variables `query_from_date` and `query_to_date` in that file. Note that the dates are inclusive. For example to get papers published in 2022, both of those varibles should be 2022.
* **Previous inventory file**: During strict deduplication and flagging for manual review, the results of the previous inventory are taken into account. Specify the location of the most recent inventory output file in the variable `previous_inventory`.

# Running the pipeline
---
Now, we are ready to run the pipeline

## Run it

The following cell will run the pipeline described above. It may take a while, but GPU will speed it up a lot.

In [9]:
! make update_inventory

snakemake \
-s snakemake/update_inventory.smk \
--configfile config/update_inventory.yml -c1
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job                      count    min threads    max threads
---------------------  -------  -------------  -------------
all                          1              1              1
classify_papers              1              1              1
extract_urls                 1              1              1
filter_positives             1              1              1
flag_for_review              1              1              1
initial_deduplication        1              1              1
ner_predict                  1              1              1
process_names                1              1              1
total                        8              1              1
[0m
[33mSelect jobs to execu

# Selective Manual Review

After running the initial pipeline, the inventory has been flagged for selective manual review.

The file to be reviewed is located at:

`out/new_query/for_manual_review/predictions.csv`

Review the flagged columns according to the instruction sheet ([doi: 10.5281/zenodo.7768363](https://doi.org/10.5281/zenodo.7768363)), then place the manually reviewed file in the following folder:

`out/new_query/manually_reviewed/`

The file must still be named `predictions.csv`

# Processing Manual Review

Next, further processing is performed on the manually reviewed inventory.

In [None]:
! make process_manually_reviewed_update

snakemake \
-s snakemake/update_inventory.smk \
--configfile config/update_inventory.yml \
-c 1 \
--until process_countries
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
check_urls               1              1              1
get_epmc_meta            1              1              1
process_countries        1              1              1
total                    3              1              1
[0m
[33mSelect jobs to execute...[0m
[32m[0m
[32m[Mon Jan 30 20:13:29 2023][0m
[32mrule check_urls:
    input: out/new_query/processed_manual_review/predictions.csv
    output: out/new_query/url_checking/predictions.csv
    jobid: 11
    reason: Missing output files: out/new_query/url_checking/predictions.csv
    resou

## Final inventory

The final inventory, including names, URLS, and metadata is found in the file:
*    `out/new_query/processed_countries/predictions.csv`