# Running Training and Prediction Pipeline
---
This notebook provides all the commands to reproduce the results of training the models, and prediction on the full corpus.

This process does not have to be done to update the inventory, but simply to reproduce the reported results, (this is the process used to produce them in the first place).

This pipeline has the following steps:

*   Split the manually curated datasets
*   Train all models on the classificaiton and NER tasks
*   Select the best model for each task
*   Evaluate best model for each task on their test sets
*   Perform classification of full corpus
*   Run NER model on predicted biodata resource papers
*   Extract URLs from predicted positives
*   Gather extra metadata from predicted positives

### ***Warning***:

Running the full pipeline trains many models, and their "checkpoint" files are quite large (~0.5GB per model, ~15GB in total). Simply running prediction requires much less resources, including storage space.

### Other use-cases

If you want to compare a new model to the previously compared models, you can add another row to `config/models_info.tsv`. This pipeline will train this model and compare it to the others. If the other trained model checkpoint files are still present from a previous run, they will not be re-trained during the process.

# Setup
---
### Mount Drive

First, mount Google Drive to have access to files necessary for the run:


In [1]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/GitHub/inventory_2022/

Mounted at /content/drive
/content/drive/MyDrive/GitHub/inventory_2022


Run the make target to install Python dependencies, and download the full corpus that was used during training and evaluation.

In [None]:
! make setup
# Add command here for getting the full corpus, once we have it archived
# For now, manually place a file called "full_corpus.csv" into the "data/" directory
# Manually curated data is already stored in the GitHub repository

# Running the pipeline
---
Now, we are ready to run the pipeline

## Previewing what has to be done.

The following can be run to get a preview of what has to be done.

In [None]:
! make dryrun

## Run it

The following cell will run the entire pipeline described above. It takes a while, even with GPU acceleration. Without GPU it will take a very long time, if it is able to finish at all.

In [None]:
! make train_and_predict

# Results
---
Once the pipeline has run, there are a few important output files


## Final inventory

The final inventory, including names, URLS, and metadata is found in the file:
*    `data/full_corpus_predictions/urls/predictions.csv`

## Model training stats

The per-epoch training statistics for all models are in the files:

*    `out/ner_train_out/best/*/combined_stats.csv`
*    `out/classif_train_out/best/*/combined_stats.csv`

## Test set evaluation

Performance measures of the best model on the test set are located in the files:

*    `out/ner_train_out/best/test_set_evaluation/metrics.csv`
*    `out/classif_train_out/best/test_set_evaluation/metrics.csv`
