## Setup

*Make sure to run this notebook using a GPU runtime (Runtime > Change runtime type > GPU). Run the gray-shaded code cells by pressing the run sign in the left corner of the cell. Run all code cells in order. The setup (first 6 code cells) may take up to 5 minutes.*

In this notebook, we demonstrate how to use our models to detect graphical representations of (latent) variables and path coefficients in a given PDF file. The outputs of our functions will be saved in two different folders ```cropped_imgs``` (for extracted conceptual model figures) and ```final_imgs``` (detections of variables and coefficients in the figures). Before we can start, we need to set a few things up. <br>
First, we have clone our GitHub Repo:

In [None]:
!git clone https://github.com/purplesweatshirt/icispaper

Next, we change the directory to our cloned repo:

In [None]:
%cd ./icispaper

We have to install two dependencies to convert each page of a PDF into an image file:

In [None]:
!apt-get install poppler-utils 
!pip install pdf2image

And have to run the "make" command to be able to use the YOLOv4 model:

In [None]:
!make

We download our model's weights. Unfortunately, we couldn't upload them directly to GitHub due to their large sizes.

In [None]:
!wget --output-document=fig_det.weights https://sync.academiccloud.de/index.php/s/U13SnHdpPAnPKI0/download
!wget --output-document=var_det.weights https://sync.academiccloud.de/index.php/s/RBm4jpUvxzwQOAu/download
!wget --output-document=sem_class.h5 https://sync.academiccloud.de/index.php/s/excRqLnqE5xN4fM/download

Finally, we download a PDF file to demonstrate our pipeline:<br>
(*Note: You can use a paper of your choice simply by swapping the url in the code cell below with a link to your desired pdf file. Do not modify anything but the url.*)

In [None]:
!wget --output-document=test.pdf http://docshare01.docshare.tips/files/7052/70528799.pdf

## Inference

We created a python file which contains all of our wrapper functions. These functions will be used in the following, so we import them.

In [7]:
from detection_utils_new import *

We are converting all PDF pages into image files by using our function ```store_images```. These images will be stored in the ```temp_imgs``` folder. Then we run our ```classify_pages``` function which classifies each image (i.e., does it contain a graphical representation of a conceptual model or not) and keeps only the relevant pages in the ```temp_imgs``` folder.

In [None]:
# Enter the path to the pdf file
PATH_TO_PDF = 'test.pdf'

store_images(PATH_TO_PDF)
classify_pages(model_path='sem_class.h5')

We pass the name/path of our weights to the ```detect_figures``` function. This function detects the SEM figures in the image files from ```temp_imgs```. The images are cropped to the size of each detection. These cropped images are stored in the ```cropped_imgs``` folder and can be used by databases to provide images of the conceptual models of a paper.

In [None]:
detect_figures(weights='fig_det.weights')

Our ```detect_variables``` functions uses the cropped images from the previous step to detect latent variables, items and path coefficients in the SEM figures. The resulting images are stored in the ```final_imgs``` folder. In the future, the bounding boxes will be used to extract the names via OCR and store this information in a database together with the intermediate images and the paper itself.

In [None]:
detect_variables(weights='var_det.weights')