This project uses Fonduer to extract information from spreadsheets. It aims to extract the data values and corresponding row/column headers from the TROY200 data set.
The project contains 3 different experiments:
- Fonduer without cell annotations
- Fonduer with cell annotations (gold data)
- Fonduer with cell annotations (predictions)
The data folder contains the spreadsheet which have the annotations indicated by the fonts. The gold standard is in data/troy200_gold.csv , and the spreadsheets with gold-annotations/predictions in gold/predictions, respectively. The data folder needs to be extracted first, in order to run the experiments.
Clone the repository and execute inside the folder.
docker build -t kaikun/fonduer-troy200 .
docker-compose up
It will spin up a docker container with the jupyter notebook and one running the postgres database.
The jupyter notebook access token is printed in the console.
The credentials for the postgres database are username user
and password venron
by default.
They can be changed in the docker-compose.yml
and will need to be adjusted in the notebook too.
If docker-compose
is not installed, run the two containers independently within a local network to be able to access the service by container name. E.g.:
1.) Create a user-defined bridge network
docker network create my-net
2.) Run postgres (latest) docker image
docker pull postgres
mkdir -p $HOME/docker/volumes/pg
docker run --rm --name postgres --network my-net -e POSTGRES_PASSWORD=venron -e POSTGRES_USER=user -d -p 5432:5432 -v $HOME/docker/volumes/pg:/var/lib/postgresql/data postgres
3.) Run the code
docker build -t kaikun/fonduer-troy200 .
docker run --rm --name app-docker --network my-net -e PGPASSWORD=venron -e PGUSER=user -d -p 8890:8888 kaikun/fonduer-troy200
4.) Check the access token if no password for the notebook server is set
docker ps
docker logs CONTAINER_ID
Now the jupyter notebook will be accessible on localhost:8890
and able to connect directly to the database.
Please note that it is discouraged to use the database password as environment variable as it is potentially exposed to other users on the network.
Especially if you use shared resources switch to a password file.
In case the code runs on a remote server and you need to connect via SSH, simply use SSH tunnelling to connect to the notebook server.
ssh -N -L 8080:localhost:8890 USER@HOST_IP -p PORT
This will make the notebook server accessible on the the 8080
port on your local machine. Simply access in the browser localhost:8080
.