The code in this repository relates to the VisPubData collection of publications in the field of visualization. For more information please check the corresponding publication about the project:
Petra Isenberg, Florian Heimerl, Steffen Koch, Tobias Isenberg, Panpan Xu, Charles D. Stolper, Michael Sedlmair, Jian Chen, Torsten Möller, and John Stasko. vispubdata.org: A Metadata Collection about IEEE Visualization (VIS) Publications. IEEE Transactions on Visualization and Computer Graphics, 23(9):2199–2206, September 2017. doi: 10.1109/TVCG.2016.2615308
If you use this data, we would appreciate a citation to our paper:
@article{Isenberg:2017:VMC,
author = {Petra Isenberg and Florian Heimerl and Steffen Koch and Tobias Isenberg and Panpan Xu and Charles D. Stolper and Michael Sedlmair and Jian Chen and Torsten M{\"o}ller and John Stasko},
title = {vispubdata.org: A Metadata Collection about {IEEE} Visualization ({VIS}) Publications},
journal = {IEEE Transactions on Visualization and Computer Graphics},
year = {2017},
volume = {23},
number = {9},
month = sep,
pages = {2199--2206},
doi = {10.1109/TVCG.2016.2615308},
shortdoi = {10/dx8vv8},
oa_hal_url = {https://hal.science/hal-01376597},
url = {http://www.vispubdata.org/site/vispubdata/},
}
Contributors to the code are: Petra Isenberg, Natkamon Tovanich, and Tobias Isenberg
There are two types of code in this repository, one for reproduing one of the figures we show in the paper mentioned above (for the purpose of showing reproducibility via the Graphics Replicability Stamp Initiative), and the other code for supporting the continued update of the dataset.
The code in the reproducibility/
subdirectory facilitates the reproduction of the plot in Figure 1 of the corresponding paper, albeit adjusted to the updated data in the dataset (as of writing this text, years 2016–2023 of the IEEE VIS conference have been added). The other figures in the paper are a manually created overview of the conference evolution (Figure 2; for that see the files in the naming-graphic/
subdirectory of this repository) and screenshots of the dataset in use by other tools (Figures 3–5). The original version of Figure 1 from the paper looks as follows:
(image is in the public domain)
- a Python 3 installation; e.g., https://www.python.org/downloads/ or https://www.anaconda.com/download/
- dedicated Python libraries installed with
pip3
orconda
as follows (or similar):altair
:pip3 install altair
orconda install -c conda-forge altair
(see https://altair-viz.github.io/)vl-convert
:pip3 install vl-convert-python
orconda install -c conda-forge vl-convert-python
(see https://altair-viz.github.io/user_guide/saving_charts.html)pandas
:pip3 install pandas
(see https://pandas.pydata.org/docs/getting_started/install.html; already included in Anaconda)- the file
reproducibility/requirements.txt
includes all of these requirements, install them all in one go withpip3 install -r reproducibility/requirements.txt
The reproducibility/reproducibility.py
script essentially loads the current state of the dataset and then directly produces the new version of the figure in the local directory. There are two ways to get the data. By default, the data is pulled directly from the respective Google Spreadsheets, and the script should run without any further configuration.
Alternatively, one can also download the data to local csv
files and then run everything locally. For that, please set
useFiles = True
in the configuration section of the reproducibility/reproducibility.py
script at the top, and then download the data as follows: Please first go to the shared Google Spreadsheet that contains the VisPubData dataset, make sure that the first tab on the bottom is selected ("Main dataset"), and then download the data using the menu via File > Download > Comma Separated Values (.csv) and then save the file to reproducibility/vispubdata.csv
. Next, please go to the shared Google Spreadsheet that contains the data about the journal presentations, and then download that dataset the same way as before and save the file to reproducibility/vis-journal-presentations.csv
. Then everything is in place to run the script.
To run the script in either case, simply do:
cd reproducibility/
python3 reproducibility.py
The script then produces the equivalent of Figure 1 of the paper as reproducibility/reproducibility.pdf
, but updated to the most recent version of the dataset, which looks like this (2023 version):
(the image is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license; please attribute the image to the contributors named above and cite the mentioned journal paper)
Notice that the labels have been reworded slightly to reflect the changes that happened in the conference in the meantime as well as to make the distinction between journal conference papers and pure journal papers presented at the conference more clear, and that the labels are ordered differently from the original figure due to the use of a new plotting tool.
This code will allow you to create an update of the VisPubData dataset. If you have only small fixes of the data to report you might be better off to leave a comment on the Google spreadsheet with the data and to e-mail petra.isenberg@inria.fr to make the change for you. If, however, you would like to, for example, add a new year to the dataset or create a local update for yourself, then read on.
- A Python 3 installation, we recommend Anaconda.
- Several additional Python packages:
- lxml
- requests
- crossrefapi
- tqdm
- The file
requirements.txt
includes all of these requirements, install them all in one go withpip3 install -r requirements.txt
.
- Get an IEEEXplore API key: https://developer.ieee.org/ (this may take a few days). Then make a copy of
vispubdata-update/ieeexplore-apikey-template.txt
and call itvispubdata-update/ieeexplore-apikey.txt
, and replace thexxxxxxxxxxxxxxxxxxxxxxxx
in the renamed copy you made with your API key. - Download the latest data from the DBLP: go to https://dblp.org/xml/ and download the files
dblp.xml.gz
anddblp.dtd
and put them into thedblp-data-extraction/data/
subfolder in this repository. Also do not forget to extract thedblp.xml.gz
todblp.xml
. - You need to get a list of papers actually accepted to and presented at IEEE VIS (i.e., those papers initially submitted at the end-of-March deadline and then accepted to IEEE VIS). Ask the IEEE VIS program chairs for the titles of the year of IEEE VIS you'd like to add. Also find the DOIs of the papers awarded in the year of the conference you would like to add (you can try https://ieeevis.org, or check on https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=2945, or on https://www.computer.org/csdl/journal/tg; the IEEE VIS proceedings are typically the first issue in a TVCG volume/year). You will need this data in the process to cross-check and verify which papers to include from the downloaded data from IEEE Xplore.
- Get a list of papers with a graphics replicability stamp, not only for the year you want to add but for all years because the stamps are awarded retroactively. This list should contain the DOIs of the papers with a GRSI stamp (in no particular order), and it can include more papers than in the VisPubData database (because the GRSI is not limited to visualization papers). The script
extract-tvcg-dois-with-stamp.py
in Tobias Isenberg's Visualization-Reproducibility repository generates such a list, if you need help just contact Tobias. Place the resulting CSV file into thevispubdata-update/
subdirectory, overwriting the existingvispubdata-update/tvcg-dois-with-stamp.csv
file. This step is not essential to be able to run things, as the repository already contains avispubdata-update/tvcg-dois-with-stamp.csv
file with data on the GRSI up to a point. But it would, of course, be better to have up-to-date data.
- Navigate to the main folder of this repsitory (the top folder) using the command line. If you are on Anaconda Python it is best to do this using the Anaconda prompt.
- Start the Jupyter notebook server there by calling:
This will open a browser window and show the subfolders of the repository and the included files.
jupyter notebook
- If this does not work out of the box, you can get more help here: https://docs.jupyter.org/en/latest/install.html
Then use the newly opened browser window to open the the Jupyter notebooks in the respective folders in the order they are named below (double-click on the folder, then double-click on the respective ipynb
file, which opens it in a new browser tab). Each Jupyter notebook contains additional prerequisites and the instructions for running it.
dblp-data-extraction/ParseDBLP-VIS-Authors.ipynb
- The last step in this notebook will take several minutes, depending on your machine and the size of the data. The script, however, shows its progress in iterations (in batches of 100,000) and iterations per second.
- Unfortunately it is not possible to compute a percentage as the total number of needed iterations is not known ahead of time. The total number of needed iterations depends on the size of the DBLP data when downloaded and should be in the order of 97,000,000 (at time of writing these instructions) or more.
- So just wait as long as these iteration counts continue to be updated.
- If error messages appear check that you have the downloaded DBLP data files (according to the instructions above) and placed them into the
dblp-data-extraction/data/
subfolder and that you also uncompressed thedblp.xml.gz
file. - The script is done when you see a note that reports how many articles were found and the script prints
DONE
. - You can the close the window with the Jypiter notebook, as the results have been saved as an updated version of
dblp-data-extraction/data/VIS-author-articles.csv
.
vispubdata-update/Vispubdata update IEEE VIS papers.ipynb
- This process is quite involved, carefully read and follow the instructions in the notebook.
- Again, some of its processing may take a while (e.g., the CrossRef data check in the order of an hour or more), but there are one again progress indicators.
- Once done, many of the data files in the subdirectory are updated and you can close the browser tab.
vispubdata-update/journals-at-vis-update.ipynb
- Again, some of its processing may take a while (e.g., the CrossRef data check in the order of 10 minutes or more), but there are one again progress indicators.
- Once done, many of the data files in the subdirectory are updated and you can close the browser tab.
aminer-citation-update/first_scan.ipynb
aminer-citation-update/merge_data.ipynb
The final results of the process can then be found in the files vispubdata-update/results/vispubdata-update.csv
and vispubdata-update/results/vispubdata-update-journals.csv
as well as the Aminer citations at vispubdata-update/results/vispubdata-update.csv-aminer.csv
and vispubdata-update/results/vispubdata-update-journals.csv-aminer.csv
, which could technically be used to update the VisPubData Google spreadsheet (for those with access to it).
If you have any problems then please contact petra.isenberg@inria.fr.