Structure Preserving Pre-Training

Directory Structure:

Our code expects that there exists a root directory, stored in an environment variable named PROJECTS_BASE which will in turn contain a directory named graph_augmented_pt which stores the raw & processed data for these experiments, as well as the eventual directories in which run checkpoints and result files will be stored.

Within PROJECTS_BASE, you must have a directories raw_datasets/treeoflife, raw_datasets/string (for our protein PT experiments), raw_datasets/TAPE (for protein FT), raw_datasets/ogbn_mag (for scientific articles PT), and raw_datasets/scibert (for scientific articles FT).

Datasets & Pre-trained Models:

Pre-training Datasets:

The treeoflife dataset can be obtained from the Stanford SNAP page. For proteins, you must also download the strings species dataset, accessible here, which must be placed in the raw_datasets/strings folder. The OGBN-MAG graph can be obtained from the OGB, here, and the associated abstracts can be downloaded from the MAG.

Pre-training data must be pre-processed via the scripts/unify_abstracts.py and scripts/unify_sequences.py scripts. No arguments are needed.

Fine-tuning Datasets:

TAPE datasets can be obtained according to the directions in the TAPE Paper. Each TAPE FT dataset should be stored within a separate directory in PROJECTS_BASE/raw_datasets/TAPE. SciBERT Datasets can be obtained according to the directions in the SciBERT Paper (simply clone the entire SciBERT github into the raw_datasets/scibert folder).

Initializing Models:

Follow the instructions in the TAPE Repository to obtain the initial TAPE pre-trained model. SciBERT can be obtained directly via huggingface. The PLUS model can be obtained via the instructions here. Note that the PLUS model's base files must be downloaded and stored in a raw_models subdirectory of PROJECTS_BASE.

Synthetic Data:

The dump of sentences from wikipedia used as node features in our synthetic experiments can be downloaded here. It should be placed in the directory PROJECTS_BASE/synthetic_datasets/. Additional Synthetic dataset processing should be run via the notebook synthetic_experiments/'Generate Synthetic Data Node Features & Topics.ipynb'. For the manifolds experiments, you must additionally run Preprocessing Topics for Simplicial Alignment.ipynb.

Environment Setup

Linux Environment Setup

Navigate to the root directory. Decide where you want to store your output dir: export OUTPUT_ENV_PATH=[INSERT YOUR PATH HERE].
Install the base conda environment: conda env create -f conda.yml -p $OUTPUT_ENV_PATH
If the process completes successfully, something weird had happened but just go with it. If the process complains about a non-pip-related issue and rolls back the transaction, something else is wrong, and I don't know how to solve it. If the process fails with a pip-installation-error, continue below.
To install the broken pip dependencies, first activate the partial conda env: conda activate $OUTPUT_ENV_PATH
Next, install the pip dependencies: 0. $OUTPUT_ENV_PATH/bin/pip install tape_proteins
1. $OUTPUT_ENV_PATH/bin/pip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
2. Presuming that succeeds, set some env variables: export TORCH="1.7.0"; export CUDA="cu101"
3. $OUTPUT_ENV_PATH/bin/pip install torch-scatter==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
4. $OUTPUT_ENV_PATH/bin/pip install torch-sparse==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
5. $OUTPUT_ENV_PATH/bin/pip install torch-cluster==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
6. $OUTPUT_ENV_PATH/bin/pip install torch-spline-conv==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-${TORCH}.html
7. $OUTPUT_ENV_PATH/bin/pip install torch-geometric
8. $OUTPUT_ENV_PATH/bin/pip install transformers
9. $OUTPUT_ENV_PATH/bin/pip install pygraphviz --install-option="--include-path=$OUTPUT_ENV_PATH/include/" --install-option="--library-path=$OUTPUT_ENV_PATH/lib/"

Synthetic Experiment Reproduction Instructions

Follow the instructions above to obtain and pre-process the synthetic data.
Follow the various 'Reproduction *.ipynb' notebooks in the synthetic_experiments directory.

Pre-training & Fine-tuning Runs

To run these experiments, first make a directory for your run and add an argument configuration json file in line with graph_augmented_pt/args.py
Then, run the run_pretraining.py script pointing at those arguments.
To fine-tune, run the run_finetuning.py script.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
PLUS		PLUS
graph_augmented_pt		graph_augmented_pt
scripts		scripts
synthetic_examples		synthetic_examples
tests		tests
.gitignore		.gitignore
README.md		README.md
conda.yml		conda.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PLUS

PLUS

graph_augmented_pt

graph_augmented_pt

scripts

scripts

synthetic_examples

synthetic_examples

tests

tests

.gitignore

.gitignore

README.md

README.md

conda.yml

conda.yml

Repository files navigation

Structure Preserving Pre-Training

Directory Structure:

Datasets & Pre-trained Models:

Pre-training Datasets:

Fine-tuning Datasets:

Initializing Models:

Synthetic Data:

Environment Setup

Linux Environment Setup

Synthetic Experiment Reproduction Instructions

Pre-training & Fine-tuning Runs

About

Releases

Packages

Languages

mmcdermott/structure_preserving_pre-training

Folders and files

Latest commit

History

Repository files navigation

Structure Preserving Pre-Training

Directory Structure:

Datasets & Pre-trained Models:

Pre-training Datasets:

Fine-tuning Datasets:

Initializing Models:

Synthetic Data:

Environment Setup

Linux Environment Setup

Synthetic Experiment Reproduction Instructions

Pre-training & Fine-tuning Runs

About

Resources

Stars

Watchers

Forks

Languages