Improving Language Understanding from Screenshots

This repository contains the code, data, and models for paper Improving Language Understanding from Screenshots. In this paper, we focus on improving the language understanding ability of "screenshot LM" (models that process everything -- including text -- within visual inputs) and propose patch-and-text prediction (PTP), a novel pre-training objective for screenshot LMs.

Environment

Firstly, please install the latest compatible PyTorch.

Then, install all the required packages by running:

pip install -r requirements.txt

We strongly recommend using the exact same transformers and accelerate versions for best reproducibility. Please checkout the renderer readme to make sure that the renderer is correctly configured.

Preparing the data

For our encoder-decoder experiments and the train-from-scratch autoregressive screenshot LM experiments, we use Wikipedia+BookCorpus as the pre-training data. You can find the already-tokenized dataset from this Huggingface website. You can download the data by

git clone https://huggingface.co/datasets/princeton-nlp/ptp_data data

This folder contains four files

wikibook_256_opt_tk_train.npy and wikibook_256_opt_tk_val.npy: Wiki+Book using OPT tokenizer, 256 tokens per example (for encoder-decoder).
wikibook_512_llama_tk_train.npy and wikibook_512_llama_tk_val.npy: Wiki+Book using LLAMA tokenizer, 512 tokens per example (for train-from scratch autoregressive).

For continuing training Sheared-llama to use screenshots, we use Sheared-llama's pipeline for processing RedPajama data. Please follow this guideline for processing the data. Our example config will use ./data/sheared-llama-rp/for_ft for continuing pre-training and ./data/sheared-llama-rp/eval for evaluation.

Reproducing our pre-trained models

To reproduce our models, run the following command (requires 8 GPUs):

NUM_GPU=8 bash run_multiple_gpus.sh {CONFIG PATH}

There are three example configs:

run_configs/ptp.yaml: our main PTP model (encoder-decoder).
run_configs/screenshot-llama-380m.yaml: train-from-scratch autoregressive.
run_configs/screenshot-llama-1.3b-from-sheared-llama.yaml: continuing pre-training sheared-llama.

You can also run the single-GPU command run_single_gpu.sh for testing. To ensure the same hyperparameters, you should adjust the per-GPU batch size (per_device_train_batch_size) or the gradient accumulation steps (gradient_accumulation_steps) accordingly if you are not using 8 GPUs or your GPUs cannot fit our preset batch sizes.

Downloading our models

We provide the following pre-trained models on Huggingface:

Fine-tuning PTP models

Coming soon!

Bugs or questions?

If you have any questions related to the paper, feel free to email Tianyu (tianyug@cs.princeton.edu). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use PTP in your work:

@article{gao2024improving,
   title={Improving Language Understanding from Screenshots},
   author={Gao, Tianyu and Wang, Zirui and Bhaskar, Adithya and Chen, Danqi},
   year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
imgs		imgs
model_configs		model_configs
modeling		modeling
rendering/src		rendering/src
run_configs		run_configs
.gitignore		.gitignore
GoNotoCurrent.ttf		GoNotoCurrent.ttf
LICENSE		LICENSE
README.md		README.md
data.py		data.py
image_utils.py		image_utils.py
requirements.txt		requirements.txt
run.py		run.py
run_multiple_gpus.sh		run_multiple_gpus.sh
run_single_gpu.sh		run_single_gpu.sh
streaming_data.py		streaming_data.py
trainer.py		trainer.py

License

princeton-nlp/PTP

Folders and files

Latest commit

History

Repository files navigation

Improving Language Understanding from Screenshots

Quick Links

Environment

Preparing the data

Reproducing our pre-trained models

Downloading our models

Fine-tuning PTP models

Bugs or questions?

Citation

About

Resources

License

Stars

Watchers

Forks

Languages