OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

Keiran Paster*, Marco Dos Santos*, Zhangir Azerbayev, Jimmy Ba

About OpenWebMath

OpenWebMath is a dataset containing the majority of the high-quality, mathematical text from the internet. It is filtered and extracted from over 200B HTML files on Common Crawl down to a set of 6.3 million documents containing a total of 14.7B tokens. OpenWebMath is intended for use in pretraining and finetuning large language models.

You can download the dataset using Hugging Face:

from datasets import load_dataset
ds = load_dataset("open-web-math/open-web-math")

Code Structure

We provide code in this repository to reproduce our processing pipeline. The code is organized into three separate folders:

text_extraction contains the code for extracting text and LaTeX from HTML documents.
extract_from_cc contains the code for extracting the dataset from Common Crawl, including prefiltering, language identification, MathScore filtering, and perplexity filtering.
filtering includes many of the manual filtering steps, including blacklisted domains.

In order to run the extract_from_cc code, you either need to run it in Apache Spark or manually run extract_from_warc.py by passing in a WARC file as an argument.

For deduplication, please use the text-dedup library.

Finally, for filtering, filter.py contains the code to load a Hugging Face dataset and filter it based on our heuristics.

The MathScore model and KenLM model will be released in the near future.

OpenWebMath Pipeline

OpenWebMath builds on the massive Common Crawl dataset, which contains over 200B HTML documents. We filtered the data to only include documents that are: (1) in English, (2) contain mathematical content, and (3) are of high quality. We also put a strong emphasis on extracting LaTeX content from the HTML documents as well as reducing boilerplate in comparison to other web datasets.

The OpenWebMath pipeline consists of five steps:

Prefiltering HTML Documents:
- We apply a simple prefilter to all HTML documents in Common Crawl in order to skip documents without mathematical content to unnecessary processing time.
Text Extraction:
- Extract text, including LaTeX content, from the HTML documents while removing boilerplate.
Content Classification and Filtering:
- Apply a FastText language identification model to keep only English documents.
- Filter high perplexity documents using a KenLM model trained on Proof-Pile.
- Filter non-mathematical documents using our own MathScore model.
Deduplication:
- Deduplicate the dataset using SimHash in text-dedup.
Manual Inspection:
- Inspect the documents gathered from previous steps and remove low quality pages.

For a detailed discussion on the processing pipeline, please refer to our paper.

OpenWebMath Contents

The dataset is structured as follows:

{
  "text": ...,  # document text.
  "url": ...,  # document url.
  "date": ...,  # date the page was crawled.
  "metadata": ...,  # JSON containing information from the extraction process.
}

OpenWebMath contains documents from over 130k different domains, including data from forums, educational pages, and blogs. The dataset contains documents covering mathematics, physics, statistics, computer science, and more. The following table shows the most common domains in OpenWebMath by character count.

Domain	# Characters	% Characters
stackexchange.com	4,655,132,784	9.55%
nature.com	1,529,935,838	3.14%
wordpress.com	1,294,166,938	2.66%
physicsforums.com	1,160,137,919	2.38%
github.io	725,689,722	1.49%
zbmath.org	620,019,503	1.27%
wikipedia.org	618,024,754	1.27%
groundai.com	545,214,990	1.12%
blogspot.com	520,392,333	1.07%
mathoverflow.net	499,102,560	1.02%

License

OpenWebMath is made available under an ODC-By 1.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.

Citation Information

@misc{paster2023openwebmath,
      title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
      author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
      year={2023},
      eprint={2310.06786},
      archivePrefix={arXiv},
      primaryClass={cs.AI}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
extract_from_cc		extract_from_cc
filtering		filtering
imgs		imgs
text_extraction		text_extraction
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

About OpenWebMath

Code Structure

OpenWebMath Pipeline

OpenWebMath Contents

License

Citation Information

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

keirp/OpenWebMath

Folders and files

Latest commit

History

Repository files navigation

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

About OpenWebMath

Code Structure

OpenWebMath Pipeline

OpenWebMath Contents

License

Citation Information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages