AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Paper

Authors: Li Lucy, Suchin Gururangan, Luca Soldaini, Emma Strubell, David Bamman, Lauren Klein, Jesse Dodge

Abstract: Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage is under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.

Preprint

Dataset

Models

Our RoBERTa classifier for tagging tokens that refer to social roles: Link

Our reproduced quality filters can be found in data/filter_data/combined/. These filters require scikit-learn 1.2.2.

Code Directory

This code directory map is under construction.

Code

cluster
- cluster.py
- train_clusterer.py
filter
- lr
  - hyperparameters.py
  - lr_quality_filters.py
  - train.py
  - util.py
- evaluate_ft_models.py
- quality_data_org.py
- rule_based_scores.py
- sample_openwebtext2.py
- score_manager.py
- text_normalizer.py
- wikipedia_perplexity.py
- zreader.py
get_data
- bloomfilter.py
- dataset_statistics.py
- get_random_pages.py
- url_processor.py
- website_expander.py
identity_measures
- geography
- personas
- roberta_classifier
- person_vs_orgs.py
- spacy_helper.py

Name	Name	Last commit message	Last commit date
Latest commit lucy3 Add files via upload Oct 4, 2024 6189eb2 · Oct 4, 2024 History 31 Commits
code	code	code	Jan 11, 2024
data/filter_data/combined	data/filter_data/combined	Add files via upload	Jun 3, 2024
LICENSE	LICENSE	Initial commit	Jan 11, 2024
README.md	README.md	Update README.md	Oct 4, 2024
environment_for_filters_only.yml	environment_for_filters_only.yml	Add files via upload	Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Paper

Models

Code Directory

About

Releases

Packages

Languages

License

lucy3/whos_filtered

Folders and files

Latest commit

History

Repository files navigation

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters

Paper

Models

Code Directory

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages