Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

This repository contains code and data associated with Word embeddings quantify 100 years of gender and ethnic stereotypes. PDF available here.

If you use the content in this repository, please cite:

Garg, N., Schiebinger, L., Jurafsky, D. & Zou, J. Word embeddings quantify 100 years of gender and ethnic stereotypes. PNAS 201720347 (2018). doi:10.1073/pnas.1720347115

To re-run all analyses and plots:

  1. download vectors from online sources and normalize by l2 norm (links in paper and below)
  2. set up parameters to run as in run_params.csv
  3. run
  4. run

dataset_utilities/ contains various helper scripts to preprocess files and create word vectors. From a corpus, for example LDC95T21-North-American-News, that contains many text files (each containing an article) from a given year, first run to create a single text file per year (with only valid words). Then, run on each of these files to create vectors, potentially combining multiple years into a single training set. contains utilities to standardize the vectors.

We have uploaded the New York Times embeddings generated for this paper. They are available at 2021/04/05 update: Unfortunately, the files are no longer available. (Upon my graduation the links died, before I was able to back them up). However, the original text data is still available at New York Times Annotated Corpus, and so the the vectors can be trained as described in the paper.

We use the following embeddings publicly available online. If you use these embeddings, please cite the associated papers.

  1. Google News, word2vec
  2. Genre-Balanced American English (1830s-2000s), SGNS and SVD
  3. Wikipedia, GloVe

Note: the paper mistakenly indicates that the Genre-Balanced American English embeddings contain data from both Google Books and the Corpus of Historical American English (COHA). It contains only data from COHA, though the same website also provides data trained using Google Books.


No description, website, or topics provided.






No releases published


No packages published