This repo contains the R and python code to create this wordcloud of the top 200 most frequent words in Man the Hunter (Lee and DeVore, 1968):
Download the repo. To install the python dependencies, first install uv (a python package and project manager). Then in a terminal:
cd /path/to/mth
uv sync
uv run python -m spacy download en_core_web_sm
Then run the R code in wordcloud.R (first install any missing packages), which will create the file wordcloud.pdf:
source("wordcloud.R")- The
pythoncode is called from theRscript, using thereticulatepackage. - Running the
Rcode in RStudio will cause a crash. Use Positron or a terminal. - The text of Man the Hunter was obtained from the Internet Archive, and will be downloaded automatically by the
Rscript. - The text contains misc errors, presumably due to errors in converting the original to text.
- Misc text cleaning included an attempt to "de-hyphenate" words hyphenated at the ends of lines, e.g., "compara-tive".
- Stop words removed using the
tidytextpackage. - Some stop words in the
tidytextpackage, specifically "man", "men", "group", "groups", "area", and "areas", were not removed. - Words were lemmatized using the spaCy python package.
- Word embeddings were computed using the GloVe algorithm from the
text2vecpackage. - Reduction of the high dimensional word embedding space to 2D for visualization was done with PaCMAP.
- The wordcloud was plotted using the
ggrepelpackage. - Neither the
PaCMAPnorggrepelalgorithms are deterministic. A seed was set for reproducibility, but differences in hardware and operating systems can cause differences in the final appearence of the wordcloud.