This code is intended for a casual analysis of preprints on the topic of depression. It goes from data collection via webscraping to analysis of the preprints with Latent Dirichlet Allocation. A writeup of the analysis can be found [here].
The three notebooks' purposes are as follows:
- "01 OSF Scraper" consists of scraping the Open Science Foundation website for information on preprints in general, as an initial survey of sorts for the topic.
- "02 PsyArXiv Scraper" is a more targeted scraper, covering only PsyArXiv. It is similar to the preceding notebook, but with less analysis and the addition of downloading the preprints.
- "03 Extract Paper Texts" is the analysis of the preprints themselves. It performs analysis using both Latent Dirichlet Allocation and non-negative matrix factorization, though only the former was used in the writeup. Much of the code is based on an example in the
scikit-learn
documentation.
The two CSV files include information on the preprints' pages that were scraped. Including the preprints themselves was not practical due to size (a few hundred megabytes).