This repository accompanies the blog post "How Well Can GPT Do Scientific Literature Meta-analysis?" , exploring the use of GPT-4 for conducting meta-analysis of scientific literature. The project investigates whether Large Language Models (LLMs) like GPT-4 can assist in synthesizing research findings, potentially offering a more efficient, consistent, and cost-effective approach compared to traditional methods.
A prototype results browser app is available at https://llm-metaanalysis.finedataproducts.com/, to view the output of the analysis on a selection of open-access papers.
analyze_papers.py
: Workflow for analyzing papers using GPT models.common.py
: Shared utility functions and classes.data/
: Directory containing datasets and annotations.decision_tree_chat.py
: Interactive decision tree for paper classification.process_papers.py
: Workflow for initial processing of papers.requirements.{in,txt}
: Python dependencies for the project.results_browser.py
: A Streamlit app for browsing results.scripts/
: Utility scripts for data processing and analysis.
-
Clone the repository:
git clone https://github.com/mmacpherson/funk-et-al-2008-llm-meta-analysis.git cd funk-et-al-2008-llm-meta-analysis
-
Set up the environment and install python dependencies:
make env # Requires `pyenv` and `pyenv-virtualenv`.
(The included
requirements.txt
supplies the dependencies needed to run the analysis, if you prefer to manage your virtualenvs with something other than pyenv/pyenv-virtualenv.) -
Configure OpenAI API access, by creating a file called
.env
with an entry like:OPENAI_API_KEY={key_here}
To process PDF papers into a vector store, run this script:
python process_papers.py run
Run python process_papers.py run --help
to see available arguments.
As provided here, this workflow assumes that you've looked up some set of papers
using Semantic Scholar's API, and stored them in a papers
table with this
schema:
CREATE TABLE papers (
semantic_scholar_id TEXT PRIMARY KEY NOT NULL,
semantic_scholar_json TEXT NOT NULL
)
And a table pdfs
with this schema, that contains the pdf content for each
paper:
CREATE TABLE pdfs (
doi TEXT PRIMARY KEY NOT NULL,
pdf_content BLOB NOT NULL,
pdf_md5 TEXT NOT NULL,
direct INTEGER NOT NULL -- Treated as boolean; could we download directly from the open internet, aot UC?
)
If those tables exist, the downstream sqlite tables and chromadb vector store will be created automatically.
The sqlite database provided at
data/funk-etal-2008.selected-open-access.db
contains example data.
To run the meta-analysis itself:
python analyze_papers.py run
Run python analyze_papers.py run --help
to see available arguments. See e.g.
scripts/run-pilot-set
for the command used to run the
analysis over our pilot/training set.
streamlit run results_browser.py
This repository provides the code to reproduce the analysis described in the accompanying blog post. As it is meant primarily for replication purposes, active development is limited.
However, if you have any questions, comments, or suggestions, please feel free to open an issue! I'm happy to answer questions about the methodology, discuss the findings, or hear any ideas you may have for extending the analysis.
Contributions in the form of bug reports, feature requests, or pull requests are also welcome, though I can't guarantee very active maintenance. I'm opening this code in the hopes that others may find it useful or build upon it in their own work.
This project is licensed under the CC0 1.0 Universal License. For more details, see the LICENSE file in this repository or visit Creative Commons CC0 1.0 Universal.