Skip to content

mitmedialab/LLM-MedQA

Repository files navigation

This is a github repository for a paper:

People trust AI-generated medical responses and view them to be as valid as doctors, despite low accuracy

Authors

  • Shruthi Shekar, MIT Media Lab, Massachusetts Institute of Technology, MA, USA
  • Pat Pataranutaporn, MIT Media Lab, Massachusetts Institute of Technology, MA, USA
  • Chethan Sarabu, Stanford Medicine, Stanford University, CA, USA
  • Guillermo A. Cecchi, IBM Research, NY, USA
  • Pattie Maes, MIT Media Lab, Massachusetts Institute of Technology, MA, USA

*The first two authors contribute equally to this paper, e-mail: sshekar@mit.edu, patpat@mit.edu

Abstract

This paper presents a comprehensive analysis of how AI-generated medical responses are perceived and evaluated by non-experts. A total of 300 participants gave evaluations for medical responses that were either written by a medical doctor on an online healthcare platform, or generated by a large language model and labeled by medical experts as having high or low accuracy. Results showed that participants could not effectively distinguish between AI-generated and Doctors' responses and demonstrated a preference for AI-generated responses, rating High Accuracy AI-generated responses as significantly more valid, trustworthy, and complete/satisfactory. Low Accuracy AI-generated responses on average performed very similar to Doctors' responses, if not better. Participants not only found these low-accuracy AI-generated responses to be valid, trustworthy, and complete/satisfactory but also indicated a high tendency to follow the advice and seek medical attention as a result of the response provided. This reaction was similar, if not superior, to the reaction they displayed towards doctors' responses. This increased trust placed on inaccurate or inappropriate AI-generated medical advice can lead to misdiagnosis and harmful consequences for individuals seeking help. Further, participants were more trusting of High Accuracy AI-generated responses when told they were given by a doctor and experts rated AI-generated responses significantly higher when the source of the response was unknown. Both experts and non-experts exhibited bias, finding AI-generated responses to be more thorough and accurate than Doctors' responses but still valuing the involvement of a Doctor in the delivery of their medical advice. Ultimately, ensuring AI systems are implemented in collaboration with medical professionals should be the future direction of using AI for the delivery of medical advice in order to prevent the liability of misinformation while reaping the benefits of such cutting-edge technology.

Repository Overview

  • Raw Experiment Data

    • The following folder contains the raw data for Experiments 1, 2, 3 as completed by the participants on Prolific and acquired from Qualtrics.
  • Cleaned Experiment Data

    • The following folder contains the data that has been cleaned of participant responses that did not pass the screeners or complete the study.
  • Organized Experiment Data

    • The following folder contains an organized, reformatted dataset for each of the three experiments. The raw data was organized with response scores from each participant, with clear indication of the Evaluation Metric being measured, the source of the Medical Response provided, the score given by the participant, and other relevant details.
  • Expert Evaluation & Dataset Generation

    • The following folder contains files with the original 150 medical questions and answers (both AI-generated and provided by Doctors) and their respective Accuracy, Strength, and Completeness scores that were given by our medical expert evaluators across the Blind and Non-Blind evaluations.
  • Python Files:

    • Contains the python files utilized to (1) clean the raw data, (2) organize the cleaned data, and (3) graph and analyze the organized data.
    • Contains the python file utilized to complete an analysis of partcipant demographics
    • Contains the python file utilized to analyze the expert evaluation scores of the medical responses (Two-way ANOVA and T-Test)
    • Contains the python file utilized to complete a linguistics analysis of the medical responses. (Basic ANOVA)
  • Statistical Analysis:

    • Contains the R files utilized to complete Hierarchical Linear Model based statistical analyses of the results from Experiments 1, 2, 3
  • Expert Evals Dataset

    • Raw data from Blind and Non-Blind expert evaluations of Accuracy, Strength, and Completeness of the medical responses.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published