Empowering Conversations: LLM Project Enhancements
Developed with the software and tools below.
The llmproject leverages various NLP models to evaluate model performance within the healthcare domain. This codebase is designed to evaluate the performance of various open-source models by comparing their outputs to ExpertQA using evaluation metrics such as Smooth BLEU, BERTScore, and Cosine Similarity. The primary goal is to assess how well these models can replicate or improve upon expert-level answers to a variety of questions.
Feature | Description | |
---|---|---|
⚙️ | Architecture | The project architecture is quite simple. Each file within the generation_code consists of independent scripts that call OpenAI's GPT AI or build out an entire pipeline to run inference on models like Mistral. Each file being independent of the other allows for a modular design that enhances scalability and maintainability if the models were to ever be updated or further tuned. |
📄 | Documentation | Along with this ReadMe, this project contains documentation in the form of a research paper that can be found here. |
🔌 | Integrations | Key integrations include Hugging Face Transformers for model management, NLTK for BLEU score calculation, and PyTorch for model quantization and inference. External dependencies consist of handling JSON and JSONL files with Python's built-in JSON library. |
🧩 | Modularity | The codebase is structured with clear separation into two main directories: generation_code for model operations and eval_code for evaluation metrics. Rest of the directories contain results and outputs. |
📦 | Dependencies | Dependencies include Python libraries such as transformers , torch , pandas , nltk , and matplotlib for data processing, model management, and visualization. The setup requires handling various data formats and integrating multiple machine learning and NLP models. |
└── llmproject/
├── LLM Plots cosine_and_bert
│ ├── bertscore_across_all_models.png
│ ├── bertscore_across_question_type.png
│ ├── bertscore_across_specific_field.png
│ ├── cosine_across_all_models.png
│ ├── cosine_across_question_type.png
│ └── cosine_across_specific_field.png
├── README.md
├── bleu_results
│ ├── Cleaned Evaluation Bleu Scores.ipynb
│ ├── Evaluation for LLM Project - Madeline.ipynb
│ ├── biomstrl_with_qtype_field_smoothed_scores.csv
│ ├── gpt_with_qtype_field_smoothed_scores.csv
│ ├── medchatbot_with_qtype_field_smoothed_scores.csv
│ ├── smoothed_bleu_score_biomstrl.csv
│ ├── smoothed_bleu_score_gpt.csv
│ └── smoothed_bleu_score_medicalchatbot.csv
├── cosine_bert_with_question_types
│ ├── biomistral_bert_score_types.csv
│ ├── biomistral_cosine_similarity_types.csv
│ ├── gpt_cosine_similarity.csv
│ ├── gpt_cosine_similarity_types.csv
│ ├── medical_chatbot_bert_score_types.csv
│ ├── medical_chatbot_cosine_similarity_types.csv
│ ├── mistral_bert_score_types.csv
│ └── mistral_cosine_similarity_types.csv
├── cosine_bert_without_question_types
│ ├── biomistral_bert_score.csv
│ ├── biomistral_cosine_similarity.csv
│ ├── gpt_bert_score 2.44.52 PM.csv
│ ├── gpt_bert_score_types.csv
│ ├── medical_chatbot_bert_score.csv
│ ├── medical_chatbot_cosine_similarity.csv
│ ├── mistral_bert_score.csv
│ └── mistral_cosine_similarity.csv
├── eval_code
│ ├── Bleu Eval Graphs.ipynb
│ └── LLM_Evaluation_Cosine_BERT.ipynb
├── expertqa.jsonl
├── generation_code
│ ├── biomistral.py
│ ├── gpt.py
│ ├── medical_chatbot.ipynb
│ └── mistral.py
└── model_outputs
├── biomistral.json
├── gpt.json
├── medical_chatbot.json
└── mistral.json
.
File | Summary |
---|---|
expertqa.jsonl | Summary: Dataset from ExpertQA project. With regards to llmproject, this dataset is scraped to pull medical domain questions along with their respective expert-annotated answers as a groundtruth. |
generation_code
File | Summary |
---|---|
medical_chatbot.ipynb | This notebook processes expertqa question-answer pairs from a JSONL file, iterates over only the healthcare questions, and employs a transformer model from huggingface to generate responses, and stores the results in medical_chatbot.json . |
gpt.py | This script handles the ingestion of JSON data, queries the GPT model for answers, and logs the responses into an output file. |
mistral.py | mistral.py processes expertqa question-answer pairs from a JSONL file, iterates over only the healthcare questions, and employs a transformer model from huggingface to generate responses, and stores the results in mistral.json . |
biomistral.py | biomistral.py processes expertqa question-answer pairs from a JSONL file, iterates over only the healthcare questions, and employs a transformer model from huggingface to generate responses, and stores the results in biomistral.json . |
bleu_results
File | Summary |
---|---|
[Cleaned Evaluation Bleu Scores.ipynb](https://github.com/mshroff123/llmproject/blob/master/bleu_results/Cleaned Evaluation Bleu Scores.ipynb) | Implemented evaluation for smoothed bleu score and included code for all bleu score related graphs. |
eval_code
File | Summary |
---|---|
[Bleu Eval Graphs.ipynb](https://github.com/mshroff123/llmproject/blob/master/eval_code/Bleu Eval Graphs.ipynb) | Summary: Implemented evaluation for smoothed bleu score and included code for all bleu score related graphs. |
LLM_Evaluation_Cosine_BERT.ipynb | Implemented evaluation for cosine similarity, BERTScore, and includes code for all related graphs. |
Contributions are welcome! Here are several ways you can contribute:
- Submit Pull Requests: Review open PRs, and submit your own PRs.
- Join the Discussions: Share your insights, provide feedback, or ask questions.
- Report Issues: Submit bugs found or log feature requests for Llmproject.
Contributing Guidelines
- Fork the Repository: Start by forking the project repository to your GitHub account.
- Clone Locally: Clone the forked repository to your local machine using a Git client.
git clone https://github.com/mshroff123/llmproject
- Create a New Branch: Always work on a new branch, giving it a descriptive name.
git checkout -b new-feature-x
- Make Your Changes: Develop and test your changes locally.
- Commit Your Changes: Commit with a clear message describing your updates.
git commit -m 'Implemented new feature x.'
- Push to GitHub: Push the changes to your forked repository.
git push origin new-feature-x
- Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
Once your PR is reviewed and approved, it will be merged into the main branch.
- MistralAI for their open-source model.
- BioMistral team for fine-tuning the Mistral-7B model on the BioASQ dataset.
- OpenAI for their GPT 3.5 model.
- Hugging Face for their Transformers library.
Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong (2022). QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summary. Preprint. arXiv. https://doi.org/10.48550/arXiv.2112.08542.
Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, Dan Roth (2024). EXPERTQA : Expert-Curated Questions and Attributed Answers. Preprint. arXiv. https://doi.org/10.48550/arXiv.2309.07852.
Debadutta Dash, Rahul Thapa, Juan M. Banda, Akshay Swaminathan, Morgan Cheatham, Mehr Kashyap, Nikesh Kotecha, Jonathan H. Chen, Saurabh Gombar, Lance Downing, Rachel Pedreira, Ethan Goh, Angel Arnaout, Garret Kenn Morris, Honor Magon, Matthew P Lungren, Eric Horvitz, Nigam H. Shah (2023). Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery. Preprint. arXiv. https://doi.org/10.48550/arXiv.2304.13714.
Liyan Tang, Zhaoyi Sun, Betina Idnay, Jordan G. Nestor, Ali Soroush, Pierre A. Elias, Ziyang Xu, Ying Ding, Greg Durrett, Justin F. Rousseau, Chunhua Weng, Yifan Peng (2023). Evaluating Large Language Models on Medical Evidence Summarization. npj Digit. Med. 6, 158. https://doi.org/10.1038/s41746-023-00896-7.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi (2020). BERTScore: Evaluating Text Generation With BERT. Preprint. arXiv. https://doi.org/10.48550/arXiv.1904.09675.