LLMPROJECT

Empowering Conversations: LLM Project Enhancements

Developed with the software and tools below.

Quick Links

Overview

Features

Repository Structure

Modules

Contributing

Acknowledgments

Overview

The llmproject leverages various NLP models to evaluate model performance within the healthcare domain. This codebase is designed to evaluate the performance of various open-source models by comparing their outputs to ExpertQA using evaluation metrics such as Smooth BLEU, BERTScore, and Cosine Similarity. The primary goal is to assess how well these models can replicate or improve upon expert-level answers to a variety of questions.

Features

	Feature	Description
⚙️	Architecture	The project architecture is quite simple. Each file within the generation_code consists of independent scripts that call OpenAI's GPT AI or build out an entire pipeline to run inference on models like Mistral. Each file being independent of the other allows for a modular design that enhances scalability and maintainability if the models were to ever be updated or further tuned.
📄	Documentation	Along with this ReadMe, this project contains documentation in the form of a research paper that can be found here.
🔌	Integrations	Key integrations include Hugging Face Transformers for model management, NLTK for BLEU score calculation, and PyTorch for model quantization and inference. External dependencies consist of handling JSON and JSONL files with Python's built-in JSON library.
🧩	Modularity	The codebase is structured with clear separation into two main directories: `generation_code` for model operations and `eval_code` for evaluation metrics. Rest of the directories contain results and outputs.
📦	Dependencies	Dependencies include Python libraries such as `transformers`, `torch`, `pandas`, `nltk`, and `matplotlib` for data processing, model management, and visualization. The setup requires handling various data formats and integrating multiple machine learning and NLP models.

Repository Structure

└── llmproject/
    ├── LLM Plots cosine_and_bert
    │   ├── bertscore_across_all_models.png
    │   ├── bertscore_across_question_type.png
    │   ├── bertscore_across_specific_field.png
    │   ├── cosine_across_all_models.png
    │   ├── cosine_across_question_type.png
    │   └── cosine_across_specific_field.png
    ├── README.md
    ├── bleu_results
    │   ├── Cleaned Evaluation Bleu Scores.ipynb
    │   ├── Evaluation for LLM Project - Madeline.ipynb
    │   ├── biomstrl_with_qtype_field_smoothed_scores.csv
    │   ├── gpt_with_qtype_field_smoothed_scores.csv
    │   ├── medchatbot_with_qtype_field_smoothed_scores.csv
    │   ├── smoothed_bleu_score_biomstrl.csv
    │   ├── smoothed_bleu_score_gpt.csv
    │   └── smoothed_bleu_score_medicalchatbot.csv
    ├── cosine_bert_with_question_types
    │   ├── biomistral_bert_score_types.csv
    │   ├── biomistral_cosine_similarity_types.csv
    │   ├── gpt_cosine_similarity.csv
    │   ├── gpt_cosine_similarity_types.csv
    │   ├── medical_chatbot_bert_score_types.csv
    │   ├── medical_chatbot_cosine_similarity_types.csv
    │   ├── mistral_bert_score_types.csv
    │   └── mistral_cosine_similarity_types.csv
    ├── cosine_bert_without_question_types
    │   ├── biomistral_bert_score.csv
    │   ├── biomistral_cosine_similarity.csv
    │   ├── gpt_bert_score 2.44.52 PM.csv
    │   ├── gpt_bert_score_types.csv
    │   ├── medical_chatbot_bert_score.csv
    │   ├── medical_chatbot_cosine_similarity.csv
    │   ├── mistral_bert_score.csv
    │   └── mistral_cosine_similarity.csv
    ├── eval_code
    │   ├── Bleu Eval Graphs.ipynb
    │   └── LLM_Evaluation_Cosine_BERT.ipynb
    ├── expertqa.jsonl
    ├── generation_code
    │   ├── biomistral.py
    │   ├── gpt.py
    │   ├── medical_chatbot.ipynb
    │   └── mistral.py
    └── model_outputs
        ├── biomistral.json
        ├── gpt.json
        ├── medical_chatbot.json
        └── mistral.json

Modules

.

File	Summary
expertqa.jsonl	Summary: Dataset from ExpertQA project. With regards to llmproject, this dataset is scraped to pull medical domain questions along with their respective expert-annotated answers as a groundtruth.

generation_code

File	Summary
medical_chatbot.ipynb	This notebook processes expertqa question-answer pairs from a JSONL file, iterates over only the healthcare questions, and employs a transformer model from huggingface to generate responses, and stores the results in `medical_chatbot.json`.
gpt.py	This script handles the ingestion of JSON data, queries the GPT model for answers, and logs the responses into an output file.
mistral.py	`mistral.py` processes expertqa question-answer pairs from a JSONL file, iterates over only the healthcare questions, and employs a transformer model from huggingface to generate responses, and stores the results in `mistral.json`.
biomistral.py	`biomistral.py` processes expertqa question-answer pairs from a JSONL file, iterates over only the healthcare questions, and employs a transformer model from huggingface to generate responses, and stores the results in `biomistral.json`.

bleu_results

File	Summary
[Cleaned Evaluation Bleu Scores.ipynb](https://github.com/mshroff123/llmproject/blob/master/bleu_results/Cleaned Evaluation Bleu Scores.ipynb)	Implemented evaluation for smoothed bleu score and included code for all bleu score related graphs.

eval_code

File	Summary
[Bleu Eval Graphs.ipynb](https://github.com/mshroff123/llmproject/blob/master/eval_code/Bleu Eval Graphs.ipynb)	Summary: Implemented evaluation for smoothed bleu score and included code for all bleu score related graphs.
LLM_Evaluation_Cosine_BERT.ipynb	Implemented evaluation for cosine similarity, BERTScore, and includes code for all related graphs.

Contributing

Contributions are welcome! Here are several ways you can contribute:

Submit Pull Requests: Review open PRs, and submit your own PRs.
Join the Discussions: Share your insights, provide feedback, or ask questions.
Report Issues: Submit bugs found or log feature requests for Llmproject.

Contributing Guidelines

Fork the Repository: Start by forking the project repository to your GitHub account.
Clone Locally: Clone the forked repository to your local machine using a Git client.
```
git clone https://github.com/mshroff123/llmproject
```
Create a New Branch: Always work on a new branch, giving it a descriptive name.
```
git checkout -b new-feature-x
```
Make Your Changes: Develop and test your changes locally.
Commit Your Changes: Commit with a clear message describing your updates.
```
git commit -m 'Implemented new feature x.'
```
Push to GitHub: Push the changes to your forked repository.
```
git push origin new-feature-x
```
Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.

Once your PR is reviewed and approved, it will be merged into the main branch.

Acknowledgments

MistralAI for their open-source model.
BioMistral team for fine-tuning the Mistral-7B model on the BioASQ dataset.
OpenAI for their GPT 3.5 model.
Hugging Face for their Transformers library.

References

Alexander R. Fabbri, Chien-Sheng Wu, Wenhao Liu, Caiming Xiong (2022). QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summary. Preprint. arXiv. https://doi.org/10.48550/arXiv.2112.08542.

Chaitanya Malaviya, Subin Lee, Sihao Chen, Elizabeth Sieber, Mark Yatskar, Dan Roth (2024). EXPERTQA : Expert-Curated Questions and Attributed Answers. Preprint. arXiv. https://doi.org/10.48550/arXiv.2309.07852.

Debadutta Dash, Rahul Thapa, Juan M. Banda, Akshay Swaminathan, Morgan Cheatham, Mehr Kashyap, Nikesh Kotecha, Jonathan H. Chen, Saurabh Gombar, Lance Downing, Rachel Pedreira, Ethan Goh, Angel Arnaout, Garret Kenn Morris, Honor Magon, Matthew P Lungren, Eric Horvitz, Nigam H. Shah (2023). Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery. Preprint. arXiv. https://doi.org/10.48550/arXiv.2304.13714.

Liyan Tang, Zhaoyi Sun, Betina Idnay, Jordan G. Nestor, Ali Soroush, Pierre A. Elias, Ziyang Xu, Ying Ding, Greg Durrett, Justin F. Rousseau, Chunhua Weng, Yifan Peng (2023). Evaluating Large Language Models on Medical Evidence Summarization. npj Digit. Med. 6, 158. https://doi.org/10.1038/s41746-023-00896-7.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, Yoav Artzi (2020). BERTScore: Evaluating Text Generation With BERT. Preprint. arXiv. https://doi.org/10.48550/arXiv.1904.09675.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMPROJECT

Quick Links

Overview

Features

Repository Structure

Modules

Contributing

Acknowledgments

References

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
LLM Plots cosine_and_bert		LLM Plots cosine_and_bert
bleu_results		bleu_results
cosine_bert_with_question_types		cosine_bert_with_question_types
cosine_bert_without_question_types		cosine_bert_without_question_types
eval_code		eval_code
generation_code		generation_code
model_outputs		model_outputs
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
expertqa.jsonl		expertqa.jsonl

mshroff123/llmproject

Folders and files

Latest commit

History

Repository files navigation

LLMPROJECT

Quick Links

Overview

Features

Repository Structure

Modules

Contributing

Acknowledgments

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages