LLM Finetuning with llama-factory and Grobid

This repository contains code and resources for finetuning large language models (LLMs) using the llama-factory library and the Grobid tool for extracting structured data from scientific publications.

Overview

The goal of this project is to finetune LLMs like LLaMA, Misteal, etc. on a corpus of scientific publications, with the aim of improving their performance on tasks related to scientific literature, such as summarization, question answering, and information extraction.

The workflow involves the following steps:

Data Collection: Gather a dataset of scientific publications in PDF format.
Data Preprocessing: Use Grobid to extract structured data (title, abstract, body text, etc.) from the PDF files.
QA extractiong: Use a local model to generate QA pairs
Model Finetuning: Use llama-factory to finetune a base LLM on the preprocessed scientific corpus.

Requirements

Python 3.8+
llama-factory
Grobid

Installation

Clone this repository:

git clone https://github.com/msalvaris/llm_finetuning
cd llm_finetuning

Install the required Python packages:

pip install -r requirements.txt

Run the Grobid docker container: https://grobid.readthedocs.io/en/latest/Install-Grobid/

make run_grobid

Build and run the llama-factory docker container

make build_lfactory
make run_lfactory

Run the processing script

python generate_qa.py

Model Finetuning: Use the llama-factory to finetune a base LLM on the preprocessed pdf.

make fine_tune

Evaluation: Evaluate the finetuned model's performance

make cli

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or bug fixes.

License

This project is licensed under the MIT License.

Acknowledgments

The llama-factory library: https://github.com/hiyouga/LLaMA-Factory
The Grobid tool: https://github.com/kermitt2/grobid

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__random_scripts__		__random_scripts__
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
generate_qa.py		generate_qa.py
grobid_config.json		grobid_config.json
lf_dockerfile		lf_dockerfile
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Finetuning with llama-factory and Grobid

Overview

Requirements

Installation

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

License

msalvaris/llm_finetuning

Folders and files

Latest commit

History

Repository files navigation

LLM Finetuning with llama-factory and Grobid

Overview

Requirements

Installation

Contributing

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages