Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions advanced_tutorials/llm_pdfs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# ⚙️ Index Private PDFs for RAG and create Fine-Tuning Datasets from them

This project will take a google drive folder of PDF files that you provide and read them, index them in vector embeddings in Hopsworks for retrieval augmented generation (RAG) and create an instruction dataset for fine-tuning using a teacher model (GPT).


![Hopsworks Architecture for Private PDFs Indexed for LLMs](../..//images/llm-pdfs-architecture.gif)

## 📖 Feature Pipeline
The Feature Pipeline does the following:

* Download any new PDFs from the google drive.
* Extract chunks of text from the PDFs and store them in a Feature Group in Hopsworks.
* Use GPT to generate an instruction set for the fine-tuning a foundation LLM and store as a feature group in Hopsworks.

## 🏃🏻‍♂️Training Pipeline
The Training Pipeline does the following:

* Uses the instruction dataset and LoRA to fine-tune the open-source LLM (Mistral-7B-Instruct-v0.2 by default) .
* Saves the fine-tuned model to Hopsworks Model Registry.

## 🚀 Inference Pipeline
* A chatbot written in Streamlit that answers questions about the PDFs you uploaded using RAG and an embedded LLM.

## 🕵🏻‍♂️ Google Drive Credentials Creation

To create your Google Drive credentials, please follow the steps outlined in this guide: [Google Drive API Quickstart with Python](https://developers.google.com/drive/api/quickstart/python). This guide will walk you through setting up your project and downloading the necessary credentials files.

After completing the setup, you will have two files: `credentials.json` and `client_secret.json`. These are your authentication files from your Google Cloud account.

Next, integrate these files into your project:

1. Create a directory named `credentials` at the root of your forked repository.

2. Place both `credentials.json` and `client_secret.json` files inside this credentials directory.

Now, you are ready to download your PDFs from the Google Drive!
Binary file added images/llm-pdfs-architecture.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.