Skip to content

Chat with your own documents pressure cooker

License

Notifications You must be signed in to change notification settings

pbl-nl/appl-docchat

Repository files navigation

Chat with your docs!

A RAG (Retrieval Augmented Generation) setup for further exploration of chatting to company documents


image

How to use this repo

! This repo is tested on a Windows platform

Preparation

  1. Clone this repo to a folder of your choice
  2. Create a subfolder vector_stores in the root folder of the cloned repo
  3. In the root folder, create a file named ".env"
  4. When using the OpenAI API, enter your OpenAI API key in the first line of this file:
    OPENAI_API_KEY="sk-....."
  5. Save and close the .env file
  1. When using Azure OpenAI Services, enter the following variables in the .env file:
  • AZURE_OPENAI_API_KEY
  • AZURE_OPENAI_ENDPOINT
  • AZURE_OPENAI_API_VERSION (e.g. "2024-02-01")
  • AZURE_OPENAI_LLM_DEPLOYMENT_NAME
  • AZURE_OPENAI_EMB_DEPLOYMENT_NAME The abovementioned variables can be found in your Azure OpenAI Services subscription
  1. In case you want to use one of the open source models API's that are available on Huggingface:
    Enter your Huggingface API key in the ".env" file :
    HUGGINGFACEHUB_API_TOKEN="hf_....."
  • If you don't have an Huggingface API key yet, you can register at https://huggingface.co/join
  • When registered and logged in, you can get your API key in your Huggingface profile settings
  1. This repository also allows for using one of the Ollama open source models on-premise. You can do this by follwing the steps below:
  • In Windows go to "Turn Windows features on or off" and check the features "Virtual Machine Platform" and "Windows Subsystem for Linux"
  • Download and install the Ubuntu Windows Subsystem for Linux (WSL) by opening a terminal window and type wsl --install
  • Start WSL by typing opening a terminal and typing wsl, and install Ollama with curl -fsSL https://ollama.com/install.sh | sh
  • When you decide to use a local LLM and/or embedding model, make sure that the Ollama server is running by:
    • opening a terminal and typing wsl
    • starting the Ollama server with ollama serve. This makes any downloaded models accessible through the Ollama API

Conda virtual environment setup

  1. Open an Anaconda prompt or other command prompt
  2. Go to the root folder of the project and create a Python environment with conda with conda env create -f appl-docchat.yml
    NB: The name of the environment is appl-docchat by default. It can be changed to a name of your choice in the first line of the file appl-docchat.yml
  3. Activate this environment with conda activate appl-docchat

Pip virtual environment setup

  1. Open an Anaconda prompt or other command prompt
  2. Go to the root folder of the project and create a Python environment with pip with python -m venv venv
    This will create a basic virtual environment folder named venv in the root of your project folder NB: The chosen name of the environment folder is here venv. It can be changed to a name of your choice
  3. Activate this environment with venv\Scripts\activate
  4. All required packages can now be installed with pip install -r requirements.txt

Setting parameters

The file settings_template.py contains all parameters that can be used and needs to be copied to settings.py. In settings.py, fill in the parameter values you want to use for your use case. Examples and restrictions for parameter values are given in the comment lines

nltk.tokenize.punkt module

When the NLTKTextSplitter is used for chunking the documents, it is necessary to download the punkt module of NLTK.
This can be done in the activated environment by starting a Python interactive session: type python.
Once in the Python session, type import nltk + Enter
Then nltk.download('punkt') + Enter

Ingesting documents

The file ingest.py can be used to vectorize all documents in a chosen folder and store the vectors and texts in a vector database for later use.
Execution is done in the activated virtual environment with python ingest.py

Querying documents

The file query.py can be used to query any folder with documents, provided that the associated vector database exists.
Execution is done in the activated virtual environment with python query.py

Querying multiple documents with multiple questions in batch

The file review.py uses the standard question-answer technique but allows you to ask multiple questions to each document in a folder.

  • Create a subfolder with the name review in folder docs/XXXX with XXXX as the name of your document folder
  • Secondly, add a file with the name questions.txt in the review folder with all your questions. The file expects a header with column names Question Type and Question for example. Then add all your question types ('Initial', or 'Follow Up' when the question refers to the previous question) and questions tab-separated in the following lines. You can find an example in the docs/CAP_nis folder.

Execution is done in the activated virtual environment with python review.py All the results, including the answers and the sources used to create the answers, are stored in a file result.csv which is also stored in the subfolder review

Ingesting and querying documents through a Streamlit User Interface

The functionalities described above can also be used through a User Interface.
In the activated virtual environment, the UI can be started with streamlit run streamlit_app.py
When this command is used, a browser session will open automatically

Summarizing documents

The file summarize.py can be used to summarize every file individually in a document folder. Three options for summarization are implemented:

  • Map Reduce: this will create a summary in a fast way. The time (and quality) to create a summary depends on the number of centroids chosen. This is a parameter in settings.py
  • Refine: this will create a more refined summary, but can take a long time to run, especially for larger documents
  • Hybrid: combines both methods above

Execution is done in the activated virtual environment with python summarize.py. The user will be prompted for the summarization method, either "Map_Reduce", "Refine" or "Hybrid"

Evaluation of Question Answer results

The file evaluate.py can be used to evaluate the generated answers for a list of questions, provided that the file eval.json exists, containing not only the list of questions but also the related list of desired answers (ground truth).
Evaluation is done at folder level in the activated virtual environment with python evaluate.py
It is also possible to run an evaluation over all folders with python evaluate_all.py

Monitoring the evaluation results through a Streamlit User Interface

All evaluation results can be viewed by using a dedicated User Interface.
In the activated virtual environment, this evaluation UI can be started with streamlit run streamlit_evaluate.py
When this command is used, a browser session will open automatically

References

This repo is mainly inspired by:

About

Chat with your own documents pressure cooker

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages