LLMCondense - A Simple LLM Document Summarizer

Short description

Simple document summarization webapp using the power of LLMs. This will include:

LLM deployment using HuggingFace Text Generation Inference;
Streamlit for the WebApp interface (UI);
LangChain as the backend framework.

This repo can be used as base for other applications that use LLMs.

Pre-requisites

Ubuntu or Windows with the following installed:

Miniconda3
Docker
Docker Compose

You could use WSL2 on a Windows machine, as an alternative to an Ubuntu machine.

Setup your machine (local or remote server)

1. Serve your LLM using HuggingFace 🤗 - Text Generation Inference

Text Generation Inference is a Rust, Python, and gRPC server designed for text generation inference, featuring optimized architectures, tensor parallelism, and production-ready capabilities. This is used by HuggingFace for various services. For more info, visit the official GitHub repo.

Very quickly:

Clone this repo and navigate inside it:

git clone https://github.com/pandego/LLMCondense.git
cd LLMCondence

Create your environment and install the necessary libraries in it using the requirements.txt file:

conda create -n llmcondense python=3.11.4
conda activate llmcondense
pip install -r requirements.txt

Define a shared volume, where your LLM will be downloaded to. This will avoid downloading weights on every run:

volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

Chose the LLM you will be working with.

model=tiiuae/falcon-7b-instruct

If you are thinking of using a model like llama2, you will need to include a HuggingFace token too:

token=hf_your_token_here

Run the following command to build and run your container, where your model will be downloaded and served:

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model

Note 1: If you want to use more than one GPU, you can add -e CUDA_VISIBLE_DEVICES=0,1 in order to shard the model on 2 processes.

Note 2: You can add -e NCCL_P2P_DISABLE=1 in case you get "Some NCCL operations have failed or timed out" error while loading the model - described in Issue 654 of the Text Generation Inference repo.

Note 3: You can decrease the --max-batch-prefill-tokens in order to decrease memory needs.

Note 4: Pay special attention to the arguments --max-input-length and --max-total-tokens.

For example:

docker run --gpus all --shm-size 1g -e CUDA_VISIBLE_DEVICES=0,1 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --max-batch-prefill-tokens 2048

or

docker run --gpus all --shm-size 1g -e CUDA_VISIBLE_DEVICES=0,1 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --max-input-length 2048 --max-total-tokens 4096

And that's it, your LLM should be served!

2. Test your served LLM

To test it you can simply navigate to http://127.0.0.1:8080/docs and "Try it out" using one of the two methods - /generate or /generate_stream:
- /generate will output the response in a "one-shot" fashion.
- /generate_stream will output the response in a "word-by-word" fashion.
Alternatively you can send a -POST request and checkout the server's response for each method:

# /generate 
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

# /generate_stream
curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

3. Launch the Streamlit App

This part is straighforward as well, simply run the next command from your terminal:

streamlit run main.py

Navigate to the Streamlit WebApp at http://localhost:8501.

That's it! 🥳 You can now start using LLM Condense to summarize your texts!

TODOs

Include LangChain framework for:
- Prompt templating
- Document loading
- Chain handling
Docker Compose

🎊 Et voilà! 🎊

References:

https://docs.streamlit.io/
https://docs.langchain.com/docs/
https://github.com/huggingface/text-generation-inference/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LLMCondense - A Simple LLM Document Summarizer

Short description

Pre-requisites

Setup your machine (local or remote server)

1. Serve your LLM using HuggingFace 🤗 - Text Generation Inference

2. Test your served LLM

3. Launch the Streamlit App

TODOs

References:

Files

README.md

Latest commit

History

README.md

File metadata and controls

LLMCondense - A Simple LLM Document Summarizer

Short description

Pre-requisites

Setup your machine (local or remote server)

1. Serve your LLM using HuggingFace 🤗 - Text Generation Inference

2. Test your served LLM

3. Launch the Streamlit App

TODOs

References: