Skip to content

Simple document summarization webapp using the power of LLMs. This repo can be used as base for other applications that use LLMs.

License

Notifications You must be signed in to change notification settings

pandego/LLMCondense

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLMCondense - A Simple LLM Document Summarizer

Short description

Simple document summarization webapp using the power of LLMs. This will include:

This repo can be used as base for other applications that use LLMs.

Pre-requisites

Ubuntu or Windows with the following installed:

You could use WSL2 on a Windows machine, as an alternative to an Ubuntu machine.

Setup your machine (local or remote server)

1. Serve your LLM using HuggingFace 🤗 - Text Generation Inference

Text Generation Inference is a Rust, Python, and gRPC server designed for text generation inference, featuring optimized architectures, tensor parallelism, and production-ready capabilities. This is used by HuggingFace for various services. For more info, visit the official GitHub repo.

Very quickly:

  • Clone this repo and navigate inside it:
git clone https://github.com/pandego/LLMCondense.git
cd LLMCondence
  • Create your environment and install the necessary libraries in it using the requirements.txt file:
conda create -n llmcondense python=3.11.4
conda activate llmcondense
pip install -r requirements.txt
  • Define a shared volume, where your LLM will be downloaded to. This will avoid downloading weights on every run:
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
  • Chose the LLM you will be working with.
model=tiiuae/falcon-7b-instruct
  • If you are thinking of using a model like llama2, you will need to include a HuggingFace token too:
token=hf_your_token_here
  • Run the following command to build and run your container, where your model will be downloaded and served:
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model

Note 1: If you want to use more than one GPU, you can add -e CUDA_VISIBLE_DEVICES=0,1 in order to shard the model on 2 processes.

Note 2: You can add -e NCCL_P2P_DISABLE=1 in case you get "Some NCCL operations have failed or timed out" error while loading the model - described in Issue 654 of the Text Generation Inference repo.

Note 3: You can decrease the --max-batch-prefill-tokens in order to decrease memory needs.

Note 4: Pay special attention to the arguments --max-input-length and --max-total-tokens.

  • For example:
docker run --gpus all --shm-size 1g -e CUDA_VISIBLE_DEVICES=0,1 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --max-batch-prefill-tokens 2048

or

docker run --gpus all --shm-size 1g -e CUDA_VISIBLE_DEVICES=0,1 -e NCCL_P2P_DISABLE=1 -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.0.3 --model-id $model --max-input-length 2048 --max-total-tokens 4096

And that's it, your LLM should be served!

2. Test your served LLM

  • To test it you can simply navigate to http://127.0.0.1:8080/docs and "Try it out" using one of the two methods - /generate or /generate_stream:
    • /generate will output the response in a "one-shot" fashion.
    • /generate_stream will output the response in a "word-by-word" fashion.
  • Alternatively you can send a -POST request and checkout the server's response for each method:
# /generate 
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

# /generate_stream
curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

3. Launch the Streamlit App

  • This part is straighforward as well, simply run the next command from your terminal:
streamlit run main.py

That's it! 🥳 You can now start using LLM Condense to summarize your texts!

TODOs

  • Include LangChain framework for:
    • Prompt templating
    • Document loading
    • Chain handling
  • Docker Compose

🎊 Et voilà! 🎊

References:

About

Simple document summarization webapp using the power of LLMs. This repo can be used as base for other applications that use LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages