## Install vLLM + Haystack

- we install vLLM using pip ([docs](https://docs.vllm.ai/en/latest/getting_started/installation.html))
- for production use cases, there are many other options, including Docker ([docs](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html))

In [7]:
!gdown https://drive.google.com/uc?id=1PKltXvQMnncerz_wvy2s_IMo7--GzNvF

Downloading...
From (original): https://drive.google.com/uc?id=1PKltXvQMnncerz_wvy2s_IMo7--GzNvF
From (redirected): https://drive.google.com/uc?id=1PKltXvQMnncerz_wvy2s_IMo7--GzNvF&confirm=t&uuid=7d350cd0-c941-404a-a654-a306b950a440
To: /content/final_model.zip
100% 1.99G/1.99G [00:17<00:00, 111MB/s]


In [8]:
!unzip  final_model.zip

Archive:  final_model.zip
   creating: final_weights_new/
  inflating: final_weights_new/tokenizer.json  
  inflating: final_weights_new/tokenizer.model  
  inflating: final_weights_new/model.safetensors  
  inflating: final_weights_new/generation_config.json  
  inflating: final_weights_new/config.json  
  inflating: final_weights_new/special_tokens_map.json  
  inflating: final_weights_new/tokenizer_config.json  


In [9]:
!pip install vllm haystack-ai

Collecting vllm
  Downloading vllm-0.5.0.post1-cp310-cp310-manylinux1_x86_64.whl (130.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.2/130.2 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting haystack-ai
  Downloading haystack_ai-2.2.3-py3-none-any.whl (345 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.2/345.2 kB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
Collecting ninja (from vllm)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi (from vllm)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Collecting openai (from vllm)
  Downloading openai-1.35.3-py3-none-any.whl (327 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━

In [10]:
# we prepend "nohup" and postpend "&" to make the Colab cell run in background
! nohup python -m vllm.entrypoints.openai.api_server \
                  --model /content/final_weights_new \
                  --dtype auto \
                  --max-model-len 2048 \
                  > vllm.log &

nohup: redirecting stderr to stdout


In [11]:
# we check the logs until the server has been started correctly
!while ! grep -q "Application startup complete" vllm.log; do tail -n 1 vllm.log; sleep 5; done

INFO 06-22 09:19:13 selector.py:51] Using XFormers backend.
INFO 06-22 09:19:13 selector.py:51] Using XFormers backend.
INFO 06-22 09:19:22 model_runner.py:160] Loading model weights took 2.0512 GB
INFO 06-22 09:19:23 gpu_executor.py:83] # GPU blocks: 30294, # CPU blocks: 11915
INFO 06-22 09:19:28 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-22 09:19:28 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-22 09:19:28 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utiliz

In [12]:
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
import string
import random

# initializing size of string
N = 20

# using random.choices()
# generating random strings
res = ''.join(random.choices(string.ascii_uppercase +
                             string.digits, k=N))

generator = OpenAIChatGenerator(
    api_key=Secret.from_token(res),  # for compatibility with the OpenAI API, a placeholder api_key is needed
    model="/content/final_weights_new",
    api_base_url="http://localhost:8000/v1",
    generation_kwargs = {"max_tokens": 1024}
)


In [14]:
messages = []

while True:
  msg = input("Enter your message or Q to exit\n🧑 ")
  if msg=="Q":
    break
  messages.append(ChatMessage.from_user(msg))
  response = generator.run(messages=messages)
  assistant_resp = response['replies'][0]
  print("🤖 "+assistant_resp.content)
  messages.append(assistant_resp)

Enter your message or Q to exit
🧑 is alias mental?
🤖 Yes, "alias mental" is a real word. It is also known as a synonym. A synonym is a word that means the same thing as the original word, but in a different way. In this case, "alias" is a synonym for "mental." They mean exactly the same thing. There is no difference in meaning between the two words. Both are valid ways of expressing the same idea. You can use either one depending on the context in which you are speaking or writing. I hope this information helps! If you have any other questions, please don't hesitate to ask. I am always happy to help. Best Wishes,

The Virtual Assistant.

Note: This information is for educational purposes only. It is not intended as a substitute for professional advice. If you have a medical problem or concern, please seek professional help. If you are experiencing symptoms of depression, please refer to the National Suicide Prevention Lifeline at 1-800-273-8255 for resources in your area. The number ma