# OpenVINO GenAI

[OpenVINO™](https://github.com/openvinotoolkit/openvino) is an open-source toolkit for optimizing and deploying AI inference. OpenVINO™ Runtime can enable running the same model optimized across various hardware [devices](https://github.com/openvinotoolkit/openvino?tab=readme-ov-file#supported-hardware-matrix). Accelerate your deep learning performance across use cases like: language + LLMs, computer vision, automatic speech recognition, and more.

OpenVINO models can be run locally through the `OpenVINOLLM` [class](https://python.langchain.com/docs/integrations/llms/openvino_geni). This integration is a wrapper of [`openvino_genai`](https://github.com/openvinotoolkit/openvino.genai) library.

To use, you should have the `openvino_genai` python [package installed](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide/genai-guide.html).

In [None]:
%pip install openvino_genai --quiet

### Model Export

It is possible to [export your model](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#export) to the OpenVINO IR format with the CLI, and load the model from local folder.

In [None]:
!optimum-cli export openvino --model TinyLlama/TinyLlama-1.1B-Chat-v1.0  --weight-format int4 ov_model_dir # for 4-bit quantization

You can also download an OpenVINO optimized model from [OpeenVINO model hub](https://huggingface.co/OpenVINO)

In [1]:
import huggingface_hub as hf_hub

model_id = "OpenVINO/TinyLlama-1.1B-Chat-v1.0-int4-ov"
hf_hub.snapshot_download(model_id, local_dir="ov_model_dir")

  from .autonotebook import tqdm as notebook_tqdm
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:17<00:00,  1.32s/it]


'/home/ethan/intel/langchain/docs/docs/integrations/llms/ov_model_dir'

### Model Loading

Models can be loaded by specifying the model parameters using the `from_model_path` method.

If you have an Intel GPU or NPU, you can specify `deivce="GPU"` or `deivce="NPU"` to run inference on it.

In [1]:
from langchain_community.llms import OpenVINOLLM


ov_llm = OpenVINOLLM.from_model_path(
    model_path="ov_model_dir",
    device="CPU",
)



You can pass the generation config parameters through `ov_llm.config`. The supported parameters are listed at the [source code of OpenVINO GenAI](https://github.com/openvinotoolkit/openvino.genai/blob/master/src/python/py_generation_config.cpp).

In [16]:
ov_llm.config.max_new_tokens = 10

### Create Chain

With the model loaded into memory, you can compose it with a prompt to form a chain.

In [17]:
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

chain = prompt | ov_llm

question = "What is electroencephalography?"

print(chain.invoke({"question": question}))

Electroencephalography (EEG) is


### Streaming

You can use `stream` method to get a streaming of LLM output, 

In [None]:
ov_llm.config.max_new_tokens = 50

chain = prompt | ov_llm

for chunk in chain.stream(question):
    print(chunk, end="", flush=True)

For more information refer to:

* [OpenVINO LLM guide](https://docs.openvino.ai/2024/learn-openvino/llm_inference_guide.html).

* [OpenVINO Documentation](https://docs.openvino.ai/2024/home.html).

* [OpenVINO Get Started Guide](https://www.intel.com/content/www/us/en/content-details/819067/openvino-get-started-guide.html).
  
* [RAG Notebook with LangChain](https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/llm-rag-langchain).