# 模型转换为ONNX格式
OpenVINO可用于从Hugging Face Hub加载优化模型，并创建管道以使用Hugging FaceAPI通过OpenVINO Runtime运行推理。这意味着我们只需要将AutoModelForXxx类替换为相应的OVModelForXxx类。就能实现模型格式的转换。

In [1]:
from optimum.intel import OVQuantizer
from optimum.intel.openvino import OVModelForCausalLM

model_dir = "llama-2-chat-7b/ov_model"
pt_model_id = 'llama-2-chat-7b'

ov_model = OVModelForCausalLM.from_pretrained(pt_model_id, export=True, compile=False)
ov_model.half()
ov_model.save_pretrained(model_dir)

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino


Framework not specified. Using pt to export to ONNX.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using framework PyTorch: 2.0.1+cu117
Overriding 1 configuration item(s)
	- use_cache -> True
Asked a sequence length of 16, but a sequence length of 1 will be used with use_past == True for `input_ids`.
  if seq_len > self.max_seq_len_cached:
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
  if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):


# CPU推理
指定其部署推理的设备为CPU，让模型在intel的CPU上进行推理。

In [2]:
from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer
from transformers import AutoConfig


model_dir = "llama-2-chat-7b/ov_model"
model_name = 'llama-2-chat-7b'

ov_config = {'PERFORMANCE_HINT': 'LATENCY', 'NUM_STREAMS': '1', "CACHE_DIR": ""}
tok = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

ov_model = OVModelForCausalLM.from_pretrained(model_dir, device='cpu', ov_config=ov_config, config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True), trust_remote_code=True)


The argument `trust_remote_code` is to be used along with export=True. It will be ignored.
Compiling the model to CPU ...


## 提供网络接口方式
 langchain 默认的模型是 OpenAI的ChatGPT。对于局域网应用来说，因为信息安全的要求私有数据不能出网关，所以需要搭本地模型。其实整个应用的硬件成本最高的就是 LLM 的部署，最经济的方式就是一个局域网一个类型的 LLM 统一部署一个，为了保障硬件的充分利用。LLM 和 Langchain 分开部署的最大好处就是灵活性，其实Langchain 已经是一个非常棒的设计样板了，langchain 只做资源整合，任何重存储和重计算的服务全部在远端部署，给 langchain 的应用留足生长的空间。

In [6]:
from transformers import (
    AutoTokenizer,
    TextIteratorStreamer,
)

def llama_partial_text_processor(partial_text, new_text):
    new_text = new_text.replace("[INST]", "").replace("[/INST]", "")
    partial_text += new_text
    return partial_text


def fun(query:str):
    start='''<s>[INST] <<SYS>>
    你是一个乐于助人、尊重他人、诚实的助手。在安全的情况下，始终尽可能提供帮助。请用中文回答
    <</SYS>>
    '''
    end= "[/INST]"
    string=start+query+end
    input_tokens = tok(string, return_tensors="pt", add_special_tokens=False)
    streamer = TextIteratorStreamer(tok, timeout=30.0, skip_prompt=True, skip_special_tokens=True)
    ov_model.generate(**input_tokens, max_new_tokens=256,streamer=streamer)
    partial_text = ""
    for new_text in streamer:
        partial_text = llama_partial_text_processor(partial_text, new_text)
    result=partial_text.replace(" ", "")
    return result

In [None]:
from flask import Flask, request
app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def index():
    if request.method == 'POST':
        query = request.form.get('ask')
        print(query)
        result=fun(query)
        print(result)
        return result
    else:
        return 'Hello, GET!'

if __name__ == '__main__':
    app.run(host='127.0.0.1',port=8080)

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:8080
[33mPress CTRL+C to quit[0m
