<a href="https://colab.research.google.com/github/m25c1049/generative_AI/blob/main/Toyama_Uni_LangChain_Llama2_7b_Q_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LLM Application Sample Code :
##Structure
- LLM: mmnga/ELYZA-japanese-Llama-2-7b-instruct-gguf on Hugging-Face
- Embedding: sentence-transformers/distiluse-base-multilingual-cased-v2
- RAG(VectorDB): Chroma
- RAG Data Source: https://www.aozora.gr.jp/cards/000081/files/43754_17659.html

##Environment Resource Requirement
- RAM: 13GB over
- Disc: 50GB Free

##Detail
- Install libraris by pip command.
- Embedding Model Configuration.
- Scraping data from Web Site.
- Splited loaded text and create chunk data.
- Insert chunk data into Vector DB.
- Configure Vector DB as Retriever(RAG).
- Create Prompt as ChatPromptTemplate instance.
- Define Lang Chain Expression Language (LCEL).
- Execute RAG retrieval, context injection to prompt, question injection to prompt, LLM execution, and get response as stream data.

In [None]:
!pip install langchain
!pip install langchain-core
!pip install langchain-community
!pip install langchain-huggingface
!pip install llama-cpp-python
!pip install huggingface-hub
!pip install sentence-transformers
!pip install chromadb
!pip install beautifulsoup4
!pip install lxml
!pip install requests
!pip install numpy
!pip install transformers
!pip install torch
!pip install tokenizers
# google-colab の依存関係を解決するため、requests==2.32.4 を明示的にインストールします。
# これにより、langchain-community に関連する警告が表示される可能性があります。
!pip install requests==2.32.4 --force-reinstall

Collecting requests<3,>=2.32.5 (from langchain-community)
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Using cached requests-2.32.5-py3-none-any.whl (64 kB)
Installing collected packages: requests
  Attempting uninstall: requests
    Found existing installation: requests 2.32.4
    Uninstalling requests-2.32.4:
      Successfully uninstalled requests-2.32.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0mSuccessfully installed requests-2.32.5
Collecting requests==2.32.4
  Using cached requests-2.32.4-py3-none-any.whl.metadata (4.9 kB)
Collecting charset_normalizer<4,>=2 (from requests==2.32.4)
  Using cached charset_normalizer-3.4.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (36 k

In [None]:
from huggingface_hub import hf_hub_download
from langchain_community.llms import LlamaCpp
from langchain_community.vectorstores import Chroma
# from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_huggingface import HuggingFaceEmbeddings

CONTEXT_SIZE = 2048
LLM_REPO_ID = "mmnga/ELYZA-japanese-Llama-2-7b-instruct-gguf"
LLM_FILE = "ELYZA-japanese-Llama-2-7b-instruct-q4_K_S.gguf"
CHUNK_SIZE = 256
CHUNK_OVERLAP = 64
EMB_MODEL = "sentence-transformers/distiluse-base-multilingual-cased-v2"
COLLECTION_NAME = "langchain"
SRC_INFO_URL = "https://www.aozora.gr.jp/cards/000081/files/43754_17659.html"

# LLMを生成
model_path = hf_hub_download(repo_id=LLM_REPO_ID, filename=LLM_FILE)
llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=128,
    n_ctx=CONTEXT_SIZE,
    f16_kv=True,
    verbose=True,
    seed=0
)

# 埋め込み表現生成用モデルを準備
embeddings = HuggingFaceEmbeddings(model_name=EMB_MODEL)

# 指定したURLから情報ソースをロード
loader = WebBaseLoader(SRC_INFO_URL)
data = loader.load()

# ロードしたテキストをチャンクに分割
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP
)
all_splits = text_splitter.split_documents(data)

# ベクトル化してベクトルDBへ格納
vector_store = Chroma.from_documents(
    documents=all_splits, embedding=embeddings
)

# ベクトルDBをLangChainのRetrieverに設定、抽出するチャンク数はkで設定
retriever = vector_store.as_retriever(search_kwargs={"k": 2})

# Llama2プロンプトテンプレート
template = """<s>[INST] <<SYS>>
あなたは誠実で優秀な日本人のアシスタントです。前提条件の情報だけで回答してください。
<</SYS>>

前提条件：{context}

質問：{question} [/INST]"""

# LangChain LCELでチェインを構築
prompt = ChatPromptTemplate.from_template(template)
output_parser = StrOutputParser()
setup_and_retrieval = RunnableParallel(
    {"context": retriever, "question": RunnablePassthrough()}
)
chain = setup_and_retrieval | prompt | llm | output_parser

# チェインを起動して、回答をストリーミング出力
for s in chain.stream("2人の紳士が連れていた動物は何ですか？"):
    print(s, end="", flush=True)

llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--mmnga--ELYZA-japanese-Llama-2-7b-instruct-gguf/snapshots/2d708f9c52bde588049a494e95b986f5bedba76f/ELYZA-japanese-Llama-2-7b-instruct-q4_K_S.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = ELYZA-japanese-Llama-2-7b-instruct
llama_model_loader: - kv   2:       general.source.hugginface.repository str              = elyza/ELYZA-japanese-Llama-2-7b-instruct
llama_model_loader: - kv   3:                   llama.tensor_data_layout str              = Meta AI original pth
llama_model_loader: - kv   4:                       llama.context_length u32              = 4096
llama_model_loader: - kv   5:                     ll

  犬

llama_perf_context_print:        load time =  196637.92 ms
llama_perf_context_print: prompt eval time =  196634.67 ms /   816 tokens (  240.97 ms per token,     4.15 tokens per second)
llama_perf_context_print:        eval time =    3235.84 ms /     5 runs   (  647.17 ms per token,     1.55 tokens per second)
llama_perf_context_print:       total time =  199882.79 ms /   821 tokens
llama_perf_context_print:    graphs reused =         80
