# 付録C RAG（検索拡張生成）


## 事前準備

In [1]:
!curl -L -o genaibook.zip https://github.com/oreilly-japan/hands-on-generative-ai-ja/releases/download/genaibook/genaibook.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  4678  100  4678    0     0  10368      0 --:--:-- --:--:-- --:--:-- 10368


In [2]:
!unzip genaibook.zip

Archive:  genaibook.zip
Made with MacWinZipper (http://tidajapan.com/macwinzipper)
  inflating: genaibook/__init__.py   
  inflating: genaibook/core.py       


## データの前処理


In [3]:
import urllib.request

# ファイル名とURLを定義する
file_name = "The-AI-Act.pdf"
url = "https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf"

# ファイルをダウンロードする
urllib.request.urlretrieve(url, file_name)
print(f"{file_name} downloaded successfully.")

The-AI-Act.pdf downloaded successfully.


In [4]:
pip install langchain_community pypdf langchain-text-splitters

Collecting langchain_community
  Downloading langchain_community-0.4-py3-none-any.whl.metadata (3.0 kB)
Collecting pypdf
  Downloading pypdf-6.1.3-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<2.0.0,>=1.0.0 (from langchain_community)
  Downloading langchain_core-1.0.1-py3-none-any.whl.metadata (3.5 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain_community)
  Downloading langchain_classic-1.0.0-py3-none-any.whl.metadata (3.9 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain_community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.10.1 (from langchain_community)
  Downloading pydantic_settings-2.11.0-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.3-py3-none-any.whl.metadata (9.7 kB)

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_name)
docs = loader.load()
print(len(docs))

108


In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=100
)
chunks = text_splitter.split_documents(docs)
print(len(chunks))

851


In [7]:
chunked_text = [chunk.page_content for chunk in chunks]
chunked_text[404]

'user or for own use on the Union market for its intended purpose; \n(12) ‘intended purpose’ means the use for which an AI system is intended by the provider, \nincluding the specific context and conditions of use,  as specified in the information \nsupplied by the provider in the instructions for use, promotional or sales materials \nand statements, as well as in the technical documentation; \n(13) ‘reasonably foreseeable misuse’ means the use of an AI system in a way that is not in'

### ドキュメントの埋め込み


In [8]:
from sentence_transformers import SentenceTransformer, util

sentences = ["I'm happy", "I'm full of happiness"]
model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# 2つの文の埋め込みを計算する
embedding_1 = model.encode(sentences[0], convert_to_tensor=True)
embedding_2 = model.encode(sentences[1], convert_to_tensor=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
embedding_1.shape

torch.Size([384])

In [10]:
util.pytorch_cos_sim(embedding_1, embedding_2)

tensor([[0.8367]], device='cuda:0')

In [11]:
embedding_1 @ embedding_2

tensor(0.8367, device='cuda:0')

In [12]:
import torch

torch.dot(embedding_1, embedding_2)

tensor(0.8367, device='cuda:0')

In [13]:
chunk_embeddings = model.encode(chunked_text, convert_to_tensor=True)

In [14]:
chunk_embeddings.shape

torch.Size([851, 384])

## 検索

In [15]:
def search_documents(query, top_k=5):
    # 質問をベクトルにエンコードする
    query_embedding = model.encode(query, convert_to_tensor=True)

    # 質問文とすべてのドキュメントのチャンクとのコサイン類似度を計算する
    similarities = util.pytorch_cos_sim(query_embedding, chunk_embeddings)

    # 類似度の高い上位k件のチャンクを得る
    top_k_indices = similarities[0].topk(top_k).indices

    # 対応するドキュメントのチャンクを検索する
    results = [chunked_text[i] for i in top_k_indices]

    return results

In [16]:
search_documents("What are prohibited ai practices?", top_k=2)

['TITLE II \nPROHIBITED ARTIFICIAL INTELLIGENCE PRACTICES \nArticle 5 \n1. The following artificial intelligence practices shall be prohibited: \n(a) the placing on the market, putting into service or use of an A I system that \ndeploys subliminal techniques beyond a person’s consciousness in order to \nmaterially distort a person’s behaviour in a manner that causes or is likely to \ncause that person or another person physical or psychological harm;',
 'low or minimal risk. The list of prohibited practices in Title II comprises all those AI systems \nwhose use is considered unacceptable as contravening Unio n values, for instance by violating \nfundamental rights. The prohibitions covers practices that have a significant potential to \nmanipulate persons  through subliminal techniques beyond their consciousness or exploit']

## 応答生成

In [17]:
from transformers import pipeline

from genaibook.core import get_device

device = get_device()
generator = pipeline(
    "text-generation", model="HuggingFaceTB/SmolLM-135M-Instruct", device=device
)

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

Device set to use cuda


In [18]:
def generate_answer(query):
    # 関連するチャンクを検索する
    context_chunks = search_documents(query, top_k=2)

    # チャンクをひとつの文字列に結合し、これを文脈情報とする
    context = "\n".join(context_chunks)

    # 文脈情報から応答を生成するプロンプトを作成する
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    # モデルに渡す形式で文脈情報を定義する
    system_prompt = (
        "You are a friendly assistant that answers questions about the AI Act. "
        "If the user is not making a question, you can ask for clarification"
    )
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt},
    ]

    response = generator(messages, max_new_tokens=300)
    return response[0]["generated_text"][2]["content"]

In [19]:
answer = generate_answer("What are prohibited ai practices in the EU act?")
print(answer)

To answer this question, we need to consider the key provisions of the EU Act and the specific guidance provided by the EU Commission.

The EU Act is a comprehensive set of laws and regulations that aim to promote the development and use of artificial intelligence, as well as ensure that AI systems are designed and deployed in a way that respects human rights and dignity. The Act provides a framework for the development and deployment of AI systems, including the rights to privacy, freedom of expression, and equal protection under the law.

The EU Act also includes a set of rules and guidelines for the development and use of AI systems, which are designed to ensure that they are designed and deployed in a way that respects human rights and dignity. These rules and guidelines are intended to promote the development of AI systems that can benefit society as a whole, rather than just their technical capabilities.

In the context of artificial intelligence, the EU Act prohibits certain pra