<a href="https://colab.research.google.com/github/kylevirtuous1211/Cache-Replacement-Computer-Architecture-final_project_2024/blob/master/RAG_implementation_(Consumer).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training vectorstore (RAG)
## 助教請從 "助教請直接從這邊開始跑" 開始

In [None]:
import os
root_dir = "upload_docs/"
os.makedirs(root_dir, exist_ok=True)

In [None]:
!pip install -U langchain langchain-community pypdf python-docx sentence-transformers faiss-cpu

Collecting langchain-community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Dow

In [None]:
!pip install pandas langchain_core



## Pipeline package 解釋：

1. Loading documents (TextLoader, PyPDFLoader, UnstructuredWordDocumentLoader)

2. Splitting text (RecursiveCharacterTextSplitter)

3. Creating embeddings (HuggingFaceEmbeddings)

5. Storing vectors for similarity search (FAISS)
    * Stores all passage vectors in an index.

  * Given a query vector, it efficiently finds the top-K closest vectors.

  * It uses efficient indexing algorithms (like HNSW or IVF) to scale to millions of documents.


In [None]:
from langchain_community.document_loaders import TextLoader, PyPDFLoader, UnstructuredWordDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS


### Embedding model E5 的目標：
1. 讓queries/passages文意相近的能夠用相同向量表示
2. 自訂兩個函式幫助訓練資料：The E5 models were trained using contrastive learning, where:
* Queries were always prefixed with "query: ..."
* Passages were always prefixed with "passage: ..." 所以有加效果會好很多

### RAG 的目標
1. 使用者輸入
2. 輸入會被模型embed成向量
3. 到FAISS尋找前幾高相似的passage embeddings
4. passage加到LLM裡面來生成答案

embed_documents: 	For indexing documents/passages into FAISS
* "passage: "	So the model treats the text as a retrievable passage

embed_query	For real-time user queries
* "query: "	So the model treats it as a search query

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

class CustomE5Embedding(HuggingFaceEmbeddings):
    def embed_documents(self, texts):
        texts = [f"passage: {t}" for t in texts]
        return super().embed_documents(texts)

    def embed_query(self, text):
        return super().embed_query(f"query: {text}")

In [None]:
from langchain_core.documents import Document
import pandas as pd
folder_path = root_dir
documents = []

def load_xlsx_as_documents(file_path):
    df = pd.read_excel(file_path)
    num_rows_to_select = 30000
    selected_rows = df.sample(n=num_rows_to_select)

    for index, row in selected_rows.iterrows():
          text = "\n".join([f"{col}: {row[col]}" for col in df.columns])
          documents.append(Document(page_content=text))

for file in os.listdir(folder_path):
    path = os.path.join(folder_path, file)
    if file.endswith(".txt"):
        loader = TextLoader(path)
    elif file.endswith(".xlsx"):
        load_xlsx_as_documents(path)
        break
    elif file.endswith(".pdf"):
        loader = PyPDFLoader(path)
    elif file.endswith(".docx"):
        loader = UnstructuredWordDocumentLoader(path)
    else:
        continue
    documents.extend(loader.load())

splitter: 一個用遞迴形式將長文，分成有意義的段落、句子等文字堆(chunks)的物件
* chunk_size: 一個文字堆(chunks)裡面有多少char
* chunk_overlap: 兩個chunk之間重複的區間大小，確保上下文意的保留

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
split_docs = splitter.split_documents(documents)

### Build a FAISS vector store (a semantic search engine) from your split documents using a dense embedding model.


①	Loops through each document chunk in split_docs

②	Calls embedding_model.embed_documents(...) to get vector embeddings

③	Stores all vectors in a FAISS index

④	Returns a vectorstore you can later use to search by vector

#### Each chunk => one semantic vector!!

In [None]:
embedding_model = CustomE5Embedding(model_name="intfloat/multilingual-e5-small")
vectorstore = FAISS.from_documents(split_docs, embedding_model)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/498k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [None]:
vectorstore.save_local("faiss_db")
# !zip -r faiss_Online_Retail_db.zip faiss_db

# 助教直接從這邊開始跑

In [None]:
!pip install -U langchain langchain-community pypdf python-docx sentence-transformers faiss-cpu
!pip install pandas langchain_core

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 362, in run
    resolver = self.make_resolver(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 177, in make_resolver
    return pip._internal.resolution.resolvelib.resolver.Resolver(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 58, in __init__
    self.factory = Factory(
                   ^^^^^^^^
  File "/usr/local/lib/py

In [None]:
!pip install pandas langchain_coreimport gdown
from langchain_community.document_loaders import TextLoader, PyPDFLoader, UnstructuredWordDocumentLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Replace FILE_ID with your actual Google Drive file ID
file_id = "1XbCzLTnsKkqemk1escZLtVP14dZreNNm"
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, output="faiss_Online_Retail_db.zip", quiet=False)


Downloading...
From (original): https://drive.google.com/uc?id=1XbCzLTnsKkqemk1escZLtVP14dZreNNm
From (redirected): https://drive.google.com/uc?id=1XbCzLTnsKkqemk1escZLtVP14dZreNNm&confirm=t&uuid=60ff9366-2b9e-4186-ae93-9293dd0d6f3f
To: /content/faiss_Online_Retail_db.zip
100%|██████████| 44.4M/44.4M [00:00<00:00, 289MB/s]


'faiss_Online_Retail_db.zip'

In [None]:
!unzip faiss_Online_Retail_db.zip

Archive:  faiss_Online_Retail_db.zip
   creating: faiss_db/
  inflating: faiss_db/index.faiss    
  inflating: faiss_db/index.pkl      


In [None]:

!pip install -U langchain langchain-community sentence-transformers faiss-cpu gradio openai




### Retrieve vectorstore db

In [None]:
embedding_model = CustomE5Embedding(model_name="intfloat/multilingual-e5-small")
db = FAISS.load_local("faiss_db", embedding_model, allow_dangerous_deserialization=True)
retriever = db.as_retriever()

試試看這個gemma2-9b的模型

In [None]:
from openai import OpenAI
import gradio as gr
from google.colab import userdata
api_key = userdata.get('Groq')
os.environ["OPENAI_API_KEY"] = api_key
model = "gemma2-9b-it"
base_url="https://api.groq.com/openai/v1"

client = OpenAI(
    base_url=base_url # 使用 OpenAI 本身不需要這段
)


## Prompt 的設計
為了讓回答更符合企業需要，讓LLM當作他們的目標客群去回答

### System Prompt
你是線上零售商店的目標顧客，請說明你是否想購買此商品的原因。

### Prompt 模板
請根據以下提示回答：
{retrieved_chunks}

使用者問題：{user_prompt}

公司即將在網路零售平台推出新產品，請描述你的購買意願。
你需要說明為什麼這個產品值得購買，或為什麼這個產品對你來說不夠令人滿意。
最後，請總結一些關於你觀點的重點，
並給出你對這個產品的最終評價。

In [None]:
system_prompt = "you are a target audience of a online retail shop, please give your reason of whether you want to buy this item or not?"

prompt_template = """
Answer according to the below prompt:
{retrieved_chunks}

user question: {user_prompt}

The company is launching new product on online retail shop, please describe your purchase intention.
You need to argue why the product is worth buying or why is the product not satisfying for you.
At the end, give a conclusion of some key points about the argument,
As well as a final verdict about the product.
"""

In [None]:

def generate_response(user_prompt):
    docs = retriever.get_relevant_documents(user_prompt)
    retrieved_chunks = "\n\n".join([doc.page_content for doc in docs])

    prompt = prompt_template.format(retrieved_chunks=" ", user_prompt=user_prompt)

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content


iface = gr.Interface(
    fn=generate_response,
    inputs=gr.Textbox(lines=2, placeholder="Enter the stuff that company wants to sell here..."),
    outputs="text",
    title="Product Purchase Intention Analyzer",
    description="Analyze user purchase intention based on provided product information."
)

iface.launch(debug=True)


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://037b7b37b00b90cc90.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://037b7b37b00b90cc90.gradio.live




## user prompt：Dark Green North Face Jacket for 30USD, perfect for outdoor activities in slightly cold weather

## 有加RAG的回答
This dark green North Face jacket for $30 sounds tempting!  Here's my breakdown:

**Why I might buy it:**

* **Brand Recognition:** North Face is known for quality outdoor wear. I trust their products to be durable and functional, especially for milder cold weather conditions.
* **Price Point:** $30 is a steal for a North Face jacket, even a slightly older or past-season model.  This makes it incredibly appealing as a budget-friendly option.
* **Versatile Use:** Hiking, camping, walking the dog – this jacket could be useful for many outdoor activities. The dark green color is also versatile and can be easily paired with different outfits.

**Why I might hesitate:**

* **Specific Details:** I need to know more about the jacket's features. What is the exact weight?  What type of insulation does it have?  Is the waterproof rating sufficient for light rain or snow?  
* **Size and Fit:** Online shopping always has the risk of fit issues. I'd want to check the size chart thoroughly and possibly read reviews from other buyers about how the jacket fits true to size.

**Key Points:**

* Price is a major draw, but brand recognition also plays a role in my decision.
* I need specific details about the jacket's features and fit to make a confident purchase.


**Verdict:**

Overall, the $30 price tag on a North Face jacket is extremely enticing. However, I'd need more information about the jacket's specific features and fit before making a final decision. If it checks those boxes, I'm definitely buying it!  

## 沒有加RAG
##  Hmmm, a Dark Green North Face Jacket for $30…

A North Face jacket for $30 is definitely tempting! I love the outdoors and I'm always looking for good deals on quality gear.  However, I need to weigh a few things before clicking "add to cart."

**Here's why it might be worth buying:**

* **North Face quality:**  North Face is known for making durable and reliable outdoor clothing. Even if this is a discounted model, I'd expect decent build quality and a good fit.
* **Perfect for slightly cold weather:** This is a big selling point for me.  I live in a place with unpredictable weather, so a versatile jacket like this would be perfect for those chilly spring or fall days.
* **Versatile color:** Dark green is a classic color that goes with pretty much everything. I could easily wear this jacket hiking, running errands, or even just hanging out.

**Here's why it might not be worth buying:**

* **Too good to be true?**  A North Face jacket for $30 seems incredibly cheap. I wonder if it's a heavily discounted older model, or perhaps a smaller size that's harder to sell.
* **Missing details:**  I need to know more about the jacket itself.  What is the fill weight? Does it have a hood? What are the pockets like?  Without knowing these details, I'm hesitant to commit.

**Key Points:**

* Price is incredibly attractive
*  Brand reputation is positive
* Need more information about the jacket's features and condition

**Verdict:**

I'm definitely interested, but I need more information before making a decision. I'd check the product description carefully, read customer reviews, and maybe even contact the company directly to ask about the jacket's specifics. If it checks out, I'd be happy to buy it! This deal is just too good to pass up without a closer look.  

# 比較
在這個更詳細的例子中，「有加RAG」的回答能夠基於更豐富的資訊進行更深入的分析，包括產品的材質、技術特性、重量、適用環境，甚至用戶的真實評價。這使得它的購買決策更加理性，並且能夠針對性地提出潛在的疑慮和考量。

而「沒有加RAG」的回答仍然停留在非常表面的判斷，主要的考量仍然是品牌和價格，對於產品的實際功能和適用性缺乏足夠的了解，因此也無法做出更明智的決策。

這個例子更清晰地展示了 RAG (Retrieval-Augmented Generation) 在提供更全面、更具體的資訊基礎上，如何顯著提升回答的質量和實用性。