# Raw Document Analysis (Optional)

In [1]:
def word_count(file_name: str) -> None:
    # More accurate word count for Chinese
    with open(file_name, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Count Chinese characters (each Chinese char ≈ 1 word)
    chinese_chars = len([c for c in content if '\u4e00' <= c <= '\u9fff'])
    # Count English words
    english_words = len([w for w in content.split() if any('a' <= c.lower() <= 'z' for c in w)])
    
    print(f"Total characters: {len(content)}")
    print(f"Chinese characters: {chinese_chars} (≈ words)")
    print(f"English words: {english_words}")
    print(f"Approximate total words: {chinese_chars + english_words}")

word_count("cn1.md")

Total characters: 8085
Chinese characters: 6433 (≈ words)
English words: 24
Approximate total words: 6457


>**5 chunks is actually good enough for a short story (i.e., 1,897 chars).** (i) Fast retrieval speed as only 5 embeddings to search through; And (ii) context preservation since each chunk has ~400 chars - enough context for a complete scene.
>
>**You'd want 10-20+ chunks if** (i) Document is 10,000+ characters (long article/book); (ii) Very specific factual queries requiring granular retrieval; (iii) Multiple topics in one document

# Splitting and Chunking Strategy

In [2]:
from typing import List
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_into_chunks(
    doc_file: str, 
    chunk_size: int = 800, 
    chunk_overlap: int = 150
) -> List[str]:
    with open(doc_file, 'r', encoding='utf-8') as file:
        content = file.read()
    
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n## ", "\n# ", "\n\n", "\n", "。", "!", "?", ";", " ", ""]
    )
    return text_splitter.split_text(content)

chunks = split_into_chunks("cn1.md")
print(f"Created {len(chunks)} chunks\n")


Created 15 chunks



# Indexing and Storage
Embedding Model Choice:
- `shibing624/text2vec-base-chinese` (good for Chinese and used in `rag0`)
- `BAAI/bge-base-zh-v1.5` (better Chinese performance)
- `moka-ai/m3e-base` (multilingual Chinese-English)

In [3]:
import chromadb
from sentence_transformers import SentenceTransformer
from typing import List

# ======================================== 
# Initialize models
# ========================================
embedding_model = SentenceTransformer("shibing624/text2vec-base-chinese")


# ======================================== 
# Create ChromaDB with correct settings
# ========================================
def create_db():
    client = chromadb.PersistentClient(path="./chroma_db")

    # If you want to GUARANTEE cosine space, delete+recreate:
    try:
        client.delete_collection(name="default")
        print("Deleted old collection")
    except Exception as e:
        # ok if it doesn't exist; still good to know unexpected errors
        print(f"(delete_collection) {e}")

    collection = client.create_collection(
        name="default",
        metadata={"hnsw:space": "cosine"}
    )

    print(f"Created collection with metadata: {collection.metadata}")
    return collection

chromadb_collection = create_db()


# ======================================== 
# Embed
# ========================================
def embed_chunk(chunk: str) -> List[float]:
    emb = embedding_model.encode(chunk)  # numpy array
    return emb.tolist()


# ======================================== 
# Store with metadata
# ========================================
def save_embeddings(
    collection,
    chunks: List[str],
    embeddings: List[List[float]],
    source_file: str,
) -> None:
    if not chunks:
        raise ValueError("chunks is empty")

    if len(chunks) != len(embeddings):
        raise ValueError(f"chunks ({len(chunks)}) and embeddings ({len(embeddings)}) length mismatch")
    
    if not source_file:
        return ValueError("file is empty")

    # Use stable IDs that won't collide across different files
    ids = [f"{source_file}:{i}" for i in range(len(chunks))]

    metadatas = [
        {
            "chunk_id": i,
            "source": source_file,
            "chunk_length": len(chunk),
            "chunk_index": i,
        }
        for i, chunk in enumerate(chunks)
    ]

    # Use upsert so reruns don't explode on duplicate ids
    collection.upsert(
        documents=chunks,
        embeddings=embeddings,
        ids=ids,
        metadatas=metadatas,
    )

    print(f"Saved {len(chunks)} chunks to ChromaDB")

# embeddings
embeddings = [embed_chunk(c) for c in chunks]
print(f"Generated {len(embeddings)} embeddings")
print(f"Embedding dimension: {len(embeddings[0])}")

save_embeddings(chromadb_collection, chunks, embeddings, "cn1.md")




Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: shibing624/text2vec-base-chinese
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Deleted old collection
Created collection with metadata: {'hnsw:space': 'cosine'}
Generated 15 embeddings
Embedding dimension: 768
Saved 15 chunks to ChromaDB


# Retrieval

In [4]:
def retrieve(query: str, top_k: int = 5, score_threshold = None):
    query_embedding = embed_chunk(query)
    
    results = chromadb_collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=['documents', 'distances', 'metadatas']
    )
    
    if not results['documents'][0]:
        print("No results found!")
        return []
    
    retrieved = []
    for i, (doc, dist, meta) in enumerate(zip(
        results['documents'][0],
        results['distances'][0],
        results['metadatas'][0]
    )):
        if score_threshold is None or dist <= score_threshold:
            retrieved.append({
                'document': doc,
                'distance': dist,
                'similarity': 1 / (1 + dist),
                'metadata': meta,
                'rank': i
            })
    
    print(f"Retrieved {len(retrieved)}/{top_k} chunks")
    return retrieved

## Retrieval Testing

In [5]:
# Test with same query
query = "哈利波特用了什么魔法打败了索伦？"
results = retrieve(query, top_k=5, score_threshold=None)

def print_top_k_result(results):
    print("\n" + "="*70)
    print("Results for ", query)
    print("="*70)
    for i, r in enumerate(results):
        print(f"\nRank {i+1}:")
        print(f"  Distance: {r['distance']:.3f}")  # Should now be 0.0-2.0 range
        print(f"  Similarity: {r['similarity']:.3f}")
        print(f"  Text: {r['document'][:80]}...")

print_top_k_result(results)

Retrieved 5/5 chunks

Results for  哈利波特用了什么魔法打败了索伦？

Rank 1:
  Distance: 0.422
  Similarity: 0.703
  Text: ## 第五章：魔法的融合

临战前夜，霍格沃茨的钟楼熄灯，取而代之的是城堡上空缓缓旋转的守护星光魔阵。夜色中笼罩着一股难以名状的寂静，仿佛整个世界都屏息等待。
...

Rank 2:
  Distance: 0.439
  Similarity: 0.695
  Text: # 魔戒与魔杖：两个世界的交汇

## 第一章：神秘的传送门

霍格沃茨的禁林，夜色正浓，月光从浓密树冠的缝隙中洒落，投下斑驳的银色光影。空气中弥漫着湿润的苔藓...

Rank 3:
  Distance: 0.457
  Similarity: 0.686
  Text: 火焰之眼骤然收缩、崩塌，在空中发出最后一声爆响后，化作万千光点飘散。半兽人军团仿佛失去灵魂的傀儡，纷纷倒地，化为虚无。

天边第一道金光破晓。

哈利缓缓放下魔...

Rank 4:
  Distance: 0.464
  Similarity: 0.683
  Text: “准备好了？”赫敏看着他。

“其实……没有。”罗恩苦笑一声，“但我们也没得选了。”

赫敏点头，挥杖落下最后一道光线。

整个魔法阵骤然亮起，泛出如水般的银光...

Rank 5:
  Distance: 0.477
  Similarity: 0.677
  Text: “还有我，”赫敏抬起头，“如果允许，我想继续研究裂缝魔法，也许这会是巫师历史新的开端。”

邓布利多微微一笑：“霍格沃茨的图书馆，将永远为你敞开。”

甘道夫从...


# Rerank

In [6]:
from sentence_transformers import CrossEncoder

def rerank(
    query: str, 
    retrieved_results: List[dict], 
    top_k: int
) -> List[dict]:
    cross_encoder = CrossEncoder('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1')
    
    # Extract documents from retrieve() results
    chunks = [result['document'] for result in retrieved_results]
    pairs = [(query, chunk) for chunk in chunks]
    scores = cross_encoder.predict(pairs)
    
    # Preserve full result dict with rerank score
    for result, score in zip(retrieved_results, scores):
        result['rerank_score'] = float(score)
    
    # Sort by rerank score
    retrieved_results.sort(key=lambda x: x['rerank_score'], reverse=True)
    
    return retrieved_results[:top_k]

# Usage
retrieved = retrieve(query, top_k=5)
reranked = rerank(query, retrieved, top_k=3)

for i, result in enumerate(reranked):
    print(f"[{i}] Rerank: {result['rerank_score']:.3f} | {result['document']}\n")

Retrieved 5/5 chunks


Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

XLMRobertaForSequenceClassification LOAD REPORT from: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
Key                             | Status     |  | 
--------------------------------+------------+--+-
roberta.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


[0] Rerank: 2.668 | “准备好了？”赫敏看着他。

“其实……没有。”罗恩苦笑一声，“但我们也没得选了。”

赫敏点头，挥杖落下最后一道光线。

整个魔法阵骤然亮起，泛出如水般的银光，直冲天际。火焰之眼被吸引，缓缓偏转方向，似乎被那魔戒的气息诱惑。
甘道夫与邓布利多互看一眼，同时发动终极魔法。

“Fianto Duri！”——钢铁守卫之盾。

“Repello Inimicum！”——驱逐黑暗之敌。

两道金色光柱自地面升起，交汇成一股炽烈光矛，直接击中火焰之眼侧翼。

索伦发出痛苦咆哮，光明凤凰趁机直刺其中心。

三股力量——哈利的光明守护神、两位巫师的联合光柱，以及甘道夫法杖中的太阳之火——汇聚成一道三色神光，灼穿天际，刺入那燃烧的瞳孔之中。

黑暗开始崩塌。

火焰退却，夜色翻滚。

但索伦，尚未消亡。

他正凝聚最后的黑暗力量，准备发动反扑——一击，足以吞噬所有。

而这一刻，佛罗多缓缓举起至尊魔戒，做出了那个他一直不愿做出的选择……

[1] Rerank: 0.816 | # 魔戒与魔杖：两个世界的交汇

## 第一章：神秘的传送门

霍格沃茨的禁林，夜色正浓，月光从浓密树冠的缝隙中洒落，投下斑驳的银色光影。空气中弥漫着湿润的苔藓气息，偶尔传来远处夜行魔兽的低吟。十六岁的哈利·波特紧握着魔杖，悄然穿行在林中。他的任务是寻找一只独角兽的毛发，作为奇兽饲育课的实地作业。

“别走太远，”海格临走前叮嘱他，“那边有些地方连我都不熟。”

突然，一道耀眼的金色光芒从一棵古老橡树后爆发出来，仿佛空间本身被撕裂。哈利猛然转身，举起魔杖高声念道：“荧光闪烁！”（Lumos）

光芒之中，一个圆形漩涡缓缓旋转，宛如银河坠落凡间。等那道光慢慢散去，哈利惊讶地看到：面前出现了五个陌生人。

一位身穿灰袍、头戴宽边帽、留着长须的高大老人静静站着，手持一根雕刻繁复的木杖。他身旁，是四个个子矮小但神情坚定的小个子生物，他们的脚上长满毛发，身着粗布外套，神情警觉。

“你是谁？”哈利小心翼翼地问道。

那老人微微一笑，语气温和：“我是甘道夫，一名中土世界的灰袍巫师。这四位是霍比特人——佛罗多、山姆、皮平和梅里。我们本应追踪一位古老的黑暗势力——索伦，却意外被卷入了这个世界。”

哈利睁大眼睛，脑中一片混乱。他本以为霍格沃茨已经是个充满奇迹的地方，但眼前这些人的出

# LLM Generation