# <p style="text-align: center; color: #1E90FF; font-size: 30px;">DocuWiki Chatbot: Document Search and Location </p>

### Table of Contents
* [Import OpenAI, Test Connection & Embeddings](#1)
* [Crawling Wikipedia & Generate PDFs)](#2)
* [Text Chunk Dividing & Embedding)](#3)
* [Testing with Questions](#4)

In [50]:
import numpy as np
import faiss
from openai import OpenAI
import os
import requests
from urllib.parse import quote
import re
from PyPDF2 import PdfReader

<div style="background-color: #EAEAEA; padding: 0px; border-radius: 5px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
    <h1 style="text-align: center; color: #333333; font-size: 20px; font-weight: bold;", id=1>Import OpenAI, Test Connection & Embeddings</h1>
</div>

In [None]:
# API_KEY deleted
os.environ["OPENAI_API_KEY"] = ""

In [19]:
client = OpenAI()

# Test connection
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": "If the connection is valid, say 'Hello, Li Feng' to me."}
    ]
)
print(response.choices[0].message.content)

Hello, Li Feng!


In [32]:
# Test embeddings
sample_text = "Chopin is a great composer."
embedding = client.embeddings.create(
    model = "text-embedding-3-small",
    input = sample_text
)

print("Embedding Length:", len(embedding.data[0].embedding),
      "\nEmbedding snippet:", embedding.data[0].embedding[:10]
)

Embedding Length: 1536 
Embedding snippet: [-0.0022081956267356873, -0.03175782039761543, -0.05779833719134331, -0.03731878846883774, 0.03745278716087341, -0.029859496280550957, -0.01409225258976221, -0.0009505570633336902, -0.011015853844583035, -0.041271764785051346]


<div style="background-color: #EAEAEA; padding: 0px; border-radius: 5px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
    <h1 style="text-align: center; color: #333333; font-size: 20px; font-weight: bold;", id=2>Crawling Wikipedia & Generate PDF</h1>
</div>

In [42]:
titles = [
    "High-intensity interval training",
    "Sergei Rachmaninoff"
]

def fetch_pdf_bytes(title):
    """Download the ready-made PDF from Wikipedia REST."""
    safe_title = quote(title, safe="")
    url = f"https://en.wikipedia.org/api/rest_v1/page/pdf/{safe_title}"
    resp = requests.get(url, headers={'User-Agent': 'MyWikiBot/1.0'})
    resp.raise_for_status()
    return resp.content


def save_pdf_from_bytes(pdf_bytes, output_path):
    with open(output_path, 'wb') as f:
        f.write(pdf_bytes)

def sanitize_filename(title):
    return re.sub(r"[^\w\-_\.]", "_", title)

if __name__ == "__main__":
    os.makedirs("output", exist_ok=True)

    for title in titles:
        try:
            pdf = fetch_pdf_bytes(title)
        except requests.HTTPError as e:
            if e.response.status_code == 404:
                print(f"No PDF available for {title!r}, skipping.")
                continue
            else:
                raise
        fn = sanitize_filename(title) + ".pdf"
        save_pdf_from_bytes(pdf, os.path.join("output", fn))


def extract_text_from_pdf(path):
    reader = PdfReader(path)
    text = ""
    for page in reader.pages:
        page_text = page.extract_text()
        if page_text:
            text += page_text
    return text

base_dir = os.getcwd()
pdf_folder = os.path.join(base_dir, "output")
pdf_texts = {}

for filename in os.listdir(pdf_folder):
    path = os.path.join(pdf_folder, filename)
    text = extract_text_from_pdf(path)
    pdf_texts[filename] = text

<div style="background-color: #EAEAEA; padding: 0px; border-radius: 5px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
    <h1 style="text-align: center; color: #333333; font-size: 20px; font-weight: bold;", id=3>Text Chunk Dividing & Embedding</h1>
</div>

In [43]:
def chunk_text(text, chunk_size=1000, overlap=100):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i+chunk_size])
    return chunks

all_chunks = []
for filename, text in pdf_texts.items():
    chunks = chunk_text(text)
    for chunk in chunks:
        all_chunks.append((filename, chunk))

print("Total chunks created:", len(all_chunks))
print("Example chunk:", all_chunks[0][:200])

Total chunks created: 166
Example chunk: ('High-intensity_interval_training.pdf', 'HIIT with dumbbells\nHigh-intensity interval training\nHigh-intensity interval training  (HIIT ) is a training\nprotocol alternating short periods of intense or explosive\nanaerobic exercise  with brief recovery periods until the point\nof exhaustion.[1] HIIT  involves exercises performed in\nrepeated quick bursts at maximum or near maximal effort\nwith periods of rest or low activity between bouts. The very\nhigh level of intensity , the interval duration, and number of\nbouts distinguish it from aerobic  (cardiovascular) activity ,\nbecause the body significantly recruits anaerobic energy\nsystems (although not completely to the exclusion of aerobic\npathways).[1] The method thereby relies on "the anaerobic\nenergy releasing system almost maximally".[1]\nAlthough there are varying forms of HIIT -style workouts\nwhich may involve exercises associated with both\ncardiovascular activity and also resistanc

In [45]:
# Chunk embedding
embeddings = []
metadatas = []

for filename, chunk in all_chunks:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=chunk
    )
    emb = resp.data[0].embedding
    embeddings.append(emb)
    metadatas.append({"source": filename, "text": chunk})

embeddings = np.array(embeddings).astype("float32")

dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print("Index built with", index.ntotal, "vectors")

Index built with 166 vectors


<div style="background-color: #EAEAEA; padding: 0px; border-radius: 5px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
    <h1 style="text-align: center; color: #333333; font-size: 20px; font-weight: bold;", id=4>Testing with Questions</h1>
</div>

In [49]:
def answer_question(query, k=2):
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    q_emb = np.array([resp.data[0].embedding]).astype("float32")

    D, I = index.search(q_emb, k)

    context_chunks = []
    sources = []
    for idx in I[0]:
        meta = metadatas[idx]
        context_chunks.append(meta["text"])
        sources.append((meta["source"], idx))

    context = "\n\n".join(context_chunks)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer the question using only the provided context. If not in context, say 'Not found in the documents.'"},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ]
    )

    answer = response.choices[0].message.content.strip()

    result = {
        "answer": answer,
        "sources": [{"document": s[0], "chunk_index": s[1]} for s in sources]
    }
    return result

query = "In which year Sergei Rachmaninoff was born?"
result = answer_question(query, k=2)

print("Answer:", result["answer"])
for s in result["sources"]:
    print(f"Source: {s['document']} | Location: chunk {s['chunk_index']}")

Answer: Sergei Rachmaninoff was born in the year 1873.
Source: Sergei_Rachmaninoff.pdf | Location: chunk 50
Source: Sergei_Rachmaninoff.pdf | Location: chunk 52


In [53]:
query = "Can you describe 1 research related to HIIT? Answer should be no more than 50 words. "
result = answer_question(query, k=2)

print("Answer:", result["answer"])
for s in result["sources"]:
    print(f"Source: {s['document']} | Location: chunk {s['chunk_index']}")

Answer: A 2019 clinical trial found that HIIT improved vascular health in inactive individuals, demonstrating decreases in arterial wall thickness and improved endothelial function. It was noted that HIIT had stronger effects compared to medium intensity continuous training in enhancing blood flow and vasodilation.
Source: High-intensity_interval_training.pdf | Location: chunk 18
Source: High-intensity_interval_training.pdf | Location: chunk 17
