<a href="https://colab.research.google.com/github/jamesemansfield2/company_policies_generative_ai_rag_gitlab/blob/main/Copy_of_Company_Policies_GenAI_RAG_202509v2_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Company Policy Assistant, Generative AI RAG

This notebook demonstrates building a policy assistant to help employees efficiently search for company policies. The goal is to provide quick and relevant answers to employee questions based on indexed company policies.

The assistant is designed to act as a strict company policy assistant for GitLab, adhering to the following rules:
- Answer ONLY from the provided context.
- Quote exact lines where possible.
- Include policy title and effective date.
- If the information is not found, state "Not found in the indexed policies" and suggest who to ask.
- Keep answers concise, using bullets when useful.

Sample questions used to test the assistant include:
- "What is the acceptable use policy for company systems?"
- "How are conflicts of interest handled?"
- "What is the code of conduct reporting path?"

This cell sets the `GEMINI_API_KEY` environment variable, which is required to authenticate with the Gemini API.

In [None]:
%env GEMINI_API_KEY = # [INSERT GOOGLE GEMINI API KEY]

env: GEMINI_API_KEY=# [INSERT GOOGLE GEMINI API KEY]


This cell installs the necessary libraries and packages for the policy assistant, including `torch`, `sentence-transformers`, `faiss-cpu`, `rank-bm25`, and `google-genai`.

In [None]:
# ==== Clean install (Colab / Py3.12) — manual restart ====

# Quiet the pydevd "frozen modules" warning (harmless, but noisy)
import os
os.environ["PYDEVD_DISABLE_FILE_VALIDATION"] = "1"

# Keep pip tooling stable; pin setuptools <81 to avoid pkg_resources deprecation spam
!pip -q install -U pip "setuptools<81" wheel

# Pin a Colab-friendly, NumPy-2-safe stack that avoids resolver thrash
!pip -q install --upgrade --force-reinstall --no-cache-dir --prefer-binary \
  "numpy==2.0.2" \
  "scipy==1.15.1" \
  "pandas==2.2.2" \
  "requests==2.32.3" \
  "google-auth==2.38.0" \
  "packaging==24.2" \
  "cryptography==43.0.3" \
  "fsspec==2025.3.2" \
  "scikit-learn==1.6.1" \
  "jedi>=0.19.1" \
  "faiss-cpu==1.8.0.post1" \
  "torch==2.4.1" "torchvision==0.19.1" "torchaudio==2.4.1" \
  "sentence-transformers>=2.7.0" \
  "rank-bm25==0.2.2" \
  "beautifulsoup4>=4.12.3" "lxml>=5.2.2" "pdfminer.six>=20231228" \
  "python-docx>=1.1.2" "tldextract>=5.1.2" "google-genai"

print("✅ Install done. Now: Runtime → Restart runtime (once), then run the next cell.")


[31mERROR: Cannot install faiss-cpu==1.8.0.post1, numpy==2.0.2, pandas==2.2.2, scikit-learn==1.6.1 and scipy==1.15.1 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m✅ Install done. Now: Runtime → Restart runtime (once), then run the next cell.


In [None]:
import numpy as np, pandas as pd, scipy, torch
print("numpy", np.__version__, "| pandas", pd.__version__, "| scipy", scipy.__version__, "| torch", torch.__version__)
import faiss
print("faiss", getattr(faiss, "__version__", "n/a"))


numpy 2.0.2 | pandas 2.2.2 | scipy 1.15.1 | torch 2.4.1+cu121
faiss 1.12.0


This cell imports the required libraries and sets up the Gemini API client. It also defines the `MODEL_ID` to be used.

In [None]:
import os
from google import genai
from google.genai import types

# Put your key here or via Colab "Secrets":  Settings ▶️ Variables ▶️ GEMINI_API_KEY
os.environ.setdefault("GEMINI_API_KEY", "YOUR_API_KEY_HERE")

# Developer API (AI Studio) client. For Vertex AI, see the commented block below.
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
MODEL_ID = "gemini-2.5-flash"

# --- If you prefer Vertex AI instead of the Developer API, use this:
# os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "true"
# os.environ["GOOGLE_CLOUD_PROJECT"] = "your-project-id"
# os.environ["GOOGLE_CLOUD_LOCATION"] = "us-central1"
# client = genai.Client()  # uses env above
# MODEL_ID = "gemini-2.5-flash"


This cell defines the crawling process to collect policy documents from specified URLs within allowed domains and paths. It saves the raw HTML content and returns a list of crawled page URLs.

The RAG (Retrieval Augmented Generation) functionality is implemented across several steps and functions.

The key parts of the RAG process are:

Data Loading and Indexing: The code in the cell with ID B5F9xCXArVsN handles processing the crawled documents, extracting text, splitting it into chunks, and building the searchable index using FAISS (for vector similarity) and BM25 (for keyword matching). This sets up the "Retrieval" part of RAG.

Hybrid Search: The hybrid_search function within the cell with ID bJwHTfHkrXWW combines the results of the vector search and BM25 search to retrieve the most relevant document chunks based on the user's query. This is a crucial part of the "Retrieval" step.

Context Creation and Answer Generation: The _context and gemini_generate functions, also in the cell with ID bJwHTfHkrXWW, take the retrieved chunks, format them as context, and then use the Gemini model to generate an answer based on that context. This is the "Augmented Generation" part of RAG.

So, while there isn't one single "RAG function," the process is distributed across these functions and cells.

In [None]:
import os, re, tldextract, requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
from collections import deque

DATA_DIR = "/content/policy_data"
RAW_DIR = f"{DATA_DIR}/raw"
os.makedirs(RAW_DIR, exist_ok=True)

COMPANY = "GitLab"
ALLOW_DOMAINS = {"about.gitlab.com"}
PATH_ALLOW_PREFIXES = ("/handbook", "/support", "/community", "/company/policies")
SEED_URLS = [
    "https://about.gitlab.com/handbook/",
    "https://about.gitlab.com/support/general-policies/",
    "https://about.gitlab.com/community/contribute/code-of-conduct/",
]
MAX_PAGES = 200
TIMEOUT = 15
HEADERS = {"User-Agent": "policy-rag-colab/1.0"}

def sha1(s: str) -> str:
    import hashlib
    return hashlib.sha1(s.encode("utf-8")).hexdigest()[:12]

def allowed(url:str)->bool:
    p = urlparse(url)
    if p.scheme not in ("http", "https"):
        return False
    d = tldextract.extract(url)
    domain = ".".join(part for part in [d.domain, d.suffix] if part)
    host = f"{(d.subdomain+'.') if d.subdomain else ''}{domain}"
    if host not in ALLOW_DOMAINS:
        return False
    return any(urlparse(url).path.startswith(pref) for pref in PATH_ALLOW_PREFIXES)

def canonicalize(url:str, base:str)->str:
    url = urljoin(base, url)
    p = urlparse(url)
    return f"{p.scheme}://{p.netloc}{p.path}"  # strip query/fragment

def crawl(seed_urls, max_pages=200):
    seen, pages = set(), []
    q = deque(seed_urls)
    while q and len(pages) < max_pages:
        url = canonicalize(q.popleft(), "")
        if url in seen or not allowed(url):
            continue
        seen.add(url)
        try:
            r = requests.get(url, headers=HEADERS, timeout=TIMEOUT)
            if r.status_code != 200 or "text/html" not in r.headers.get("Content-Type",""):
                continue
            html = r.text
            out = os.path.join(RAW_DIR, f"{sha1(url)}.html")
            with open(out, "w", encoding="utf-8") as f: f.write(html)
            pages.append(url)
            soup = BeautifulSoup(html, "lxml")
            for a in soup.find_all("a", href=True):
                nxt = canonicalize(a["href"], url)
                if nxt not in seen and allowed(nxt):
                    q.append(nxt)
        except Exception:
            pass
    return pages

pages = crawl(SEED_URLS, MAX_PAGES)
len(pages)


54

This cell processes the raw HTML files, extracts text content, splits the content into chunks, and builds a searchable index using FAISS for vector similarity search and BM25 for keyword matching.

In [None]:
import os, re, pickle, json, numpy as np, pandas as pd, faiss
from bs4 import BeautifulSoup
from pdfminer.high_level import extract_text as pdf_extract_text
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi

INDEX_DIR = f"{DATA_DIR}/index"
os.makedirs(INDEX_DIR, exist_ok=True)

def read_file_text(path: str) -> str:
    ext = path.lower().split('.')[-1]
    if ext in ("txt","md"):
        return open(path, "r", encoding="utf-8", errors="ignore").read()
    if ext in ("html","htm"):
        soup = BeautifulSoup(open(path, "r", encoding="utf-8", errors="ignore").read(), "lxml")
        for tag in soup(["script","style","noscript"]): tag.decompose()
        for i in range(1,7):
            for h in soup.find_all(f"h{i}"):
                h.insert_before(f"\n{'#'*i} {h.get_text(strip=True)}\n")
        return soup.get_text("\n", strip=True)
    if ext == "pdf":
        return pdf_extract_text(path) or ""
    if ext == "docx":
        import docx
        d = docx.Document(path)
        return "\n".join(p.text for p in d.paragraphs)
    raise ValueError(f"Unsupported file: {path}")

_heading_re = re.compile(r'^(#{1,6})\s+(.+)$', re.MULTILINE)
def _split_by_words(block, target=350, overlap=40):
    w = block.split()
    if len(w) <= target + 100: return [block.strip()]
    out, step = [], target - overlap
    for i in range(0, len(w), step):
        piece = " ".join(w[i:i+target]).strip()
        if piece: out.append(piece)
        if i + target >= len(w): break
    return out

def split_by_headings(text, target=350, overlap=40):
    text = re.sub(r'\r', '', text)
    hs = list(_heading_re.finditer(text))
    parts = []
    if hs:
        for i, m in enumerate(hs):
            s, e = m.start(), hs[i+1].start() if i+1<len(hs) else len(text)
            parts.extend(_split_by_words(text[s:e].strip(), target, overlap))
    else:
        parts = _split_by_words(text, target, overlap)
    return [c for c in (p.strip() for p in parts) if len(c.split())>25]

def discover_files(root):
    exts = (".pdf",".docx",".html",".htm",".md",".txt")
    acc = []
    for r,_,fs in os.walk(root):
        for f in fs:
            if f.lower().endswith(exts): acc.append(os.path.join(r,f))
    return acc

paths = discover_files(RAW_DIR)
rows, chunks = [], []
def sha1(s: str):
    import hashlib
    return hashlib.sha1(s.encode("utf-8")).hexdigest()[:12]

for path in paths:
    try:
        raw = read_file_text(path)
    except Exception:
        continue
    title = re.sub(r'[_-]+',' ', os.path.splitext(os.path.basename(path))[0]).title()
    doc_id = sha1(path)
    eff = None
    for rx in [re.compile(r"Effective Date[:\s]+([A-Za-z]{3,9}\s+\d{1,2},\s*\d{4})", re.I),
               re.compile(r"Last Updated[:\s]+([A-Za-z]{3,9}\s+\d{1,2},\s*\d{4})", re.I)]:
        m = rx.search(raw)
        if m: eff = m.group(1); break
    parts = split_by_headings(raw)
    for j, part in enumerate(parts):
        chunks.append({
            "chunk_id": f"{doc_id}-{j:04d}",
            "doc_id": doc_id,
            "title": title,
            "path": path,
            "effective_date": eff,
            "text": part
        })
    rows.append({"doc_id": doc_id, "title": title, "path": path, "effective_date": eff, "n_chunks": len(parts)})

pd.DataFrame(rows).to_csv(f"{INDEX_DIR}/catalog.csv", index=False)
with open(f"{INDEX_DIR}/chunks.pkl","wb") as f: pickle.dump({c["chunk_id"]: c for c in chunks}, f)

# Embeddings + FAISS
texts = [c["text"] for c in chunks]
print(f"Embedding {len(texts)} chunks…")
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
vecs = model.encode(texts, batch_size=64, show_progress_bar=True, normalize_embeddings=True)
vecs = np.array(vecs, dtype="float32")
fa = faiss.IndexFlatIP(vecs.shape[1]); fa.add(vecs)
faiss.write_index(fa, f"{INDEX_DIR}/faiss.index")

# BM25
tok = [t.lower().split() for t in texts]
bm25 = BM25Okapi(tok)
with open(f"{INDEX_DIR}/bm25.pkl","wb") as f: pickle.dump({"bm25": bm25, "tokenized": tok}, f)
with open(f"{INDEX_DIR}/positions.json","w") as f: json.dump({"chunk_ids":[c["chunk_id"] for c in chunks]}, f)

print("✅ Index built.")


Embedding 667 chunks…


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

✅ Index built.


This cell contains the core functions for the policy assistant:
- `load_index`: Loads the created index components.
- `embed_query`: Embeds the user's query into a vector.
- `hybrid_search`: Performs a combined vector and keyword search to retrieve relevant policy chunks.
- `_cite`: Formats citations for the retrieved policy chunks.
- `_context`: Creates a combined context from the retrieved chunks for the language model.
- `gemini_generate`: Calls the Gemini API to generate an answer based on the provided context and system prompt, now including the `top_p` parameter.
- `answer`: Orchestrates the search, context creation, and answer generation process, also including the `top_p` parameter.

In [None]:
import json, numpy as np, pickle, faiss
from google.genai import types # Import types for top_p

def load_index():
    with open(f"{INDEX_DIR}/chunks.pkl","rb") as f: chunk_map = pickle.load(f)
    with open(f"{INDEX_DIR}/bm25.pkl","rb") as f: bm25 = pickle.load(f)["bm25"]
    with open(f"{INDEX_DIR}/positions.json","r") as f: chunk_ids = json.load(f)["chunk_ids"]
    fa = faiss.read_index(f"{INDEX_DIR}/faiss.index")
    return chunk_map, bm25, chunk_ids, fa

def embed_query(q: str) -> np.ndarray:
    m = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    v = m.encode([q], normalize_embeddings=True)
    return np.array(v, dtype="float32")

def hybrid_search(question: str, top_k: int = 8):
    chunk_map, bm25, chunk_ids, fa = load_index()
    qv = embed_query(question)
    D, I = fa.search(qv, min(50, len(chunk_ids)))
    vec_scores, vec_idx = D[0], I[0]
    bm25_scores = bm25.get_scores(question.lower().split())

    def norm(x):
        x = np.array(x);
        if x.size==0: return x
        mn, mx = x.min(), x.max()
        return np.zeros_like(x) if mx-mn<1e-9 else (x-mn)/(mx-mn)

    n_bm25 = norm(bm25_scores)
    n_vec = np.zeros_like(n_bm25); n_vec[vec_idx] = norm(vec_scores)
    fused = 0.55*n_vec + 0.45*n_bm25
    top_idx = np.argsort(-fused)[:top_k]
    return [(chunk_ids[i], float(fused[i])) for i in top_idx]

def _cite(meta):
    bits = [meta.get("title") or "Untitled"]
    if meta.get("effective_date"): bits.append(f"(Effective: {meta['effective_date']})")
    bits.append(f"[{os.path.basename(meta.get('path',''))}]")
    return " ".join(bits)

def _context(chunks, max_chars=6000):
    ctx, used = [], 0
    for c in chunks:
        header = f"### {_cite(c)}\n"
        piece = header + c["text"].strip() + "\n\n"
        if used + len(piece) > max_chars and ctx: break
        ctx.append(piece); used += len(piece)
    return "".join(ctx)

def gemini_generate(system_prompt: str, user_prompt: str, temperature: float = 0.2, top_p: float = 0.95) -> str: # Added top_p
    cfg = types.GenerateContentConfig(temperature=temperature, top_p=top_p) # Added top_p to config
    # Prefer system instructions via config where available
    resp = client.models.generate_content(
        model=MODEL_ID,
        contents=[user_prompt],
        config=cfg.with_updates(system_instruction=system_prompt)
        if hasattr(cfg, "with_updates") else types.GenerateContentConfig(
            temperature=temperature,  # fallback; older SDKs may ignore system_instruction
            top_p=top_p # Added top_p to fallback config
        ),
    )
    # Robust text extraction across SDK responses
    if getattr(resp, "text", None):
        return resp.text
    try:
        return "".join(
            part.text for cand in resp.candidates for part in cand.content.parts if getattr(part, "text", None)
        )
    except Exception:
        return ""

def answer(question: str, top_k: int = 6, temperature: float = 0.2, top_p: float = 0.95): # Added temperature and top_p
    chunk_map, _, _, _ = load_index()
    hits = hybrid_search(question, top_k=top_k)
    ranked = [chunk_map[cid] for cid,_ in hits]
    context = _context(ranked)

    sys_prompt = (
        "You are a strict company-policy assistant for GitLab. "
        "Rules: Answer ONLY from the provided context; quote exact lines where possible; "
        "include policy title and effective date; if not found, say 'Not found in the indexed policies' "
        "and suggest who to ask. Keep answers concise with bullets when useful."
    )
    user_prompt = f"Question: {question}\n\nContext:\n{context}"
    out = gemini_generate(sys_prompt, user_prompt, temperature=temperature, top_p=top_p) or "No answer." # Passed temperature and top_p
    cites = [_cite(c) for c in ranked[:min(5, len(ranked))]]
    return {"answer": out, "citations": cites}

This cell demonstrates how to use the `answer` function with sample questions and prints the generated answers and citations.

In [None]:
for q in [
    "What is the acceptable use policy for company systems?",
    "How are conflicts of interest handled?",
    "What is the code of conduct reporting path?"
]:
    res = answer(q)
    print("\nQ:", q)
    print(res["answer"])
    print("Citations:")
    for c in res["citations"]:
        print(" -", c)



Q: What is the acceptable use policy for company systems?
Based on the provided documents, there is no information about the acceptable use policy for company systems. The documents cover topics such as spending company money, transparency, support, legal and corporate affairs departments, and corrective actions.
Citations:
 - 59Ded5F81A63 [59ded5f81a63.html]
 - 59Ded5F81A63 [59ded5f81a63.html]
 - A5B26Fb746Ed [a5b26fb746ed.html]
 - 2D6Ff0E6A0D1 [2d6ff0e6a0d1.html]
 - 018A581E1Ad0 [018a581e1ad0.html]

Q: How are conflicts of interest handled?
The provided text does not explicitly detail a process for handling conflicts of interest.

However, it does state a principle related to avoiding actions that can arise from conflicts of interest:
*   "Don’t show favoritism as it breeds resentment, destroys employee morale, and creates disincentives for good performance. Seek out ways to be fair to everyone."

This implies that actions resulting from conflicts of interest (like favoritism) are t

Save to github

In [None]:
# --- Colab → GitHub preview fixer (robust) ---
# Use this cell just before: File → Save a copy to GitHub…

def _clean_notebook_dict(nb: dict) -> dict:
    # Remove broken ipywidgets metadata that GitHub can't render
    md = nb.get("metadata", {})
    md.pop("widgets", None)
    for cell in nb.get("cells", []):
        # cell-level widget metadata
        if isinstance(cell.get("metadata"), dict):
            cell["metadata"].pop("widgets", None)
        # outputs: drop widget MIME bundles that GitHub can't render
        outs = cell.get("outputs", [])
        if isinstance(outs, list):
            for out in outs:
                data = out.get("data")
                if isinstance(data, dict):
                    data.pop("application/vnd.jupyter.widget-view+json", None)
                    data.pop("application/vnd.jupyter.widget-state+json", None)
    return nb

def clean_current_notebook_for_github(drive_fallback=True):
    # 1) Try to clean the live (in-memory) notebook so "Save a copy to GitHub" uploads the fixed JSON.
    try:
        from google.colab import _message
        nb = _message.blocking_request("get_ipynb", {})
        if not isinstance(nb, dict) or "cells" not in nb:
            raise RuntimeError("Unexpected notebook structure from Colab frontend.")
        nb = _clean_notebook_dict(nb)
        # Write cleaned JSON back to the frontend
        _message.blocking_request("set_ipynb", {"ipynb": nb})
        print("✅ Cleaned in memory. Now use: File → Save a copy to GitHub…")
        return True
    except Exception as e:
        print(f"⚠️ Could not update in-memory notebook: {e}")

    # 2) Fallback: save a cleaned copy to Drive (open it, then Save to GitHub)
    if drive_fallback:
        try:
            from google.colab import _message, drive
            import nbformat as nbf
            drive.mount("/content/drive", force_remount=True)
            nb = _message.blocking_request("get_ipynb", {})
            nb = _clean_notebook_dict(nb)
            out_path = "/content/drive/MyDrive/Colab Notebooks/clean_for_github.ipynb"
            nbf.write(nbf.from_dict(nb), out_path, version=4)
            print("✅ Saved cleaned copy to Drive:", out_path)
            print("Next: File → Open notebook → Google Drive → open 'clean_for_github.ipynb',")
            print("then File → Save a copy to GitHub…")
            return True
        except Exception as e2:
            print(f"❌ Fallback also failed: {e2}")
    return False

# Run it:
clean_current_notebook_for_github()


⚠️ Could not update in-memory notebook: Unexpected notebook structure from Colab frontend.
Mounted at /content/drive
❌ Fallback also failed: Notebook could not be converted from version 1 to version 2 because it's missing a key: cells


False