# Vector Database Initialization

## Dependencies

In [11]:
%pip install beautifulsoup4 requests pandas langchain tiktoken pyarrow fastparquet chromadb sentence_transformers --user

Note: you may need to restart the kernel to use updated packages.


In [12]:
%pip install pydantic==1.10.11 --user

Note: you may need to restart the kernel to use updated packages.


## Setup

### Documents URL Scraping

In [14]:
from pathlib import Path
from urllib.parse import urlparse

import requests
from bs4 import BeautifulSoup

root_url = "https://airflow.apache.org/docs/apache-airflow/stable/"
root_response = requests.get(root_url)
root_html = root_response.content.decode("utf-8")
soup = BeautifulSoup(root_html, "html.parser")

root_url_parts = urlparse(root_url)
root_links = soup.find_all("a", attrs={"class": "reference internal"})

result = set()
for root_link in root_links:
    path = root_url_parts.path + root_link.get("href")
    path = str(Path(path).resolve())
    path = urlparse(path).path
    url = f"{root_url_parts.scheme}://{root_url_parts.netloc}{path}"
    result.add(url)
urls = list(result)
print(*urls, sep="\n")

https://airflow.apache.org/docs/apache-airflow/stable/operators-and-hooks-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/integration.html
https://airflow.apache.org/docs/apache-airflow/stable/license.html
https://airflow.apache.org/docs/apache-airflow/stable/privacy_notice.html
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/howto/index.html
https://airflow.apache.org/docs/apache-airflow/stable/project.html
https://airflow.apache.org/docs/apache-airflow/stable/deprecated-rest-api-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/start.html
https://airflow.apache.org/docs/apache-airflow/stable/ui.html
https://airflow.apache.org/docs/apache-airflow/stable/cli-and-env-variables-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/faq.html
https://airflow.apache.org/docs/apache-airflow/stabl

In [15]:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)

loader = WebBaseLoader(urls)
documents = loader.load()

# Select one of the following:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=100)

splitted_documents = text_splitter.split_documents(documents)
print("Total documents: ", len(splitted_documents))

Total documents:  1488


In [16]:
import pandas as pd

page_contents = []
sources = []
titles = []
languages = []

for document in splitted_documents:
    page_contents.append(document.page_content)
    if document.metadata:
        sources.append(document.metadata.get("source", "Unknown"))
        titles.append(document.metadata.get("title", "Unknown"))
        languages.append(document.metadata.get("language", "Unknown"))

documents_df = pd.DataFrame(
    {
        "page_content": page_contents,
        "source": sources,
        "title": titles,
        "language": languages,
    }
)
documents_df.fillna("Unknown", inplace=True)
documents_df.head()

Unnamed: 0,page_content,source,title,language
0,Operators and Hooks Reference — Airflow Docume...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en
1,Announcements\n \n\...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en
2,Database Migrations\nDatabase ERD Schema\n\n\n...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en
3,Module\nGuides\n\n\n\nairflow.hooks.base\n\n\n...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en
4,Previous\n\n\nNext\n\n\n\n\n\n\n\n\nWas this e...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en


In [17]:
# Replace \n and \t with a space
documents_df["page_content"] = documents_df["page_content"].replace(
    "\n", " ", regex=True
)
documents_df["page_content"] = documents_df["page_content"].replace(
    "\t", " ", regex=True
)
# Remove leading and trailing spaces
documents_df["page_content"] = documents_df["page_content"].str.strip()

In [18]:
documents_df.head()

Unnamed: 0,page_content,source,title,language
0,Operators and Hooks Reference — Airflow Docume...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en
1,Announcements ...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en
2,Database Migrations Database ERD Schema ...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en
3,Module Guides airflow.hooks.base airflow....,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en
4,Previous Next Was this entry helpful...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en


In [19]:
from urllib.parse import urlparse


def decompose_url(url):
    parsed = urlparse(url)

    # The root url will be the scheme plus '://' plus the netloc
    root_url = parsed.scheme + "://" + parsed.netloc

    # The path will be split into parts by '/'
    path_parts = parsed.path.strip("/").split("/")

    # The section and page depend on how many parts there are
    if len(path_parts) >= 2:
        # The section will be the second to last part
        section = path_parts[-2]
        # The page will be the last part
        page = path_parts[-1]
    elif len(path_parts) == 1:
        # If there's only one part, we'll assume it's the page
        section = None
        page = path_parts[0]
    else:
        # If there are no parts, then both section and page will be None
        section = None
        page = None

    return root_url, section, page


# Apply the function to the 'source' column of your dataframe
documents_df["root_url"], documents_df["section"], documents_df["page"] = zip(
    *documents_df["source"].map(decompose_url)
)

In [20]:
documents_df.head()

Unnamed: 0,page_content,source,title,language,root_url,section,page
0,Operators and Hooks Reference — Airflow Docume...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en,https://airflow.apache.org,stable,operators-and-hooks-ref.html
1,Announcements ...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en,https://airflow.apache.org,stable,operators-and-hooks-ref.html
2,Database Migrations Database ERD Schema ...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en,https://airflow.apache.org,stable,operators-and-hooks-ref.html
3,Module Guides airflow.hooks.base airflow....,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en,https://airflow.apache.org,stable,operators-and-hooks-ref.html
4,Previous Next Was this entry helpful...,https://airflow.apache.org/docs/apache-airflow...,Operators and Hooks Reference — Airflow Docume...,en,https://airflow.apache.org,stable,operators-and-hooks-ref.html


In [21]:
documents_df.isnull().sum()

page_content    0
source          0
title           0
language        0
root_url        0
section         0
page            0
dtype: int64

In [22]:
documents_df.to_parquet("./parquets/documents_with_rec-char-split.parquet")

## Storage

In [23]:
import chromadb

client = chromadb.PersistentClient(path="./db")

In [24]:
from chromadb.utils import embedding_functions

sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-mpnet-base-v2"
)
collection_name = "airflow_docs_stable"

if len(client.list_collections()) > 0 and collection_name in [
    client.list_collections()[0].name
]:
    client.delete_collection(name=collection_name)
print(f"Creating collection: '{collection_name}'")
collection = client.create_collection(
    name=collection_name, embedding_function=sentence_transformer_ef
)

Creating collection: 'airflow_docs_stable'


In [25]:
for index, row in documents_df.iterrows():
    if (
        pd.notnull(row["source"])
        and pd.notnull(row["section"])
        and pd.notnull(row["page"])
    ):
        metadata = {
            "source": row["source"],
            "section": row["section"],
            "page": row["page"],
        }
    collection.add(
        documents=[row["page_content"]],
        metadatas=[metadata],
        ids=[str(index)],
    )

## Testing

### Retriving Information from Vector Store

In [27]:
question = "How to create a DAG?"
results = collection.query(
    query_texts=[question],
    n_results=3,
)
formatted_result = "\n\n".join(results["documents"][0])
print(formatted_result)

Named Arguments¶  -d, --dag-id The id of the dag  --limit Return a limited number of records  -o, --output Possible choices: table, json, yaml, plain Output format. Allowed values: json, yaml, plain, table (default: table) Default: “table”  --state Possible choices: queued, running, success, failed Only list the dag runs corresponding to the state  -v, --verbose Make logging output more verbose Default: False

Using the Public Interface for DAG Authors¶  DAGs¶ The DAG is Airflow’s core entity that represents a recurring workflow. You can create a DAG by instantiating the DAG class in your DAG file. You can also instantiate them via :class::~airflow.models.dagbag.DagBag class that reads DAGs from a file or a folder. DAGs can also have parameters specified via :class::~airflow.models.param.Param class. Airflow has a set of example DAGs that you can use to learn how to write DAGs   airflow.example_dags   You can read more about DAGs in DAGs. References for the modules used in DAGs are her

### Loading Vector Store

In [31]:
import chromadb
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

client = chromadb.PersistentClient(path="./db")
embeddings = SentenceTransformerEmbeddings(
    model_name="all-mpnet-base-v2",
)
vector_db = Chroma(
    client=client,
    collection_name="airflow_docs_stable",
    embedding_function=embeddings,
)
print(f"Documents Loaded: {vector_db._collection.count()}")

Documents Loaded: 1488
