# Vector Database Initialization

## Dependencies

In [6]:
%pip install beautifulsoup4 requests pandas langchain tiktoken pyarrow fastparquet chromadb typing-inspect==0.8.0 typing_extensions==4.5.0 --user

Collecting typing_extensions==4.5.0
  Using cached typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Collecting pydantic<2,>=1 (from langchain)
  Using cached pydantic-1.10.11-py3-none-any.whl (158 kB)
  Using cached pydantic-1.9.0-py3-none-any.whl (140 kB)
Installing collected packages: typing_extensions, pydantic
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pydantic-core 2.1.2 requires typing-extensions!=4.7.0,>=4.6.0, but you have typing-extensions 4.5.0 which is incompatible.[0m[31m
[0mSuccessfully installed pydantic-1.9.0 typing_extensions-4.5.0
Note: you may need to restart the kernel to use updated packages.


In [7]:
%pip install pydantic -U

Collecting pydantic
  Using cached pydantic-2.0.2-py3-none-any.whl (359 kB)
Collecting typing-extensions>=4.6.1 (from pydantic)
  Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Installing collected packages: typing-extensions, pydantic
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.5.0
    Uninstalling typing_extensions-4.5.0:
      Successfully uninstalled typing_extensions-4.5.0
  Attempting uninstall: pydantic
    Found existing installation: pydantic 1.9.0
    Uninstalling pydantic-1.9.0:
      Successfully uninstalled pydantic-1.9.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastapi 0.85.1 requires pydantic!=1.7,!=1.7.1,!=1.7.2,!=1.7.3,!=1.8,!=1.8.1,<2.0.0,>=1.6.2, but you have pydantic 2.0.2 which is incompatible.
langchain 0.0.230 requires pydantic<2,>=1, but you have pydantic 2.0.2

## Setup

### Documents URL Scraping

In [8]:
from pathlib import Path
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests

root_url = "https://airflow.apache.org/docs/apache-airflow/stable/"
root_response = requests.get(root_url)
root_html = root_response.content.decode("utf-8")
soup = BeautifulSoup(root_html, 'html.parser')

root_url_parts = urlparse(root_url)
root_links = soup.find_all("a", attrs={"class": "reference internal"})

result = set()
for root_link in root_links:
    path = root_url_parts.path + root_link.get("href")
    path = str(Path(path).resolve())
    path = urlparse(path).path
    url = f"{root_url_parts.scheme}://{root_url_parts.netloc}{path}"
    result.add(url)
urls = list(result)
print(*urls, sep="\n")

https://airflow.apache.org/docs/apache-airflow/stable/project.html
https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html
https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/
https://airflow.apache.org/docs/apache-airflow/stable/deprecated-rest-api-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/public-airflow-interface.html
https://airflow.apache.org/docs/apache-airflow/stable/license.html
https://airflow.apache.org/docs/apache-airflow/stable/tutorial/index.html
https://airflow.apache.org/docs/apache-airflow/stable/authoring-and-scheduling/index.html
https://airflow.apache.org/docs/apache-airflow/stable/extra-packages-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/index.html
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html
https://airflow.apache.org/docs/apache-airflow/stable/migrations-ref.html
https://airflow.ap

In [9]:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter

loader = WebBaseLoader(urls)
documents = loader.load()

# Select one of the following:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
# text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, chunk_overlap=100)

splitted_documents = text_splitter.split_documents(documents)
print("Total documents: ", len(splitted_documents))

PydanticUserError: If you use `@root_validator` with pre=False (the default) you MUST specify `skip_on_failure=True`. Note that `@root_validator` is deprecated and should be replaced with `@model_validator`.

For further information visit https://errors.pydantic.dev/2.0.2/u/root-validator-pre-skip

In [10]:
import pandas as pd

page_contents = []
sources = []
titles = []
languages = []

for document in splitted_documents:
    page_contents.append(document.page_content)
    if document.metadata:
        sources.append(document.metadata.get('source', "Unknown"))
        titles.append(document.metadata.get('title', "Unknown"))
        languages.append(document.metadata.get('language', "Unknown"))

documents_df = pd.DataFrame({
    'page_content': page_contents,
    'source': sources,
    'title': titles,
    'language': languages
})
documents_df.fillna("Unknown", inplace=True)
documents_df.head()

Unnamed: 0,page_content,source,title,language
0,Public Interface of Airflow — Airflow Document...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en
1,Announcements\n \n\...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en
2,Task Instance Keys\nairflow.models.taskinstanc...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en
3,Database Migrations\nDatabase ERD Schema\n\n\n...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en
4,Timetables\nairflow.timetables\n\n\nListeners\...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en


In [11]:
# Replace \n and \t with a space
documents_df["page_content"] = documents_df["page_content"].replace('\n', ' ', regex=True)
documents_df["page_content"] = documents_df["page_content"].replace('\t', ' ', regex=True)
# Remove leading and trailing spaces
documents_df["page_content"] = documents_df["page_content"].str.strip()

In [12]:
documents_df.head()

Unnamed: 0,page_content,source,title,language
0,Public Interface of Airflow — Airflow Document...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en
1,Announcements ...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en
2,Task Instance Keys airflow.models.taskinstance...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en
3,Database Migrations Database ERD Schema ...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en
4,Timetables airflow.timetables Listeners Extr...,https://airflow.apache.org/docs/apache-airflow...,Public Interface of Airflow — Airflow Document...,en


In [13]:
documents_df.isnull().sum()

page_content    0
source          0
title           0
language        0
dtype: int64

In [14]:
documents_df.to_parquet('./parquets/documents_with_rec-char-split.parquet')

## Storage

In [15]:
import chromadb
from chromadb.config import Settings
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./db/"
))

In [19]:
collection_name = "airflow_docs_stable"
if len(client.list_collections()) > 0 and collection_name in [
    client.list_collections()[0].name
]:
    client.delete_collection(name=collection_name)
print(f"Creating collection: '{collection_name}'")
collection = client.create_collection(name=collection_name)

Creating collection: 'airflow_docs_stable'


In [20]:
for index, row in documents_df.iterrows():
    if pd.notnull(row['source']) and pd.notnull(row['title']) and pd.notnull(row['language']):
        metadata = {
            'source': row['source'],
            'title': row['title'],
            'language': row['language']
        }
    collection.add(
        documents=[row['page_content']],
        metadatas=[metadata],
        ids=[str(index)],
    )

/home/jovyan/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:05<00:00, 14.0MiB/s]


In [21]:
client.persist()

True

## Testing

In [22]:
question = "How to create a DAG?"
results = collection.query(
    query_texts=[question],
    n_results=3,
)
formatted_result = "\n\n".join(results["documents"][0])
print(formatted_result)

Using the Public Interface for DAG Authors¶  DAGs¶ The DAG is Airflow’s core entity that represents a recurring workflow. You can create a DAG by instantiating the DAG class in your DAG file. You can also instantiate them via :class::~airflow.models.dagbag.DagBag class that reads DAGs from a file or a folder. DAGs can also have parameters specified via :class::~airflow.models.param.Param class. Airflow has a set of example DAGs that you can use to learn how to write DAGs   airflow.example_dags   You can read more about DAGs in DAGs. References for the modules used in DAGs are here:   airflow.models.dag airflow.models.dagbag airflow.models.param     Operators¶ Operators allow for generation of certain types of tasks that become nodes in the DAG when instantiated. There are 3 main types of operators:

Positional Arguments¶  dag_id The id of the dag  execution_date The execution date of the DAG (optional)     Named Arguments¶  -c, --conf JSON string that gets pickled into the DagRun’s con