# Create Knowledge Base

## Notes

- Check for missing dependencies (especially ones only added in the ingestion phase currently)
- Check for multi-core processing for SBert models, seems to only use one core right now

## TODO

- Check results from DB without reranker vs no-reranker
- Check token sizes / limits for Sentence Transformer / Bi-Encoder (also check used models again, mentioned parameters depend on those)
- Check why `title` is incorrect in last crawls
- Overall: check quality of crawled content per domain before ingesting

## Set the domain to ingest

Set the domain you want to have scanned, this will include all subpages on that domain (and that domain only). Excludes links with parameters (`?`) and anchors (`#`).

Also set a name that will be used to create files to persist ingested content.

In [33]:
name = 'kickstartds_com'
url = 'https://www.kickstartDS.com/'

## Some more dependencies

Install `jsonlines`, if not already available.

In [None]:
%pip install jsonlines

## Create knowledge base

Create a knowledge base using SBert and Sentence Transformers, based on the sections of ingested pages.

This is can all still be done completely "offline", and without any third party APIs.

In [34]:
import torch
import jsonlines
from sentence_transformers import SentenceTransformer

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook.")
    
print('Creating SBert knowledge base.')    

bi_encoder = SentenceTransformer('msmarco-distilbert-cos-v5')
bi_encoder.max_seq_length = 256

def getSectionContent(section):
    return section['content'].replace('\n', ' ').strip()

sections = []
with jsonlines.open('pages-' + name + '_extracted_sections.jsonl', 'r') as pages:
    for page in pages:
        for page_section in page['sections']:
            section = dict()
            section['page'] = dict()
            section['page']['url'] = page['url']
            section['page']['title'] = page['title']
            section['page']['summary'] = page['summaries']['sbert']
            section['content'] = page_section['content']['raw']
            section['tokens'] = page_section['tokens']
            sections.append(section)

passages = []
passages.extend(map(getSectionContent, sections))

print('Passages:', len(passages))

corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True)

print('Corpus embeddings created.')
print('Corpus embedding size:', corpus_embeddings.shape)

Creating SBert knowledge base.
Passages: 3745
Corpus embeddings created.
Corpus embedding size: torch.Size([3745, 768])


## Define (offline) search function

This function will search through our corpus, and retrieve the most relevant sections in relation to the user query given as a parameter.

In [11]:
from sentence_transformers import CrossEncoder, util

top_k = 32
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def search(query):
    print("Input question:", query)

    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    # Change back to .cuda() when GPU is available on Codespace
    question_embedding = question_embedding.cpu()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]

    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    print("\n-------------------------\n")
    print("Top-5 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:5]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))

## Run queries against (offline) knowledge base

Using the "search" function defined before we can start querying our knowledge base.

In [12]:
search("What is a Design System?")

Input question: What is a Design System?

-------------------------

Top-5 Cross-Encoder Re-ranker hits
	8.710	https://www.kickstartDS.com/docs/guides/design-system-initiative: A Design System Initiative is the process an organization undertakes when considering to initialize a Design System. This generally occurs in a couple of steps where usually different kinds of workshops are run. It starts way before the pure design and development process itself by impacting many disciplines and stakeholders within the organization.
	8.474	https://www.kickstartDS.com/blog/everything-meta-and-everything-matters/: A design system meta framework is a set of guidelines, principles and best practices that are used to create, maintain, and evolve a Design System. A Design System is a collection of standardized design elements, such as colors, typography, iconography, and components, that are used to create a consistent, high-quality user experience across multiple products and platforms. In this sense

# Database

## Install DB dependencies

We'll need `psycopg` for the connection, and `pgvector` to register the vector type for our embeddings to the PostgreSQL connection.

We also use `python-dotenv` to load our environment, namely the `DB_PASS` for the database connection. Ensure settings this variable in your environment, or add a `.env` file next to the notebook where the variable is defined to avoid putting it into your host context.

In [None]:
%pip install psycopg pgvector python-dotenv

## Connect to DB

Establish a connection with the database.

In [13]:
from dotenv import load_dotenv
from pgvector.psycopg import register_vector
import psycopg
import os

load_dotenv()

conn_string = "dbname=postgres user=postgres password=" + os.getenv('DB_PASS') + " host=db.pzdzoelitkqizxopmwfg.supabase.co port=5432"
conn = psycopg.connect(conn_string, row_factory=psycopg.rows.dict_row)
conn.autocommit = True
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(conn)

## Create database tables

Create a table to hold all received questions and their respective response data.

In [26]:
def createDbTables():
    conn.execute('DROP TABLE IF EXISTS question_answer_sections')

    conn.execute('DROP TABLE IF EXISTS sections')
    conn.execute('CREATE TABLE sections (id bigserial PRIMARY KEY, created_at timestamptz, updated_at timestamptz, page_url text, page_title text, page_summary text, tokens integer, content text, embedding vector(768))')
    conn.execute('CREATE INDEX ON sections USING ivfflat (embedding vector_cosine_ops)')

    conn.execute('DROP TABLE IF EXISTS questions')
    conn.execute('CREATE TABLE questions (id bigserial PRIMARY KEY, created_at timestamptz, updated_at timestamptz, question text, prompt text, prompt_length integer, answer text, embedding vector(768))')
    conn.execute('CREATE INDEX ON questions USING ivfflat (embedding vector_cosine_ops)')

    conn.execute('CREATE TABLE question_answer_sections (question_id bigserial REFERENCES questions, section_id bigserial REFERENCES sections, similarity real, PRIMARY KEY (question_id, section_id))')

createDbTables()

## Write embeddings to DB

Write all page embeddings into PostgreSQL / pgvector / Supabase.

In [35]:
from datetime import datetime

timestamp = datetime.now()

# Change back to .cuda() when GPU is available on Codespace
section_embeddings = corpus_embeddings.cpu()

for index, page_embedding in enumerate(section_embeddings.detach().numpy()):
    conn.execute("""
        INSERT INTO sections (created_at, updated_at, page_url, page_title, page_summary, tokens, content, embedding)
        VALUES (%(created_at)s, %(updated_at)s, %(page_url)s, %(page_title)s, %(page_summary)s, %(tokens)s, %(content)s, %(embedding)s);
    """, ({
        'created_at': timestamp,
        'updated_at': timestamp,
        'page_url': sections[index]['page']['url'],
        'page_title': sections[index]['page']['title'],
        'page_summary': sections[index]['page']['summary'],
        'content': sections[index]['content'],
        'tokens': sections[index]['tokens'],
        'embedding': page_embedding
    }))

conn.execute('REINDEX TABLE sections')

<psycopg.Cursor [COMMAND_OK] [IDLE] (host=db.pzdzoelitkqizxopmwfg.supabase.co database=postgres) at 0x7f13a1733cc0>

## Define (online) search function

This function can be used to run a search against the PostgreSQL vector table.

In [14]:
from sentence_transformers import CrossEncoder, SentenceTransformer

bi_encoder = SentenceTransformer('msmarco-distilbert-cos-v5')
bi_encoder.max_seq_length = 256

top_k = 32
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def search(query):
    print("Input question:", query)

    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    # Change back to .cuda() when GPU is available on Codespace
    question_embedding = question_embedding.cpu()

    hits = conn.execute('SELECT * FROM sections ORDER BY embedding <-> %s LIMIT ' + str(top_k), (question_embedding.detach().numpy(),)).fetchall()

    cross_inp = [[query, hit['content']] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    print("\n-------------------------\n")
    print("Top-5 Cross-Encoder Re-ranker hits from PostgreSQL")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:5]:
        print("\t{:.3f}\t{}\t{}".format(hit['cross-score'], hit['heading'], hit['content']))


In [15]:
search('What are the 5 main benefits of using Design Token in your Design System?')

Input question: What are the 5 main benefits of using Design Token in your Design System?

-------------------------

Top-5 Cross-Encoder Re-ranker hits from PostgreSQL
	4.090	Features	This starter is already quite rich in features that are enabled out-of-the-box for you.. To give you some orientation, while also describing the intention behind features, and ensuring you can actually make the most out of your Design System.. Design Token integration One important part of a Design System is having a well structured and semantic token system in place.. Learn about customizing your Design Token set in our dedicated section above, helping you to adapt your own branding / CI / CD.. Design Token can be initialized by changing the values in src/token/branding-token.json and calling yarn init-tokens, and compiled to CSS Custom Properties by running yarn build-tokens.. While yarn init-tokens generates your Design Token set in src/token/dictionary, yarn build-tokens reads it from there, and outp

## Call edge function with cURL locally

This calls the edge function using cURL. Edge function is served locally by first using `yarn supabase start && yarn supabase functions serve`.

```
curl --request POST 'http://localhost:54321/functions/v1/answer' \
  --header 'Authorization: Bearer ADD_YOUR_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{ "question":"What is the definition of a Design System?" }'
```

## Close DB connection

After we're done, we'll close down the connection to PostgreSQL again.

In [5]:
conn.close()