### Read .md files

In [4]:
from pathlib import Path 

data_path = Path("data")

list(data_path.glob("*.md"))

[WindowsPath('data/An introduction to the vector database LanceDB.md'),
 WindowsPath('data/API trafiklab (1).md'),
 WindowsPath('data/API trafiklab.md'),
 WindowsPath('data/Azure static web app deploy react app.md'),
 WindowsPath('data/Chat with your excel data - xlwings lite (1).md'),
 WindowsPath('data/Chat with your excel data - xlwings lite.md'),
 WindowsPath('data/Course structure for Azure two weeks course.md'),
 WindowsPath('data/Data platform course structure.md'),
 WindowsPath('data/data processing course  structure.md'),
 WindowsPath('data/data storytelling.md'),
 WindowsPath('data/dbt modeling snowflake.md'),
 WindowsPath('data/docker setup windows.md'),
 WindowsPath('data/FastAPI and scikit-learn API connect to streamlit frontend.md'),
 WindowsPath('data/Fastapi CRUD app.md'),
 WindowsPath('data/Hands on regularization.md'),
 WindowsPath('data/How does LLM work_.md'),
 WindowsPath('data/Logistic regression hands on with scikit learn.md'),
 WindowsPath('data/Logistic regress

### Quick checks
- conform table and columns

In [2]:
import lancedb
from backend.constants import VECTOR_DATABASE_PATH

db = lancedb.connect(uri=VECTOR_DATABASE_PATH)
tbl = db["transcripts"]
df = tbl.to_pandas()
print(df.columns)

Index(['md_id', 'filepath', 'filename', 'content', 'embedding'], dtype='object')


In [3]:
print(df.shape)

(53, 5)


In [4]:
print(df.head(1).T)

                                                           0
md_id         An introduction to the vector database LanceDB
filepath   C:\Users\Katrin\Documents\github\yt-rag-assist...
filename      An introduction to the vector database LanceDB
content    # An introduction to the vector database Lance...
embedding  [-0.038686633, 0.0036908067, 0.02178414, -0.07...


#### check the embedding column and inspect a vector:
- does it exist?
- is it a list/array?
- does the length equal the embedding dim?

In [5]:
emb = df.loc[0, "embedding"]
print(type(emb), len(emb))

<class 'numpy.ndarray'> 3072


#### Sanity check : vector norms and non-zero check
- norms > 0 (not all zeros)
- the should be roughly similar scale.

In [7]:
import numpy as np
embs = df["embedding"].apply(lambda x: np.array(x, dtype=float))
norms = embs.apply(np.linalg.norm)
print("min, median, max norm:", norms.min(), norms.median(), norms.max())

min, median, max norm: 0.9999998880146814 1.0000000087852439 1.0000001546808057


## Data preprocessing for Transcripts
- Removing Non-Semantic Metadata (Timestamps - strings like [00:01:23] which are noise)
- Eliminating Transcription Artifacts (Strikethroughs)
- Structural Flattening (Whitespace & Line Breaks)
- Reducing "Token Waste": collapsing multiple spaces (\s+) and removing fillers

In [1]:
import re

def clean_transcript(text: str) -> str:
    # Remove timestamps: [00:00:00]
    text = re.sub(r"\[\d{2}:\d{2}:\d{2}\]", "", text)
    
    # Remove strikethrough artifacts: ~~example~~
    # The '.*?' makes it non-greedy so it doesn't delete everything between the first and last tildes in a file.
    text = re.sub(r"~~.*?~~", "", text)
    
    # Clean up excessive line breaks and leading/trailing whitespace
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    text = " ".join(lines)
    
    # Collapse multiple spaces into one
    text = re.sub(r"\s+", " ", text)
    
    return text.strip()

In [2]:
raw_sample = """
[00:00:00] Hello and welcome to this video where we'll go into getting data from an API. And the API that we've chosen is Trafiklab. And from this API, you will be able to get data on public transport. 

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.

So here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.

In Traffic Lab SC slash api~~ we have ~~we have several APIs that we can work with. And~~ one, ~~the ones that we will pick are those that are in s robot. So here race robot stalled. It's tabella here. You can get the timetables for different stops.

So we'll go in and ~~see ~~see more details, how to work with this one. [00:01:00] And this re robot plan is used for ~~planning~~ planning your trip. For example, you want to travel ~~from Sweden to ~~from UBO to Stockholm, ~~for example. ~~Ubo to Malmo. You can find out. ~~The trips~~ which type of trains and buses there are and their stops, et cetera.
"""

cleaned_sample = clean_transcript(raw_sample)

print("--- RAW ---")
print(raw_sample)
print("\n--- CLEANED ---")
print(cleaned_sample)

--- RAW ---

[00:00:00] Hello and welcome to this video where we'll go into getting data from an API. And the API that we've chosen is Trafiklab. And from this API, you will be able to get data on public transport. 

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.

So here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.

In Traffic Lab SC slash api~~ we have ~~we have several APIs that we can work with. And~~ one, ~~the ones that we will pick are those that are in s robot. So here race robot stalled. It's tabella here. You can get the timetables for different stops.

So we'll go in and ~~see ~~see more details, how to work with this one. [00:01:00] And this re robot plan is used for ~~planning~~ planning your trip. For example, you want to travel ~~from Sweden to ~~from UBO to Stockholm, ~~f

In [3]:
from pathlib import Path

file_path = Path("data/API trafiklab.md")

if file_path.exists():
    raw_text = file_path.read_text(encoding="utf-8")
    processed_text = clean_transcript(raw_text)
    
    # Preview the first 500 characters
    print(processed_text[:500])
else:
    print("File not found. Please check the path.")

# API trafiklab Hello and welcome to this video where we'll go into getting data from an API. And the API that we've chosen is Trafiklab. And from this API, you will be able to get data on public transport. It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now. So here I'm in the web browser and I've gone into Trafiklab. se slash API. Let me move mysel


### Improve preprocessing by removing fillers and prevent the chunks to become one big blob of text

In [5]:
import re

# Raw snippet from API trafiklab (1).md
raw_text = """# API trafiklab

[00:00:00] Hello and welcome to this video where we'll go into getting data from an API. And the API that we've chosen is Trafiklab. And from this API, you will be able to get data on public transport. 

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.

So here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.

In Traffic Lab SC slash api~~ we have ~~we have several APIs that we can work with. And~~ one, ~~the ones that we will pick are those that are in s robot. So here race robot stalled. It's tabella here. You can get the timetables for different stops.

So we'll go in and ~~see ~~see more details, how to work with this one. [00:01:00] And this re robot plan is used for ~~planning~~ planning your trip. For example, you want to travel ~~from Sweden to ~~from UBO to Stockholm, ~~for example. ~~Ubo to Malmo. You can find out. ~~The trips~~ which type of trains and buses there are and their stops, et cetera.
"""

# 1. Artifacts
step1 = re.sub(r"\[\d{2}:\d{2}:\d{2}\]", " ", raw_text)
step1 = re.sub(r"~~.*?~~", " ", step1)
print(f"--- STEP 1 (Artifacts Removed) ---\n{step1}\n")

--- STEP 1 (Artifacts Removed) ---
# API trafiklab

  Hello and welcome to this video where we'll go into getting data from an API. And the API that we've chosen is Trafiklab. And from this API, you will be able to get data on public transport. 

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.

So here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.

In Traffic Lab SC slash api we have several APIs that we can work with. And the ones that we will pick are those that are in s robot. So here race robot stalled. It's tabella here. You can get the timetables for different stops.

So we'll go in and  see more details, how to work with this one.   And this re robot plan is used for   planning your trip. For example, you want to travel  from UBO to Stockholm,  Ubo to Malmo. You can find out.   whic

In [11]:
# 2. Fillers
# Remove conversational filler combinations
step2 = re.sub(r"(?i)\b(so|and|then|now)\b\s+(you|we|i)\s+\b(basically|actually|just)\b\s*", "", step1)

# Remove "so" at the start of paragraphs/lines FIRST (preserve the newlines)
step2 = re.sub(r"(?i)(\n+)\s*so\b(?!\s+(?:that|as))\s*", r"\1", step2)

# Remove "and" at the start of paragraphs/lines (preserve the newlines)
step2 = re.sub(r"(?i)(\n+)\s*and\b\s*", r"\1", step2)

# Remove "so" at the start of sentences (after punctuation)
step2 = re.sub(r"(?i)([.!?])\s+so\b(?!\s+(?:that|as))", r"\1 ", step2)

# Remove "and" at the start of sentences (after punctuation)
step2 = re.sub(r"(?i)([.!?])\s+and\b\s*", r"\1 ", step2)

print(f"--- STEP 2 (Fillers Removed) ---\n{step2}\n")

--- STEP 2 (Fillers Removed) ---
# API trafiklab

  Hello and welcome to this video where we'll go into getting data from an API. the API that we've chosen is Trafiklab. from this API, you will be able to get data on public transport. 

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains.  yes moving on, we'll go into the web browser directly now.

here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.

In Traffic Lab SC slash api we have several APIs that we can work with. the ones that we will pick are those that are in s robot.  here race robot stalled. It's tabella here. You can get the timetables for different stops.

we'll go in and  see more details, how to work with this one. this re robot plan is used for   planning your trip. For example, you want to travel  from UBO to Stockholm,  Ubo to Malmo. You can find out.   which type of trains and buses the

In [14]:
# 3. The Split
paragraphs = step2.split("\n\n")
print(f"--- STEP 3 (Paragraph Count) ---\nDetected {len(paragraphs)} paragraphs.\n")

# 4. The Final Assembly
final_blocks = []
for i, para in enumerate(paragraphs):
    # Just strip whitespace from each paragraph, don't collapse internal lines
    clean_para = para.strip()
    if clean_para:  # Only add non-empty paragraphs
        final_blocks.append(clean_para)
        print(f"Paragraph {i} content: {clean_para[:50]}...")

final_text = "\n\n".join(final_blocks)

--- STEP 3 (Paragraph Count) ---
Detected 7 paragraphs.

Paragraph 0 content: # API trafiklab...
Paragraph 1 content: Hello and welcome to this video where we'll go int...
Paragraph 2 content: It's good to, understand a little bit about the da...
Paragraph 3 content: here I'm in the web browser and I've gone into Tra...
Paragraph 4 content: se slash API. Let me move myself here....
Paragraph 5 content: In Traffic Lab SC slash api we have several APIs t...
Paragraph 6 content: we'll go in and  see more details, how to work wit...


In [16]:
final_text = "\n\n".join(final_blocks)

# DEBUG: Show the actual structure
print(f"\n--- FINAL TEXT (repr) ---")
print(repr(final_text[:300]))
print(f"\n--- FINAL TEXT (normal) ---")
print(final_text)


--- FINAL TEXT (repr) ---
"# API trafiklab\n\nHello and welcome to this video where we'll go into getting data from an API. the API that we've chosen is Trafiklab. from this API, you will be able to get data on public transport.\n\nIt's good to, understand a little bit about the data set so that you could, for example, monitor if"

--- FINAL TEXT (normal) ---
# API trafiklab

Hello and welcome to this video where we'll go into getting data from an API. the API that we've chosen is Trafiklab. from this API, you will be able to get data on public transport.

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains.  yes moving on, we'll go into the web browser directly now.

here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.

In Traffic Lab SC slash api we have several APIs that we can work with. the ones that we will pick are those that are in s robot.  here r

### Debug normalize_text() function to check which regex pattern is responsible for clumping paragraphs once more regex expressions were added

In [17]:
import re

# Copy all the regex patterns from normalize_transcripts.py
TIMESTAMP_RE = re.compile(r"\[\d{2}:\d{2}:\d{2}\]")
TILDE_RE = re.compile(r"~~.*?~~")
SPEAKER_LABEL_RE = re.compile(r"(?i)\*\*Kokchun Giang-\d+:\*\*\s*")
HORIZONTAL_SPACE_RE = re.compile(r"[ \t]+")

VERBAL_FILLERS_RE = re.compile(r"(?i)\b(basically|actually|sort of|kind of|you know|et cetera)\b")
SO_PARA_RE = re.compile(r"(?i)(\n+)\s*so\b(?!\s+(?:that|as))\s*")
SO_START_RE = re.compile(r"(?i)([.!?])\s+so\b(?!\s+(?:that|as))")
AND_PARA_RE = re.compile(r"(?i)(\n+)\s*and\b\s*")
AND_START_RE = re.compile(r"(?i)([.!?])\s+and\b\s*")

CONVERSATIONAL_RE = re.compile(
    r"(?i)\b(so|and|then|now)\b\s+(you|we|i)\s+\b(basically|actually|just|sort of|kind of)\b\s*"
)

# Test sample
raw_text = """It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.

So here I'm in the web browser and I've gone into Trafiklab."""

print("=== ORIGINAL ===")
print(repr(raw_text))
print("\n" + raw_text + "\n")

# Step-by-step debugging
text = raw_text
text = TIMESTAMP_RE.sub(" ", text)
text = TILDE_RE.sub(" ", text)
text = SPEAKER_LABEL_RE.sub("", text)
print("After artifacts removal:")
print(repr(text))

text = CONVERSATIONAL_RE.sub("", text)
print("\nAfter CONVERSATIONAL_RE:")
print(repr(text))

text = VERBAL_FILLERS_RE.sub("", text)
print("\nAfter VERBAL_FILLERS_RE:")
print(repr(text))

text = SO_PARA_RE.sub(r"\1", text)
print("\nAfter SO_PARA_RE:")
print(repr(text))

text = AND_PARA_RE.sub(r"\1", text)
print("\nAfter AND_PARA_RE:")
print(repr(text))

text = SO_START_RE.sub(r"\1 ", text)
print("\nAfter SO_START_RE:")
print(repr(text))

text = AND_START_RE.sub(r"\1 ", text)
print("\nAfter AND_START_RE:")
print(repr(text))

# Check paragraph split
paragraphs = text.split("\n\n")
print(f"\n=== PARAGRAPH SPLIT: {len(paragraphs)} paragraphs ===")
for i, para in enumerate(paragraphs):
    print(f"Para {i}: {repr(para)}")

=== ORIGINAL ===
"It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.\n\nSo here I'm in the web browser and I've gone into Trafiklab."

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.

So here I'm in the web browser and I've gone into Trafiklab.

After artifacts removal:
"It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.\n\nSo here I'm in the web browser and I've gone into Trafiklab."

After CONVERSATIONAL_RE:
"It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or t

In [18]:
import re

# Test sample with problematic punctuation
test_text = """It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains.  yes moving on, we'll go into the web browser directly now.

here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here."""

print("=== ORIGINAL ===")
print(repr(test_text))
print("\n" + test_text + "\n")

text = test_text

# 2.6 Fix punctuation issues
print("\n--- Testing punctuation fixes ---\n")

# Remove multiple consecutive punctuation: ".. " or ". ." → "."
text = re.sub(r'([.!?,])[ \t]*\1+', r'\1', text)
print("After removing duplicate punctuation:")
print(repr(text))

# Clean up punctuation combinations: "., " or ".  ," → "."
text = re.sub(r'\.[ \t]*,[ \t]*', '. ', text)
print("\nAfter fixing ., pattern:")
print(repr(text))

text = re.sub(r',[ \t]*\.[ \t]*', '. ', text)
print("\nAfter fixing ,. pattern:")
print(repr(text))

# Remove stray punctuation at paragraph starts: "\n\n. " or "\n\n, "
text = re.sub(r'(\n\n+)[ \t]*[.,?!]+[ \t]*', r'\1', text)
print("\nAfter removing stray punctuation at paragraph starts:")
print(repr(text))

# 2.7 Capitalize first letter after sentence punctuation or paragraph start
text = re.sub(r'([.!?])\s+([a-z])', lambda m: m.group(1) + ' ' + m.group(2).upper(), text)
print("\nAfter capitalizing after punctuation:")
print(repr(text))

text = re.sub(r'(\n\n+)([a-z])', lambda m: m.group(1) + m.group(2).upper(), text)
print("\nAfter capitalizing after paragraph break:")
print(repr(text))

# Check paragraph split
paragraphs = text.split("\n\n")
print(f"\n=== FINAL PARAGRAPH SPLIT: {len(paragraphs)} paragraphs ===")
for i, para in enumerate(paragraphs):
    print(f"Para {i}: {repr(para)}")

=== ORIGINAL ===
"It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains.  yes moving on, we'll go into the web browser directly now.\n\nhere I'm in the web browser and I've gone into Trafiklab.\n\nse slash API. Let me move myself here."

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains.  yes moving on, we'll go into the web browser directly now.

here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.


--- Testing punctuation fixes ---

After removing duplicate punctuation:
"It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains.  yes moving on, we'll go into the web browser directly now.\n\nhere I'm in the web browser and I've gone into Trafiklab.\n\nse slash API. Let me move myself here."

