### Read .md files

In [4]:
from pathlib import Path 

data_path = Path("data")

list(data_path.glob("*.md"))

[WindowsPath('data/An introduction to the vector database LanceDB.md'),
 WindowsPath('data/API trafiklab (1).md'),
 WindowsPath('data/API trafiklab.md'),
 WindowsPath('data/Azure static web app deploy react app.md'),
 WindowsPath('data/Chat with your excel data - xlwings lite (1).md'),
 WindowsPath('data/Chat with your excel data - xlwings lite.md'),
 WindowsPath('data/Course structure for Azure two weeks course.md'),
 WindowsPath('data/Data platform course structure.md'),
 WindowsPath('data/data processing course  structure.md'),
 WindowsPath('data/data storytelling.md'),
 WindowsPath('data/dbt modeling snowflake.md'),
 WindowsPath('data/docker setup windows.md'),
 WindowsPath('data/FastAPI and scikit-learn API connect to streamlit frontend.md'),
 WindowsPath('data/Fastapi CRUD app.md'),
 WindowsPath('data/Hands on regularization.md'),
 WindowsPath('data/How does LLM work_.md'),
 WindowsPath('data/Logistic regression hands on with scikit learn.md'),
 WindowsPath('data/Logistic regress

### Quick checks
- conform table and columns

In [2]:
import lancedb
from backend.constants import VECTOR_DATABASE_PATH

db = lancedb.connect(uri=VECTOR_DATABASE_PATH)
tbl = db["transcripts"]
df = tbl.to_pandas()
print(df.columns)

Index(['md_id', 'filepath', 'filename', 'content', 'embedding'], dtype='object')


In [3]:
print(df.shape)

(53, 5)


In [4]:
print(df.head(1).T)

                                                           0
md_id         An introduction to the vector database LanceDB
filepath   C:\Users\Katrin\Documents\github\yt-rag-assist...
filename      An introduction to the vector database LanceDB
content    # An introduction to the vector database Lance...
embedding  [-0.038686633, 0.0036908067, 0.02178414, -0.07...


#### check the embedding column and inspect a vector:
- does it exist?
- is it a list/array?
- does the length equal the embedding dim?

In [5]:
emb = df.loc[0, "embedding"]
print(type(emb), len(emb))

<class 'numpy.ndarray'> 3072


#### Sanity check : vector norms and non-zero check
- norms > 0 (not all zeros)
- the should be roughly similar scale.

In [7]:
import numpy as np
embs = df["embedding"].apply(lambda x: np.array(x, dtype=float))
norms = embs.apply(np.linalg.norm)
print("min, median, max norm:", norms.min(), norms.median(), norms.max())

min, median, max norm: 0.9999998880146814 1.0000000087852439 1.0000001546808057


## Data preprocessing for Transcripts
- Removing Non-Semantic Metadata (Timestamps - strings like [00:01:23] which are noise)
- Eliminating Transcription Artifacts (Strikethroughs)
- Structural Flattening (Whitespace & Line Breaks)
- Reducing "Token Waste": collapsing multiple spaces (\s+) and removing fillers

In [1]:
import re

def clean_transcript(text: str) -> str:
    # Remove timestamps: [00:00:00]
    text = re.sub(r"\[\d{2}:\d{2}:\d{2}\]", "", text)
    
    # Remove strikethrough artifacts: ~~example~~
    # The '.*?' makes it non-greedy so it doesn't delete everything between the first and last tildes in a file.
    text = re.sub(r"~~.*?~~", "", text)
    
    # Clean up excessive line breaks and leading/trailing whitespace
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    text = " ".join(lines)
    
    # Collapse multiple spaces into one
    text = re.sub(r"\s+", " ", text)
    
    return text.strip()

In [2]:
raw_sample = """
[00:00:00] Hello and welcome to this video where we'll go into getting data from an API. And the API that we've chosen is Trafiklab. And from this API, you will be able to get data on public transport. 

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.

So here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.

In Traffic Lab SC slash api~~ we have ~~we have several APIs that we can work with. And~~ one, ~~the ones that we will pick are those that are in s robot. So here race robot stalled. It's tabella here. You can get the timetables for different stops.

So we'll go in and ~~see ~~see more details, how to work with this one. [00:01:00] And this re robot plan is used for ~~planning~~ planning your trip. For example, you want to travel ~~from Sweden to ~~from UBO to Stockholm, ~~for example. ~~Ubo to Malmo. You can find out. ~~The trips~~ which type of trains and buses there are and their stops, et cetera.
"""

cleaned_sample = clean_transcript(raw_sample)

print("--- RAW ---")
print(raw_sample)
print("\n--- CLEANED ---")
print(cleaned_sample)

--- RAW ---

[00:00:00] Hello and welcome to this video where we'll go into getting data from an API. And the API that we've chosen is Trafiklab. And from this API, you will be able to get data on public transport. 

It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now.

So here I'm in the web browser and I've gone into Trafiklab.

se slash API. Let me move myself here.

In Traffic Lab SC slash api~~ we have ~~we have several APIs that we can work with. And~~ one, ~~the ones that we will pick are those that are in s robot. So here race robot stalled. It's tabella here. You can get the timetables for different stops.

So we'll go in and ~~see ~~see more details, how to work with this one. [00:01:00] And this re robot plan is used for ~~planning~~ planning your trip. For example, you want to travel ~~from Sweden to ~~from UBO to Stockholm, ~~f

In [3]:
from pathlib import Path

file_path = Path("data/API trafiklab.md")

if file_path.exists():
    raw_text = file_path.read_text(encoding="utf-8")
    processed_text = clean_transcript(raw_text)
    
    # Preview the first 500 characters
    print(processed_text[:500])
else:
    print("File not found. Please check the path.")

# API trafiklab Hello and welcome to this video where we'll go into getting data from an API. And the API that we've chosen is Trafiklab. And from this API, you will be able to get data on public transport. It's good to, understand a little bit about the data set so that you could, for example, monitor if there's delays in the trams or trains. So yes moving on, we'll go into the web browser directly now. So here I'm in the web browser and I've gone into Trafiklab. se slash API. Let me move mysel
