# Vector Store Creation

This notebook demonstrates the process of reading, processing, and adding markdown files to the vector store.

In [1]:
from ragchallenge.api.utils.documentstore import DocumentStore

  from .autonotebook import tqdm as notebook_tqdm


## Instantiate the Document Store

In [3]:
database = DocumentStore(model_name = "thenlper/gte-small",
                            persist_directory = "../data/vectorstore",
                            device = "mps")

## Process Markdown Files

First we read the makdown files as plain text files and convert them into LangChain documents.

In [4]:
directory_path = "../data/raw/"
documents = database.load_markdown_documents(directory_path)
print("Number of documents: ", len(documents))

Number of documents:  3


Now we plit the documents by markdown header "##" assuming that everything within this section is related by the same topic.

In [5]:
split_documents = database.split_documents_by_header(documents, header="##")
print("Number of documents after splitting by header: ", len(split_documents))

#documents_chunked = processor.split_documents_by_token_count(split_documents)

Number of documents after splitting by header:  370


Finally we split the subsections into chunks manageable for the encoder.

In [6]:
documents_chunked = database.split_documents_by_token_count(split_documents, chunk_size=192, chunk_overlap=64)
print("Number of documents after chunking: ", len(documents_chunked))

Number of documents after chunking:  926


## Add Documents to Database

In [7]:
database.add_documents_to_vector_store(documents_chunked)

## Test Retriever

In [8]:
# Query the vector store
user_query = "How to start conda?"
results = database.query_vector_store(user_query)

# Print results
for result_id, result in enumerate(results):
    print(f"\n Document {result_id + 1}:")
    print(result.metadata)
    print(result.page_content)


{'cleaned_source': 'conda tutorial', 'cleaned_title': 'GettingStartedWithConda', 'source': '../data/raw/conda-tutorial.md', 'title': '1.3 **Getting Started With Conda**'}
Conda is a powerful package manager and environment manager that you use with command line commands at the Anaconda Prompt for Windows, or in a Terminal window for macOS or Linux.

This 20-minute guide to getting started with conda lets you try out the major features of conda. You should understand how conda works when you finish this guide.

SEE ALSO: Getting started with Anaconda Navigator, a graphical user interface that lets you use conda in a weblike interface without having to enter manual commands. Compare the Getting started guides for each to see which program you prefer.

{'cleaned_source': 'conda tutorial', 'cleaned_title': 'GettingStartedWithConda', 'source': '../data/raw/conda-tutorial.md', 'title': '1.3 **Getting Started With Conda**'}
Conda is a powerful package manager and environment manager that you