# Vector Store Creation

This notebook demonstrates the process of reading, processing, and adding markdown files to the vector store.

In [1]:
from ragchallenge.api.interfaces.database import DocumentStore
from ragchallenge.api.interfaces.generator import HypotheticalQuestionGenerator

  from .autonotebook import tqdm as notebook_tqdm


## Instantiate the Document Store

In [2]:
database = DocumentStore(model_name = "thenlper/gte-small",
                            persist_directory = "./data/vectorstore_augmented",
                            device = "mps")

## Process Markdown Files

First we read the makdown files as plain text files and convert them into LangChain documents.

In [3]:
directory_path = "../data/raw/"
documents = database.load_markdown_documents(directory_path)
print("Number of documents: ", len(documents))

Number of documents:  3


Now we plit the documents by markdown header "##" assuming that everything within this section is related by the same topic.

In [4]:
documents_splited = database.split_documents_by_header(documents, header="##")
print("Number of documents after splitting by header: ", len(documents_splited))

Number of documents after splitting by header:  370


We filter out very short sections.

In [5]:
documents_splited = database.filter_documents_by_token_length(documents_splited, min_token_length=25)
print("Number of documents after filtering by token length: ", len(documents_splited))

Number of documents after filtering by token length:  357


Finally we split the subsections into chunks manageable for the encoder.

In [6]:
documents_chunked = database.split_documents_by_token_count(documents_splited, chunk_size=256, chunk_overlap=64)
print("Number of documents after chunking: ", len(documents_chunked))

Number of documents after chunking:  788


## Augment Documents with Hypothetical Question

In [7]:
import os
from langchain.prompts import ChatPromptTemplate
from langchain.schema import SystemMessage, HumanMessage
from langchain_huggingface import ChatHuggingFace
from langchain_huggingface import HuggingFaceEndpoint

# Define the prompt template to generate hypothetical questions
messages_hypothetical = [
    SystemMessage(
        role="system",
        content="Generate 3 hypothetical questions based on the following text. "
                "The results should be formatted as a list, with each question separated by a newline."
    ),
    HumanMessage(
        role="user",
        content="Here is the text: {text}\n"
                "Generate 5 hypothetical questions about the above text."
    ),
]

# Create the ChatPromptTemplate from the messages
prompt_template_hypothetical = ChatPromptTemplate.from_messages(
    [(msg.role, msg.content) for msg in messages_hypothetical]
)

# Define the Hugging Face model to use for generating hypothetical questions
repo_id = "HuggingFaceH4/zephyr-7b-beta"  # Model ID from Hugging Face
task = "text-generation"  # Task type

# Parameters for generation (you can adjust these as needed)
generation_params = {
    "temperature": 0.7,
    "max_length": 512,
    "top_p": 0.9,
    "repetition_penalty": 1.2,
    "sampling": True,
}

# Create the Hugging Face Endpoint using the specified parameters
endpoint = HuggingFaceEndpoint(
    repo_id=repo_id,
    task=task,
    **generation_params,  # Pass the generation parameters 
)

# Return the LangChain HuggingFacePipeline object with the endpoint
llm = ChatHuggingFace(llm=endpoint)

generator = HypotheticalQuestionGenerator(
    model=llm, prompt_template=prompt_template_hypothetical)

# # Example text input
# document = "Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda quickly installs, runs, and updates packages and their dependencies."

# # Generate hypothetical questions
# questions = generator.generate(document)

# # Output the generated hypothetical questions
# print("\nGenerated Hypothetical Questions:")
# for idx, question in enumerate(questions, 1):
#     print(f"{idx}. {question}")

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.
                    sampling was transferred to model_kwargs.
                    Please make sure that sampling is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/julianschelb/.cache/huggingface/token
Login successful


In [8]:
from typing import List
from langchain.schema import Document
import time

def prepend_generated_questions(generator, documents: List[Document]) -> List[Document]:
    """
    Prepend generated hypothetical questions to the page content for a list of documents.

    :param generator: The question generator (HypotheticalQuestionGenerator) object.
    :param documents: List of Document objects.
    :return: List of updated Document objects with prepended hypothetical questions.
    """
    updated_documents = []

    for document in documents:
        # Extract the original page content
        original_content = document.page_content
        
        # Generate 5 alternative questions based on the document's content
        questions = generator.generate(document.page_content)
        
        # Format the questions as a list
        questions_content = "\n".join([f"- {question}" for idx, question in enumerate(questions)])
        
        # Prepend the generated questions to the page content
        new_content = (
            f"Related Questions:\n"
            f"{questions_content}\n\n"
            f"Page Content:\n{original_content}"
        )
        
        # Create a new document with the updated content and the same metadata
        updated_doc = document.model_copy(update={"page_content": new_content})
        updated_documents.append(updated_doc)
        time.sleep(0.2)  # Sleep to avoid rate limiting

    return updated_documents

In [9]:
documents_augmented = prepend_generated_questions(generator, documents_chunked)

In [10]:
print(documents_augmented[0].page_content)

Related Questions:
- 1. How can I quickly start using conda and what resources are available for learning the basics?
- 2. What are the different functions that I can perform using the conda command? Can you provide examples of frequently used command options?
- 3. How can I abbreviate command options in conda for easier usage? Is there a limitation to which options can be abbreviated?
- 4. What is the best way to access detailed information about each con

Page Content:
This page provides an overview of how to use conda. For an overview of what conda is and what it does, please see the *front page*.

The quickest way to start using conda is to go through the 20-minute *Getting started with conda* guide.

The conda command is the primary interface for managing installations of various packages. It can:
- Query and search the Anaconda package index and current Anaconda installation.

- Create new conda environments.

- Install and update packages into existing conda environments.

TIP: 

## Augment Documents with Metadata

In [11]:
from typing import List
from langchain.schema import Document

def prepend_metadata_to_content(documents: List[Document]) -> List[Document]:
    """
    Prepend cleaned title and source to the page content and store the original content in metadata.

    :param documents: List of Document objects.
    :return: List of updated Document objects with prepended metadata.
    """
    updated_documents = []

    for document in documents:
        # Extract the cleaned title and source from the metadata
        cleaned_title = document.metadata.get("cleaned_title", "")
        cleaned_source = document.metadata.get("cleaned_source", "")
        
        # Store the original content in metadata
        original_content = document.page_content
        
        # Prepend the metadata to the page content
        new_content = (
            f"Page title: {cleaned_title}\n"
            f"Filename: {cleaned_source}\n"
            f"\n{document.page_content}"
        )
        
        # Create a new document with the updated content and metadata including the original content
        updated_doc = document.model_copy(update={
            "page_content": new_content,
            "metadata": {**document.metadata, "original_page_content": original_content}
        })
        
        updated_documents.append(updated_doc)

    return updated_documents

In [12]:
documents_augmented = prepend_metadata_to_content(documents_augmented)

In [13]:
print(documents_augmented[3].page_content)

Page title:  Conda Environments
Filename: conda tutorial

Related Questions:
- 1. What are conda environments, and how can they be useful in managing different versions of packages? Provide an example to elaborate on the concept.
- 2. Why is it important to have different environments for different versions of packages, and how can changing one environment without affecting others be achieved?
- 3. Can you explain how to activate or deactivate environments in conda, and what happens when an environment is activated?
- 4. How can I share a specific collection of

Page Content:
A conda environment is a directory that contains a specific collection of conda packages that you have installed.

For example, you may have one environment with NumPy 1.7 and its dependencies, and another environment with NumPy 1.6 for legacy testing. If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them. Y

## Add Documents to Database

In [14]:
database.add_documents_to_vector_store(documents_augmented)

## Test Retriever

In [15]:
# Query the vector store
#user_query = "How to start conda?"
user_query = "How to create a git repo?"
results = database.query_vector_store(user_query)

# Print results
for result_id, result in enumerate(results):
    print(f"\nDocument {result_id + 1}:")
    #print(result.metadata)
    print(result.page_content)


Document 1:
Page title: Initializing A Repository In An Existing Directory
Filename: git tutorial

Related Questions:
- 1. If I have a directory that is not currently being version controlled with Git, what command do I type to start controlling it with Git? (Answer: $ git init)
- 2. Where in the file system should I navigate to in order to type the command to start controlling my project directory with Git? (Answer: To the project directory)
- 3. How do the directions for navigating to the project directory differ depending on the operating system? (Answer:

Page Content:
If you have a project directory that is currently not under version control and you want to start controlling it with Git, you first need to go to that project's directory. If you've never done this, it looks a little different depending on which system you're running: for Linux:
$ cd /home/user/my_project for macOS:
$ cd /Users/user/my_project for Windows:
$ cd C:/Users/user/my_project and type:
$ git init This cre