# **Text Embeddings**
## Create Chunks
If you are working with large documents, you can chunk them based on queries you expect.
- You can chunk at sentence level or at paragraph level. 
- You can add context to chunks like the document tittle or include some text before or after the chunk.

In [None]:
def split_text(text, max_length, min_length):
    words = text.split()
    chunks = []
    current_chunk = []

    for word in words:
        current_chunk.append(word)
        if (
            len(" ".join(current_chunk)) < max_length
            and len(" ".join(current_chunk)) > min_length
        ):
            chunks.append(" ".join(current_chunk))
            current_chunk = []

    # If the last chunk didn't reach the minimum length, add it anyway
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


## Search Index
When doing retrieval, we will need to build a search index for our knowledge base before we perform search. An index will store our embeddings and can quickly retrieve the most similar chunks even in a large database. We can create our index locally using:

In [None]:
from sklearn.neighbors import NearestNeighbors

embeddings = flattened_df["embeddings"].to_list()

# Create the search index
nbrs = NearestNeighbors(n_neighbors=5, algorithm="ball_tree").fit(embeddings)

# To query the index, you can use the kneighbors method
distances, indices = nbrs.kneighbors(embeddings)


## Re-ranking
Once you have queried the database, you might need to sort the results from the most relevant. A reranking LLM utilizes Machine Learning to improve the relevance of search results by ordering them from the most relevant. 

In [None]:
# Find the most similar documents
distances, indices = nbrs.kneighbors([query_vector])

index = []
# Print the most similar documents
for i in range(3):
    index = indices[0][i]
    for index in indices[0]:
        print(flattened_df["chunks"].iloc[index])
        print(flattened_df["path"].iloc[index])
        print(flattened_df["distances"].iloc[index])
    else:
        print(f"Index {index} not found in DataFrame")


In [None]:
user_input = "what is a perceptron?"


def chatbot(user_input):
    # Convert the question to a query vector
    query_vector = create_embeddings(user_input)

    # Find the most similar documents
    distances, indices = nbrs.kneighbors([query_vector])

    # add documents to query  to provide context
    history = []
    for index in indices[0]:
        history.append(flattened_df["chunks"].iloc[index])

    # combine the history and the user input
    history.append(user_input)

    # create a message object
    messages = [
        {
            "role": "system",
            "content": "You are an AI assistant that helps with AI questions.",
        },
        {"role": "user", "content": history[-1]},
    ]

    # use chat completion to generate a response
    response = openai.chat.completions.create(
        model="gpt-4", temperature=0.7, max_tokens=800, messages=messages
    )

    return response.choices[0].message


chatbot(user_input)
