# Open Source Vector Database for LLM local Memory

-------------------------------------------------------------
Notebook created by Marcelo Amaral.
Notebook developed with assistance from GPT (July 2023). Designed to interface with a local Milvus database setup. Ensure Milvus is configured and running locally before execution.
-------------------------------------------------------------

The idea is to put a system in production that leverages large language models for our specific research.

Here's a rough step-by-step this tutorial aims to do:

    Extract Text from LaTeX: Use a library like text_splitter, TexSoup or pylatexenc to extract and clean text from your LaTeX papers and other documents.

    Embed: Use Setence Transformers to generate embeddings for the extracted text.

    Store in Milvus: Insert the generated embeddings into a Milvus collection for storage and retrieval. Remember to also store any necessary metadata (like paper IDs or partition information) so you can associate embeddings with their corresponding papers.

    Semantic Search: Use Milvus to perform semantic searches. Given a query, convert it to an embedding using the same method as in step 2, and then query Milvus to find the most similar embeddings in the database.

    Reasoning with GPT: Once you have the search results, you can pass the corresponding text to a GPT model for further processing. The idea will be to have a plugin.

## Install and Admin Milvus

Check the requirements for your system here: https://milvus.io/docs/prerequisite-docker.md (visited Jul 10, 2023)

I will install in linux but should be similar for other systems: https://milvus.io/docs/install_standalone-docker.md

For debian based linux we need to install Docker and Docker Compose

    sudo apt-get update
    sudo apt-get upgrade
    sudo apt-get install docker-compose

Or the docker engine with docker compose plugin

https://docs.docker.com/engine/install/ubuntu/#installation-methods

    sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-compose-plugin
 
Then the commands below will be without "-", for example:

    docker compose up -d

Download the YAML file

    wget https://github.com/milvus-io/milvus/releases/download/v2.2.11/milvus-standalone-docker-compose.yml -O docker-compose.yml

Start Milvus

In the same directory as the docker-compose.yml file, start up Milvus by running:

    sudo docker-compose up -d

Check if the containers are up and running.

    sudo docker-compose ps

Connect to Milvus

Verify which local port the Milvus server is listening on. Replace the container name with your own.
       
    sudo docker port milvus-standalone 19530/tcp

Stop Milvus

To stop Milvus standalone, run:

    sudo docker-compose down

To delete data after stopping Milvus, run:

    sudo rm -rf  volumes


To see container ID:

    sudo docker ps -a

To login inside the container:

    sudo docker exec -u 0 -it yourcontainer /bin/bash

To install a text editor inside the container:

    apt-get update
    apt-get intall nano

To update IP:

Edit docker-compose.yml

ports:

      - "x.xx.xx.xx:19530:19530"
      - "x.xx.xx.xx:9091:9091"

And In my case I had in the end of the file just, not need to create a specific network: 

    networks:    
     default:

Then restart the container

    docker-compose down
    docker-compose up -d


Now we can connect with python, for example:
      
    connections.connect("default", host="xx.xx.xx.xx", port="19530")

Currently to intall pymilvus in a client it is required to install together this two pre-requisites:
pip install protobuf==3.20.0 grpcio-tools 
then
pip install pymilvus

## Import necessary libraries

In [1]:
import numpy as np
import pandas as pd
from pymilvus import (
    connections,
    utility,
    FieldSchema, CollectionSchema, DataType,
    Collection,
)
# The powerfull transformers models
from sentence_transformers import SentenceTransformer

In [2]:
# In case you need to work updte the modeule. This imports the importlib module, 
# which contains functions that help you control 
# the runtime process of Python scripts, especially those related to importing and reloading modules.
import importlib

# Then imports the youtube_data_processing module under the alias yt.
from packages import qgr_data_processing as qgr

# This reloads the youtube_data module. The purpose of this is to ensure that the 
# latest version of the module is in use, especially if the module has been modified 
# since the start of the Python session.
importlib.reload(qgr)

<module 'packages.qgr_data_processing' from '/home/mamaral/Documents/qgr/codes/python/notebooks/computational_essays/packages/qgr_data_processing.py'>

## Connecting and managing the database

In [9]:
# configure /milvus/configs/milvus.yaml with your IP host and change and uncomment here
connections.connect("default", host="192.168.1.90", port="19530") 
#connections.connect("default", host="localhost", port="19530") 


In [10]:
# List existing collections (this will return an empty list if no collections exist)
collections = utility.list_collections()
print("Existing collections:", collections)

Existing collections: ['YT_Videos', 'QGRmemory']


### tables

The function create_milvus_collection_with_partitions creates a Milvus collection with a specified name and dimensionality for the vector field, and with specific partitions.

It takes three arguments:

    collection_name: the name of the collection to be created.
    dim: the dimensionality of the vector field.
    partition_names: a list of partition names to be created within the collection.

The function begins by checking if a collection with the given name already exists. If it does, the existing collection is dropped.

Then, it defines the schema for the collection. The schema includes an ID field, two VARCHAR fields for the title and content of the documents, and a FLOAT_VECTOR field for the document embeddings. The ID field is marked as the primary field and is set to auto-generate IDs.

After defining the schema, the function creates a new collection with the given name and schema.

The function then creates partitions within the collection. For each name in the partition_names list, it creates a partition with that name.

Once the collection and partitions are set up, the function creates an index on the content_vector field to speed up similarity searches. The type of the index is IVF_FLAT, and the similarity metric is L2 (Euclidean distance).

Finally, the function returns the created collection.

When inserting data into the collection, you can specify the partition to insert into. This allows you to keep related data together and can improve search performance.

In [86]:

# Define the schema for the new collection
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="documentId", dtype=DataType.VARCHAR, max_length=256, auto_id=False),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=1024), 
    FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="authors", dtype=DataType.VARCHAR, max_length=1024), 
    FieldSchema(name="abstract", dtype=DataType.VARCHAR, max_length=4096), 
    FieldSchema(name="keywords", dtype=DataType.VARCHAR, max_length=1024), 
    FieldSchema(name="category", dtype=DataType.VARCHAR, max_length=256),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=1024),   
    FieldSchema(name="content_vector", dtype=DataType.FLOAT_VECTOR, dim=384)
]
collection_name = "QGRmemory"

##Drop the collection if it already exists
#if utility.has_collection(collection_name):
#    utility.drop_collection(collection_name)

# List of partition names.
#partition_names = ['mypapers', 'papers', 'notes', 'books', 'others', 'chats']

# Create the collection schema
data_schema = CollectionSchema(fields=fields, description='QGR Content Data')

# Create the collection
collection = Collection(name=collection_name, schema=data_schema)

# List of partition names.
partition_names = ['mypapers', 'papers', 'notes', 'books', 'others', 'chats']

# Create the partitions
for partition_name in partition_names:
    collection.create_partition(partition_name)

# Create a vector index for semantic search
index_params = {
    'metric_type': "IP", #other option L2
    'index_type': "IVF_FLAT",
    'params': {"nlist": 2048}
}

collection.create_index(field_name='content_vector', index_params=index_params)


Status(code=0, message=)

Since we' will be using the sentence-transformers/multi-qa-MiniLM-L6-cos-v1 model to generate embeddings and we are normalizing them, it would indeed make sense to use a cosine similarity metric for our index instead of the L2 metric. Inner Product (IP) similarity metric can be used in conjunction with normalized vectors to achieve cosine similarity

## Preparing and Inserting Data 

Choosing the best model for your text embedding task can depend on the specifics of your use case. For tasks involving academic papers or scientific texts, the following models might be suitable:

    SciBERT: A variant of BERT pre-trained on a large corpus of scientific texts. It may perform better on scientific texts due to its exposure to scientific jargon and concepts.
    BioBERT: Tailored for biology papers, this BERT model is pre-trained on a large-scale biomedical corpus, making it adept at handling biomedical terminology.
    MathBERT: A BERT variant specifically designed for mathematical papers.
    GPT-4 or GPT-3: OpenAI's general-purpose models, which can be powerful for various language understanding and generation tasks.
    LaBSE (Language-agnostic BERT Sentence Embeddings): Ideal for multilingual sentence-level embeddings, this model is particularly useful for cross-lingual tasks.
    sentence_transformers: A library offering models like multi-qa-MiniLM-L6-cos-v1 and paraphrase-MiniLM-L6-v2, optimized for sentence embeddings.
    multi-qa-MiniLM-L6-cos-v1: Proven effective for embedding in other tests, this model may also be suitable for other contexts.

Given the specific needs of your database, multi-qa-MiniLM-L6-cos-v1 might be a good choice to try.

In [87]:
# Assuming already connected to Milvus
collection_name = "QGRmemory"
collection = Collection(name=collection_name)

# Partition name for the data
partition_name = "mypapers"

#logging.basicConfig(level=logging.ERROR)

# Load the pre-trained SBERT model
sbert_model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1') 

# Initialize a list to keep track of processed file paths
processed_file_name = "processed_files.json"
processed_files = qgr.load_processed_files(processed_file_name)

#Folders to process:
folder_path="mypapers"
folder_path_not_processed = "notprocessed"
files_processed = 0
files_process_max = 40
while files_processed < files_process_max:
    try:
        entry, file_path = qgr.process_file_from_folder(folder_path)
    except Exception as e: # Catch the specific exception and print it
        print(f"error folder_path: {folder_path}")
        print(f"Error details: {str(e)}")
        break
        
    if entry is None:
        break

    try:
        qgr.process_and_insert_documents(entry, sbert_model, collection, partition_name)

        # List of partition names.
        #partition_names = ['mypapers', 'papers', 'notes', 'books', 'other']
       
        # Delete the file after insertion
        os.remove(file_path)
        
        # Append the successfully processed file path to the list
        processed_files.append(file_path)
        
        #print(file_path)
        files_processed += 1  # Increment the counter
        
        # Flush the data and save the processed file paths every 1000 files
        if files_processed % 20 == 0:
            collection.flush()
            with open(processed_file_name, 'w') as json_file:
                json.dump(processed_files, json_file)
            print(f"Flushed and saved processed files at {files_processed}")

        
    except Exception as e: # Catch the specific exception and print it
        print(f"file not fully processed: {file_path}")
        print(f"Error details: {str(e)}")
        
        # Move the file to the "notprocessed" folder
        destination_path = os.path.join(folder_path_not_processed, os.path.basename(file_path))
        shutil.move(file_path, destination_path)
        
# Flush the data
collection.flush()
print(f"Flushed and saved processed files at {files_processed}")

# Save the updated list of processed file paths to the JSON file
with open(processed_file_name, 'w') as json_file:
    json.dump(processed_files, json_file)

Flushed and saved processed files at 20
Flushed and saved processed files at 28


In [95]:
# Flush the data
collection.flush()

In [12]:
# Assuming already connected to Milvus
collection_name = "QGRmemory"
collection = Collection(name=collection_name)

num_entities = collection.num_entities
print(f"The number of entities in the collection is: {num_entities}")


The number of entities in the collection is: 392712


## Reading Data

About the search:

    limit: This parameter determines the number of returned results. Depending on your use case, you might want to return more results for further processing, or fewer results for speed and simplicity.

    expr: This parameter allows you to filter the results based on conditions. If you have other fields in your collection that could help refine your search results, you can use this parameter to add conditions. For example, if you have a "date" field and you only want papers from the last five years, you could set expr="date > '2018'".

    output_fields: Here you specify additional fields that you want to retrieve for each result. You've already added "title" and "content", but you can add more fields if they are defined in your collection.

    partition_names: You're already searching within the "mypapers" partition. If you want to search within multiple partitions, you can add more partition names to the list.

About refining for better search results.

The nlist parameter is used when building the index and influences the accuracy of the vector insertion. A higher nlist can make the vector insertion more precise but will increase the index file size and the time it takes to create the index.

On the other hand, nprobe is used during the search process. It determines the number of clusters to inspect during the search. A higher nprobe can increase the recall rate of a search but will also increase the search time.

In practice, it's common to set nlist to a relatively high value (to ensure good indexing accuracy) and then adjust nprobe depending on the speed-accuracy trade-off you're willing to accept for your searches.

Query Refinement: Make sure your query is representative of the information you are looking for. Since SciBERT is a BERT-based model, it might be sensitive to the phrasing and choice of words in the query.

Check Data Preprocessing: Make sure that the text preprocessing for both your dataset and search queries are appropriate and consistent. Ensure that the LaTeX text is being correctly parsed and cleaned before being passed to the SciBERT model.

In [16]:
# Load the pre-trained SBERT model
sbert_model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1') 
query_text = "quasicrystals and spin foams"
results = qgr.search_documents(query_text, "QGRmemory", sbert_model, "mypapers", limit=3)
grouped_results = qgr.group_by_document_id(results)
print(grouped_results)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'54b8943ededb63c826aa061b9767067cc31f09162778c3581365f8828bef4503': {'title': 'QUASICRYSTALLINE SPIN FOAM WITH MATTER: DEFINITIONS AND EXAMPLES', 'date': 'Unknown', 'authors': 'Marcelo Amaral, Richard Clawson, Klee Irwin', 'abstract': 'In this work, we define quasicrystalline spin networks as a subspace within the standard Hilbert space of loop quantum gravity, effectively constraining the states to coherent states that align with quasicrystal geometry structures. We introduce quasicrystalline spin foam amplitudes, a variation of the EPRL spin foam model, in which the internal spin labels are constrained to correspond to the boundary data of quasicrystalline spin networks. Within this framework, the quasicrystalline spin foam amplitudes encode the dynamics of quantum geometries that exhibit aperiodic structures. Additionally, we investigate the coupling of fermions within the quasicrystalline spin foam amplitudes. We present calculations for three-dimensional examples and then explore

In [18]:
#Deleting
# Step 1: Search for the documentId to get the primary key (id)
collection_name = "QGRmemory"
collection = Collection(name=collection_name)

document_id_to_search = "54b8943ededb63c826aa061b9767067cc31f09162778c3581365f8828bef4503"

search_results = collection.query(f"documentId == '{document_id_to_search}'")
print(search_results)

# Step 2: Extract the primary keys from the search results
primary_keys_to_delete = [result['id'] for result in search_results]

# Step 3: Delete the entities using the primary keys
delete_expr = f"id in {primary_keys_to_delete}"
#collection.delete(delete_expr)


[]


In [91]:
#Re-indexing
#Release the collection
collection.release()
#Drop the existing index
collection.drop_index()
#Create a new index
index_params = {
    'metric_type': "IP",
    'index_type': "IVF_FLAT",
    'params': {"nlist": 2048}
}
collection.create_index(field_name='content_vector', index_params=index_params)


Status(code=0, message=)