# Load Processed Data into Vector Database

This notebook loads output from data prep kit into Milvus

**Step-5 in this workflow**

![](media/rag-overview-2.png)


## Step-1: Configuration

In [1]:
from my_config import MY_CONFIG

## Step-2: Load Parquet Data

Load all  `.parquet` files in the given dir

In [2]:
import pandas as pd
import glob

print ('Loading data from : ', MY_CONFIG.OUTPUT_FOLDER_FINAL)

# Get a list of all Parquet files in the directory
parquet_files = glob.glob(f'{MY_CONFIG.OUTPUT_FOLDER_FINAL}/*.parquet')
print ("Number of parquet files to read : ", len(parquet_files))
print ()

# Create an empty list to store the DataFrames
dfs = []

# Loop through each Parquet file and read it into a DataFrame
for file in parquet_files:
    df = pd.read_parquet(file)
    print (f"Read file: '{file}'.  number of rows = {df.shape[0]}")
    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
data_df = pd.concat(dfs, ignore_index=True)

print (f"\nTotal number of rows = {data_df.shape[0]}")

Loading data from :  output/output_final
Number of parquet files to read :  2

Read file: 'output/output_final/attention.parquet'.  number of rows = 27
Read file: 'output/output_final/granite.parquet'.  number of rows = 33

Total number of rows = 60


In [3]:

## Shape the data

MY_CONFIG.EMBEDDING_LENGTH =  len(data_df.iloc[0]['embeddings'])
print ('embedding length: ', MY_CONFIG.EMBEDDING_LENGTH)

# rename 'embeddings' columns as 'vector' to match default schema
data_df = data_df.rename( columns= {'embeddings' : 'vector', 'contents' : 'text'})

# keep only these columns
data_df = data_df[['text', 'vector','filename']]

print (data_df.info())
data_df.head(3)

embedding length:  768
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      60 non-null     object
 1   vector    60 non-null     object
 2   filename  60 non-null     object
dtypes: object(3)
memory usage: 1.5+ KB
None


Unnamed: 0,text,vector,filename
0,"Provided proper attribution is provided, Googl...","[-0.009165785, 0.021550555, -0.032464594, 0.04...",attention.pdf
1,## Attention Is All You Need\n\nAshish Vaswani...,"[-0.0015829674, 0.010807318, 0.0343509, 0.0365...",attention.pdf
2,## Abstract\n\nThe dominant sequence transduct...,"[0.012864714, -0.036180798, 0.00082880055, 0.0...",attention.pdf


## Step-3: Connect to Vector Database

Milvus can be embedded and easy to use.

<span style="color:blue;">Note: If you encounter an error about unable to load database, try this: </span>

- <span style="color:blue;">In **vscode** : **restart the kernel** of previous notebook. This will release the db.lock </span>
- <span style="color:blue;">In **Jupyter**: Do `File --> Close and Shutdown Notebook` of previous notebook. This will release the db.lock</span>
- <span style="color:blue;">Re-run this cell again</span>




In [4]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(MY_CONFIG.DB_URI)

print ("✅ Connected to Milvus instance:", MY_CONFIG.DB_URI)

✅ Connected to Milvus instance: ./rag_2.db


# Step-4: Create A Collection



In [None]:
# # if we already have a collection, clear it first
# if milvus_client.has_collection(collection_name=MY_CONFIG.COLLECTION_NAME):
#     milvus_client.drop_collection(collection_name=MY_CONFIG.COLLECTION_NAME)
#     print ('✅ Cleared collection :', MY_CONFIG.COLLECTION_NAME)

if not milvus_client.has_collection(collection_name=MY_CONFIG.COLLECTION_NAME):
    milvus_client.create_collection(
        collection_name=MY_CONFIG.COLLECTION_NAME,
        dimension=MY_CONFIG.EMBEDDING_LENGTH,
        metric_type="IP",  # Inner product distance
        consistency_level="Strong",  # Strong consistency level
        auto_id=True
    )
    print ("✅ Created collection :", MY_CONFIG.COLLECTION_NAME)


✅ Created collection : rag_milvus


## Step-5: Insert Data into Collection

In [6]:
res = milvus_client.insert(collection_name=MY_CONFIG.COLLECTION_NAME, data=data_df.to_dict('records'))

print('inserted # rows', res['insert_count'])

milvus_client.get_collection_stats(MY_CONFIG.COLLECTION_NAME)

inserted # rows 60


{'row_count': 60}

## Step-6: Close DB Connection

Close the connection so the lock files are relinquished and other notebooks can access the db

In [7]:
milvus_client.close()

print ("✅ SUCCESS")


✅ SUCCESS


## Test your data by doing a Vector Search

See notebook [vector_search.ipynb](vector_search.ipynb)