# Upload PDFs to a Vector Database

## Overview
This notebook will guide you through uploading a sample PDF dataset to a vector database. 
You should already have a sample [Milvus](https://milvus.io/docs/install_standalone-docker.md) vector database setup from the Workbench project, which is setup to run at port `19530`. 

In [1]:
!pip install pypdf

Defaulting to user installation because normal site-packages is not writeable


## Unzip Dataset
The dataset used in this example is pdf files containing NVIDIA blogs and press releases. These PDF files have been scraped and stored in `../data/corp-comms-dataset.zip`.

In [2]:
!unzip -n ../data/corp-comms-dataset.zip -d ../data/

Archive:  ../data/corp-comms-dataset.zip
   creating: ../data/dataset/
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjIvMTIvMDEvbW9uZGF2aS1tb25hcmNoLXNtYXJ0LWVsZWN0cmljLWpldHNvbi10cmFjdG9yLw==.pdf  
  inflating: ../data/dataset/RGVsbCBUZWNoIDUvMjMvMjMucGRm.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDMvMjIvc3V0c2tldmVyLW9wZW5haS1ndGMv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDcvMTAvdHJlay1iaWN5Y2xlLXRvdXItZGUtZnJhbmNlLWdwdXMv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDcvMTIvbW9zYWljbWwv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDIvMTYvYWktbWV0YXZlcnNlLXNoYXBpbmctYXV0b21vdGl2ZS1pbmR1c3RyeS1ndGMv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjMvMDQvMjEvZXBpYy1iZW5lZml0cy1vbW5pdmVyc2UtY29ubmVjdG9yLXVucmVhbC1lbmdpbmUv.pdf  
  inflating: ../data/dataset/YmxvZ3MubnZpZGlhLmNvbS9ibG9nLzIwMjIvMTEvMTcveHItdGVjaG5vbG9naWVzLw==.pdf  
  inflati

## Setup NVIDIA Embedding Model
This model, [embed-qa-4](https://build.nvidia.com/nvidia/embed-qa-4), is a fine-tuned E5-large model deployed as a NIM and hosted on the [NVIDIA API catalog](https://build.nvidia.com/). 


*⚠️* Be sure to populate config variables for the app!

In [3]:
from chain_server.configuration import config
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

# "nvapi-xxx" is the NVIDIA API KEY format. If you have not configured this variable, be sure to do so. 
embedding_model = NVIDIAEmbeddings(
    model=config.embedding_model.name,
    base_url=str(config.embedding_model.url),
    api_key=config.nvidia_api_key,
    truncate="END"
)

## Setup Milvus Vector Database
[Milvus](https://milvus.io/docs/install_standalone-docker.md) should already be running through NVIDIA Workbench.  Milvus is a database that stores, indexes, and manages massive embedding vectors.

In [4]:
print(config.milvus.collection_name)

collection_1


In [5]:
from langchain_milvus.vectorstores.milvus import Milvus

vector_store = Milvus(
    embedding_function=embedding_model,
    connection_args={"uri": config.milvus.url},
    collection_name=config.milvus.collection_name,
    auto_id=True,
)

  from pkg_resources import DistributionNotFound, get_distribution


## Upload PDFs to Milvus Vector Database

In [6]:
import glob

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def upload_document(file_path):
    loader = PyPDFLoader(str(file_path))
    data = loader.load()
    text_splitter = RecursiveCharacterTextSplitter()
    all_splits = text_splitter.split_documents(data)
    vector_store.add_documents(documents=all_splits)

    return f"uploaded {file_path}"

def upload_pdf_files(folder_path, num_files):
    i = 0
    for file_path in glob.glob(f"{folder_path}/*.pdf"):
        print(upload_document(file_path))
        i += 1
        if i >= num_files:
            break

In [7]:
NUM_DOCS_TO_UPLOAD=10
upload_pdf_files("../data/dataset", NUM_DOCS_TO_UPLOAD)

2026-01-21 18:09:13,700 [ERROR][handler]: RPC error: [insert_rows], <DataNotMatchException: (code=1, message=Insert missed an field `simple_file_name` to collection without set nullable==true or set default_value)>, <Time:{'RPC start': '2026-01-21 18:09:13.700345', 'RPC error': '2026-01-21 18:09:13.700638'}> (decorators.py:140)
Failed to insert batch starting at entity: 0/3


DataNotMatchException: <DataNotMatchException: (code=1, message=Insert missed an field `simple_file_name` to collection without set nullable==true or set default_value)>

In [8]:
query = "How is NVIDIA working with Mercedes Benz?"
docs = vector_store.similarity_search(query)
print(docs[0])

page_content='Arch. Biol. Sci., Belgrade, 65 (1), 1-7, 2013 DOI:10.2298/ABS1301001O
1
POPULATION GENETIC CHARACTERISTICS OF HORSE CHESTNUT IN S ERBIA
MIRJANA OCOKOLJIĆ*, DRAGICA VILOTIĆ and MIRJANA ŠIJAČIĆ-NIKOLIĆ 
Faculty of Forestry, University of Belgrade, 11030 Belgrade, Serbia
Abstract – The general population genetic characteristics of cultivated horse chestnut trees excelling in growth, phenotype 
characteristics, type of inflorescence, productivity and resistance to the leafminer Cameraria ohridella Deschka and Dimić 
were analyzed in Serbia. The analyzed population genetic parameters point to fundamental differences in the genetic struc-
ture among the cultivated populations in Serbia. The study shows the variability in all properties among the populations 
and inter-individual variability within the populations. The variability and differential characteristics were assessed using 
statistical parameters, taking into account the satisfactory reflection of the hereditary potent