# **myAI: Populating the Pinecone RAG Database**

Daniel M. Ringel  
Kenan-Flagler Business School  
*The University of North Carolina at Chapel Hill*  
dmr@unc.edu

### ***Adding Documents to PineCone without RAGLoader***

***Note***: We depricated RAGLoader after I completed my teaching at KFBS in the Spring 2025 Semester.

[myAI on GitHub](https://github.com/dringel/myAI)

*April 28, 2025*  
Version 1.0

# From Raw Content to Pinecone


This notebook walks through a simple, end-to-end process for indexing unstructured text into Pinecone using semantic chunking. It starts by installing the required Python packages, then creates a Pinecone index tailored to a specific embedding model. Raw documents are split into smaller, semantically meaningful chunks, which are then upserted into the index for vector search. The final step demonstrates how to query the indexed content using natural language.

**WARNING**: myAI ([myAI on GitHub](https://github.com/dringel/myAI)) was originally set-up to work with OpenAI Embeddings. If you do not change anything on your github and vercel delpoyment, then you need to use them (see bottom of notebook) to insert documents (chuncks) into Pinecone.
This means that myAI originally used OpenAI embeddings directly, without relying on Pinecone's built-in embedding integration.

**IMPORTANT**: The myAI repo on GitHub was updated in April 2025 so that you can decide between using OpenAI Embeddings and internal Pinecone Embeddings. Your chatbot will be faster when you use internal Pinecone embeddings, but you may have some volume constraints on a free plan and you are locked-in to Pinecone.

> #### **Please note that you cannot use the two in Tandem**, or your upserting process will fail. If you wish to use OpenAI embeddings, please skip to the end after installing the prerequisites.

### To start, upload your secret keys using Google Secrets on your left.

![Secrets](https://mapxp.app/BUSI488/secret-collab-pinecone-s.png)



## ***Prerequisites***

You need to install a few packages / libraries to start.

In [None]:
# Running these installations should take a couple of minutes

%pip install pinecone # for creating and managing vector indexes
%pip install unstructured['all-docs'] # for parsing various document formats (PDFs, Word, HTML, etc.) and chunking
%pip install openai # for generating embeddings if not using Pinecone’s built-in models
%pip install pdf2image pytesseract

## **Step 1**: Using Unstructured and Ensuring Correct Setups

What is ‎`unstructured`?

The ‎`unstructured` Python library provides flexible tools for parsing and extracting text from a wide range of document types. It supports both structured and unstructured content, and is designed to work with PDFs, Word documents, HTML, emails, spreadsheets, and more. It is particularly useful for preparing raw content for downstream tasks such as chunking which we will use later.

At its core, ‎`unstructured` uses a modular pipeline to “partition” documents into clean, structured elements like titles, paragraphs, tables, and images. This makes it easy to convert complex files into plain text or structured formats suitable for LLM and vector workflows. The next code block shall test that you have correctly installed packages.

As per the documentation, unstructured supports the following file types, if you want to change to a specific partitioning function, refer to the documentation [here](https://docs.unstructured.io/open-source/core-functionality/partitioning). If you wish to upload an image, please just upload a text describing the image and include the link of the image as the source url later.

📄 Text Documents
 • Plain Text (‎⁠.txt⁠, ‎⁠.text⁠, ‎⁠.log⁠)
 • Markdown (‎⁠.md⁠)
 • ReStructuredText (‎⁠.rst⁠)
 • Rich Text Format (‎⁠.rtf⁠)
 • Org Mode (‎⁠.org⁠)
 • XML (‎⁠.xml⁠)

📝 Word Processing
 • Microsoft Word (‎⁠.doc⁠, ‎⁠.docx⁠)
 • OpenOffice (‎⁠.odt⁠)
 • EPUB (‎⁠.epub⁠)

📊 Spreadsheets & Tables
 • Excel (‎⁠.xlsx⁠, ‎⁠.xls⁠)
 • CSV (‎⁠.csv⁠)
 • TSV (‎⁠.tsv⁠)

📧 Emails
 • EML (‎⁠.eml⁠)
 • MSG (‎⁠.msg⁠)

🌐 Web & Code
 • HTML (‎⁠.html⁠, ‎⁠.htm⁠)
 • Code Files (‎⁠.js⁠, ‎⁠.py⁠, ‎⁠.java⁠, ‎⁠.cpp⁠, ‎⁠.cc⁠, ‎⁠.cxx⁠, ‎⁠.c⁠, ‎⁠.cs⁠, ‎⁠.php⁠, ‎⁠.rb⁠, ‎⁠.swift⁠, ‎⁠.ts⁠, ‎⁠.go⁠)

📽️ Presentations
 • PowerPoint (‎⁠.ppt⁠, ‎⁠.pptx⁠)

📷 Images
 • Image Files (‎⁠.png⁠, ‎⁠.jpg⁠, ‎⁠.jpeg⁠, ‎⁠.tiff⁠, ‎⁠.bmp⁠, ‎⁠.heic⁠)

📚 PDFs
 • PDF (‎⁠.pdf⁠)



In [None]:
from google.colab import files
from unstructured.partition.auto import partition

uploaded = files.upload() # upload a file (PDF, DOCX, TXT, etc.)

file_path: str  = list(uploaded.keys())[0] # pick the first file as a test

elements = partition(filename=file_path)

RAW_TEXT_EXAMPLE: str = "\n\n".join([str(el) for el in elements])
print(RAW_TEXT_EXAMPLE) # you should see the raw text of whatever file you just uploaded

*If you do not see your raw text from the document, try reinstalling the packages or restarting your session to get the latest versions running. If your code does work, congratulations - we can now move onto the next step!*

## **Step 2:** Create an Index

What is a Pinecone Index? [Documentation](https://docs.pinecone.io/guides/indexes/create-an-index)

A Pinecone index is a cloud-hosted vector database that stores and retrieves embeddings for fast and scalable similarity search. You can think of an index as a smart container for vectorized data (like text embeddings), where each record is stored with a unique ID and optional metadata.

When you create an index, you define:
 • Name: A unique identifier for your index

 • Cloud and region: Where the index is hosted (e.g., AWS, GCP)

 • Embedding model: Either managed by Pinecone or external (like OpenAI)

 • Field mapping: How your data maps to the embedding input (e.g., ‎`"text": "chunk_text"`)

Once your index is created, you can:

 • Upsert: Insert or update records (vectors with IDs and metadata)

 • Query: Search for similar records using a new input vector

 • Delete: Remove records by ID

 • Fetch: Retrieve records by ID

We will only focus on querying and upserting vectors for our notebook. But first, lets create an index.

In [None]:
from pinecone import Pinecone
from google.colab import userdata
import os

# Before running this code block, be sure to add your key value pairs (OpenAI/Pinecone) in the secret tab to the left, allowing access to the notebook
pc: str = Pinecone(api_key=userdata.get("PINECONE_API_KEY"))

INDEX_NAME: str = input("Enter a name for your Pinecone Index: ")

# the fields cloud, region, and embed are all customizable and you should refer to documentation if you wish to change things.
if not pc.has_index(INDEX_NAME):
    pc.create_index_for_model(
        name=INDEX_NAME,
        cloud="aws",
        region="us-east-1",
        embed={
            "model":"multilingual-e5-large",
            "field_map":{"text": "chunk_text"}
        }
    )

*A key thing to note when using your index is that you cannot switch between using other embedding models and integrated embeddings, so when you set up your index for the first time, you can only use that initial embedding process. For safety, try to set up new indexes using this notebook.*

## **Step 3:** Making Chunks from Documents

Chunks are smaller sections of a larger document. Instead of sending an entire file to an embedding or vector database, we break it down into manageable pieces that preserve meaning. This is especially helpful when working with long-form content like PDFs, articles, or transcripts.

Breaking text into chunks improves search accuracy and performance. Each chunk becomes a standalone unit for embedding, indexing, and querying. We clean chunks before upserting to improve meaning. When a user asks a question, the system searches across all the chunks to find the most relevant ones.

In this notebook, we use the ‎`unstructured` library to create smart chunks that align with natural language structure, not just fixed word or character counts. This helps each chunk retain context and stay semantically useful. In this section, we will chunk the actual files you want to use

*You can run the following two code cells as much as possible to insert as many records as you want into Pinecone. You do not need to rerun the whole notebook. However, you may only insert one file at a time.*

In [None]:
from unstructured.chunking.basic import chunk_elements
from google.colab import files
from unstructured.partition.auto import partition

uploaded = files.upload() # upload your file

file_path: str = list(uploaded.keys())[0]
source_citation: str = input("Enter the source description: ") # can change this line just to assigning your metadata
source_url: str = input("Enter the url of the source: ")

elements = partition(filename=file_path)
chunked_elements = chunk_elements(elements)

chunks: list[str] = []

for chunk in chunked_elements:
    chunks.append(chunk.text.replace("\n", " "))

## **Step 4:** Refining and Upserting Document Chunks to Pinecone


What is Upserting?

Upserting is the process of inserting new records - vectors and texts - into a Pinecone index, or updating them if they already exist. In this notebook, we take each chunk of text and prepare it with extra metadata before sending it to Pinecone.

We start by wrapping the list of chunks with empty strings at the beginning and end. This lets us safely create pre and post context around each chunk—helpful for improving search relevance later.

Then we loop through the chunks and build a list of dictionaries, each representing a record to be indexed. Each record includes:

 • a unique ‎`id` (generated using ‎`uuid`)

 • the chunk text

 • the order of the chunk

 • the text that comes before and after the chunk

 • source metadata (description and URL)

 **This code block is using integrated embeddings, so if you plan to use another type of embedding model this code will not work.**


In [None]:
import uuid

data = [""] + chunks + [""] # chunks from the previous cell

data_to_upsert = []

# create pre and post contexts with accurate metadata
for i in range(1, len(data) - 1):
    if len(data[i]) < 3:
        continue
    data_to_upsert += [{"id": str(uuid.uuid4()), "text": data[i], "order": i - 1, "post_context": data[i + 1], "pre_context": data[i - 1], "source_description": source_citation, "source_url": source_url}]

refined_chunks = [data_to_upsert[i:min(i + 48, len(data_to_upsert))] for i in range(0, len(data_to_upsert), 48)]

# upserting via batches
for chunk_batch in refined_chunks:
  formatted_batch = []
  for record in chunk_batch:
    formatted_batch.append({
            "id": record["id"],
            "chunk_text": record["text"],
            "order": record["order"],
            "pre_context": record["pre_context"],
            "post_context": record["post_context"],
            "source_description": record["source_description"],
            "source_url": record["source_url"]
        })
    pc.Index(INDEX_NAME).upsert_records(INDEX_NAME, formatted_batch)




## **Step 5:** Querying Records

What is Querying?

Once your chunks are indexed in Pinecone, you can search them using natural language queries. This is done through Pinecone’s ‎⁠search_records⁠ method, which compares your query text to all stored vectors and returns the most relevant matches.

In this notebook, we use the integrated embedding model to automatically convert your search text into a vector behind the scenes. You don’t need to generate the embedding yourself — just pass in your input

This call returns the top 2 chunks that are most semantically similar to your search. You can increase ‎⁠top_k⁠ to get more results.

Each result includes the original chunk text, metadata (like source URL), and a similarity score. This lets you surface the most relevant information from your indexed documents in response to user questions.

In [None]:
search_text: str = input("Enter a search query: ")
index_name: str = input("Enter the name of the index you want to query: ")

response = pc.Index(index_name).search_records(
    index_name,
    query= {
        "inputs": {"text": search_text},
        "top_k": 2
    }
)

results: dict = response.to_dict()

for match in results["result"]["hits"]:
    print(match["fields"]["chunk_text"])




# **Using OpenAI Embeddings**: as originally done with myAI before April 2025

[myAI on GitHub](https://github.com/dringel/myAI)

If you would instead choose to use OpenAI embeddings, you may chunk and embed your documents here. You can edit your model but be careful that it adheres to your Pinecone index dimensions.

**WARNING**: myAI was originally set-up to work with OpenAI Embeddings. If you do not change anything on your github and vercel delpoyment, then you need to use this part here to insert documents (chuncks) into Pinecone.

In [None]:
from google.colab import files
from unstructured.partition.auto import partition
from google.colab import files
from unstructured.partition.auto import partition
from openai import OpenAI
from google.colab import userdata
from pinecone import Pinecone

uploaded = files.upload() # upload your file

file_path = list(uploaded.keys())[0]
source_citation = input("Enter the source description: ")
source_url = input("Enter the url of the source: ")

elements = partition(filename=file_path)
chunked_elements = chunk_elements(elements)

chunks: list[str] = []

for chunk in chunked_elements:
    chunks.append(chunk.text.replace("\n", " "))


data = [""] + chunks + [""]

data_to_upsert = []

# create pre and post contexts with accurate metadata
for i in range(1, len(data) - 1):
    if len(data[i]) < 3:
        continue
    data_to_upsert += [{"id": str(uuid.uuid4()), "text": data[i], "order": i - 1, "post_context": data[i + 1], "pre_context": data[i - 1], "source_description": source_citation, "source_url": source_url}]

refined_chunks = [data_to_upsert[i:min(i + 48, len(data_to_upsert))] for i in range(0, len(data_to_upsert), 48)]


# connect to Pinecone and Open AI
index: str = input("What is the name of the Pinecone index you wish to upsert to: ") # can change this line just to assinging your index name
pc = Pinecone(api_key=userdata.get("PINECONE_API_KEY"))
if not pc.has_index(index): # you must have an existing pinecone index with the correct dimensions
    ValueError("Index does not exist.")

pinecone_index = pc.Index(name=index, pool_threads=5)
client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))

# format for Pinecone upsert

for chunk_batch in refined_chunks:
  formatted_batch = []
  for record in chunk_batch:
    response = client.embeddings.create(
        input=record["text"],
        model="text-embedding-ada-002" # customize this as you wish
    )
    embedding = response.data[0].embedding
     # print(embedding) uncomment to watch your embeddings get upserted in real-time
    formatted_batch.append({
            "id": record["id"],
            "values": embedding,
            "metadata":{
            "order": record["order"],
            "pre_context": record["pre_context"],
            "post_context": record["post_context"],
            "source_description": record["source_description"],
            "source_url": record["source_url"],
            "chunk_text": record["text"]}
        })
    pinecone_index.upsert(vectors=formatted_batch)


Reading and Querying your non Pinecone Integrated Index

In [None]:
res = client.embeddings.create(
    input="INPUT TEXT HERE", # change for any vector
    model="text-embedding-ada-002"
)

query_vector = res.data[0].embedding


response = pinecone_index.query(
    namespace="",
    vector=query_vector,
    top_k=3,
    include_metadata=True,
    include_values=True
)

results: dict = response.to_dict()

for match in results["matches"]:
    print(match["metadata"]["chunk_text"])
