# AI Vector Store with LangChain and Pinecone

This guide demonstrates how to create a vector store using LangChain and the Pinecone package (not the deprecated pinecone-client) with specific document files: a text file (Alice in Wonderland), a markdown file (Flask documentation), and a PDF file (Attention Is All You Need research paper).

## Setup and Installation

First, let's install all necessary dependencies with the updated Pinecone package:

In [1]:
# Install required packages with the new Pinecone package
# !pip install langchain langchain_openai pinecone pypdf tiktoken requests unstructured markdown faiss-cpu

## Environment Setup

Let's set up our environment variables:

In [2]:
import os
from pathlib import Path
from dotenv import load_dotenv
import pinecone
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_pinecone import PineconeVectorStore

# Load environment variables from .env file (if you have one)
load_dotenv()

# Set your API keys
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
PINECONE_ENVIRONMENT = os.getenv('PINECONE_ENVIRONMENT') # region
# print(os.environ["OPENAI_API_KEY"])
# print(PINECONE_API_KEY)
# print(PINECONE_ENVIRONMENT)

# Create data directory
data_dir = Path("data")
data_dir.mkdir(exist_ok=True)

MODEL_GPT = 'gpt-4o-mini' # 'gpt-3.5-turbo'
PINECONE_INDEX_NAME = "mixed-document-types"

## Download Specific Files

We'll download the three specific files you mentioned:

In [3]:
import requests

def download_file(url, path):
    """Download a file from a URL to a specific path."""
    response = requests.get(url)
    if response.status_code == 200:
        with open(path, "wb") as f:
            f.write(response.content)
        print(f"Downloaded {path.name}")
    else:
        print(f"Failed to download {url}")

In [4]:
# Download Alice in Wonderland (txt)
alice_url = "https://www.gutenberg.org/files/11/11-0.txt"
alice_path = data_dir / "alice_in_wonderland.txt"
if not alice_path.exists():
    download_file(alice_url, alice_path)
else:
    print(f"{alice_path.name} already exists.")

Downloaded alice_in_wonderland.txt


In [5]:
# Download Flask documentation (markdown)
flask_docs_url = "https://raw.githubusercontent.com/pallets/flask/main/README.md"
flask_docs_path = data_dir / "flask_docs.md"
if not flask_docs_path.exists():
    download_file(flask_docs_url, flask_docs_path)
else:
    print(f"{flask_docs_path.name} already exists.")

Downloaded flask_docs.md


In [6]:
# Download a research paper (PDF)
research_paper_url = "https://arxiv.org/pdf/1706.03762.pdf"
research_paper_path = data_dir / "attention_paper.pdf"
if not research_paper_path.exists():
    download_file(research_paper_url, research_paper_path)
else:
    print(f"{research_paper_path.name} already exists.")

Downloaded attention_paper.pdf


In [7]:
print(alice_path)
print(flask_docs_path)
print(research_paper_path)

data\alice_in_wonderland.txt
data\flask_docs.md
data\attention_paper.pdf


## Load and Process Different File Types

Now we'll load the documents using the appropriate loaders for each file type:

In [8]:
from langchain_community.document_loaders import TextLoader, PyPDFLoader, UnstructuredMarkdownLoader

# Create a list to store all documents
all_documents = []

# Load TXT file - Alice in Wonderland
print(f"Loading {alice_path.name}...")
# text_loader = TextLoader(alice_path)
text_loader = TextLoader(str(alice_path), encoding='utf-8')
all_documents.extend(text_loader.load())

# Load Markdown file - Flask docs
print(f"Loading {flask_docs_path.name}...")
# md_loader = UnstructuredMarkdownLoader(flask_docs_path)
md_loader = UnstructuredMarkdownLoader(str(flask_docs_path))
all_documents.extend(md_loader.load())

# Load PDF file - Research paper
print(f"Loading {research_paper_path.name}...")
# pdf_loader = PyPDFLoader(research_paper_path)
pdf_loader = PyPDFLoader(str(research_paper_path))
all_documents.extend(pdf_loader.load())

print(f"Loaded {len(all_documents)} documents")

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)
chunks = text_splitter.split_documents(all_documents)

print(f"Split into {len(chunks)} chunks")

Loading alice_in_wonderland.txt...
Loading flask_docs.md...
Loading attention_paper.pdf...
Loaded 17 documents
Split into 245 chunks


## Create Embeddings

We'll use OpenAI embeddings to vectorize our document chunks:

In [9]:
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings()

# Test embedding a sample text
sample_text = "This is a test document for embeddings."
sample_embedding = embeddings.embed_query(sample_text)
print(f"Embedding dimension: {len(sample_embedding)}")

Embedding dimension: 1536


## Create Vector Store with Pinecone

Now let's initialize Pinecone and create our vector store using the updated Pinecone package:

In [10]:
# Initialize the Pinecone client
pc = pinecone.Pinecone(api_key=PINECONE_API_KEY)

# Set index name with a unique identifier
# index_name = "mixed-document-types"
index_name = PINECONE_INDEX_NAME

# Check if index already exists, if not create it
existing_indexes = [index.name for index in pc.list_indexes()]

if PINECONE_INDEX_NAME not in existing_indexes:
    # Determine dimension from the OpenAI embeddings
    dimension = len(sample_embedding)
    # Create the index with the Pinecone API
    pc.create_index(
        name=PINECONE_INDEX_NAME,
        dimension=dimension,
        metric="cosine",
        spec=pinecone.ServerlessSpec(
            cloud="aws",
            # region="us-west-2"
            region=PINECONE_ENVIRONMENT
        )
    )
    print(f"Created new Pinecone index: {PINECONE_INDEX_NAME}")
else:
    print(f"Using existing Pinecone index: {PINECONE_INDEX_NAME}")

# Connect to the index with Pinecone API
index = pc.Index(PINECONE_INDEX_NAME)

# Create Pinecone vector store with LangChain
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name=PINECONE_INDEX_NAME
)

print("Documents loaded into Pinecone successfully!")

Created new Pinecone index: mixed-document-types
Documents loaded into Pinecone successfully!


## Create Retrieval System

Let's set up our retrieval system with LangChain:

In [11]:
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# Create a RetrievalQA chain
llm = ChatOpenAI(model_name=MODEL_GPT)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

## Test with Questions About the Documents

Let's test our system with five questions about our specific documents:

In [12]:
# Function to query and display results
def ask_question(question):
    print(f"\n\nQuestion: {question}")
    print("-" * 50)
    
    # result = qa_chain({"query": question}) # deprecated
    result = qa_chain.invoke({"query": question})
    
    print("Answer:")
    print(result["result"])
    
    print("\nSource Documents:")
    for i, doc in enumerate(result["source_documents"]):
        print(f"\nDocument {i+1}:")
        source = doc.metadata.get('source', 'Unknown')
        if source.endswith('.txt'):
            doc_type = "Alice in Wonderland"
        elif source.endswith('.md'):
            doc_type = "Flask Documentation"
        elif source.endswith('.pdf'):
            doc_type = "Attention Research Paper"
        else:
            doc_type = "Unknown"
            
        print(f"Source: {doc_type} ({source})")
        if 'page' in doc.metadata:
            print(f"Page: {doc.metadata['page']}")
        print(f"Content: {doc.page_content[:200]}...")

# List of questions about the documents
questions = [
    "What happens to Alice when she drinks from the bottle?",
    "What are the key features of Flask framework?",
    "What is the attention mechanism described in the research paper?",
    "How does the White Rabbit appear in Alice in Wonderland?",
    "Compare the self-attention architecture with RNNs according to the paper."
]

# Ask each question
for question in questions:
    ask_question(question)



Question: What happens to Alice when she drinks from the bottle?
--------------------------------------------------
Answer:
When Alice drinks from the bottle labeled "DRINK ME," she finds it very nice and finishes it off. The context suggests that drinking from the bottle likely causes her to change in size, although the specific effect of that particular drink on her size is not detailed in the provided text.

Source Documents:

Document 1:
Source: Alice in Wonderland (data\alice_in_wonderland.txt)
Content: There seemed to be no use in waiting by the little door, so she went
back to the table, half hoping she might find another key on it, or at
any rate a book of rules for shutting people up like telesco...

Document 2:
Source: Alice in Wonderland (data\alice_in_wonderland.txt)
Content: It was all very well to say “Drink me,” but the wise little Alice was
not going to do _that_ in a hurry. “No, I’ll look first,” she said,
“and see whether it’s marked ‘_poison_’ or not”; for she had 

## Clean Up (Optional)

If you want to delete the Pinecone index after you're done:

In [13]:
# Delete the index using modern Pinecone API (uncomment if needed)
# pc.delete_index(PINECONE_INDEX_NAME)
# print(f"Deleted Pinecone index: {PINECONE_INDEX_NAME}")