# ```.pdf``` loader and splitter

1. Create ```/pdfs``` directory then place all .docx files inside 
2. Run app and all ```.pdf``` files within the directory will be loaded into your vector store
3. Use helper function to delet db 
4. Use Chat functions to test
5. Trace using LangServe
---

### Directory Structure

```
├── pinecone-langchain-pdf-dataloader.ipynb
├── pdfs
│   ├── client_file_01.pdf
│   ├── client_file_02.pdf
│   ├── client_file_03.pdf
```
---

1. ```check_and_load_pdf_from_dir``` : checks directory for files,loads documents with the ```Docx2txtLoader```, then splits the docs with langchains ```RecursiveCharacterTextSplitter```.

In [None]:
! pip install --upgrade --quiet  docx2txt
! pip install langchain
! pip install tiktoken
! pip install pinecone-client

In [15]:
# DOCX - Check Path ald Load .pdf

import os
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders.pdf import PyPDFLoader


def check_and_load_pdf_from_dir(directory):
    # Ensure the directory path exists
    if not os.path.exists(directory):
        print("Directory does not exist.")
        return False

    # Check if directory is actually a directory
    if not os.path.isdir(directory):
        print("The specified path is not a directory.")
        return False

    # List all files in the directory
    all_files = os.listdir(directory)
    print(f'.pdf files within the /pdfs directory: {all_files}')
    
    # Check if each file ends with .docx
    for file in all_files:
        if not file.endswith(".pdf"):
            print(f"Non-PDF file found: {file}")
            return False
 
 # load directory path
    directory_path = directory 
    # add list comprehension for file os.path.join usage
    pdf_files = [f for f in all_files if f.endswith(".pdf")]
    # Create emptly list container
    documents = []
    # Loop through the directory, bundle and load docx files. 
    for pdf_file in pdf_files:
        file_path = os.path.join(directory_path, pdf_file)
        loader = PyPDFLoader(file_path)
        documents = loader.load()


    text_splitter = RecursiveCharacterTextSplitter( 
        chunk_size=1000,  # Maximum size of each chunk
        chunk_overlap=100,  # Number of overlapping characters between chunks
    )

    chunks = text_splitter.split_documents(documents)

    return chunks

chunks = check_and_load_pdf_from_dir('pdfs')

# Call Chunks 
print(len(chunks))
print(chunks[0].page_content)
print(chunks[1].page_content)

.pdf files within the /pdfs directory: ['CodeLlamaMeta.pdf', 'ArjanCodes-SDev-Guide.pdf']


Ignoring wrong pointing object 11 0 (offset 0)
Ignoring wrong pointing object 24 0 (offset 0)
Ignoring wrong pointing object 26 0 (offset 0)
Ignoring wrong pointing object 37 0 (offset 0)


17
Software Design Guide          
A 7 step-guide to designing great software © ArjanCodes
1  Before We Dive In Have you ever been stuck trying to find a way to write software that can solve a complex problem, but that doesn’t become a huge mess of spaghetti code? Do you often end up in a situation where you know what your software should eventually do, but you have no idea how or where to start? I’ve been there many times, just like you, and I’ve written this 7-step plan to help you create consistently great software designs. I’ve been developing software for as long as I can remember (but I have pretty bad memory, so there’s that


In [9]:
# Split Data Text With Cost Calculation
# How much it costs to embed
def calculate_and_display_embedding_cost(texts):
    import tiktoken
    enccoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    total_tokens = sum([len(enccoding.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD:{total_tokens / 1000 * 0.0004:.6f}')

calculate_and_display_embedding_cost(chunks)

Total Tokens: 2439
Embedding Cost in USD:0.000976


3. Create and delete pinecone index functions
- ```delete_index_with_same_name```: deletes pincone index of the same name 
- ```load_or_create_embeddings_index```: If the index already exists it will just load data. If the idex is brand new it will create an then load the data. 

In [10]:
from pinecone import Pinecone, ServerlessSpec
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import time

pc = Pinecone(
    api_key=os.environ.get("PINECONE_API_KEY") or 'PINECONE_API_KEY'
)

def delete_index_with_same_name(index_name): 
    
    # Delete index if any incdexes of the same name are present
    if index_name in pc.list_indexes().names():
        print(f'Deleting the {index_name} vector database')
        pc.delete_index(index_name)


def load_or_create_embeddings_index(index_name, chunks, namespace):
    
    if index_name in pc.list_indexes().names():
        print(f'Index {index_name} already exists. Loading embeddings...', end='')
        vector_store = PineconeVectorStore.from_documents(
        documents=chunks, embedding=OpenAIEmbeddings(), index_name=index_name, namespace=namespace)
        
        while not pc.describe_index(index_name).status['ready']:
            time.sleep(1)
        
        print('Done')
    else:
        print(f'Creating index {index_name} and embeddings ...', end = '')
        pc.create_index(name=index_name, dimension=1536, metric='cosine',  spec=ServerlessSpec(
                cloud='aws',
                region='us-west-2'
            ))
        
        while not pc.describe_index(index_name).status['ready']:
            time.sleep(1)
        # Add to vectorDB using LangChain 
        vector_store = PineconeVectorStore.from_documents(
        documents=chunks, embedding=OpenAIEmbeddings(), index_name=index_name, namespace=namespace)
        print('Done')   
    return vector_store

  from tqdm.autonotebook import tqdm


2. ```calculate_and_display_embedding_cost```: calculate embedding cost using using tiktoken
- https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

4. Create your index and load all data. 

In [11]:
index_name='pdf-rag-test-1'
chunks = chunks
namespace = "pdf_documents"

vector_store = load_or_create_embeddings_index(index_name=index_name, chunks=chunks, namespace=namespace)

Creating index pdf-rag-test-1 and embeddings ...Done


5. Delete your index

In [None]:
# Delete vector store 
index_name='pdf-rag-test-1'
delete_index_with_same_name(index_name)

6. Set up a LCEL conversation chain for to test data

In [None]:
# Experiment 3
# Create Retrieval Chain Q+A Chain Chaining Retrieval with Create retrival/hitory 
def create_history_aware_retriever_with_hub(vector_store, question, chat_history=[]):
    from langchain_community.chat_models import ChatOpenAI
    from langchain.chains.combine_documents import create_stuff_documents_chain
    from langchain.chains import create_history_aware_retriever
    from langchain.chains import create_retrieval_chain
    from langchain.prompts import ChatPromptTemplate
    from langchain_core.prompts import MessagesPlaceholder
    from langchain import hub
   
    # https://smith.langchain.com/hub/langchain-ai/retrieval-qa-chat?organizationId=80b05ae8-a524-5b5f-b4d7-36207f821772
    rephrase_prompt = hub.pull("langchain-ai/chat-langchain-rephrase")

    llm = ChatOpenAI(temperature=1)
    
    # Grab your Pinecone and set the search type    
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k':3})
    
    # Chain 1    
    chat_retriever_chain = create_history_aware_retriever(
        llm, retriever, rephrase_prompt
    )
    
    qa_system_prompt = """You are an assistant for question-answering tasks. \
    Use the following pieces of retrieved context to answer the question. \
    If you don't know the answer, just say that you don't know. \
    Use three sentences maximum and keep the answer concise.\

    {context}"""

    qa_prompt = ChatPromptTemplate.from_messages(
        [
            ("system", qa_system_prompt),
            MessagesPlaceholder("chat_history"),
            ("human", "{input}"),
        ]
    )
    # Chain 2
    question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
    # combine both chains
    rag_chain = create_retrieval_chain(chat_retriever_chain, question_answer_chain)
    # invoke the chain to get a response
    result = rag_chain.invoke({"input": question, "chat_history": chat_history })
    # Append to Chat History 
    chat_history.append((question, result['answer']))
    
    return result, chat_history

7. This does something with LangSmith trace
https://docs.smith.langchain.com/tracing

In [None]:
# LangSmith Trace

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = os.environ['LANGCHAIN_API_KEY']

8. Chat with your database to test

In [None]:
chat_history = []
question = "What is Llama?"
result, chat_history = create_history_aware_retriever_with_hub(vector_store, question, chat_history)
print(result['answer'])
print(chat_history)