## ```.docx``` Pinecone/Langchain dataloader and splitter

1. Create ```/docxs``` directory then place all .docx files inside 
2. Run app and all ```/docxs``` files within the directory will be loaded into your vector store
3. Use helper function to delete your pinecone index 
4. Use Chat functions to test
5. Trace using LangServe
---

### Directory Structure 

```
├── pinecone-langchain-docx-dataloader.ipynb
├── docxs
│   ├── client_file_01.docx
│   ├── client_file_02.docx
│   ├── client_file_03.docx



1. ```check_and_load_docx_from_dir``` : checks directory for files,loads documents with the ```Docx2txtLoader```, then splits the docs with langchains ```RecursiveCharacterTextSplitter```.

In [None]:
! pip install --upgrade --quiet  docx2txt
! pip install langchain
! pip install tiktoken
! pip install pinecone-client

In [2]:
# DOCX - Check Path ald Load .docx

import os
from langchain_community.document_loaders.word_document import Docx2txtLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter


def check_and_load_docx_from_dir(directory):
    # Ensure the directory path exists
    if not os.path.exists(directory):
        print("Directory does not exist.")
        return False

    # Check if directory is actually a directory
    if not os.path.isdir(directory):
        print("The specified path is not a directory.")
        return False

    # List all files in the directory
    all_files = os.listdir(directory)
    print(f'Heres the list of all .docx files within the {directory} directory: {all_files}')
    
    # Check if each file ends with .docx
    for file in all_files:
        if not file.endswith(".docx"):
            print(f"Non-DOCX file found: {file}")
            return False
    
    # load directory path
    directory_path = directory 
    # add list comprehension for file os.path.join usage
    docx_files = [f for f in all_files if f.endswith(".docx")]
    # Create emptly list container
    documents = []
    # Loop through the directory, bundle and load docx files. 
    for docx_file in docx_files:
        file_path = os.path.join(directory_path, docx_file)
        loader = Docx2txtLoader(file_path)
        documents.extend(loader.load())

    # Split text
    text_splitter = RecursiveCharacterTextSplitter( 
        chunk_size=1000,  # Maximum size of each chunk
        chunk_overlap=100,  # Number of overlapping characters between chunks
        )
    
    # Create empty list to hold chunks
    chunks = []

    # Split and add to the list
    for document in documents:
        chunks.extend(text_splitter.split_documents([document]))  
        
    return chunks

chunks = check_and_load_docx_from_dir('docxs')

chunks


# Usage example
# directory_path = 'docxs'
# check_and_load_docx_from_dir(directory_path)

Heres the list of all .docx files within the docxs directory: ['Trasaterra_not_marketing_company.docx', 'wdttd.docx']


[Document(page_content='Trasaterra Is Not a Marketing Company !\n\n\n\nTrasaterra is not a marketing company. It is a design agency that specializes in branding, comprehensive design services, and both web design and development. Additionally, Trasaterra can include content creation as a component of its core offerings when necessary. But content creation is not part of its core offering.', metadata={'source': 'docxs/Trasaterra_not_marketing_company.docx'}),
 Document(page_content='What does trasaterra do? \n\n\n\n\n\nTrasaterra is a New York City-based design studio founded in 2010 by Ewa Orzech and Jason Paul Guzman, leverages a combined wealth of branding, design and digital expertise to craft transformative, refined, and forthright creative solutions for a diverse range of clients across media, art, finance, non-profit, and technology industries.\n\n\n\nTrasaterra\xa0creates\xa0work that\xa0attracts,\xa0converts,\xa0engages\xa0and\xa0retains\xa0Client audiences\n\nOur core value is

2. ```calculate_and_display_embedding_cost```: calculate embedding cost using using tiktoken
- https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

In [3]:
# How much it costs to embed
def calculate_and_display_embedding_cost(texts):
    import tiktoken
    enccoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    total_tokens = sum([len(enccoding.encode(page.page_content)) for page in texts])
    print(f'Token Amount: {total_tokens}')
    print(f'Embedding Cost in USD:{total_tokens / 1000 * 0.0004:.6f}')

calculate_and_display_embedding_cost(chunks)

Token Amount: 631
Embedding Cost in USD:0.000252


3. Create and delete pinecone index functions
- ```delete_index_with_same_name```: deletes pincone index of the same name 
- ```load_or_create_embeddings_index```: If the index already exists it will just load data. If the idex is brand new it will create an then load the data. 

In [4]:
from pinecone import Pinecone, ServerlessSpec
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import time

pc = Pinecone(
    api_key=os.environ.get("PINECONE_API_KEY") or 'PINECONE_API_KEY'
)

def delete_index_with_same_name(index_name): 
    
    # Delete index if any indexes of the same name are present
    if index_name in pc.list_indexes().names():
        print(f'Deleting the {index_name} vector database')
        pc.delete_index(index_name)


def load_or_create_embeddings_index(index_name, chunks, namespace):
    
    if index_name in pc.list_indexes().names():
        print(f'Index {index_name} already exists. Loading embeddings...', end='')
        vector_store = PineconeVectorStore.from_documents(
        documents=chunks, embedding=OpenAIEmbeddings(), index_name=index_name, namespace=namespace)
        
        while not pc.describe_index(index_name).status['ready']:
            time.sleep(1)
        
        print('Done')
    else:
        print(f'Creating index {index_name} and embeddings ...', end = '')
        pc.create_index(name=index_name, dimension=1536, metric='cosine',  spec=ServerlessSpec(
                cloud='aws',
                region='us-west-2'
            ))
        
        while not pc.describe_index(index_name).status['ready']:
            time.sleep(1)
        # Add to vectorDB using LangChain 
        vector_store = PineconeVectorStore.from_documents(
        documents=chunks, embedding=OpenAIEmbeddings(), index_name=index_name, namespace=namespace)
        print('Done')   
    return vector_store

  from tqdm.autonotebook import tqdm


4. Create your index and load all data. 

In [5]:
index_name='dataloader-test-1'
chunks = chunks
namespace = "docxs_documents"

vector_store = load_or_create_embeddings_index(index_name=index_name, chunks=chunks, namespace=namespace)

Creating index dataloader-test-1 and embeddings ...Done


5. Delete your index

In [None]:
# Delete vector store 
index_name='pipelinetest-4'
delete_index_with_same_name(index_name)

6. Generate answer without context - simple answer
- https://github.com/atef-ataya/LangChain-Tutorial/blob/master/Building%20QA%20application%20using%20OpenAI%2C%20Pinecone%2C%20and%20LangChain.ipynb

In [None]:
# Q&A Chat Function 
def generate_answer_from_vector_store(vector_store, question):
    from langchain.chains import RetrievalQA
    from langchain_openai import ChatOpenAI

    llm = ChatOpenAI(model='gpt-4-turbo', temperature=1)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k':3})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
    answer = chain.invoke(question)
    
    return answer

7. Set up a chain for having a conversation based on retrieved documents.
- https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html#langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain
- https://github.com/atef-ataya/LangChain-Tutorial/blob/master/Building%20QA%20application%20using%20OpenAI%2C%20Pinecone%2C%20and%20LangChain.ipynb

In [6]:
# Creating Conversation Logic and ChatHistory
def conduct_conversation_with_context(vector_store, question, chat_history=[]):
    from langchain.chains import ConversationalRetrievalChain
    from langchain_openai import ChatOpenAI
    
    llm = ChatOpenAI(temperature=1)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k':3})
    
    crc = ConversationalRetrievalChain.from_llm(llm, retriever)
    
    result = crc.invoke({'question': question, 'chat_history': chat_history})
    chat_history.append((question, result['answer']))
    
    return result, chat_history

8. This does something with LangSmith trace
https://docs.smith.langchain.com/tracing

In [None]:
# LangSmith Trace

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = os.environ['LANGCHAIN_API_KEY']

9. Chat with your database to test

In [7]:
chat_history = []
question = "What is Trasaterra?"
result, chat_history = conduct_conversation_with_context(vector_store, question, chat_history)
print(result['answer'])
print(chat_history)

Trasaterra is a New York City-based design studio that specializes in branding, comprehensive design services, web design, and development. They focus on crafting transformative creative solutions for clients in various industries such as media, art, finance, non-profit, and technology by leveraging authenticity in branding and design expertise. Trasaterra provides services like logos, naming, brand books, identity systems, infographics, iconography, presentation design, social media graphics, motion graphics, website design, website development, user experience, e-commerce, email campaigns, site maintenance, and SEO.
[('What is Trasaterra?', 'Trasaterra is a New York City-based design studio that specializes in branding, comprehensive design services, web design, and development. They focus on crafting transformative creative solutions for clients in various industries such as media, art, finance, non-profit, and technology by leveraging authenticity in branding and design expertise. 

In [8]:
chat_history = []
question = "Is Trasaterra a marketing company?"
result, chat_history = conduct_conversation_with_context(vector_store, question, chat_history)
print(result['answer'])
print(chat_history)

No, Trasaterra is not a marketing company. It is a design agency that specializes in branding, comprehensive design services, web design, and development. Content creation can also be included as a component of its core offerings when necessary.
[('Is Trasaterra a marketing company?', 'No, Trasaterra is not a marketing company. It is a design agency that specializes in branding, comprehensive design services, web design, and development. Content creation can also be included as a component of its core offerings when necessary.')]


In [None]:
# Q&A Chat Function 
