# ```.csv``` loader and splitter

1. Create ```/csvs``` directory then place all .docx files inside 
2. Run app and all ```.csv``` files within the directory will be loaded into your vector store
3. Use helper function to delet db 
4. Use Chat functions to test
5. Trace using LangServe
---

### Directory Structure

```
├── pinecone-langchain-csv-dataloader.ipynb
├── pdfs
│   ├── client_file_01.csv
│   ├── client_file_02.csv
│   ├── client_file_03.csv
```
---

1. ```check_and_load_csv_from_dir``` : checks directory for files,loads documents with the ```DirectoryLoader```, then splits the docs with langchains ```RecursiveCharacterTextSplitter```.

In [None]:
! pip install --upgrade --quiet  docx2txt
! pip install langchain
! pip install tiktoken
! pip install pinecone-client

https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.directory.DirectoryLoader.html

In [None]:
# DOCX - Check Path ald Load .pdf

import os
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader

def check_and_load_pdf_from_dir(directory):
    # Ensure the directory path exists
    if not os.path.exists(directory):
        print("Directory does not exist.")
        return False

    # Check if directory is actually a directory
    if not os.path.isdir(directory):
        print("The specified path is not a directory.")
        return False

    # List all files in the directory
    all_files = os.listdir(directory)
    print(f'.csv files within the /csvs directory: {all_files}')
    
    # Check if each file ends with .docx
    for file in all_files:
        if not file.endswith(".csv"):
            print(f"Non-CSV file found: {file}")
            return False
 
 # load directory path
    directory_path = directory 
    # add list comprehension for file os.path.join usage
    loader = DirectoryLoader(directory_path,  glob="**/*.csv")
    
    documents = loader.load()


    text_splitter = RecursiveCharacterTextSplitter( 
        chunk_size=1000,  # Maximum size of each chunk
        chunk_overlap=100,  # Number of overlapping characters between chunks
    )

    chunks = text_splitter.split_documents(documents)

    return chunks

chunks = check_and_load_pdf_from_dir('csvs')

chunks
# # Call Chunks 
# print(len(chunks))
# print(chunks[0].page_content)
# print(chunks[1].page_content)

In [None]:
# Split Data Text With Cost Calculation
# How much it costs to embed
def calculate_and_display_embedding_cost(texts):
    import tiktoken
    enccoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    total_tokens = sum([len(enccoding.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD:{total_tokens / 1000 * 0.0004:.6f}')

calculate_and_display_embedding_cost(chunks)

Total Tokens: 99487
Embedding Cost in USD:0.039795


3. Create and delete pinecone index functions
- ```delete_index_with_same_name```: deletes pincone index of the same name 
- ```load_or_create_embeddings_index```: If the index already exists it will just load data. If the idex is brand new it will create an then load the data. 

In [None]:
from pinecone import Pinecone, ServerlessSpec
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
import time

pc = Pinecone(
    api_key=os.environ.get("PINECONE_API_KEY") or 'PINECONE_API_KEY'
)

def delete_index_with_same_name(index_name): 
    
    # Delete index if any incdexes of the same name are present
    if index_name in pc.list_indexes().names():
        print(f'Deleting the {index_name} vector database')
        pc.delete_index(index_name)


def load_or_create_embeddings_index(index_name, chunks, namespace):
    
    if index_name in pc.list_indexes().names():
        print(f'Index {index_name} already exists. Loading embeddings...', end='')
        vector_store = PineconeVectorStore.from_documents(
        documents=chunks, embedding=OpenAIEmbeddings(), index_name=index_name, namespace=namespace)
        
        while not pc.describe_index(index_name).status['ready']:
            time.sleep(1)
        
        print('Done')
    else:
        print(f'Creating index {index_name} and embeddings ...', end = '')
        pc.create_index(name=index_name, dimension=1536, metric='cosine',  spec=ServerlessSpec(
                cloud='aws',
                region='us-west-2'
            ))
        
        while not pc.describe_index(index_name).status['ready']:
            time.sleep(1)
        # Add to vectorDB using LangChain 
        vector_store = PineconeVectorStore.from_documents(
        documents=chunks, embedding=OpenAIEmbeddings(), index_name=index_name, namespace=namespace)
        print('Done')   
    return vector_store

2. ```calculate_and_display_embedding_cost```: calculate embedding cost using using tiktoken
- https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

4. Create your index and load all data. 

In [None]:
index_name='csv-rag-test-2'
chunks = chunks
namespace = "csv_documents"

vector_store = load_or_create_embeddings_index(index_name=index_name, chunks=chunks, namespace=namespace)

Index csv-rag-test-2 already exists. Loading embeddings...Done


5. Delete your index

In [None]:
# Delete vector store 
index_name='pdf-rag-test-1'
delete_index_with_same_name(index_name)

6. Generate answer without context - simple answer
- https://github.com/atef-ataya/LangChain-Tutorial/blob/master/Building%20QA%20application%20using%20OpenAI%2C%20Pinecone%2C%20and%20LangChain.ipynb

In [None]:
# Q&A Chat Function 
def generate_answer_from_vector_store(vector_store, question):
    from langchain.chains import RetrievalQA
    from langchain_openai import ChatOpenAI

    llm = ChatOpenAI(model='gpt-4-turbo', temperature=1)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k':3})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
    answer = chain.invoke(question)
    
    return answer

7. Set up a chain for having a conversation based on retrieved documents.
- https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html#langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain
- https://github.com/atef-ataya/LangChain-Tutorial/blob/master/Building%20QA%20application%20using%20OpenAI%2C%20Pinecone%2C%20and%20LangChain.ipynb

In [None]:
# Creating Conversation Logic and ChatHistory
def conduct_conversation_with_context(vector_store, question, chat_history=[]):
    from langchain.chains import ConversationalRetrievalChain
    from langchain_openai import ChatOpenAI
    
    llm = ChatOpenAI(temperature=1)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k':3})
    
    crc = ConversationalRetrievalChain.from_llm(llm, retriever)
    
    result = crc.invoke({'question': question, 'chat_history': chat_history})
    chat_history.append((question, result['answer']))
    
    return result, chat_history

8. This does something with LangSmith trace
https://docs.smith.langchain.com/tracing

In [None]:
# LangSmith Trace

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = os.environ['LANGCHAIN_API_KEY']

9. Chat with your database to test

In [None]:
chat_history = []
question = "What employee is the Fire Suppression Captain?"
result, chat_history = conduct_conversation_with_context(vector_store, question, chat_history)
print(result['answer'])
print(chat_history)

The employees who hold the position of Fire Suppression Captain in the provided data are MICHAEL CASTAGNOLA, MICHAEL DELANE, and GREG WYRSCH.
[('What employee is the Fire Suppression Captain?', 'The employees who hold the position of Fire Suppression Captain in the provided data are MICHAEL CASTAGNOLA, MICHAEL DELANE, and GREG WYRSCH.')]


In [None]:
chat_history = []
question = "How much money does Jeffrey Covits make?"
result, chat_history = conduct_conversation_with_context(vector_store, question, chat_history)
print(result['answer'])
print(chat_history)

Jeffrey Covitz makes a total of $160,383.74, according to the provided information for the year 2011.
[('How much money does Jeffrey Covits make?', 'Jeffrey Covitz makes a total of $160,383.74, according to the provided information for the year 2011.')]


In [None]:
chat_history = []
question = "How much more money does Jeffrey Covits make compared to Jonathan Huggins?"
result, chat_history = conduct_conversation_with_context(vector_store, question, chat_history)
print(result['answer'])
print(chat_history)

To calculate how much more money Jeffrey Covitz makes compared to Jonathan Huggins, we can look at their TotalPay:

- Jeffrey Covitz's TotalPay: $160,383.74
- Jonathan Huggins's TotalPay: $162,557.43

Now we subtract Jonathan Huggins' TotalPay from Jeffrey Covitz's TotalPay:

$160,383.74 - $162,557.43 = -$2,173.69

Therefore, Jeffrey Covitz makes $2,173.69 less than Jonathan Huggins.
[('How much more money does Jeffrey Covits make compared to Jonathan Huggins?', "To calculate how much more money Jeffrey Covitz makes compared to Jonathan Huggins, we can look at their TotalPay:\n\n- Jeffrey Covitz's TotalPay: $160,383.74\n- Jonathan Huggins's TotalPay: $162,557.43\n\nNow we subtract Jonathan Huggins' TotalPay from Jeffrey Covitz's TotalPay:\n\n$160,383.74 - $162,557.43 = -$2,173.69\n\nTherefore, Jeffrey Covitz makes $2,173.69 less than Jonathan Huggins.")]
