This repository has two major parts: adding pdf to Pine cone DB and ChatBot.
• Read all the pdf files in the given folder.
• Remove any special character like “\xa0”, which comes during reading of pdf.
• Divide pdf data into chunks of 500 tokens.
• Pass chunks data through open source embedding model.
• Get vector embedding of all the data.
• Add vector embedding to pine cone DB.
• Retrieve all the vector embedding from pine cone db.
• Add simple prompt for claude LLM model.
• Take input from the user.
• Pass the user input to embedding model.
• Query Pine cone db with user input embedding vector.
• Get top 3 textual data from Pine cone DB related to user question.
• Pass user question and top 3 relevant data to LLM claude model.
• Return output back to user.
- python 3.10
- pypdf
- langchain
- langchain-community
- huggingface-hub
- pinecone-client
- anthropic
Following are key variable that need to set at the start of Notebook.
- pdf_folder: Folder containing new pdf files.
- pine_cone_api: PineCone API.
- database_index: Name of Database.
- embedding_model_name: Name of model for creating embeding vectors.
- from_scratch: If false then it will append to previously uploaded data, else it will remove previous data from pinecone and then upload the new files.
Following are key variable that need to set at the start of Code.
- pine_cone_api = PineCone API
- database_index = Name of Database
- claude_api = Claude API key
- embedding_model_name = Name of model for creating embeding vectors.