# **How to ingest data to Pinecone using LangChain**

- This notebook shows you how you can ingest data from a document into a Pinecone Database.
- It uses Langchain to ease the process of split and ingest the data

### **1. Import dependencies**

In [21]:
import os
import pinecone
import openai
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import AzureOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.chains.question_answering import load_qa_chain

### **2. Get environment variables**

In [22]:
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
PINECONE_API_ENV = os.getenv('PINECONE_ENVIRONMENT')
PINECONE_INDEX_NAME = 'qa-app'
OPENAI_API_KEY = os.getenv('AZURE_OPENAI_APIKEY')
OPENAI_API_BASE = os.getenv('AZURE_OPENAI_BASE_URI')
OPENAI_API_TYPE = 'azure'
OPENAI_API_VERSION = '2023-03-15-preview'

### **3. Load your data files and split it into chunks**

In [23]:
loader = UnstructuredPDFLoader("./docs/NET-Microservices-Architecture-for-Containerized-NET-Applications.pdf")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = text_splitter.split_documents(data)

print (f'You have a total of {len(chunks)} chunks')

detectron2 is not installed. Cannot use the hi_res partitioning strategy. Falling back to partitioning with the fast strategy.


Now have a total of 917 chunks


### **4. Init Azure OpenAI and Pinecone clients**

In [8]:
embeddings = OpenAIEmbeddings(openai_api_base=OPENAI_API_BASE, openai_api_key=OPENAI_API_KEY, openai_api_type=OPENAI_API_TYPE, chunk_size=1)

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_API_ENV  
)

### **5. Create and store embeddings into Pinecone**

In [24]:
docsearch = Pinecone.from_texts([t.page_content for t in chunks], embeddings, index_name=PINECONE_INDEX_NAME)

### **6. Ask a question**

In [25]:
llm = AzureChatOpenAI(temperature=0, openai_api_base=OPENAI_API_BASE, openai_api_key=OPENAI_API_KEY, openai_api_version=OPENAI_API_VERSION, deployment_name='gpt-4')
chain = load_qa_chain(llm, chain_type="stuff")

query = "Enumerate 3 strategies to handle partial failures?"
docs = docsearch.similarity_search(query, include_metadata=True)

result = chain.run(input_documents=docs, question=query)

print(f"Answer: \n {result}")

Answer: 
 1. Use asynchronous communication: Implement message-based communication across internal microservices to minimize ripple effects and enforce a higher level of microservice autonomy.

2. Implement retries with exponential backoff: In case of transient faults, use the "Retry pattern" to retry the operation with increasing time intervals between attempts, allowing the system to recover from temporary issues.

3. Use the Circuit Breaker pattern: Track the number of failed requests, and if the error rate exceeds a configured limit, trip a "circuit breaker" to prevent further attempts. After a timeout period, try again and close the circuit breaker if new requests are successful.
