In [17]:
import os
from dotenv import load_dotenv

# Load all environment variables from .env file
load_dotenv()

## LLM
openai_api_key = os.getenv('OPENAI_API_KEY')

## Pinecone Vector Database
pinecone_api_key = os.getenv('PINECONE_API_KEY')

In [2]:
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=pinecone_api_key)

  from tqdm.autonotebook import tqdm


In [18]:
import time

index_name = "rag-hyde-index" # change if desired

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

In [19]:
# Load blog
import bs4
from langchain_community.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Pinecone
from pprint import pprint

#### INDEXING ####

# Load Document (Uploading one file at a time)
pdf_file_path = "./data/langchain_turing.pdf"
loader = PyPDFLoader(pdf_file_path)

docs = loader.load()

# Upload muiltiple PDF files from a directory
# pdf_file_paths = <enter your path here>
# loader = PyPDFDirectoryLoader(pdf_file_paths)

# docs_dir = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=2000, 
    chunk_overlap=500)

# Make splits
splits = text_splitter.split_documents(docs)

# Index
vectorstore = Pinecone.from_documents(
    documents=splits, 
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"), 
    index_name=index_name
)


In [20]:
retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 5, "score_threshold": 0.5},
)

# Hypothetical Document Embeddings (HyDE)

![RAG HyDE](./images/rag_hyde.png)

In [21]:
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

def generate_docs_for_retrieval(question):
    # HyDE document genration
    template = """Please write a scientific paper passage to answer the question
    Question: {question}
    Passage:"""
    prompt_hyde = ChatPromptTemplate.from_template(template)

    llm = ChatOpenAI(model = "gpt-4o-mini", temperature=1)

    docs_for_retrieval = llm.invoke(prompt_hyde.format_prompt(question = question))

    return docs_for_retrieval.content



In [22]:
question = "How does LangChain ensure security when integrating external services like vector databases and API providers in LLM applications?"
docs_for_retrieval = generate_docs_for_retrieval(question)
docs_for_retrieval

"**Title: Security Mechanisms in LangChain for Integrating External Services in LLM Applications**\n\n**Abstract:** The integration of external services such as vector databases and API providers is crucial for enhancing the functionality of Large Language Model (LLM) applications. However, this integration poses significant security challenges that must be addressed to ensure data integrity, confidentiality, and overall system robustness. In this paper, we explore the security measures implemented by LangChain to safeguard LLM applications when interfacing with external services.\n\n**Introduction:** As LLM applications increasingly rely on external service integration to operate efficiently and effectively, the necessity for robust security protocols becomes paramount. LangChain addresses potential vulnerabilities associated with connecting to vector databases and API providers through a multifaceted security approach that encompasses authentication, data encryption, access control, 

In [23]:
from IPython.display import Markdown
Markdown(docs_for_retrieval)

**Title: Security Mechanisms in LangChain for Integrating External Services in LLM Applications**

**Abstract:** The integration of external services such as vector databases and API providers is crucial for enhancing the functionality of Large Language Model (LLM) applications. However, this integration poses significant security challenges that must be addressed to ensure data integrity, confidentiality, and overall system robustness. In this paper, we explore the security measures implemented by LangChain to safeguard LLM applications when interfacing with external services.

**Introduction:** As LLM applications increasingly rely on external service integration to operate efficiently and effectively, the necessity for robust security protocols becomes paramount. LangChain addresses potential vulnerabilities associated with connecting to vector databases and API providers through a multifaceted security approach that encompasses authentication, data encryption, access control, and efficient error handling.

**Security Mechanisms:**

1. **Authentication and Authorization:** LangChain employs OAuth 2.0 and API keys to secure access to external services. By requiring token-based authentication, LangChain ensures that only authorized applications and users can request data from or send data to vector databases and APIs. This authentication process is crucial for mitigating the risk of unauthorized access.

2. **Data Encryption:** To protect sensitive data during transmission and storage, LangChain implements Transport Layer Security (TLS) protocols. All data exchanged between the LLM application and external services is encrypted, significantly reducing the risk of data interception and man-in-the-middle attacks. Additionally, LangChain encourages the use of encryption for sensitive information stored in vector databases.

3. **Access Control Policies:** LangChain incorporates role-based access control (RBAC) mechanisms, enabling developers to define granular permissions for different components of the application. By limiting access based on user roles, LangChain minimizes the likelihood of malicious actors exploiting vulnerabilities in the system.

4. **Audit Logging and Monitoring:** Continuous monitoring and logging of interactions with external services are integrated into the LangChain framework. This feature allows for real-time analysis of API calls and database queries, enabling the detection of any anomalous behaviors or security breaches. The comprehensive audit logs facilitate forensic investigations and help improve the overall security posture.

5. **Error Handling and Rate Limiting:** LangChain employs robust error handling mechanisms to manage exceptions that may arise during interaction with external services. This includes implementing rate limiting to prevent abuse from excessive requests, which can lead to service denial or degradation. By managing both successes and errors systematically, LangChain enhances stability and security.

**Conclusion:** The secure integration of external services in LLM applications is vital for protecting sensitive data and maintaining trust in automated systems. LangChain's comprehensive approach, which includes robust authentication, encryption, access control, monitoring, and error handling, ensures that these integrations are carried out securely, thus mitigating potential risks. Future work will focus on enhancing these mechanisms in response to evolving security threats, ensuring that LangChain remains at the forefront of secure LLM application development.

**Keywords:** LangChain, security, LLM applications, external services, vector databases, API integration, authentication, encryption, access control, monitoring.

In [24]:
retrieval_chain = generate_docs_for_retrieval | retriever 
retireved_docs = retrieval_chain.invoke(question)
retireved_docs

[Document(metadata={'author': '', 'creationdate': '2024-11-06T10:08:55+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2024-11-06T10:08:55+00:00', 'page': 12.0, 'page_label': '13', 'producer': 'pdfTeX-1.40.26', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) kpathsea version 6.4.0', 'source': './data/langchain_turing.pdf', 'subject': '', 'title': '', 'total_pages': 14.0, 'trapped': '/False'}, page_content='LangChain 13\nLangChain’s security model addresses many of these concerns, yet challenges\npersist, particularly in sectors with rigorous compliance standards, such as fi-\nnance and healthcare. Key areas for ongoing improvement include:\n– DynamicPermissionAdjustment :CurrentpermissionsettingsinLangChain\nare defined at deployment, but in dynamic applications, permissions may\nneed to adapt based on user interactions. Implementing adaptive permis-\nsions responsive to application state or user roles could enhance security.\n–

In [25]:
def generate_response(question):
    
    retireved_docs = retrieval_chain.invoke(question)
    
    # RAG
    template = """Answer the following question based on this context:

    {context}

    Question: {question}
    """

    prompt = ChatPromptTemplate.from_template(template)

    llm = ChatOpenAI(model = "gpt-4o-mini", temperature=1)

    response = llm.invoke(prompt.format_prompt(context = retireved_docs, question = question))

    return response.content


In [26]:
hyde_response = generate_response(question)


In [27]:
from IPython.display import Markdown
Markdown(hyde_response)

LangChain addresses security concerns when integrating external services, such as vector databases and API providers, through several key measures:

1. **Granular Permissions**: LangChain enforces the principle of least privilege, which allows developers to specify limited permissions. This minimizes the risk of unauthorized actions and ensures that applications only have access to necessary resources.

2. **Sandboxing**: The framework utilizes sandboxed environments to protect sensitive data. By isolating application processes, sandboxing reduces the exposure of vulnerabilities that may arise from external integrations.

3. **Defense in Depth**: LangChain employs layered security measures to create multiple barriers against threats. This approach enhances the overall security posture of applications by making it more difficult for potential breaches to succeed.

4. **Auditability and Monitoring**: LangChain includes tools like LangSmith, which provide detailed logging and monitoring capabilities. This allows developers to track application usage, detect anomalies in real-time, and respond proactively to any potential security incidents.

5. **Management of External Risks**: Recognizing the risks associated with reliance on third-party services, LangChain encourages thorough vetting of external providers and continuous monitoring of their security practices. This is essential to mitigate risks such as data exposure and third-party dependency vulnerabilities.

By implementing these strategies, LangChain aims to secure LLM applications that rely on external services, thus addressing critical security concerns while enhancing functionality.