### What is Langchain?
LangChain is an open-source framework for building LLM-powered applications, offering core components like Models (LLM integration), Prompt Templates (structured prompts), Memory (context retention), Indexes & Retrievers (efficient document retrieval), Agents (dynamic decision-making), and Chains (workflow automation). It simplifies RAG applications by enabling efficient document ingestion, retrieval, contextualized responses, and state management using vector databases, embeddings, and intelligent querying. This allows for seamless integration with external data sources and scalable AI-driven search and reasoning systems. 🚀

1. **`langchain-community`** – Contains community-maintained integrations and tools for working with various LLM providers, databases, and APIs.  
2. **`langchain-experimental`** – Includes experimental and early-stage features for advanced LangChain applications, such as novel retrieval methods and agent capabilities.  
3. **`langchain-groq`** – Provides integration with **Groq's LLMs**, enabling fast and efficient model inference.  
4. **`langchain-huggingface`** – Facilitates the use of **Hugging Face models** (transformers, embeddings, and pipelines) within LangChain applications. 🚀

##Imports

In [None]:
!pip install --upgrade --quiet langchain langchain-community langchain-experimental langchain-groq langchain-huggingface
!pip install --upgrade --quiet  sentence-transformers
!pip install --upgrade --quiet transformers
!pip install --upgrade --quiet neo4j tiktoken yfiles_jupyter_graphs
!pip install --upgrade --quiet pypdf

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.5/2.5 MB[0m [31m14.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m38.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.6/129.6 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━

##Uploading the PDF

In [None]:
# Imports
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from google.colab import files
import os

# Upload the PDF file using Google Colab's file upload utility
uploaded = files.upload()

# Get the file path
pdf_path = list(uploaded.keys())[0]

# Load the PDF using Langchain's PyPDFLoader
loader = PyPDFLoader(pdf_path)
documents = loader.load()


Saving BTSA-FTE Offer Letter.pdf to BTSA-FTE Offer Letter.pdf


In [None]:
type(documents)

list

In [None]:
documents[0]

Document(metadata={'producer': 'PDFKit.NET 12.3.320.0 DMV10', 'creator': 'PyPDF', 'creationdate': '2025-05-21T01:17:06-07:00', 'moddate': '2025-05-21T01:17:06-07:00', 'source': 'BTSA-FTE Offer Letter.pdf', 'total_pages': 8, 'page': 0, 'page_label': '1'}, page_content="May 21, 2025\n \nCONFIDENTIAL\n \nSubhadip De\nAdhikari Ghosh Road, Hatthuba, PO :- Habra\nDistrict - North 24 Parganas\nHabra, West Bengal 743263\nDear Subhadip:\nWe are pleased to extend you an offer to join ZS Associates India Private Ltd. (‘ZS’) as a\nBusiness Technology Solutions Associate , to be based in our Bengaluru office with a start \ndate of June 2, 2025 . We hope that you give this opportunity with ZS serious \nconsideration.\n \nZS has a special culture of collaboration and innovation and intensity. We produce work of \noutstanding quality and are proud of the client-first approach we bring to every \nengagement. ZSers bring passion to make an impact, commitment to continuous learning, \nself-improvement an

In [None]:
len(documents)

8

##Setting up the Environment for Developing

### Environment in a Development Project
In a development project, an **environment** refers to a configured system setup where software is developed, tested, and deployed, often using **environment variables** to manage sensitive information like API keys securely. In **Google Colab**, environment variables can be stored in **secrets** (e.g., `os.environ["API_KEY"] = "your_key"`) to prevent hardcoding sensitive data. This ensures security, flexibility, and easier configuration management across different environments. 🚀

In [None]:
import os
from google.colab import userdata

os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')
os.environ["NEO4J_URI"] = userdata.get('NEO4J_URI')
os.environ["NEO4J_USERNAME"] = userdata.get('NEO4J_USERNAME')
os.environ["NEO4J_PASSWORD"] = userdata.get('NEO4J_PASSWORD')

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate

from typing import Tuple, List, Optional

from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser

from langchain_core.runnables import ConfigurableField

from yfiles_jupyter_graphs import GraphWidget
from neo4j import GraphDatabase

from langchain_community.vectorstores import Neo4jVector
from langchain_community.graphs import Neo4jGraph

from langchain_huggingface import HuggingFaceEmbeddings

In [None]:
try:
  import google.colab
  from google.colab import output
  output.enable_custom_widget_manager()
except:
  pass

In [None]:
from langchain_core.runnables import (
    RunnableBranch,
    RunnableLambda,
    RunnableParallel,
    RunnablePassthrough,
)

##Extracting Text from Wikipedia Pages
--Using WikipediaLoader from Langchain

In [None]:
# from langchain.document_loaders import WikipediaLoader
# raw_documents = WikipediaLoader(query="The Merchant of Venice").load()



  lis = BeautifulSoup(html).find_all('li')


##Constants

In [None]:
chunk_size = 512
chunk_overlap = 24

model_name = "deepseek-r1-distill-llama-70b"
embedding_model = "sentence-transformers/all-mpnet-base-v2"
temperature = 0.3
tokens_per_minute = 900

##Text Splitting using Recursive Charecter Text Splitter

In [None]:
# # For Wikipedia
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# documents = text_splitter.split_documents(raw_documents[:4])

In [None]:
# For PDF (Custom Upload)
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
document_chunks = text_splitter.split_documents(documents)

In [None]:
document_chunks

[Document(metadata={'producer': 'PDFKit.NET 12.3.320.0 DMV10', 'creator': 'PyPDF', 'creationdate': '2025-05-21T01:17:06-07:00', 'moddate': '2025-05-21T01:17:06-07:00', 'source': 'BTSA-FTE Offer Letter.pdf', 'total_pages': 8, 'page': 0, 'page_label': '1'}, page_content='May 21, 2025\n \nCONFIDENTIAL\n \nSubhadip De\nAdhikari Ghosh Road, Hatthuba, PO :- Habra\nDistrict - North 24 Parganas\nHabra, West Bengal 743263\nDear Subhadip:\nWe are pleased to extend you an offer to join ZS Associates India Private Ltd. (‘ZS’) as a\nBusiness Technology Solutions Associate , to be based in our Bengaluru office with a start \ndate of June 2, 2025 . We hope that you give this opportunity with ZS serious \nconsideration.'),
 Document(metadata={'producer': 'PDFKit.NET 12.3.320.0 DMV10', 'creator': 'PyPDF', 'creationdate': '2025-05-21T01:17:06-07:00', 'moddate': '2025-05-21T01:17:06-07:00', 'source': 'BTSA-FTE Offer Letter.pdf', 'total_pages': 8, 'page': 0, 'page_label': '1'}, page_content='consideration

##Initializing a Large Language Model (LLM) and Graph Transformer instance

In [None]:
from langchain_groq import ChatGroq

llm = ChatGroq(
            model_name=model_name,
            temperature=temperature,
            max_tokens=None,
            groq_api_key=os.environ["GROQ_API_KEY"],
            timeout=60
        )

In [None]:
# Import the LLMGraphTransformer for converting text into a structured graph
from langchain_experimental.graph_transformers import LLMGraphTransformer

# Initialize the Graph Transformer with a Large Language Model (LLM)
llm_transformer = LLMGraphTransformer(llm=llm)

# Convert a list of textual documents into a structured graph representation
graph_documents = llm_transformer.convert_to_graph_documents(document_chunks)

# The output 'graph_documents' contains entities (nodes) and their relationships (edges),
# which can be used for knowledge graph construction, search, and reasoning.


In [None]:
graph_documents

[GraphDocument(nodes=[Node(id='Subhadip De', type='Person', properties={}), Node(id='Zs Associates India Private Ltd.', type='Organization', properties={}), Node(id='Business Technology Solutions Associate', type='Role', properties={}), Node(id='Bengaluru', type='Location', properties={}), Node(id='June 2, 2025', type='Date', properties={})], relationships=[Relationship(source=Node(id='Zs Associates India Private Ltd.', type='Organization', properties={}), target=Node(id='Subhadip De', type='Person', properties={}), type='OFFERED_TO', properties={}), Relationship(source=Node(id='Zs Associates India Private Ltd.', type='Organization', properties={}), target=Node(id='Business Technology Solutions Associate', type='Role', properties={}), type='OFFERS', properties={}), Relationship(source=Node(id='Business Technology Solutions Associate', type='Role', properties={}), target=Node(id='Bengaluru', type='Location', properties={}), type='BASED_IN', properties={}), Relationship(source=Node(id='B

In [None]:
# Initializing Neo4j Instance
graph = Neo4jGraph()

  graph = Neo4jGraph()


In [None]:
# Adding the Graph created to the Neo4j Cloud
graph.add_graph_documents(
    graph_documents,
    baseEntityLabel=True, #Ensures nodes have labels like Person, Company, etc.
    include_source=True #Keeps the original document as part of the graph for traceability.
)

In [None]:
# directly show the graph resulting from the given Cypher query
default_cypher = "MATCH (s)-[r:!MENTIONS]->(t) RETURN s,r,t LIMIT 50"

In [None]:
from yfiles_jupyter_graphs import GraphWidget
from neo4j import GraphDatabase

In [None]:
# Visualizing the graph through GraphWidget
def showGraph(cypher: str = default_cypher):
    # create a neo4j session to run queries
    driver = GraphDatabase.driver(
        uri = os.environ["NEO4J_URI"],
        auth = (os.environ["NEO4J_USERNAME"],
                os.environ["NEO4J_PASSWORD"]))
    session = driver.session()
    widget = GraphWidget(graph = session.run(cypher).graph())
    widget.node_label_mapping = 'id'
    display(widget)
    return widget

In [None]:
showGraph()

GraphWidget(layout=Layout(height='800px', width='100%'))

GraphWidget(layout=Layout(height='800px', width='100%'))

##Creating Word Embedding

In [None]:
# Creating Word Embedding instance from HuggingFace
embeddings = HuggingFaceEmbeddings(
            model_name=embedding_model,
            model_kwargs={'device': 'cpu'}
        )

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
from langchain_community.vectorstores import Neo4jVector

# Use the embeddings with Neo4jVector
vector_index = Neo4jVector.from_existing_graph(
    embeddings,
    search_type="hybrid",
    node_label="Document",
    text_node_properties=["text"],
    embedding_node_property="embedding"
)

In [None]:
graph.query("CREATE FULLTEXT INDEX entity IF NOT EXISTS FOR (e:__Entity__) ON EACH [e.id]")

[]

##Extracting Entities (Nodes) from the text given input

In [None]:
from pydantic import BaseModel, Field
# Extract entities from text
class Entities(BaseModel):
    """Identifying information about entities."""

    names: List[str] = Field(
        ...,
        description="All the person, organization, or business entities that "
        "appear in the text",
    )

In [None]:
# Creating Prompt Templates using Langchain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.prompts.prompt import PromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are extracting organization and person entities from the text.",
        ),
        (
            "human",
            "Use the given format to extract information from the following "
            "input: {question}",
        ),
    ]
)

In [None]:
entity_chain = prompt | llm.with_structured_output(Entities)

In [None]:
entity_chain.invoke({"question": "Will ZS offer any Broadband Allowances?"}).names

['ZS', 'Broadband Allowances']

##Graph Retrieval from the Question

In [None]:
# Generates a full-text search query with fuzzy matching (~2) for Neo4j by sanitizing input and combining words using AND.
from langchain_community.vectorstores.neo4j_vector import remove_lucene_chars

def generate_full_text_query(input: str) -> str:
    full_text_query = ""
    words = [el for el in remove_lucene_chars(input).split() if el]
    for word in words[:-1]:
        full_text_query += f" {word}~2 AND"
    full_text_query += f" {words[-1]}~2"
    return full_text_query.strip()

In [None]:
# Full text index query
def structured_retriever(question: str) -> str:
    result = ""
    entities = entity_chain.invoke({"question": question})
    for entity in entities.names:
        response = graph.query(
            """CALL db.index.fulltext.queryNodes('entity', $query, {limit:2})
            YIELD node,score
            CALL {
              WITH node
              MATCH (node)-[r:!MENTIONS]->(neighbor)
              RETURN node.id + ' - ' + type(r) + ' -> ' + neighbor.id AS output
              UNION ALL
              WITH node
              MATCH (node)<-[r:!MENTIONS]-(neighbor)
              RETURN neighbor.id + ' - ' + type(r) + ' -> ' +  node.id AS output
            }
            RETURN output LIMIT 50
            """,
            {"query": generate_full_text_query(entity)},
        )
        result += "\n".join([el['output'] for el in response])
    return result

In [None]:
print(structured_retriever("Will ZS offer any Broadband Allowances?"))

  words = [el for el in remove_lucene_chars(input).split() if el]


Zs - OFFERS -> Group Insurance Plan
Zs - OFFERS -> Preventive Healthcare Coverage
Zs - OFFERS -> Accident Insurance
Zs - OFFERS -> Business Travel Insurance
Zs - OFFERS -> Life Insurance
Zs - EMPHASIZES -> Collaboration
Zs - EMPHASIZES -> Innovation
Zs - EMPHASIZES -> Intensity
Zs - EMPHASIZES -> Quality
Zs - FOLLOWS -> Client-First Approach
Zs - FOLLOWS -> Monthly Payment Schedule
Zs - PROVIDES -> Annual Gross Salary
Zs - PROVIDES -> Starting Bonus
Zs - PROVIDES -> Performance Bonus
Zs - PROVIDES -> Emerging Leader Reward Program (Elrp)
Zs - PROVIDES -> Elrp
Zs - PROVIDES -> Annual Leave
Zs - PROVIDES -> Holidays
Zs - PROVIDES -> Sick Time
Zs - PROVIDES -> Transportation Allowance
Zs - PROVIDES -> Meal Allowance
Zs - PROVIDES -> Relocation Assistance
Zs - PROVIDES -> Guest_House
Zs - PROVIDES -> Relocation_Allowance
Zs - EMPLOYS -> Associates
Zs - EMPLOYS -> Employee
Zs - EMPLOYS -> Employment
Zs - CONDUCTS -> Salary Review
Zs - PAYS -> Starting Bonus
Zs - REQUIRES -> High_Speed_Broad

##Combining results from a structured retriever and a vector-based similarity search


In [None]:
# Retrieves structured and unstructured data based on the input question.

def retriever(question: str):
    print(f"Search query: {question}")
    structured_data = structured_retriever(question)
    unstructured_data = [el.page_content for el in vector_index.similarity_search(question)]
    final_data = f"""Structured data:
      {structured_data}
      Unstructured data:
      {"#Document ". join(unstructured_data)}
          """
    return final_data

In [None]:
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question,
in its original language.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""

In [None]:
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

In [None]:
def _format_chat_history(chat_history: List[Tuple[str, str]]) -> List:
    buffer = []
    for human, ai in chat_history:
        buffer.append(HumanMessage(content=human))
        buffer.append(AIMessage(content=ai))
    return buffer

In [None]:
_search_query = RunnableBranch(
    # If input includes chat_history, we condense it with the follow-up question
    (
        RunnableLambda(lambda x: bool(x.get("chat_history"))).with_config(
            run_name="HasChatHistoryCheck"
        ),  # Condense follow-up question and chat into a standalone_question
        RunnablePassthrough.assign(
            chat_history=lambda x: _format_chat_history(x["chat_history"])
        )
        | CONDENSE_QUESTION_PROMPT
        | llm
        | StrOutputParser(),
    ),
    # Else, we have no chat history, so just pass through the question
    RunnableLambda(lambda x : x["question"]),
)

In [None]:
template = """Answer the question based only on the following context:
{context}

Question: {question}

Provide a concise context (2-3 sentences) for this chunk, considering the following guidelines:
        1. Identify the main topic or metric discussed .
        2. Mention any relevant time periods or comparisons .
        3. Include any key figures or percentages that provide important context.
        4. Do not use phrases like "This chunk discusses" or "This section provides". Instead, directly state the context.

Answer:"""


In [None]:
prompt = ChatPromptTemplate.from_template(template)

In [None]:
# Creates a processing chain where a search query is retrieved, passed to a prompt, sent to an LLM, and then parsed into a string output.
chain = (
    RunnableParallel(
        {
            "context": _search_query | retriever,
            "question": RunnablePassthrough(),
        }
    )
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
chain.invoke({"question": "Will ZS offer any Broadband Allowances?"})

Search query: Will ZS offer any Broadband Allowances?




"<think>\nOkay, I need to figure out if ZS offers any Broadband Allowances. Let me go through the provided context step by step.\n\nFirst, looking at the structured data, I see that Zs Associates India Private Ltd. offers a position called Business Technology Solutions Associate to Subhadip De. The unstructured data has several documents. \n\nIn the first document, it mentions that ZS will provide a broadband allowance of INR ₹1,500 per month through payroll. They also reimburse a one-time installation charge of INR ₹500. They expect the employee to have a high-speed connection, at least 2.0 MBPS, for remote work. Additionally, ZS can audit the usage of this allowance randomly.\n\nThe second document is the salary breakup, which includes the Broadband Allowance as a separate component amounting to INR 18,000 annually. This is calculated as INR 1,500 per month. \n\nThe other documents talk about the offer being contingent on background verification and relocation assistance, but those a

In [None]:
# chain.invoke(
#     {
#         "question": "When was she born?",
#         "chat_history": [("Which house did Elizabeth I belong to?", "House Of Tudor")],
#     }
# )