# **`Legal AI Chatbot`**


This project implements a **Graph-based Retrieval-Augmented Generation (RAG) system** for legal documents. The core idea is to **combine a knowledge graph (Neo4j) with a language model (LLM)** to answer user questions accurately and with context.


## --------------------------------  Core Components (Graph-RAG)  ------------------------------


## 1. Document Ingestion and Chunking
- Legal PDF documents are loaded using `PyPDFLoader`.
- Long documents are split into smaller chunks using `RecursiveCharacterTextSplitter` to optimize retrieval.
- Each chunk is stored with metadata (`id`, `source`, etc.) for traceability.

**Purpose:** Breaks down large legal documents into manageable, semantically meaningful pieces for retrieval.

---

## 2. Embedding and Vector Store
- Text chunks are converted into **dense vector embeddings** using `HuggingFaceEmbeddings`.
- Embeddings are stored in **Neo4j** as a **vector store**, enabling semantic search.

**Purpose:** Allows the system to find relevant chunks based on semantic similarity rather than keyword matching.

---

## 3. Neo4j Knowledge Graph
- `Neo4jVector` and `Neo4jGraph` connect to a Neo4j instance.
- Each chunk is a **node** in the graph; relationships can be defined between related chunks.
- Queries can retrieve relevant chunks using **keywords or semantic similarity**.

**Purpose:** Provides **contextual knowledge retrieval** from the document corpus, forming the “retrieval” part of RAG.

---

## 4. Large Language Model (LLM)
- The LLM is loaded via HuggingFace (`AutoModelForCausalLM` and `pipeline`).
- The LLM **generates answers** by conditioning on the retrieved chunks.
- Prompts are structured with `PromptTemplate` to include:
  - Knowledge graph context
  - Document context
  - User question

**Purpose:** This is the “generation” part of RAG — synthesizing human-like answers based on retrieved evidence.

---

## 5. Retrieval-Augmented Generation (RAG) Pipeline
- Steps:
  1. **User submits a question.**
  2. **Relevant chunks are retrieved** from Neo4j using semantic search.
  3. **LLM generates an answer** using the retrieved context.
- Optionally, `RetrievalQAWithSourcesChain` can provide answers along with source citations.

**Purpose:** Combines structured retrieval with powerful generation to produce accurate, context-aware responses.

---

## 6. Chat History
- Questions and answers are stored in a local database (`sqlite3`) for **persistent chat memory**.
- History can be retrieved and displayed to maintain conversation context.

**Purpose:** Allows users to **review previous interactions** and improves continuity in long chats.

---

**Key Takeaways:**
- Neo4j serves as the **retrieval backend** for semantic search.
- LLM serves as the **generative backend** for human-like answers.
- The RAG architecture ensures answers are **grounded in legal documents**.

---

##  Try the Chatbot

You can access the running Streamlit app here:  
[**Open Legal AI Chatbot**](https://26175c46c778.ngrok-free.app/)

---

##  References & Helpful Blogs

This notebook and implementation were inspired and guided by the following resources:

1. **RAG Tutorial: How to Build a RAG System on a Knowledge Graph**  
   [Read the blog](https://neo4j.com/blog/developer/rag-tutorial/)

2. **What is GraphRAG?**  
   [Read the blog](https://neo4j.com/blog/genai/what-is-graphrag/)

3. **LangChain Library Full Support for Neo4j Vector Index**  
   [Read the blog](https://neo4j.com/blog/developer/langchain-library-full-support-neo4j-vector-index/)

These resources helped in understanding **Neo4j Graph + LangChain RAG integration** and building this interactive legal chatbot.


## Install Required Packages

In [1]:
!pip install --quiet langchain langchain-community langchain_neo4j sentence-transformers pypdf transformers accelerate torch neo4j bitsandbytes tiktoken --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━[0m [32m2.2/2.5 MB[0m [31m71.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/310.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.2/313.2 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m201.7/201.7 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━

## Imports and Device Setup

In [2]:
import os
import uuid
import textwrap
import sqlite3
from datetime import datetime
from typing import List

# LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Neo4jVector
from langchain_neo4j import Neo4jGraph
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain
from langchain import PromptTemplate
from langchain.llms import HuggingFacePipeline
from langchain.schema import Document

# Transformers
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Neo4j
from neo4j import GraphDatabase

# Device
DEVICE = 0 if torch.cuda.is_available() else -1
print("Device:", "GPU" if DEVICE == 0 else "CPU")

Device: GPU


## Upload PDFs and load them as `Document` objects using `PyPDFLoader`.  
This prepares the documents for chunking and embedding.


In [3]:
from google.colab import files

uploaded = files.upload()
pdf_paths = list(uploaded.keys())
print("Uploaded PDFs:", pdf_paths)

documents = []
for path in pdf_paths:
    loader = PyPDFLoader(path)
    documents.extend(loader.load())
print(f"Loaded {len(documents)} documents")

Saving d (1).pdf to d (1).pdf
Uploaded PDFs: ['d (1).pdf']
Loaded 5 documents


## Split documents into smaller chunks for semantic retrieval.  
Each chunk is assigned a unique `id` and metadata for traceability.


In [4]:
splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
documents = splitter.split_documents(documents)

for i, doc in enumerate(documents):
    doc.metadata['id'] = str(uuid.uuid4())

print(f"Total chunks: {len(documents)}")
print("Sample chunk metadata:", documents[0].metadata)
print("Sample chunk content:", documents[-3].page_content)


Total chunks: 14
Sample chunk metadata: {'producer': 'www.ilovepdf.com', 'creator': 'Microsoft® Word 2016', 'creationdate': '2025-09-14T07:16:48+00:00', 'author': 'Ahmed', 'moddate': '2025-09-14T07:16:48+00:00', 'source': 'd (1).pdf', 'total_pages': 5, 'page': 0, 'page_label': '1', 'id': 'f22890b2-86da-4036-bc8e-927c5079f060'}
Sample chunk content: events beyond its reasonable control, including but not limited to natural disasters, acts 
.of government, labor disputes, cyberattacks, or pandemics (“Force Majeure Event”) 
 
ure Event shall notify the other Party as soon as The Party affected by a Force Maje 12.2
.reasonably possible and use reasonable efforts to resume performance 
 
 
Notices. 13 
All notices under this Agreement shall be in writing and delivered via email, registered 
listed above or to such other addresses as the Parties  mail, or courier to the addresses
:may designate in writing. Notices shall be deemed received


In [5]:
print(f"First 2 chunks:\n{documents[:2]}")

First 2 chunks:
[Document(metadata={'producer': 'www.ilovepdf.com', 'creator': 'Microsoft® Word 2016', 'creationdate': '2025-09-14T07:16:48+00:00', 'author': 'Ahmed', 'moddate': '2025-09-14T07:16:48+00:00', 'source': 'd (1).pdf', 'total_pages': 5, 'page': 0, 'page_label': '1', 'id': 'f22890b2-86da-4036-bc8e-927c5079f060'}, page_content='SERVICE AGREEMENT \n \n This Service Agreement (“Agreement”) is entered into on this 1st day of September 2025\n:(“Effective Date”) by and between \n \nProvider: Zenith Solutions Ltd., a company incorporated under the laws of the State of \nprincipal office at 245 Lexington Avenue, New York, NY 10016, New York, having its \n United States, represented herein by its Chief Executive Officer, Mr. Daniel H. Carter\n”);Service Provider(“ \n \nClient: Brightwave Technologies Inc., a company incorporated under the laws of \nare, having its principal office at 980 Market Street, Wilmington, DE 19801, United Delaw\n.States, represented herein by its Chief Operat

## Create embeddings for each text chunk  
Use `HuggingFaceEmbeddings` to convert text into vectors for semantic search.


In [6]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-mpnet-base-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-mpnet-base-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Store embeddings in Neo4j Vector Store  
Create a Neo4j-backed vector store for semantic retrieval and set up a retriever to fetch the top 5 relevant chunks.


In [7]:
NEO4J_URI = "neo4j+s://2fa5080d.databases.neo4j.io"
NEO4J_USER = "2fa5080d"
NEO4J_PASSWORD = "mzH7Bw867kfMbzFOkc2NJ9l8kjD5bsT3cGQQueTZn6k"


neo4j_vector = Neo4jVector.from_documents(
    documents,
    embeddings,
    url=NEO4J_URI,
    username=NEO4J_USER,
    password=NEO4J_PASSWORD,
    database="2fa5080d"
)

retriever = neo4j_vector.as_retriever(search_kwargs={"k": 5})
print("Neo4j Vector Store ready")

Neo4j Vector Store ready


In [8]:
sample_docs = retriever.get_relevant_documents("The company prepared a detailed report outlining all procedures and compliance requirements clearly.")
print("Sample retrieved docs:")
for i, doc in enumerate(sample_docs):
    print(f"\n--- Doc {i+1} ---")
    print(doc)

  sample_docs = retriever.get_relevant_documents("The company prepared a detailed report outlining all procedures and compliance requirements clearly.")


Sample retrieved docs:

--- Doc 1 ---
page_content=':maintenance services, including but not limited to 
 
;Custom software development for Client’s internal applications 
;party software solutions-Integration of third 
;d troubleshootingTechnical support an 
;System upgrades and performance optimization 
.Regular reporting on project progress and milestones 
Service Provider shall perform the Services with the highest degree of  1.2
compliance with  professionalism, in accordance with industry standards, and in
.applicable laws 
 
Any additional services beyond the Scope of Services shall require a written  1.3
.amendment to this Agreement, executed by both Parties' metadata={'moddate': '2025-09-14T07:16:48+00:00', 'creationdate': '2025-09-14T07:16:48+00:00', 'creator': 'Microsoft® Word 2016', 'author': 'Ahmed', 'source': 'd (1).pdf', 'page': 0, 'name': 'Page 0', 'total_pages': 5, 'producer': 'www.ilovepdf.com', 'page_label': '1'}

--- Doc 2 ---
page_content=':maintenance services, in

## Neo4j Knowledge Graph Setup  (Structured Data)
Initialize a Neo4jGraph instance to manage nodes and relationships for document chunks, enabling structured queries and graph-based retrieval.


In [9]:
## Neo4j Graph Setup
neo4j_graph = Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USER,
    password=NEO4J_PASSWORD,
    database="2fa5080d"
)
print("Neo4j Graph ready")


Neo4j Graph ready


## Update Node Properties in Neo4j

In [10]:
## Update Node Properties in Neo4j
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

with driver.session(database="2fa5080d") as session:
    for d in documents:
        cypher = """
        MERGE (c:Chunk {id: $id})
        SET c.page = $page,
            c.source = $source,
            c.text = $text,
            c.name = $name
        """
        session.run(cypher,
                    id=d.metadata["id"],
                    page=d.metadata.get("page", 0),
                    source=d.metadata.get("source", ""),
                    text=d.page_content,
                    name=f"Page {d.metadata.get('page', 0)}")
print("Node properties updated with name")

Node properties updated with name


## Create Relationships Between Nodes

In [11]:
## Create Relationships Between Nodes
with driver.session(database="2fa5080d") as session:
    cypher = """
    MATCH (c1:Chunk), (c2:Chunk)
    WHERE c1.source = c2.source AND c1.page + 1 = c2.page
    MERGE (c1)-[:NEXT]->(c2)
    """
    session.run(cypher)
print("Relationships created")

Relationships created


## Knowledge Graph Query Function  
Define `dynamic_kg_query` to retrieve relevant chunks from the Neo4j knowledge graph based on keywords extracted from the user question.  
It returns a concatenated string of chunk texts to provide context for the LLM.


In [12]:
## Knowledge Graph Query Function
def dynamic_kg_query(question: str):
    try:
        keywords = [w for w in question.split() if len(w) > 3]
        if not keywords:
            keywords = ["legal"]
        with driver.session(database="2fa5080d") as session:
            cypher = """
            MATCH (n:Chunk)
            WHERE any(k IN $keywords WHERE toLower(n.name) CONTAINS toLower(k)
                       OR toLower(n.text) CONTAINS toLower(k))
            RETURN n.text AS text
            LIMIT 10
            """
            results = session.run(cypher, keywords=keywords)
            return "\n".join([r["text"] for r in results])
    except Exception as e:
        print("KG query failed:", e)
        return ""

In [13]:
with driver.session(database="2fa5080d") as session:
    count_nodes = session.run("MATCH (n:Chunk) RETURN count(n) AS c").single()["c"]
    print("Total nodes in graph:", count_nodes)

Total nodes in graph: 100


## Load LLM (Large Language Model)  
Define `load_llm` to load a quantized 7B model (`wizardLM-7B-HF`) for text generation.  
If loading fails, it falls back to a smaller Flan-T5 model.  
The returned object is a `HuggingFacePipeline` ready for RAG-style generation.


In [14]:
def load_llm(model_name="TheBloke/wizardLM-7B-HF", max_length=512):
    try:
        quant_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16
        )
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            quantization_config=quant_config,
            trust_remote_code=True,
            low_cpu_mem_usage=True
        )
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        pipe = pipeline(
            "text-generation",
            model=model,
            tokenizer=tokenizer,
            max_new_tokens=max_length,
            do_sample=False
        )
        return HuggingFacePipeline(pipeline=pipe)
    except Exception as e:
        print("⚠️ Fallback to Flan-T5", e)
        pipe = pipeline(
            "text2text-generation",
            model="google/flan-t5-base",
            tokenizer="google/flan-t5-base",
            device=DEVICE,
            max_length=max_length,
            do_sample=False
        )
        return HuggingFacePipeline(pipeline=pipe)

llm = load_llm()

config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

pytorch_model.bin.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  return HuggingFacePipeline(pipeline=pipe)


## Prompt Template for Legal AI  

The `professional_prompt` structures how the LLM answers questions:  

- Uses **knowledge graph context** (`kg_context`)  
- Uses **document context** (`doc_context`)  
- Includes the **user question** (`question`)  

The LLM is instructed to answer **short, direct, and in a legal style**.


In [15]:
professional_prompt = PromptTemplate(
    template="""
You are a professional legal AI assistant. Answer **precisely and concisely** using the provided contexts.

Context from Knowledge Graph:
{kg_context}

Context from Documents:
{doc_context}

Question: {question}

Answer (short, direct, sharp, legal style):
""",
    input_variables=["kg_context", "doc_context", "question"]
)


## Setup RetrievalQAWithSourcesChain

In [16]:
# qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
#     llm=llm,
#     chain_type="map_reduce",
#     retriever=retriever,
#     return_source_documents=True
# )

In [17]:
# chain_response = qa_chain.invoke({"question": "What are the applications of Artificial Intelligence in modern world?"})
# chain_response["answer"]

## Chat History Database

- **SQLite** database is used to persist all user interactions.
- Table `chats` stores:
  - `id` → unique identifier for each chat
  - `question` → user question
  - `answer` → AI-generated answer
  - `created_at` → timestamp of the chat

**Function `save_chat`**:
- Generates a unique ID
- Inserts the question-answer pair into the database
- Commits the change to persist it


In [18]:
DB = "chat_history.db"
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS chats (
    id TEXT PRIMARY KEY,
    question TEXT,
    answer TEXT,
    created_at TEXT
)
""")
conn.commit()

def save_chat(question: str, answer: str):
    chat_id = str(uuid.uuid4())
    cur.execute(
        "INSERT INTO chats (id, question, answer, created_at) VALUES (?,?,?,?)",
        (chat_id, question, answer, datetime.utcnow().isoformat())
    )
    conn.commit()

## Question-Answer Function (`ask_professional_question`)

This function handles **end-to-end retrieval and generation** for the legal chatbot:

1. **Vector Store Retrieval**:
   - Retrieves top 3 relevant chunks from Neo4j vector database.
   - Prepares `doc_context` by taking a snippet of each chunk.

2. **Knowledge Graph Retrieval**:
   - Queries the Neo4j knowledge graph for keyword matches.
   - Prepares `kg_context` by combining the top results.

3. **Prompt Construction**:
   - Fills `professional_prompt` with:
     - Knowledge graph context
     - Document context
     - User question

4. **Answer Generation**:
   - Sends the prompt to the LLM.
   - Generates a concise, legal-style answer.

5. **Chat Persistence**:
   - Saves question and answer to SQLite database for history.

6. **Output**:
   - Prints the question and answer.
   - Returns the answer string.


In [19]:
def ask_professional_question(question: str):
    # Neo4j Vector DB retrieval
    res = retriever.get_relevant_documents(question)[:3]
    doc_context = "\n".join([d.page_content[:150] + "..." for d in res])

    # Knowledge Graph retrieval
    kg_full = dynamic_kg_query(question)
    kg_context = "\n".join(kg_full.split("\n")[1:15]) if kg_full else ""

    # Build prompt
    prompt_text = professional_prompt.format(
        kg_context=kg_context,
        doc_context=doc_context,
        question=question
    )

    # Generate answer
    answer = llm(prompt_text, max_new_tokens=150, do_sample=False)

    # Save chat (optional, without sources)
    save_chat(question, answer)

    # Print
    print("\n❓ Question:", question)
    print("\n🤖 Answer:\n", answer)

    return answer

In [20]:
dynamic_kg_query("What services is the Service Provider obligated to perform under this Agreement?").split("\n")[1:20]

['CONTRACT UNDERSTANDING ATTICUS DATASET ',
 ' ',
 'Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of more ',
 'than 13,000 labels in 510 commercial legal contracts that have been ',
 'manually labeled to identify 41 categories of important clauses that ',
 'lawyers look for when reviewing contracts in connection with corporate ',
 'transactions. ',
 ' ',
 'CUAD is curated and maintained by The Atticus Project, Inc. to support ',
 'NLP research and development in legal contract review. Analysis of CUAD ',
 'can be found at https://arxiv.org/abs/2103.06268. Code for replicating ',
 'the results and the trained model can be found at ',
 'https://github.com/TheAtticusProject/cuad. ',
 ' ',
 'FORMAT',
 'contract in the dataset and include the text context and human-input ',
 'answers corresponding to the categories. The human-input answers are ',
 'derived from the text context and are formatted to a unified form. ']

## Example Queries

In [21]:
ask_professional_question("What limitations of liability are included in the Agreement, and what types of liability are excluded from those limitations?")

  answer = llm(prompt_text, max_new_tokens=150, do_sample=False)



❓ Question: What limitations of liability are included in the Agreement, and what types of liability are excluded from those limitations?

🤖 Answer:
 
You are a professional legal AI assistant. Answer **precisely and concisely** using the provided contexts.

Context from Knowledge Graph:
answers corresponding to the categories. The human-input answers are 
derived from the text context and are formatted to a unified form. 
 
- 1 SQuAD-style JSON: this file is derived from the master clauses CSV 
to follow the same format as SQuAD 2.0 
(https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/), a 
question answering dataset whose answers are similarly spans of the 
input text. The exact format of the JSON format exactly mimics that of 
SQuAD 2.0 for compatibility with prior work. We also provide Python 
scripts for processing this data for further ease of use. 
 
- 28 Excels: a collection of Excel files containing clauses responsive
contracts that are considered important by experi

  (chat_id, question, answer, datetime.utcnow().isoformat())


'\nYou are a professional legal AI assistant. Answer **precisely and concisely** using the provided contexts.\n\nContext from Knowledge Graph:\nanswers corresponding to the categories. The human-input answers are \nderived from the text context and are formatted to a unified form. \n \n- 1 SQuAD-style JSON: this file is derived from the master clauses CSV \nto follow the same format as SQuAD 2.0 \n(https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/), a \nquestion answering dataset whose answers are similarly spans of the \ninput text. The exact format of the JSON format exactly mimics that of \nSQuAD 2.0 for compatibility with prior work. We also provide Python \nscripts for processing this data for further ease of use. \n \n- 28 Excels: a collection of Excel files containing clauses responsive\ncontracts that are considered important by experienced attorneys in \ncontract review in connection with a corporate transaction. Such \n\nContext from Documents:\nAgreement and that 

In [22]:
ask_professional_question("What provisions in the Agreement protect the confidentiality of information shared between the Parties?")


❓ Question: What provisions in the Agreement protect the confidentiality of information shared between the Parties?

🤖 Answer:
 
You are a professional legal AI assistant. Answer **precisely and concisely** using the provided contexts.

Context from Knowledge Graph:
contract review in connection with a corporate transaction. Such 
transactions include mergers & acquisitions, investments, initial 
public offering, etc. 
 
Each category supports a contract review task which is to extract from 
an underlying contract (1) text context (clause) and (2) human-input 
answers that correspond to each of the categories in these contracts. 
For example, in response to the “Governing Law” category, the clause 
states “This Agreement is accepted by Company in the State of Nevada 
and shall be governed by and construed in accordance with the laws 
thereof, which laws shall prevail in the event of any conflict.”. The 
answer derived from the text context is Nevada.
contract (string, date, or combina

  (chat_id, question, answer, datetime.utcnow().isoformat())


'\nYou are a professional legal AI assistant. Answer **precisely and concisely** using the provided contexts.\n\nContext from Knowledge Graph:\ncontract review in connection with a corporate transaction. Such \ntransactions include mergers & acquisitions, investments, initial \npublic offering, etc. \n \nEach category supports a contract review task which is to extract from \nan underlying contract (1) text context (clause) and (2) human-input \nanswers that correspond to each of the categories in these contracts. \nFor example, in response to the “Governing Law” category, the clause \nstates “This Agreement is accepted by Company in the State of Nevada \nand shall be governed by and construed in accordance with the laws \nthereof, which laws shall prevail in the event of any conflict.”. The \nanswer derived from the text context is Nevada.\ncontract (string, date, or combination thereof), we represent answers \nin consistent formats. For example, if the Agreement Date in a contract \n

In [23]:
ask_professional_question("What services is the Service Provider obligated to perform under this Agreement?")


❓ Question: What services is the Service Provider obligated to perform under this Agreement?

🤖 Answer:
 
You are a professional legal AI assistant. Answer **precisely and concisely** using the provided contexts.

Context from Knowledge Graph:
CONTRACT UNDERSTANDING ATTICUS DATASET 
 
Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of more 
than 13,000 labels in 510 commercial legal contracts that have been 
manually labeled to identify 41 categories of important clauses that 
lawyers look for when reviewing contracts in connection with corporate 
transactions. 
 
CUAD is curated and maintained by The Atticus Project, Inc. to support 
NLP research and development in legal contract review. Analysis of CUAD 
can be found at https://arxiv.org/abs/2103.06268. Code for replicating 
the results and the trained model can be found at 
https://github.com/TheAtticusProject/cuad. 
 

Context from Documents:
ce Provider’s total liability under this Agreement shall not exceed the tota

  (chat_id, question, answer, datetime.utcnow().isoformat())


'\nYou are a professional legal AI assistant. Answer **precisely and concisely** using the provided contexts.\n\nContext from Knowledge Graph:\nCONTRACT UNDERSTANDING ATTICUS DATASET \n \nContract Understanding Atticus Dataset (CUAD) v1 is a corpus of more \nthan 13,000 labels in 510 commercial legal contracts that have been \nmanually labeled to identify 41 categories of important clauses that \nlawyers look for when reviewing contracts in connection with corporate \ntransactions. \n \nCUAD is curated and maintained by The Atticus Project, Inc. to support \nNLP research and development in legal contract review. Analysis of CUAD \ncan be found at https://arxiv.org/abs/2103.06268. Code for replicating \nthe results and the trained model can be found at \nhttps://github.com/TheAtticusProject/cuad. \n \n\nContext from Documents:\nce Provider’s total liability under this Agreement shall not exceed the total fees Servi 7.2\n.paid by Client to Service Provider under this Agreement ...\nce Pr

## View Last 5 Chats

In [24]:
rows = cur.execute(
    "SELECT question, answer, created_at FROM chats ORDER BY created_at DESC LIMIT 5"
).fetchall()

for r in rows:
    print("\nQuestion:", r[0])
    print("Answer:", r[1])
    print("Time:", r[2])


Question: What services is the Service Provider obligated to perform under this Agreement?
Answer: 
You are a professional legal AI assistant. Answer **precisely and concisely** using the provided contexts.

Context from Knowledge Graph:
CONTRACT UNDERSTANDING ATTICUS DATASET 
 
Contract Understanding Atticus Dataset (CUAD) v1 is a corpus of more 
than 13,000 labels in 510 commercial legal contracts that have been 
manually labeled to identify 41 categories of important clauses that 
lawyers look for when reviewing contracts in connection with corporate 
transactions. 
 
CUAD is curated and maintained by The Atticus Project, Inc. to support 
NLP research and development in legal contract review. Analysis of CUAD 
can be found at https://arxiv.org/abs/2103.06268. Code for replicating 
the results and the trained model can be found at 
https://github.com/TheAtticusProject/cuad. 
 

Context from Documents:
ce Provider’s total liability under this Agreement shall not exceed the total fees

## Deployment

In [29]:
%%writefile app.py
import streamlit as st
import os, uuid, sqlite3
from datetime import datetime
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Neo4jVector
from langchain_neo4j import Neo4jGraph
from langchain.embeddings import HuggingFaceEmbeddings
from langchain import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from neo4j import GraphDatabase
import torch
import tempfile

# --- Device ---
DEVICE = 0 if torch.cuda.is_available() else -1

# --- Streamlit page ---
st.set_page_config(
    page_title="Legal AI Chatbot",
    page_icon="🤖",
    layout="wide"
)

st.markdown("""
<style>
body { background-color: #f7f9fc; }
.main-header { font-size: 2.8rem; color: #1f3d7a; font-weight: 700; margin-bottom:10px;}
.sub-header { font-size: 1.3rem; color: #4a4a4a; margin-bottom:20px;}
.question { color:#1f3d7a; font-weight:600; padding:10px; background-color:#e6f0ff; border-radius:8px; margin:10px 0;}
.answer { color:#2e7d32; padding:10px; background-color:#e6fff0; border-radius:8px; margin:10px 0;}
.divider { border-top: 2px solid #d6d6d6; margin:20px 0;}
.upload-box { background-color:#fff; padding:20px; border-radius:10px; box-shadow:0 2px 5px rgba(0,0,0,0.1);}
.footer { margin-top:50px;text-align:center;color:#6c6c6c;font-size:0.9rem;}
</style>
""", unsafe_allow_html=True)

st.markdown('<p class="main-header">Legal AI Chatbot</p>', unsafe_allow_html=True)
st.markdown('<p class="sub-header">Upload legal PDFs and get AI-powered answers instantly.</p>', unsafe_allow_html=True)

# --- Upload PDFs ---
documents = []
with st.container():
    uploaded_files = st.file_uploader("Upload PDFs", type="pdf", accept_multiple_files=True)
    if uploaded_files:
        for uploaded_file in uploaded_files:
            with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
                tmp_file.write(uploaded_file.read())
                tmp_path = tmp_file.name
            loader = PyPDFLoader(tmp_path)
            documents.extend(loader.load())
        st.success(f"Loaded {len(documents)} documents")

# --- Split Documents ---
if documents:
    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    documents = splitter.split_documents(documents)
    for d in documents:
        d.metadata['id'] = str(uuid.uuid4())
    st.info(f"Documents split into {len(documents)} chunks")

# --- Embeddings ---
@st.cache_resource
def load_embeddings():
    return HuggingFaceEmbeddings(
        model_name="sentence-transformers/paraphrase-mpnet-base-v2",
        model_kwargs={"device":"cuda" if DEVICE>=0 else "cpu"}
    )
embeddings = load_embeddings()

# --- Neo4j setup ---
NEO4J_URI = "neo4j+s://c964c2c9.databases.neo4j.io"
NEO4J_USER = "c964c2c9"
NEO4J_PASSWORD = "_2INhWaD42QqE9AxVbNkfpOzoaftnyadSeFWHQm4GiU"
NEO4J_DB = "c964c2c9"

neo4j_vector = None
retriever = None
if documents:
    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))
    with driver.session(database=NEO4J_DB) as session:
        session.run("MATCH (n) DETACH DELETE n")
    neo4j_vector = Neo4jVector.from_documents(
        documents, embeddings,
        url=NEO4J_URI,
        username=NEO4J_USER,
        password=NEO4J_PASSWORD,
        database=NEO4J_DB
    )
    retriever = neo4j_vector.as_retriever(search_kwargs={"k":3})
    st.success("Neo4j Vector Store ready")

def dynamic_kg_query(question:str):
    keywords = [w for w in question.split() if len(w)>3] or ["legal"]
    with driver.session(database=NEO4J_DB) as session:
        cypher="""
        MATCH (n:Chunk)
        WHERE any(k IN $keywords WHERE toLower(n.name) CONTAINS toLower(k)
               OR toLower(n.text) CONTAINS toLower(k))
        RETURN n.text AS text
        LIMIT 5
        """
        results = session.run(cypher, keywords=keywords)
        return "\n".join([r["text"] for r in results])

# --- Load LLM ---
@st.cache_resource
def load_llm(model_name="NousResearch/Nous-Hermes-13b", max_length=512):
    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        trust_remote_code=True,
        quantization_config=quant_config
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=max_length,
        do_sample=False
    )
    return HuggingFacePipeline(pipeline=pipe)

llm = load_llm()

professional_prompt = PromptTemplate(
    template="""
You are a professional legal AI assistant. Answer concisely using provided contexts.

Context from Knowledge Graph:
{kg_context}

Context from Documents:
{doc_context}

Question: {question}

Answer:
""",
    input_variables=["kg_context","doc_context","question"]
)

# --- Chat History ---
DB="chat_history.db"
conn = sqlite3.connect(DB)
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS chats (
    id TEXT PRIMARY KEY,
    question TEXT,
    answer TEXT,
    created_at TEXT
)
""")
conn.commit()

# --- Load chat history ---
def load_history():
    cur.execute("SELECT question, answer FROM chats ORDER BY created_at ASC")
    return cur.fetchall()

# --- Display chat history ---
st.subheader("Chat History")
history = load_history()
for q, a in history:
    st.markdown(f'<div class="question">👤 You: {q}</div>', unsafe_allow_html=True)
    st.markdown(f'<div class="answer">🤖 Bot: {a}</div>', unsafe_allow_html=True)


def save_chat(question:str, answer:str):
    chat_id = str(uuid.uuid4())
    cur.execute(
        "INSERT INTO chats (id, question, answer, created_at) VALUES (?,?,?,?)",
        (chat_id, question, answer, datetime.utcnow().isoformat())
    )
    conn.commit()

# --- Ask question ---
st.subheader("Ask your question")
question = st.text_input("Type your question here:")

if st.button("Send") and question:
    doc_context = ""
    if retriever:
        res = retriever.get_relevant_documents(question)[:5]
        doc_context = "\n".join([d.page_content[:250]+"..." for d in res])
    kg_full = dynamic_kg_query(question)
    kg_context = "\n".join(kg_full.split("\n")[1:15]) if kg_full else ""
    prompt_text = professional_prompt.format(
        kg_context=kg_context,
        doc_context=doc_context,
        question=question
    )
    answer = llm(prompt_text, max_new_tokens=300, do_sample=False)
    st.markdown(f'<div class="question">👤 You: {question}</div>', unsafe_allow_html=True)
    st.markdown(f'<div class="answer">🤖 Bot: {answer}</div>', unsafe_allow_html=True)
    save_chat(question, answer)

st.markdown('<div class="footer">© 2025 Legal AI Chatbot | Built with Streamlit</div>', unsafe_allow_html=True)


Overwriting app.py


In [2]:
!pip install streamlit pyngrok



In [3]:
from pyngrok import ngrok, conf

# Replace with your token
NGROK_AUTH_TOKEN = "32gTvi2fX6YswsbnM0Wlqx3YODB_5m9FS7UtW5jaBVMuwwRRy"

!ngrok config add-authtoken $NGROK_AUTH_TOKEN

Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml


In [4]:
from pyngrok import ngrok
!streamlit run app.py &>/dev/null &
url = ngrok.connect(8501)
print('Chatbot running at:', url)

Chatbot running at: NgrokTunnel: "https://26175c46c778.ngrok-free.app" -> "http://localhost:8501"


In [27]:
## What the document is about ?
## what clause does Orange limit its liability for damages from its services?
## what section are users prohibited from copying Orange content without permission?
## what clause does Orange specify that its content is for personal use only?
##  what part can Orange modify, suspend, or terminate its services?
## what section does Orange describe its handling of users’ personal information?
## what clause does Orange disclaim responsibility for external website content?
## what part does Orange claim rights over content submitted by users?
##  what clause are users prohibited from hacking or damaging Orange services?
##  what section does Orange state it can amend the terms and conditions without notice?
##  what part does Orange explain that users may incur charges for certain services?