# Author: Omkar Bare

## Project: Knowlegde Graph Based Retrieval-Augmented Generation (RAG) System For U.S. Securities and Exchange Commission (SEC) Filings Data (Company: NetApp, Inc.)

NetApp, Inc. is an American data infrastructure company that provides unified data storage, integrated data services, and cloud operations (CloudOps) solutions to enterprise customers. https://en.wikipedia.org/wiki/NetApp

This notebook implements:
- RAG Pipeline for QA with the forms.

---------------------------

Construction of nodes for form 10k is done in this notebook: https://colab.research.google.com/drive/1UnQEfa66TtXmlHAM-IfC-R0ZahIcFEl5?usp=sharing

Construction of relationship between nodes for form 10k and form 13 is done in this notebook:
https://colab.research.google.com/drive/13GCkyUvMs42voBzg_3dwyJ4M26iBMVRW?usp=sharing



Data used in this project:
 - SEC Form 10k for company NetApp Inc. (retrieved and stored in `.json` format from SEC website): Publicly traded companies are required to fill a form 10-K each year with the Securities and Exchange Commision (SEC)


 - SEC Form 13 for the company NetApp Inc. (retrieved and stored in `.csv` format from SEC website): Investment management firms must report on their investments in companies to the SEC by filing a document called **Form 13**

In [1]:
!pip install neo4j langchain==0.3.18 langchain_community==0.3.17 langchain_openai==0.3.6 openai>=1.6.1 langchain_text_splitters==0.3.6

In [2]:
import json
import textwrap

from google.colab import userdata

# Langchain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI

# Warning control
import warnings
warnings.filterwarnings("ignore")

In [4]:
NEO4J_URI = userdata.get('NEO4J_URI')
NEO4J_USERNAME = userdata.get('NEO4J_USERNAME')
NEO4J_PASSWORD = userdata.get('NEO4J_PASSWORD')
NEO4J_DATABASE = userdata.get('NEO4J_DATABASE')

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
OPENAI_ENDPOINT = userdata.get('OPENAI_BASE_URL') + '/embeddings'

In [5]:
# Global constants
VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'

In [6]:
kg = Neo4jGraph(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database=NEO4J_DATABASE
)

# RAG With Cypher Query Augmentation

In [26]:
investment_retrieval_query = """
MATCH (node)-[:PART_OF]->(f:Form),
    (f)<-[:FILED]-(com:Company),
    (com)<-[owns:OWNS_STOCK_IN]-(mgr:Manager)
WITH node, score, mgr, owns, com
    ORDER BY owns.shares DESC LIMIT 10
WITH collect (
    mgr.managerName +
    " owns " + owns.shares +
    " shares in " + com.companyName +
    " at a value of $" +
    apoc.number.format(toInteger(owns.value)) + "."
) AS investment_statements, node, score
RETURN apoc.text.join(investment_statements, "\n") +
    "\n" + node.text AS text,
    score,
    {
      source: node.source
    } as metadata
"""

In [29]:
vector_store_with_investment = Neo4jVector.from_existing_index(
    OpenAIEmbeddings(model="text-embedding-3-small", openai_api_key=OPENAI_API_KEY),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database="neo4j",
    index_name=VECTOR_INDEX_NAME,
    text_node_property=VECTOR_SOURCE_PROPERTY,
    retrieval_query=investment_retrieval_query,
)

In [None]:
# Create a retriever from the vector store
retriever_with_investments = vector_store_with_investment.as_retriever()

In [34]:
# Create a chatbot Question & Answer chain from the retriever
investment_chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0, model="gpt-4.1-mini", openai_api_key=OPENAI_API_KEY),
    chain_type="stuff",
    retriever=retriever_with_investments
)

In [35]:
question = "In a single sentence, tell me about Netapp investors."

In [36]:
investment_chain(
    {"question": question},
    return_only_outputs=True,
)

{'answer': 'FINAL ANSWER: The largest investors in NetApp include Vanguard Group Inc., BlackRock Inc., and Primecap Management Co., owning approximately 27.6 million, 18.2 million, and 15.5 million shares respectively.\n\n',
 'sources': 'https://www.sec.gov/Archives/edgar/data/1002047/000095017023027948/0000950170-23-027948-index.htm'}

# RAG With Cypher Generation

In [15]:
CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""

In [16]:
CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"],
    template=CYPHER_GENERATION_TEMPLATE
)

In [17]:
cypherChain = GraphCypherQAChain.from_llm(
    ChatOpenAI(temperature=0, model="gpt-4.1-mini", openai_api_key=OPENAI_API_KEY),
    graph=kg,
    verbose=True,
    cypher_prompt=CYPHER_GENERATION_PROMPT,
    allow_dangerous_requests=True,
)

In [18]:
def prettyCypherChain(question: str) -> str:
    response = cypherChain.run(question)
    print(textwrap.fill(response, 60))

In [19]:
prettyCypherChain("SHELTON CAPITAL MANAGEMENT owns how many shares of NETAPP INC?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Manager {managerName: "SHELTON CAPITAL MANAGEMENT"})- [r:OWNS_STOCK_IN]->(c:Company {companyName: "NETAPP INC"})
RETURN r.shares AS shares_owned[0m
Full Context:
[32;1m[1;3m[{'shares_owned': 39124}][0m

[1m> Finished chain.[0m
SHELTON CAPITAL MANAGEMENT owns 39,124 shares of NETAPP INC.


In [25]:
prettyCypherChain("D. E. Shaw & Co., Inc. owns how many shares of NETAPP INC?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Manager)-[r:OWNS_STOCK_IN]->(c:Company {companyName: "NETAPP INC"})
WHERE m.managerName = "D. E. Shaw & Co., Inc."
RETURN sum(r.shares) AS totalSharesOwned[0m
Full Context:
[32;1m[1;3m[{'totalSharesOwned': 323440}][0m

[1m> Finished chain.[0m
D. E. Shaw & Co., Inc. owns 323,440 shares of NETAPP INC.
