# Semantic Search
Semantic search is [defined as "search with meaning."](https://en.wikipedia.org/wiki/Semantic_search) It is key for effective knowledge retrieval.

As opposed to traditional, lexical, search which finds matches based on keywords, semantic search seeks to improve search quality and accuracy by understanding search intent and pulling results that match the user’s contextual meaning.  Semantic search is often used in reference to text embedding and vector similarity search, but this is just one implementation aspect of it. Knowledge graph and symbolic query logic can also play a critical role in making semantic search a reality. 

If all you care about is analyzing a set of documents on a file system, then sure, vector indexing and search may be sufficient.  However, once you need to retrieve and make inferences about people, places, and things connected to those documents, Knowledge graph becomes key. 

If documents are the entities of interest, for example: "find all documents that talk about pharma related things," then text embeddings with vector similarity search suffices.  But, what if we want second or third-order entities related to the documents?  For example, "find investors who are most focused on pharma related strategies," how would we efficiently search for them at scale in an enterprise setting?

This is what we demonstrate below.  We will also show how you can use graph relationships and Graph Data Science algorithms to further improve search results, especially in common scenarios where the presence of text data is inconsistent or sparse.

## Connect to Neo4j

In [1]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'

# You will need to change these to match your credentials
NEO4J_URI = 'neo4j+s://7e098dc5.databases.neo4j.io' #Eg 'neo4j+s://ccc5f4f5.databases.neo4j.io'
NEO4J_PASSWORD = '19_Kt_bHLte_0F2r4OAqGv1ihz9ciXJfJaav0-hpNrI'

In [2]:
from graphdatascience import GraphDataScience

gds = GraphDataScience(
    NEO4J_URI,
    auth = (NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds = True
)

## Neo4j Vector Index
We will need to create a vector index for similarity search on Document nodes. Neo4j offers a vector index that enables Approximate Nearest Neighbor Search (ANN). Let's creat an index.

In [None]:
gds.run_cypher("CALL db.index.vector.createNodeIndex('document-embeddings', 'Document', 'textEmbedding', 768, 'cosine')")

You can see that vector index has been created using `SHOW INDEXES`

In [5]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "VECTOR"
''')

Unnamed: 0,name,type,labelsOrTypes,properties,options
0,document-embeddings,VECTOR,[Document],[textEmbedding],"{'indexProvider': 'vector-2.0', 'indexConfig':..."


## Deep Semantic Search with Knowledge Graph
Now that we have an index.  Let’s use it in action.
In this case, we will Answer The Question - "What are the companies associated with energy, oil and gas?". 
Remember we do not have documents on investment managers, just companies, and there can be multiple documents for each company.

## Dense Retrieval using Neo4j

In [6]:
from langchain.docstore.document import Document
from typing import (
    List,
    Tuple,
)

def to_df(results: List[Tuple[Document, float]]):
    return pd.DataFrame({
        "doc_id": [r[0].metadata.get('documentId') for r in results],
        "score": [r[1] for r in results],
        "text": [r[0].page_content for r in results]
    })

In [7]:
from langchain.vectorstores import Neo4jVector
from langchain_google_vertexai import VertexAIEmbeddings
from langchain.vectorstores.neo4j_vector import SearchType
import pandas as pd

EMBEDDING_MODEL = 'text-embedding-004'
db = Neo4jVector.from_existing_index(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings(model_name=EMBEDDING_MODEL),
    index_name = 'document-embeddings',
    keyword_index_name = 'doc_text',
    search_type = SearchType.VECTOR
)
query = 'What are the companies associated with energy, oil and gas?'
results = db.similarity_search_with_score(query, k=10)
res_df = to_df(results)
res_df

Unnamed: 0,doc_id,score,text
0,BERKSHIRE HATHAWAY INC25,0.78765,The Great Britain distribution companies consi...
1,BERKSHIRE HATHAWAY INC30,0.765144,The natural gas pipelines are subject to regul...
2,BERKSHIRE HATHAWAY INC34,0.763698,BHE and its energy subsidiaries continue to fo...
3,BERKSHIRE HATHAWAY INC26,0.763532,"Northern Natural, based in Nebraska, operates ..."
4,BERKSHIRE HATHAWAY INC23,0.761165,Berkshire currently has a 91.1% ownership inte...
5,BERKSHIRE HATHAWAY INC42,0.753069,Lubrizol operates its business on a global bas...
6,APPLIED INDUSTRIAL TECHNOLOGIES IN12,0.751196,. Our Fluid Power & Flow Control segment incl...
7,BERKSHIRE HATHAWAY INC47,0.749413,Industrial Products\n supplies construction fa...
8,BERKSHIRE HATHAWAY INC24,0.747765,MidAmerican Energy Company (“MEC”) is a regula...
9,APPLIED INDUSTRIAL TECHNOLOGIES IN8,0.747416,Our operations are primarily based in the U.S....


Lets take a look at the first result and then understand whether the chunk has the information we asked for.

In [8]:
res_df['text'][0]

'The Great Britain distribution companies consist of Northern Powergrid (Northeast) plc and Northern Powergrid (Yorkshire) plc, which own a substantial electricity distribution network that delivers electricity to end-users in northeast England in an area covering approximately 10,000 square miles. The distribution companies primarily charge supply companies regulated tariffs for the use of their distribution systems. \n\n\nAltaLink L.P. (“AltaLink”) is a regulated electric transmission-only utility company headquartered in Calgary, Alberta. AltaLink’s high voltage transmission lines and related facilities transmit electricity from generating facilities to major load centers, cities and large industrial plants throughout its 87,000 square mile service territory. \n\n\nThe natural gas pipelines consist of BHE GT&S, LLC (“BHE GT&S”), Northern Natural Gas Company (“Northern Natural”) and Kern River Gas Transmission Company (“Kern River”). BHE GT&S was acquired on November 1, 2020.\n\n\nBH

## Re-rank results
If you explore the results, you may find some irrelevant results. The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.
A re-ranker can rank the final results for the user. The query and a list of retrieved documents are passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.

We will use a Cross-encoder model named `ms-marco-MiniLM-L-6-v2` from SentenceTransformers library. In a Production setting, you can also consider fine-tuning the reranker appropriately.

In [11]:
def rerank_results(res_df):
    from sentence_transformers import SentenceTransformer, CrossEncoder, util
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    passages = []
    
    # Score all retrieved passages with the cross_encoder
    cross_inp = [[query, f"Text: {hit['text']}"] for _,hit in res_df[:].iterrows()]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    res_df['cross-score'] = cross_scores

    # Output of top-10 hits from re-ranker
    print("Top-10 Cross-Encoder Re-ranker hits")
    hits = res_df.sort_values('cross-score', ascending=False).reset_index(drop=True)
    for _,hit in hits.iterrows():
        print("\t{:.3f}\t{}".format(hit['cross-score'], hit['doc_id']))
    return hits

In [12]:
ranked_results = rerank_results(res_df)
ranked_results

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



Top-10 Cross-Encoder Re-ranker hits
	-2.413	BERKSHIRE HATHAWAY INC23
	-3.615	BERKSHIRE HATHAWAY INC24
	-4.773	BERKSHIRE HATHAWAY INC25
	-5.920	BERKSHIRE HATHAWAY INC34
	-6.956	APPLIED INDUSTRIAL TECHNOLOGIES IN8
	-7.015	BERKSHIRE HATHAWAY INC26
	-7.671	BERKSHIRE HATHAWAY INC42
	-8.045	BERKSHIRE HATHAWAY INC30
	-9.600	BERKSHIRE HATHAWAY INC47
	-10.158	APPLIED INDUSTRIAL TECHNOLOGIES IN12


Unnamed: 0,doc_id,score,text,cross-score
0,BERKSHIRE HATHAWAY INC23,0.761165,Berkshire currently has a 91.1% ownership inte...,-2.412768
1,BERKSHIRE HATHAWAY INC24,0.747765,MidAmerican Energy Company (“MEC”) is a regula...,-3.615499
2,BERKSHIRE HATHAWAY INC25,0.78765,The Great Britain distribution companies consi...,-4.772871
3,BERKSHIRE HATHAWAY INC34,0.763698,BHE and its energy subsidiaries continue to fo...,-5.92015
4,APPLIED INDUSTRIAL TECHNOLOGIES IN8,0.747416,Our operations are primarily based in the U.S....,-6.95618
5,BERKSHIRE HATHAWAY INC26,0.763532,"Northern Natural, based in Nebraska, operates ...",-7.01503
6,BERKSHIRE HATHAWAY INC42,0.753069,Lubrizol operates its business on a global bas...,-7.670772
7,BERKSHIRE HATHAWAY INC30,0.765144,The natural gas pipelines are subject to regul...,-8.044989
8,BERKSHIRE HATHAWAY INC47,0.749413,Industrial Products\n supplies construction fa...,-9.599635
9,APPLIED INDUSTRIAL TECHNOLOGIES IN12,0.751196,. Our Fluid Power & Flow Control segment incl...,-10.158342


As seen above, the cross-encoder finds this passage to be more relevant to the query

In [13]:
ranked_results['text'][0]

'Berkshire currently has a 91.1% ownership interest in Berkshire Hathaway Energy Company (“BHE”). BHE is a global energy company with subsidiaries and affiliates that generate, transmit, store, distribute and supply energy. BHE’s locally managed businesses are organized as separate operating units. BHE’s domestic regulated energy interests are comprised of four regulated utility companies serving approximately 5.2\xa0million retail customers, five interstate natural gas pipeline companies with approximately 21,100 miles of operated pipeline having a design capacity of approximately 21\xa0billion cubic feet of natural gas per day and ownership interests in electricity transmission businesses. BHE’s Great Britain electricity distribution subsidiaries serve about 3.9\xa0million electricity end-users and its electricity transmission-only business in Alberta, Canada serves approximately 85% of Alberta’s population. BHE’s interests also include a diversified portfolio of independent power pr

## Hybrid Search with Neo4j
Vector Search is not the only solution. Often you will find a hybrid approach - a combination of Vector and Fulltext search perform better. You can do both in Neo4j.

To do a Fulltext Search, lets first create a fulltext index

In [14]:
gds.run_cypher("CREATE FULLTEXT INDEX doc_text IF NOT EXISTS FOR (n:Document) ON EACH [n.text]")

In [15]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "FULLTEXT"
''')

Unnamed: 0,name,type,labelsOrTypes,properties,options
0,doc_text,FULLTEXT,[Document],[text],"{'indexProvider': 'fulltext-1.0', 'indexConfig..."


In [16]:
from langchain_google_vertexai import VertexAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Neo4jVector
from langchain.vectorstores.neo4j_vector import SearchType

db = Neo4jVector.from_existing_index(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings(model_name=EMBEDDING_MODEL),
    index_name = 'document-embeddings',
    keyword_index_name = 'doc_text',
    search_type = SearchType.HYBRID # Hybrid search type
)

results = db.similarity_search_with_score(query, k=10)
res_df = to_df(results)
res_df

Unnamed: 0,doc_id,score,text
0,BERKSHIRE HATHAWAY INC25,1.0,The Great Britain distribution companies consi...
1,BERKSHIRE HATHAWAY INC26,1.0,"Northern Natural, based in Nebraska, operates ..."
2,BERKSHIRE HATHAWAY INC30,0.971426,The natural gas pipelines are subject to regul...
3,BERKSHIRE HATHAWAY INC34,0.96959,BHE and its energy subsidiaries continue to fo...
4,BERKSHIRE HATHAWAY INC23,0.966374,Berkshire currently has a 91.1% ownership inte...
5,BERKSHIRE HATHAWAY INC42,0.956096,Lubrizol operates its business on a global bas...
6,APPLIED INDUSTRIAL TECHNOLOGIES IN12,0.953718,. Our Fluid Power & Flow Control segment incl...
7,BERKSHIRE HATHAWAY INC47,0.951454,Industrial Products\n supplies construction fa...
8,BERKSHIRE HATHAWAY INC24,0.949362,MidAmerican Energy Company (“MEC”) is a regula...
9,APPLIED INDUSTRIAL TECHNOLOGIES IN8,0.948919,Our operations are primarily based in the U.S....


Lets re-rank our results for relevancy

In [17]:
ranked_results = rerank_results(res_df)
ranked_results



Top-10 Cross-Encoder Re-ranker hits
	-2.413	BERKSHIRE HATHAWAY INC23
	-3.615	BERKSHIRE HATHAWAY INC24
	-4.773	BERKSHIRE HATHAWAY INC25
	-5.920	BERKSHIRE HATHAWAY INC34
	-6.956	APPLIED INDUSTRIAL TECHNOLOGIES IN8
	-7.015	BERKSHIRE HATHAWAY INC26
	-7.671	BERKSHIRE HATHAWAY INC42
	-8.045	BERKSHIRE HATHAWAY INC30
	-9.600	BERKSHIRE HATHAWAY INC47
	-10.158	APPLIED INDUSTRIAL TECHNOLOGIES IN12


Unnamed: 0,doc_id,score,text,cross-score
0,BERKSHIRE HATHAWAY INC23,0.966374,Berkshire currently has a 91.1% ownership inte...,-2.412768
1,BERKSHIRE HATHAWAY INC24,0.949362,MidAmerican Energy Company (“MEC”) is a regula...,-3.615499
2,BERKSHIRE HATHAWAY INC25,1.0,The Great Britain distribution companies consi...,-4.772871
3,BERKSHIRE HATHAWAY INC34,0.96959,BHE and its energy subsidiaries continue to fo...,-5.92015
4,APPLIED INDUSTRIAL TECHNOLOGIES IN8,0.948919,Our operations are primarily based in the U.S....,-6.95618
5,BERKSHIRE HATHAWAY INC26,1.0,"Northern Natural, based in Nebraska, operates ...",-7.01503
6,BERKSHIRE HATHAWAY INC42,0.956096,Lubrizol operates its business on a global bas...,-7.670772
7,BERKSHIRE HATHAWAY INC30,0.971426,The natural gas pipelines are subject to regul...,-8.044989
8,BERKSHIRE HATHAWAY INC47,0.951454,Industrial Products\n supplies construction fa...,-9.599635
9,APPLIED INDUSTRIAL TECHNOLOGIES IN12,0.953718,. Our Fluid Power & Flow Control segment incl...,-10.158342


In [18]:
ranked_results['text'][4]

'Our operations are primarily based in the U.S. where 87% of our fiscal 2022 sales were generated.  We also have international operations, the largest of which is in Canada (7% of fiscal 2022 sales) with the balance (6% of fiscal 2022 sales) in Mexico, Australia, New Zealand, and Singapore. \nSUPPLIERS\nWe are a leading distributor of products including bearings, power transmission products, engineered fluid power components and systems, specialty flow control solutions, advanced automation products, industrial rubber products, linear motion components, tools, safety products, and other industrial and maintenance supplies.\nThese products are generally supplied to us by manufacturers whom we serve as a non-exclusive distributor.  The suppliers also may provide us product training, as well as sales and marketing support.  Authorizations to represent particular suppliers and product lines vary by geographic region, particularly for our fluid power, flow control, and automation businesses

We don't have much documents from other companies chunked. But once you have it, you can see additional results with Hybrid search. The re-ranker can then help rank these results accordingly. 

## Semantic Search with Multi-hop retrieval using Knowledge Graph

First, let's create a database connection.

In [19]:
db = Neo4jVector(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings(model_name=EMBEDDING_MODEL)
)

Now let's use that query vector to search for companies.  Remember, companies have multiple documents so we will need to use a graph traversal on top of a document lookup to find which companies are most similar.

In [20]:
query_vector = db.embedding.embed_query(text=query)
query_vector[0:5]

[0.005244616884738207,
 -0.0270429328083992,
 -0.048955537378787994,
 -0.017425144091248512,
 0.018080072477459908]

In [21]:
# Search for similar companies
res_df = db.query("""
CALL db.index.vector.queryNodes('document-embeddings', 10, $queryVector)
YIELD node AS similarDocuments, score

MATCH (similarDocuments)<-[:HAS]-(c:Company)
RETURN c.companyName as companyName, avg(score) AS score
ORDER BY score DESC LIMIT 10
""", params =  {'queryVector': query_vector})
res_df

[{'companyName': 'Berkshire Hathaway B', 'score': 0.7614296078681946},
 {'companyName': 'Applied Materials', 'score': 0.749306321144104}]

The above result is based on the limited set of documents we have in the DB. The scores are based on Average similarity scores across chunks

Now let's take this one step further and find investment managers who are most heavily focused in Energy.  This will involve a bit more Cypher for a 2-hop traversal. 

In [22]:
# Search for managers with significiant investments in area
res_df = gds.run_cypher("""
CALL db.index.vector.queryNodes('document-embeddings', 1000, $queryVector)
YIELD node AS similarDocuments, score
MATCH (similarDocuments)<-[:HAS]-(c:Company)
WITH c, avg(score) AS score ORDER BY score LIMIT 100
MATCH (c)<-[r:OWNS]-(m:Manager)
WITH m, r.value as value, score*r.value as weightedScore
WITH m.managerName AS managerName, sum(weightedScore) AS aggScore, sum(value) AS aggValue
RETURN managerName, aggScore/aggValue AS score ORDER BY score DESC LIMIT 1000

""", params =  {'queryVector': query_vector})
res_df

Unnamed: 0,managerName,score
0,"Alley Investment Management Company, LLC",0.706942
1,SAYBROOK CAPITAL /NC,0.706942
2,STALEY CAPITAL ADVISERS INC,0.706942
3,"Keeley-Teton Advisors, LLC",0.706942
4,MENLO ADVISORS LLC,0.706942
...,...,...
995,"Global Trust Asset Management, LLC",0.684052
996,"FinTrust Capital Advisors, LLC",0.684052
997,Financial Council Asset Management Inc,0.684052
998,"Naviter Wealth, LLC",0.684052


And we can see that our top result is:

In [23]:
top_manager = res_df['managerName'][0]
top_manager

'Alley Investment Management Company, LLC'

## Expanding Available Data for Knowledge Retrieval
Not every element in your data will have rich text data, and further, much like we only have 10K documents for some companies, your use cases may also have incomplete, unevenly distributed text data. 

We can check our top result investment manager to this. 

In [24]:
gds.run_cypher('''
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)-[:HAS]->(d:Document)
WITH m, count(DISTINCT c) AS ownedCompaniesWithDocs
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)
RETURN m.managerName AS managerName, ownedCompaniesWithDocs, count(DISTINCT c) AS totalOwnedCompanies
''', params =  {'managerName':top_manager})

Unnamed: 0,managerName,ownedCompaniesWithDocs,totalOwnedCompanies
0,"Alley Investment Management Company, LLC",1,81


This manager has significantly more other companies they own without documents.  We can use Graph Data Science Node Similarity to find the managers that have the most overlap to this one which should give us other Energy companies that we missed due to sparse text data.

In [25]:
g, _ = gds.graph.project('proj', ['Company', 'Manager'], {'OWNS':{'properties':['value']}})

Loading:   0%|          | 0/100 [00:00<?, ?%/s]

In [26]:
gds.nodeSimilarity.write(g, writeRelationshipType = 'SIMILAR', 
                                  writeProperty = 'score', relationshipWeightProperty = 'value',
                                  similarityMetric = 'COSINE',
                                 similarityCutoff = 0.2, degreeCutoff = 4)

Node Similarity:   0%|          | 0/100 [00:00<?, ?%/s]

preProcessingMillis                                                       3
computeMillis                                                          7616
writeMillis                                                             601
postProcessingMillis                                                      0
nodesCompared                                                          5418
relationshipsWritten                                                  49978
similarityDistribution    {'min': 0.19999980926513672, 'p5': 0.247662544...
configuration             {'writeProperty': 'score', 'writeRelationshipT...
Name: 0, dtype: object

In [27]:
g.drop()

graphName                                                             proj
database                                                             neo4j
databaseLocation                                                     local
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                            21871
relationshipCount                                                  1600580
configuration            {'relationshipProjection': {'OWNS': {'aggregat...
density                                                           0.003346
creationTime                           2024-09-12T07:47:16.311362783+00:00
modificationTime                       2024-09-12T07:47:18.349305851+00:00
schema                   {'graphProperties': {}, 'nodes': {'Manager': {...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'Manager': {...
Name: 0, dtype: object

And now we can pull back other relevant results

In [28]:
gds.run_cypher('''
MATCH (m0:Manager {managerName: $managerName})-[r:SIMILAR]->(m:Manager)
WHERE r.score IS NOT NULL
RETURN DISTINCT m.managerName AS managerName, r.score AS score
ORDER BY score DESC LIMIT 10
''', params =  {'managerName':top_manager})

Unnamed: 0,managerName,score
0,"BMS Financial Advisors, LLC",0.303577
1,"MONOGRAPH WEALTH ADVISORS, LLC",0.275661
2,"Pinnacle Wealth Management Advisory Group, LLC",0.272319
3,"CONCENTRIC WEALTH MANAGEMENT, LLC",0.271114
4,"Decatur Capital Management, Inc.",0.260563
5,Fragasso Group Inc.,0.258156
6,"Chicago Capital, LLC",0.255217
7,DAVIS R M INC,0.254797
8,"NATIXIS ADVISORS, L.P.",0.251811
9,BARCLAYS PLC,0.249026


## Clean Up

In [None]:
gds.run_cypher('MATCH (M:Manager)-[s:SIMILAR]->() DELETE s')