# Semantic Search
Semantic search is [defined as "search with meaning."](https://en.wikipedia.org/wiki/Semantic_search) It is key for effective knowledge retrieval.

As opposed to traditional, lexical, search which finds matches based on keywords, semantic search seeks to improve search quality and accuracy by understanding search intent and pulling results that match the user’s contextual meaning.  Semantic search is often used in reference to text embedding and vector similarity search, but this is just one implementation aspect of it. Knowledge graph and symbolic query logic can also play a critical role in making semantic search a reality. 

If all you care about is analyzing a set of documents on a file system, then sure, vector indexing and search may be sufficient.  However, once you need to retrieve and make inferences about people, places, and things connected to those documents, Knowledge graph becomes key. 

If documents are the entities of interest, for example: "find all documents that talk about pharma related things," then text embeddings with vector similarity search suffices.  But, what if we want second or third-order entities related to the documents?  For example, "find investors who are most focused on pharma related strategies," how would we efficiently search for them at scale in an enterprise setting?

This is what we demonstrate below.  We will also show how you can use graph relationships and Graph Data Science algorithms to further improve search results, especially in common scenarios where the presence of text data is inconsistent or sparse.

## Connect to Neo4j

In [None]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'

# You will need to change these to match your credentials
NEO4J_URI = 'neo4j+s://6688b25b.databases.neo4j.io'
NEO4J_PASSWORD = '_kogrNk53u8oTk5be55kmit1kHGdhZj98yJlG-VYSR'

In [5]:
from graphdatascience import GraphDataScience

gds = GraphDataScience(
    NEO4J_URI,
    auth = (NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds = True
)

## Neo4j Vector Index
We will need to create a vector index for similarity search on Document nodes. Neo4j offers a vector index that enables Approximate Nearest Neighbor Search (ANN). Let's creat an index.

In [None]:
gds.run_cypher("CALL db.index.vector.createNodeIndex('document-embeddings', 'Document', 'textEmbedding', 768, 'cosine')")

You can see that vector index has been created using `SHOW INDEXES`

In [None]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "VECTOR"
''')

## Deep Semantic Search with Knowledge Graph
Now that we have an index.  Let’s use it in action.
In this case, we will Answer The Question - "What are the companies associated with energy, oil and gas?". 
Remember we do not have documents on investment managers, just companies, and there can be multiple documents for each company.

## Dense Retrieval using Neo4j

In [6]:
from langchain.docstore.document import Document
from typing import (
    List,
    Tuple,
)

def to_df(results: List[Tuple[Document, float]]):
    return pd.DataFrame({
        "doc_id": [r[0].metadata.get('documentId') for r in results],
        "score": [r[1] for r in results],
        "text": [r[0].page_content for r in results]
    })

In [27]:
from langchain.vectorstores import Neo4jVector
from langchain.embeddings.vertexai import VertexAIEmbeddings
from langchain.vectorstores.neo4j_vector import SearchType
import pandas as pd

db = Neo4jVector.from_existing_index(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings(),
    index_name = 'document-embeddings',
    keyword_index_name = 'doc_text',
    search_type = SearchType.VECTOR
)
query = 'What are the companies associated with energy, oil and gas?'
results = db.similarity_search_with_score(query, k=10)
res_df = to_df(results)
res_df

Unnamed: 0,doc_id,score,text
0,BERKSHIRE HATHAWAY INC10,0.847954,"As railroads streamline, rationalize and other..."
1,BERKSHIRE HATHAWAY INC21,0.84245,"Certain Marmon business, including the Rail an..."
2,BERKSHIRE HATHAWAY INC11,0.841626,"As vertically integrated utilities, BHE’s dome..."
3,ROPER TECHNOLOGIES I2,0.840261,Logitech\n - provides equipment and consumable...
4,BERKSHIRE HATHAWAY INC20,0.839466,Electrical\n produces electrical wire for use ...
5,BERKSHIRE HATHAWAY INC27,0.837863,"The BH Shoe Holdings Group, headquartered in G..."
6,Goldman sachs group inc56,0.836494,Goldman Sachs 2021 Form 10-K\n \n \n \n \n \n ...
7,BERKSHIRE HATHAWAY INC13,0.835542,AltaLink is regulated by the Alberta Utilities...
8,Johnson Controls Intern3,0.834024,"Building Solutions North America designs, sell..."
9,BERKSHIRE HATHAWAY INC12,0.833583,"BHE Renewables, based in Iowa, owns interests ..."


Lets take a look at the first result and then understand whether the chunk has the information we asked for.

In [28]:
res_df['text'][0]

'As railroads streamline, rationalize and otherwise enhance their franchises, competition among rail carriers intensifies. BNSF Railway’s primary rail competitor in the Western region of the United States is the Union Pacific Railroad Company. Other Class\xa0I railroads and numerous regional railroads and motor carriers also operate in parts of the same territories served by BNSF Railway. \n\n\nUtilities and Energy Businesses—Berkshire Hathaway Energy \n\n\nBerkshire currently has a 91.1% ownership interest in Berkshire Hathaway Energy Company (“BHE”). BHE is a global energy company with subsidiaries and affiliates that generate, transmit, store, distribute and supply energy. BHE’s locally managed businesses are organized as separate operating units. BHE’s domestic regulated energy interests are comprised of four regulated utility companies serving approximately 5.2\xa0million retail customers, five interstate natural gas pipeline companies with approximately 21,100 miles of operated p

## Re-rank results
If you explore the results, you may find some irrelevant results. The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.
A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.

We will use a Cross-encoder model named `ms-marco-MiniLM-L-6-v2` from SentenceTransformers library.

In [9]:
def rerank_results(res_df):
    from sentence_transformers import SentenceTransformer, CrossEncoder, util
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    passages = []
    
    # Score all retrieved passages with the cross_encoder
    cross_inp = [[query, f"Text: {hit['text']}"] for _,hit in res_df[:].iterrows()]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    res_df['cross-score'] = cross_scores

    # Output of top-10 hits from re-ranker
    print("Top-10 Cross-Encoder Re-ranker hits")
    hits = res_df.sort_values('cross-score', ascending=False).reset_index(drop=True)
    for _,hit in hits.iterrows():
        print("\t{:.3f}\t{}".format(hit['cross-score'], hit['doc_id']))
    return hits

In [29]:
ranked_results = rerank_results(res_df)
ranked_results

Top-10 Cross-Encoder Re-ranker hits
	-2.966	BERKSHIRE HATHAWAY INC10
	-3.291	BERKSHIRE HATHAWAY INC11
	-5.627	BERKSHIRE HATHAWAY INC12
	-6.222	BERKSHIRE HATHAWAY INC13
	-6.797	Goldman sachs group inc56
	-6.881	BERKSHIRE HATHAWAY INC20
	-7.527	ROPER TECHNOLOGIES I2
	-8.108	Johnson Controls Intern3
	-9.081	BERKSHIRE HATHAWAY INC21
	-9.925	BERKSHIRE HATHAWAY INC27


Unnamed: 0,doc_id,score,text,cross-score
0,BERKSHIRE HATHAWAY INC10,0.847954,"As railroads streamline, rationalize and other...",-2.966444
1,BERKSHIRE HATHAWAY INC11,0.841626,"As vertically integrated utilities, BHE’s dome...",-3.290705
2,BERKSHIRE HATHAWAY INC12,0.833583,"BHE Renewables, based in Iowa, owns interests ...",-5.627203
3,BERKSHIRE HATHAWAY INC13,0.835542,AltaLink is regulated by the Alberta Utilities...,-6.222227
4,Goldman sachs group inc56,0.836494,Goldman Sachs 2021 Form 10-K\n \n \n \n \n \n ...,-6.797363
5,BERKSHIRE HATHAWAY INC20,0.839466,Electrical\n produces electrical wire for use ...,-6.880808
6,ROPER TECHNOLOGIES I2,0.840261,Logitech\n - provides equipment and consumable...,-7.526938
7,Johnson Controls Intern3,0.834024,"Building Solutions North America designs, sell...",-8.107805
8,BERKSHIRE HATHAWAY INC21,0.84245,"Certain Marmon business, including the Rail an...",-9.08062
9,BERKSHIRE HATHAWAY INC27,0.837863,"The BH Shoe Holdings Group, headquartered in G...",-9.924603


As seen above, the cross-encoder finds this passages which are more relevant to the query and ranks them accordingly

## Hybrid Search with Neo4j
Vector Search is not the only solution. Often you will find a hybrid approach - a combination of Vector and Fulltext search perform better. You can do both in Neo4j.

To do a Fulltext Search, lets first create a fulltext index

In [12]:
gds.run_cypher("CREATE FULLTEXT INDEX doc_text IF NOT EXISTS FOR (n:Document) ON EACH [n.text]")

In [13]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "FULLTEXT"
''')

Unnamed: 0,name,type,labelsOrTypes,properties,options
0,doc_text,FULLTEXT,[Document],[text],"{'indexProvider': 'fulltext-1.0', 'indexConfig..."


In [31]:
from langchain.embeddings.vertexai import VertexAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Neo4jVector
from langchain.vectorstores.neo4j_vector import SearchType

db = Neo4jVector.from_existing_index(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings(),
    index_name = 'document-embeddings',
    keyword_index_name = 'doc_text',
    search_type = SearchType.HYBRID # Hybrid search type
)

results = db.similarity_search_with_score(query, k=10)
res_df = to_df(results)
res_df

Unnamed: 0,doc_id,score,text
0,CANADIAN PACIFIC KANSAS CITY LTD7,1.0,export company owned in equal shares by Nutrie...
1,BERKSHIRE HATHAWAY INC12,0.990436,"BHE Renewables, based in Iowa, owns interests ..."
2,DELTA AI8,0.985763,If we or our subsidiaries are unable to reach ...
3,"Goldman Sachs BDC, Inc.48",0.954804,Net Assets\n\n\n \n\n\n\n\n\n\nSoftware\n\n\n ...
4,Martin Marietta Materia5,0.953364,The cement operations of the Building Material...
5,CANADIAN PACIFIC KANSAS CITY LTD8,0.949872,Petroleum products consist of commodities such...
6,BERKSHIRE HATHAWAY INC11,0.932074,"As vertically integrated utilities, BHE’s dome..."
7,Delta Apparel Inc5,0.928153,The focus on reducing our overall energy inten...
8,DELTA AI9,0.894565,The competitive nature of the airline industry...
9,BERKSHIRE HATHAWAY INC16,0.88496,Investment casting technology involves a multi...


Lets re-rank our results for relevancy

In [32]:
ranked_results = rerank_results(res_df)
ranked_results

Top-10 Cross-Encoder Re-ranker hits
	-0.917	CANADIAN PACIFIC KANSAS CITY LTD8
	-3.291	BERKSHIRE HATHAWAY INC11
	-5.437	DELTA AI9
	-5.627	BERKSHIRE HATHAWAY INC12
	-6.176	BERKSHIRE HATHAWAY INC16
	-6.386	Martin Marietta Materia5
	-6.993	CANADIAN PACIFIC KANSAS CITY LTD7
	-7.620	Delta Apparel Inc5
	-7.949	DELTA AI8
	-8.604	Goldman Sachs BDC, Inc.48


Unnamed: 0,doc_id,score,text,cross-score
0,CANADIAN PACIFIC KANSAS CITY LTD8,0.949872,Petroleum products consist of commodities such...,-0.916921
1,BERKSHIRE HATHAWAY INC11,0.932074,"As vertically integrated utilities, BHE’s dome...",-3.290705
2,DELTA AI9,0.894565,The competitive nature of the airline industry...,-5.437329
3,BERKSHIRE HATHAWAY INC12,0.990436,"BHE Renewables, based in Iowa, owns interests ...",-5.627203
4,BERKSHIRE HATHAWAY INC16,0.88496,Investment casting technology involves a multi...,-6.175798
5,Martin Marietta Materia5,0.953364,The cement operations of the Building Material...,-6.386041
6,CANADIAN PACIFIC KANSAS CITY LTD7,1.0,export company owned in equal shares by Nutrie...,-6.992681
7,Delta Apparel Inc5,0.928153,The focus on reducing our overall energy inten...,-7.620418
8,DELTA AI8,0.985763,If we or our subsidiaries are unable to reach ...,-7.948834
9,"Goldman Sachs BDC, Inc.48",0.954804,Net Assets\n\n\n \n\n\n\n\n\n\nSoftware\n\n\n ...,-8.603694


In [34]:
ranked_results['text'][5]

'The cement operations of the Building Materials business produce Portland and specialty cements. Similar to aggregates, cement is used in infrastructure projects, nonresidential and residential construction, and the railroad, agricultural, utility and environmental industries. \nConsequently, the cement industry is cyclical and dependent on the strength of the construction sector.\n\xa0\xa0\n\n\nCement consumption is dependent on the time of year and prevalent weather conditions. According to the Portland Cement Association, nearly two-thirds of U.S. cement consumption occurs in the six months between May and October.\xa0\xa0Approximately 70% to 75% of all cement shipments are sent to ready mixed concrete operators. The rest are shipped to manufacturers of concrete-related products, contractors, materials dealers and oil well/mining/drilling companies, as well as government entities.\n\n\n\xa0\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSOAR to a Sustainable Future\n\n\n\n\n\n\n\n\nForm 

The Hybrid search brought in additional results from companies not in vector-only search but has content related to energy, oil & gas. The re-ranker helped rank the results. 

## Semantic Search with Multi-hop retrieval using Knowledge Graph

First, let's create a database connection.

In [17]:
db = Neo4jVector(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings()
)

Now let's use that query vector to search for companies.  Remember, companies have multiple documents so we will need to use a graph traversal on top of a document lookup to find which companies are most similar.

In [18]:
query_vector = db.embedding.embed_query(text=query)
query_vector[0:5]

[0.06058839336037636,
 0.0039505185559391975,
 -0.05093487352132797,
 0.01724359206855297,
 0.0011352039873600006]

In [35]:
# Search for similar companies
res_df = db.query("""
CALL db.index.vector.queryNodes('document-embeddings', 10, $queryVector)
YIELD node AS similarDocuments, score

MATCH (similarDocuments)<-[:HAS]-(c:Company)
RETURN c.companyName as companyName, avg(score) AS score
ORDER BY score DESC LIMIT 10
""", params =  {'queryVector': query_vector})
res_df

[{'companyName': 'ROPER TECHNOLOGIES', 'score': 0.8402605056762695},
 {'companyName': 'Berkshire Hathaway Inc B', 'score': 0.8397833960396902},
 {'companyName': 'Goldman Sachs Grp', 'score': 0.836493730545044},
 {'companyName': 'Johnson &amp; Johnson', 'score': 0.8340243101119995}]

The above result is based on the limited set of documents we have in the DB. The scores are based on Average similarity scores across chunks

Now let's take this one step further and find investment managers who are most heavily focused in Energy.  This will involve a bit more Cypher for a 2-hop traversal. 

In [36]:
# Search for managers with significiant investments in area
res_df = gds.run_cypher("""
CALL db.index.vector.queryNodes('document-embeddings', 1000, $queryVector)
YIELD node AS similarDocuments, score
MATCH (similarDocuments)<-[:HAS]-(c:Company)
WITH c, avg(score) AS score ORDER BY score LIMIT 100
MATCH (c)<-[r:OWNS]-(m:Manager)
WITH m, r.value as value, score*r.value as weightedScore
WITH m.managerName AS managerName, sum(weightedScore) AS aggScore, sum(value) AS aggValue
RETURN managerName, aggScore/aggValue AS score ORDER BY score DESC LIMIT 1000

""", params =  {'queryVector': query_vector})
res_df

Unnamed: 0,managerName,score
0,Rempart Asset Management Inc.,0.801099
1,"BASSETT HARGROVE INVESTMENT COUNSEL, LLC",0.800587
2,TIGER MANAGEMENT L.L.C.,0.796342
3,"Emery Howard Portfolio Management, Inc.",0.794092


In [39]:
top_mgr = res_df['managerName'][0]
top_mgr

'Rempart Asset Management Inc.'

## Expanding Available Data for Knowledge Retrieval
Not every element in your data will have rich text data, and further, much like we only have 10K documents for some companies, your use cases may also have incomplete, unevenly distributed text data. 

We can check our top result investment manager to this. 

In [40]:
gds.run_cypher('''
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)-[:HAS]->(d:Document)
WITH m, count(DISTINCT c) AS ownedCompaniesWithDocs
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)
RETURN m.managerName AS managerName, ownedCompaniesWithDocs, count(DISTINCT c) AS totalOwnedCompanies
''', params =  {'managerName': top_mgr})

Unnamed: 0,managerName,ownedCompaniesWithDocs,totalOwnedCompanies
0,Rempart Asset Management Inc.,17,41


This manager has significantly more other companies they own without documents.  We can use Graph Data Science Node Similarity to find the managers that have the most overlap to this one which should give us other Energy companies that we missed due to sparse text data.

In [43]:
g, _ = gds.graph.project('proj', ['Company', 'Manager'], {'OWNS':{'properties':['value']}})

In [44]:
gds.nodeSimilarity.write(g, writeRelationshipType = 'SIMILAR', writeProperty = 'score', relationshipWeightProperty = 'value')

preProcessingMillis                                                       0
computeMillis                                                             3
writeMillis                                                              20
postProcessingMillis                                                     -1
nodesCompared                                                             4
relationshipsWritten                                                     10
similarityDistribution    {'min': 6.620865315198898e-05, 'p5': 6.6208653...
configuration             {'writeProperty': 'score', 'writeRelationshipT...
Name: 0, dtype: object

In [45]:
g.drop()

graphName                                                             proj
database                                                             neo4j
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                              247
relationshipCount                                                      172
configuration            {'relationshipProjection': {'OWNS': {'aggregat...
density                                                           0.002831
creationTime                           2023-10-04T06:46:59.593541431+00:00
modificationTime                       2023-10-04T06:46:59.602932015+00:00
schema                   {'graphProperties': {}, 'nodes': {'Manager': {...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'Manager': {...
Name: 0, dtype: object

And now we can pull back other relevant results

In [46]:
gds.run_cypher('''
MATCH (m0:Manager {managerName: $managerName})-[r:SIMILAR]->(m:Manager)
RETURN DISTINCT m.managerName AS managerName, r.score AS score
ORDER BY score DESC LIMIT 10
''', params =  {'managerName': top_mgr})

Unnamed: 0,managerName,score
0,"BASSETT HARGROVE INVESTMENT COUNSEL, LLC",0.02311
1,TIGER MANAGEMENT L.L.C.,0.000284


## Clean Up

In [47]:
gds.run_cypher('MATCH (M:Manager)-[s:SIMILAR]->() DELETE s')