# Semantic Search
Semantic search is [defined as "search with meaning."](https://en.wikipedia.org/wiki/Semantic_search) It is key for effective knowledge retrieval.

As opposed to traditional, lexical, search which finds matches based on keywords, semantic search seeks to improve search quality and accuracy by understanding search intent and pulling results that match the user’s contextual meaning.  Semantic search is often used in reference to text embedding and vector similarity search, but this is just one implementation aspect of it. Knowledge graph and symbolic query logic can also play a critical role in making semantic search a reality. 

If all you care about is analyzing a set of documents on a file system, then sure, vector indexing and search may be sufficient.  However, once you need to retrieve and make inferences about people, places, and things connected to those documents, Knowledge graph becomes key. 

If documents are the entities of interest, for example: "find all documents that talk about pharma related things," then text embeddings with vector similarity search suffices.  But, what if we want second or third-order entities related to the documents?  For example, "find investors who are most focused on pharma related strategies," how would we efficiently search for them at scale in an enterprise setting?

This is what we demonstrate below.  We will also show how you can use graph relationships and Graph Data Science algorithms to further improve search results, especially in common scenarios where the presence of text data is inconsistent or sparse.

## Connect to Neo4j

In [None]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'

# You will need to change these to match
NEO4J_URI = 'neo4j+s://xxxxx.databases.neo4j.io'
NEO4J_PASSWORD = 'password'

In [5]:
from graphdatascience import GraphDataScience

gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=True
)
gds.set_database('neo4j')

## Neo4j Vector Index
We will need to create a vector index for similarity search on Document nodes. Neo4j offers a vector index that enables Approximate Nearest Neighbor Search (ANN). Let's creat an index.

In [None]:
gds.run_cypher("CALL db.index.vector.createNodeIndex('document-embeddings', 'Document', 'textEmbedding', 768, 'cosine')")

You can see that vector index has been created using `SHOW INDEXES`

In [8]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "VECTOR"
''')

Unnamed: 0,name,type,labelsOrTypes,properties,options
0,document-embeddings,VECTOR,[Document],[textEmbedding],"{'indexProvider': 'vector-1.0', 'indexConfig':..."


## Deep Semantic Search with Knowledge Graph
Now that we have an index.  Let’s use it in action.
In this case, we will Answer The Question - "What Investors are most focused in pharma and healthcare?" 
Remember we do not have documents on investment managers, just companies, and there can be multiple documents for each company. 

### Vector Search with Neo4j

In [283]:
from langchain.docstore.document import Document
from typing import (
    List,
    Tuple,
)
import json

def to_df(results: List[Tuple[Document, float]]):
    return pd.DataFrame({
        "doc_id": [r[0].metadata.get('documentId') for r in results],
        "score": [r[1] for r in results],
        "text": [r[0].page_content for r in results]
    })

In [325]:
from langchain.embeddings.vertexai import VertexAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Neo4jVector
from langchain.vectorstores.neo4j_vector import SearchType
import pandas as pd

db = Neo4jVector.from_existing_index(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    embedding=VertexAIEmbeddings(),
    index_name='document-embeddings',
    keyword_index_name='doc_text',
    search_type=SearchType.VECTOR
)
query = 'What are the companies associated with nickel, lithium or cobalt?'
results = db.similarity_search_with_score(query, k=10)
res_df = to_df(results)
res_df

Unnamed: 0,doc_id,score,text
0,BERKSHIRE HATHAWAY INC40,0.849097,Lubrizol \n\n\nThe Lubrizol Corporation (“Lubr...
1,"Goldman Sachs BDC, Inc.34",0.847477,Conergy Asia & ME Pte. LTD.\n\n\nConstruction ...
2,"STANLEY BLACK & DECKER, INC45",0.841106,"In addition, many of the Company’s products in..."
3,BERKSHIRE HATHAWAY INC25,0.840315,The Great Britain distribution companies consi...
4,ROPER TECHNOLOGIES I5,0.838356,"FTI\n - provides flow meter calibrators, and c..."
5,BERKSHIRE HATHAWAY INC46,0.83734,Transportation Products\n serves the automotiv...
6,BERKSHIRE HATHAWAY INC62,0.837216,Forest River is subject to regulations of the ...
7,General motors co16,0.836687,"In some instances, we purchase systems, compon..."
8,"Goldman Sachs BDC, Inc.19",0.836241,6.75%\n\n\nL + 5.75%\n\n\n1.00%\n\n\n04/02/25\...
9,Martin Marietta Materia14,0.835686,The Magnesia Specialties business produces and...


In [318]:
res_df['text'][0]

'Lubrizol \n\n\nThe Lubrizol Corporation (“Lubrizol”) is a specialty chemical and performance materials company that manufactures products and supplies technologies for the global transportation, industrial and consumer markets. Lubrizol currently operates two business segments: Lubrizol Additives, which produces engine lubricant additives, driveline lubricant additives and industrial specialties products; and Lubrizol Advanced Materials, which includes engineered materials (engineered polymers and performance coatings) and life sciences (beauty and personal care, and health and home care solutions). \n\n\nLubrizol Additives’ products are used in a broad range of applications including engine oils, transmission fluids, gear oils, specialty driveline lubricants, fuels, metalworking fluids and compressor lubricants for transportation and industrial applications. Lubrizol Advanced Materials’ products are used in many different types of applications including beauty, personal care, home ca

### Re-rank results

In [313]:
def rerank_results(res_df):
    from sentence_transformers import SentenceTransformer, CrossEncoder, util
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    passages = []
    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, f"Text: {hit['text']}"] for _,hit in res_df[:].iterrows()]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    res_df['cross-score'] = cross_scores

    # Output of top-5 hits from re-ranker
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = res_df.sort_values('cross-score', ascending=False).reset_index(drop=True)
    for _,hit in hits.iterrows():
        print("\t{:.3f}\t{}".format(hit['cross-score'], hit['doc_id']))
    return hits

In [315]:
ranked_results = rerank_results(res_df)
ranked_results

Top-3 Cross-Encoder Re-ranker hits
	-4.094	STANLEY BLACK & DECKER, INC45
	-8.700	Goldman Sachs BDC, Inc.19
	-9.284	General motors co16
	-9.936	BERKSHIRE HATHAWAY INC25
	-10.056	BERKSHIRE HATHAWAY INC62
	-10.341	BERKSHIRE HATHAWAY INC40
	-10.517	BERKSHIRE HATHAWAY INC46
	-10.604	Goldman Sachs BDC, Inc.34
	-11.034	Martin Marietta Materia14
	-11.184	ROPER TECHNOLOGIES I5


Unnamed: 0,doc_id,score,text,cross-score
0,"STANLEY BLACK & DECKER, INC45",0.841106,"In addition, many of the Company’s products in...",-4.094014
1,"Goldman Sachs BDC, Inc.19",0.836241,6.75%\n\n\nL + 5.75%\n\n\n1.00%\n\n\n04/02/25\...,-8.700157
2,General motors co16,0.836687,"In some instances, we purchase systems, compon...",-9.283574
3,BERKSHIRE HATHAWAY INC25,0.840315,The Great Britain distribution companies consi...,-9.936461
4,BERKSHIRE HATHAWAY INC62,0.837216,Forest River is subject to regulations of the ...,-10.055683
5,BERKSHIRE HATHAWAY INC40,0.849097,Lubrizol \n\n\nThe Lubrizol Corporation (“Lubr...,-10.340531
6,BERKSHIRE HATHAWAY INC46,0.83734,Transportation Products\n serves the automotiv...,-10.517445
7,"Goldman Sachs BDC, Inc.34",0.847477,Conergy Asia & ME Pte. LTD.\n\n\nConstruction ...,-10.604258
8,Martin Marietta Materia14,0.835686,The Magnesia Specialties business produces and...,-11.033838
9,ROPER TECHNOLOGIES I5,0.838356,"FTI\n - provides flow meter calibrators, and c...",-11.184478


In [317]:
ranked_results['text'][0]

"In addition, many of the Company’s products incorporate battery technology. As the world moves towards a lower-carbon economy and as other industries begin to adopt similar battery technology for use in their products or increase their current consumption of battery technology, the increased demand could place capacity constraints on the Company’s supply chain. In addition, increased demand for battery technology may also increase the costs to the Company for both the battery cells as well as the underlying raw materials such as cobalt and lithium, among others. If the Company is unable to mitigate any possible supply constraints, related increased costs or drive alternative technology through innovation, its profitably and financial results could be negatively impacted.\nUncertainty about the financial stability of economies outside the U.S. could have a significant adverse effect on the Company's business, results of operations and financial condition. \n15\nThe Company generates ap

### Hybrid Search with Neo4j

In [228]:
gds.run_cypher("CREATE FULLTEXT INDEX doc_text IF NOT EXISTS FOR (n:Document) ON EACH [n.text]")

In [229]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "FULLTEXT"
''')

Unnamed: 0,name,type,labelsOrTypes,properties,options
0,doc_text,FULLTEXT,[Document],[text],"{'indexProvider': 'fulltext-1.0', 'indexConfig..."


In [319]:
from langchain.embeddings.vertexai import VertexAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Neo4jVector
from langchain.vectorstores.neo4j_vector import SearchType

db = Neo4jVector.from_existing_index(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    embedding=VertexAIEmbeddings(),
    index_name='document-embeddings',
    keyword_index_name='doc_text',
    search_type=SearchType.HYBRID
)

results = db.similarity_search_with_score(query, k=10)
res_df = to_df(results)
res_df

Unnamed: 0,doc_id,score,text
0,"Goldman Sachs BDC, Inc.50",1.0,The accompanying notes are part of these conso...
1,"STANLEY BLACK & DECKER, INC10",0.998252,Raw Materials\nThe Company’s products are manu...
2,BERKSHIRE HATHAWAY INC36,0.916378,Precision Castparts Corp. (“PCC”) manufactures...
3,"Goldman Sachs BDC, Inc.19",0.915651,6.75%\n\n\nL + 5.75%\n\n\n1.00%\n\n\n04/02/25\...
4,"APPLIED OPTOELECTRONICS, INC.55",0.89173,‑\n \n\n\n \ndifficulties integrating the busi...
5,BERKSHIRE HATHAWAY INC40,0.849097,Lubrizol \n\n\nThe Lubrizol Corporation (“Lubr...
6,"Goldman Sachs BDC, Inc.34",0.847477,Conergy Asia & ME Pte. LTD.\n\n\nConstruction ...
7,"STANLEY BLACK & DECKER, INC45",0.841106,"In addition, many of the Company’s products in..."
8,BERKSHIRE HATHAWAY INC25,0.840315,The Great Britain distribution companies consi...
9,ROPER TECHNOLOGIES I5,0.838356,"FTI\n - provides flow meter calibrators, and c..."


In [321]:
res_df['text'][0]

'The accompanying notes are part of these consolidated financial statements.\n\n\n92\n\n\n\xa0\n\n\n\xa0\n\n\n\n\nTable of Contents\n\n\nGoldman Sachs BDC, Inc.\n\n\nConsolidated Schedule of Investments as of \nDecember 31\n, 2020 (continued)\n\n\n(in thousands, except share and per share amounts)\n\n\n\xa0\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nInvestment *\n\n\nIndustry\n\n\nInterest\nRate (+)\n\n\nReference Rate\n \nand Spread (+)\n\n\nFloor (+)\n\n\n \n\xa0\n\n\nMaturity\n\n\nPar\n(++)\n\n\n \n\xa0\n\n\nCost\n\n\n \n\xa0\n\n\nFair Value\n\n\n \n\xa0\n\n\nFootnotes\n\n\n\n\n\n\nIracore International Holdings, Inc.\n\n\nEnergy Equipment & Services\n\n\n10.00%\n\n\nL + 9.00%\n\n\n1.00%\n\n\n \n\xa0\n\n\n04/12/21\n\n\n$\n\n\n2,361\n\n\n \n\xa0\n\n\n$\n\n\n2,361\n\n\n \n\xa0\n\n\n$\n\n\n2,361\n\n\n \n\xa0\n\n\n^ (4)\n\n\n\n\n\n\nJill Acquisition LLC (dba J. Jill)\n\n\nSpecialty Retail\n\n\n6.00%\n\n\nL + 5.00%\n\n\n1.00%\n\n\n \n\xa0

In [320]:
ranked_results = rerank_results(res_df)
ranked_results

Top-3 Cross-Encoder Re-ranker hits
	-3.510	BERKSHIRE HATHAWAY INC36
	-3.588	STANLEY BLACK & DECKER, INC10
	-4.094	STANLEY BLACK & DECKER, INC45
	-8.700	Goldman Sachs BDC, Inc.19
	-9.374	Goldman Sachs BDC, Inc.50
	-9.936	BERKSHIRE HATHAWAY INC25
	-10.341	BERKSHIRE HATHAWAY INC40
	-10.604	Goldman Sachs BDC, Inc.34
	-11.184	ROPER TECHNOLOGIES I5
	-11.215	APPLIED OPTOELECTRONICS, INC.55


Unnamed: 0,doc_id,score,text,cross-score
0,BERKSHIRE HATHAWAY INC36,0.916378,Precision Castparts Corp. (“PCC”) manufactures...,-3.509796
1,"STANLEY BLACK & DECKER, INC10",0.998252,Raw Materials\nThe Company’s products are manu...,-3.587836
2,"STANLEY BLACK & DECKER, INC45",0.841106,"In addition, many of the Company’s products in...",-4.094014
3,"Goldman Sachs BDC, Inc.19",0.915651,6.75%\n\n\nL + 5.75%\n\n\n1.00%\n\n\n04/02/25\...,-8.700157
4,"Goldman Sachs BDC, Inc.50",1.0,The accompanying notes are part of these conso...,-9.373708
5,BERKSHIRE HATHAWAY INC25,0.840315,The Great Britain distribution companies consi...,-9.936461
6,BERKSHIRE HATHAWAY INC40,0.849097,Lubrizol \n\n\nThe Lubrizol Corporation (“Lubr...,-10.340531
7,"Goldman Sachs BDC, Inc.34",0.847477,Conergy Asia & ME Pte. LTD.\n\n\nConstruction ...,-10.604258
8,ROPER TECHNOLOGIES I5,0.838356,"FTI\n - provides flow meter calibrators, and c...",-11.184478
9,"APPLIED OPTOELECTRONICS, INC.55",0.89173,‑\n \n\n\n \ndifficulties integrating the busi...,-11.215431


In [322]:
ranked_results['text'][0]

'Precision Castparts Corp. (“PCC”) manufactures complex metal components and products, provides high-quality investment castings, forgings, fasteners/fastener systems and aerostructures for critical aerospace and power and energy applications. PCC also manufactures seamless pipe for coal-fired, industrial gas turbine (“IGT”) and nuclear power plants; downhole casing and tubing, fittings and various mill forms in a variety of nickel and steel alloys for severe-service oil and gas environments; investment castings and forgings for general industrial, armament, medical and other applications; nickel and titanium alloys in all standard mill forms from large ingots and billets to plate, foil, sheet, strip, tubing, bar, rod, extruded shapes, rod-in-coil, wire and welding consumables, as well as cobalt alloys, for the aerospace, chemical processing, oil and gas, pollution control and other industries; revert management solutions; fasteners for automotive and general industrial markets; specia

### Semantic Search with Knowledge Graph

In [326]:
# Semantic query, lets use these key words to search
db = Neo4jVector(
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    embedding=VertexAIEmbeddings()
)

Now let's use that query vector to search for companies.  Remember, companies have multiple documents so we will need to use a graph traversal on top of a document lookup to find which companies are most similar.

In [365]:
query_vector = db.embedding.embed_query(text='pharma, healthcare, biotech related companies')
query_vector[0:5]

[0.018738236278295517,
 -0.04106052964925766,
 -0.026169316843152046,
 0.05331621691584587,
 0.0431697852909565]

In [366]:
# Search for similar companies
res_df = db.query("""
CALL db.index.vector.queryNodes('document-embeddings', 10, $queryVector)
YIELD node AS similarDocuments, score

MATCH (similarDocuments)<-[:HAS]-(c:Company)
RETURN c.companyName as companyName, avg(score) AS score
ORDER BY score DESC LIMIT 10
""", params =  {'queryVector': query_vector})
res_df

[{'companyName': 'UnitedHealth Group', 'score': 0.8696630597114563},
 {'companyName': 'Goldman Sachs Grp', 'score': 0.8675507545471192},
 {'companyName': 'JOHNSON & JOHNSON', 'score': 0.8666969537734985},
 {'companyName': 'BERKSHIRE HATHAWAY CLASS B', 'score': 0.864528238773346},
 {'companyName': 'ROPER TECHNOLOGIES', 'score': 0.8645270466804504}]

You may recognize some of these companies, and if not a quick google search will con firm that their business is involved in healthcare and pharma, so this seems to be working.

Now let's take this one step further and find investment managers who are most heavily focused in pharma.  This will involve a bit more Cypher for a 2-hop traversal. 

In [367]:
# Search for managers with significiant investments in area
res_df = gds.run_cypher("""
CALL db.index.vector.queryNodes('document-embeddings', 1000, $queryVector)
YIELD node AS similarDocuments, score
MATCH (similarDocuments)<-[:HAS]-(c:Company)
WITH c, avg(score) AS score ORDER BY score LIMIT 100
MATCH (c)<-[r:OWNS]-(m:Manager)
WITH m, r.value as value, score*r.value as weightedScore
WITH m.managerName AS managerName, sum(weightedScore) AS aggScore, sum(value) AS aggValue
RETURN managerName, aggScore/aggValue AS score ORDER BY score DESC LIMIT 1000

""", params =  {'queryVector': query_vector})
res_df

Unnamed: 0,managerName,score
0,Rempart Asset Management Inc.,0.824773
1,"BASSETT HARGROVE INVESTMENT COUNSEL, LLC",0.823473
2,"Emery Howard Portfolio Management, Inc.",0.821982
3,TIGER MANAGEMENT L.L.C.,0.819795


And we can see that our top result is `Rempart Asset Management Inc.`

## Expanding Available Data for Knowledge Retrieval
Not every element in your data will have rich text data, and further, much like we only have 10K documents for some companies, your use cases may also have incomplete, unevenly distributed text data. 

We can check our top result investment manager to this. 

In [368]:
gds.run_cypher('''
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)-[:HAS]->(d:Document)
WITH m, count(DISTINCT c) AS ownedCompaniesWithDocs
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)
RETURN m.managerName AS managerName, ownedCompaniesWithDocs, count(DISTINCT c) AS totalOwnedCompanies
''', params =  {'managerName':'Rempart Asset Management Inc.'})

Unnamed: 0,managerName,ownedCompaniesWithDocs,totalOwnedCompanies
0,Rempart Asset Management Inc.,17,41


This manager has significantly more other companies they own without documents.  We can use Graph Data Science Node Similarity to find the managers that have the most overlap to this one which should give us other Biotech companies that we missed due to sparse text data.

In [357]:
g, _  = gds.graph.project('proj', ['Company', 'Manager'], {'OWNS':{'properties':['value']}})

In [358]:
gds.nodeSimilarity.write(g, writeRelationshipType='SIMILAR', writeProperty='score', relationshipWeightProperty='value')

preProcessingMillis                                                       0
computeMillis                                                             3
writeMillis                                                              15
postProcessingMillis                                                     -1
nodesCompared                                                             4
relationshipsWritten                                                     10
similarityDistribution    {'min': 6.620865315198898e-05, 'p5': 6.6208653...
configuration             {'writeProperty': 'score', 'writeRelationshipT...
Name: 0, dtype: object

In [359]:
g.drop()

graphName                                                             proj
database                                                             neo4j
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                              153
relationshipCount                                                      172
configuration            {'relationshipProjection': {'OWNS': {'aggregat...
density                                                           0.007396
creationTime                           2023-09-26T15:50:35.663277734+00:00
modificationTime                       2023-09-26T15:50:35.671715265+00:00
schema                   {'graphProperties': {}, 'nodes': {'Manager': {...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'Manager': {...
Name: 0, dtype: object

And now we can pull back other relevant results

In [360]:
gds.run_cypher('''
MATCH (m0:Manager {managerName: $managerName})-[r:SIMILAR]->(m:Manager)
RETURN m.managerName AS managerName, r.score AS score
ORDER BY score DESC LIMIT 10
''', params =  {'managerName':'Rempart Asset Management Inc.'})

Unnamed: 0,managerName,score
0,"BASSETT HARGROVE INVESTMENT COUNSEL, LLC",0.02311
1,TIGER MANAGEMENT L.L.C.,0.000284


## Clean Up

In [356]:
gds.run_cypher('MATCH (M:Manager)-[s:SIMILAR]->() DELETE s')