# Semantic Search
Semantic search is [defined as "search with meaning."](https://en.wikipedia.org/wiki/Semantic_search) It is key for effective knowledge retrieval.

As opposed to traditional, lexical, search which finds matches based on keywords, semantic search seeks to improve search quality and accuracy by understanding search intent and pulling results that match the user’s contextual meaning.  Semantic search is often used in reference to text embedding and vector similarity search, but this is just one implementation aspect of it. Knowledge graph and symbolic query logic can also play a critical role in making semantic search a reality. 

If all you care about is analyzing a set of documents on a file system, then sure, vector indexing and search may be sufficient.  However, once you need to retrieve and make inferences about people, places, and things connected to those documents, Knowledge graph becomes key. 

If documents are the entities of interest, for example: "find all documents that talk about pharma related things," then text embeddings with vector similarity search suffices.  But, what if we want second or third-order entities related to the documents?  For example, "find investors who are most focused on pharma related strategies," how would we efficiently search for them at scale in an enterprise setting?

This is what we demonstrate below.  We will also show how you can use graph relationships and Graph Data Science algorithms to further improve search results, especially in common scenarios where the presence of text data is inconsistent or sparse.

## Connect to Neo4j

In [None]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'

# You will need to change these to match your credentials
NEO4J_URI = 'neo4j+s://6688b25b.databases.neo4j.io'
NEO4J_PASSWORD = '_kogrNk53u8oTk5be55kmit1kHGdhZj98yJlG-VYSR'

In [2]:
from graphdatascience import GraphDataScience

gds = GraphDataScience(
    NEO4J_URI,
    auth = (NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds = True
)

## Neo4j Vector Index
We will need to create a vector index for similarity search on Document nodes. Neo4j offers a vector index that enables Approximate Nearest Neighbor Search (ANN). Let's creat an index.

In [3]:
gds.run_cypher("CALL db.index.vector.createNodeIndex('document-embeddings', 'Document', 'textEmbedding', 768, 'cosine')")

You can see that vector index has been created using `SHOW INDEXES`

In [4]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "VECTOR"
''')

Unnamed: 0,name,type,labelsOrTypes,properties,options
0,document-embeddings,VECTOR,[Document],[textEmbedding],"{'indexProvider': 'vector-1.0', 'indexConfig':..."


## Deep Semantic Search with Knowledge Graph
Now that we have an index.  Let’s use it in action.
In this case, we will Answer The Question - "What are the companies associated with energy, oil and gas?". 
Remember we do not have documents on investment managers, just companies, and there can be multiple documents for each company.

## Dense Retrieval using Neo4j

In [5]:
from langchain.docstore.document import Document
from typing import (
    List,
    Tuple,
)

def to_df(results: List[Tuple[Document, float]]):
    return pd.DataFrame({
        "doc_id": [r[0].metadata.get('documentId') for r in results],
        "score": [r[1] for r in results],
        "text": [r[0].page_content for r in results]
    })

In [6]:
from langchain.vectorstores import Neo4jVector
from langchain.embeddings.vertexai import VertexAIEmbeddings
from langchain.vectorstores.neo4j_vector import SearchType
import pandas as pd

db = Neo4jVector.from_existing_index(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings(),
    index_name = 'document-embeddings',
    keyword_index_name = 'doc_text',
    search_type = SearchType.VECTOR
)
query = 'What are the companies associated with energy, oil and gas?'
results = db.similarity_search_with_score(query, k=10)
res_df = to_df(results)
res_df

Unnamed: 0,doc_id,score,text
0,General motors co8,0.832602,HYDROTEC\n We are developing hydrogen fuel ce...
1,General motors co0,0.821065,>Item 1. Business \nGeneral Motors Company (so...
2,Verizon Communications INC8,0.817423,Local services\n. We offer an array of local d...
3,General motors co17,0.816134,"In North America, GM Financial offers a sub-pr..."
4,General motors co16,0.815707,"In some instances, we purchase systems, compon..."
5,General motors co2,0.814699,"In September 2021, we announced three new driv..."
6,Schwab Charles Corp0,0.81386,>Item 1. Business\nGeneral Corporate Overv...
7,General motors co12,0.812042,"As discussed above, total vehicle sales and ma..."
8,General motors co31,0.811816,"Through December 31, 2021, we implemented proj..."
9,Verizon Communications INC6,0.811811,Global Enterprise offers a broad portfolio of ...


Lets take a look at the first result and then understand whether the chunk has the information we asked for.

In [7]:
res_df['text'][0]

"HYDROTEC\n  We are developing hydrogen fuel cell applications across transportations and industries, including mobile power generation, class 7/8 truck, locomotive, aerospace and marine applications. The development of HYDROTEC is another element of our long-term strategy and commitment toward the reduction of petroleum consumption and GHG emissions. We believe hydrogen fuel cells will play an important role in many automotive and other mobility applications where customers will derive additional benefits from the ability to refuel quickly, an extended range, suitability for heavier payloads and central refueling of large fleets. GM and Honda, through our long-term strategic alliance to collaborate in research and advanced engineering efforts, are developing and commercializing fuel cell systems. In 2021, GM announced it will supply HYDROTEC to Navistar, Inc., which is developing hydrogen-powered heavy trucks to launch in 2024, and to Liebherr-Aerospace, which is developing hydrogen-p

## Re-rank results
If you explore the results, you may find some irrelevant results. The retriever has to be efficient for large document collections with millions of entries. However, it might return irrelevant candidates.
A re-ranker based on a Cross-Encoder can substantially improve the final results for the user. The query and a possible document is passed simultaneously to transformer network, which then outputs a single score between 0 and 1 indicating how relevant the document is for the given query.

We will use a Cross-encoder model named `ms-marco-MiniLM-L-6-v2` from SentenceTransformers library.

In [8]:
def rerank_results(res_df):
    from sentence_transformers import SentenceTransformer, CrossEncoder, util
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    passages = []
    
    # Score all retrieved passages with the cross_encoder
    cross_inp = [[query, f"Text: {hit['text']}"] for _,hit in res_df[:].iterrows()]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    res_df['cross-score'] = cross_scores

    # Output of top-10 hits from re-ranker
    print("Top-10 Cross-Encoder Re-ranker hits")
    hits = res_df.sort_values('cross-score', ascending=False).reset_index(drop=True)
    for _,hit in hits.iterrows():
        print("\t{:.3f}\t{}".format(hit['cross-score'], hit['doc_id']))
    return hits

In [9]:
ranked_results = rerank_results(res_df)
ranked_results

Downloading (…)lve/main/config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Top-10 Cross-Encoder Re-ranker hits
	-7.386	General motors co31
	-9.050	General motors co8
	-9.898	General motors co0
	-10.299	Schwab Charles Corp0
	-10.334	General motors co17
	-10.509	General motors co16
	-10.627	General motors co2
	-10.954	Verizon Communications INC6
	-11.068	General motors co12
	-11.102	Verizon Communications INC8


Unnamed: 0,doc_id,score,text,cross-score
0,General motors co31,0.811816,"Through December 31, 2021, we implemented proj...",-7.385767
1,General motors co8,0.832602,HYDROTEC\n We are developing hydrogen fuel ce...,-9.049635
2,General motors co0,0.821065,>Item 1. Business \nGeneral Motors Company (so...,-9.897727
3,Schwab Charles Corp0,0.81386,>Item 1. Business\nGeneral Corporate Overv...,-10.298505
4,General motors co17,0.816134,"In North America, GM Financial offers a sub-pr...",-10.333549
5,General motors co16,0.815707,"In some instances, we purchase systems, compon...",-10.508667
6,General motors co2,0.814699,"In September 2021, we announced three new driv...",-10.627051
7,Verizon Communications INC6,0.811811,Global Enterprise offers a broad portfolio of ...,-10.953615
8,General motors co12,0.812042,"As discussed above, total vehicle sales and ma...",-11.06806
9,Verizon Communications INC8,0.817423,Local services\n. We offer an array of local d...,-11.101725


As seen above, the cross-encoder finds this passage to be more relevant to the query

In [10]:
ranked_results['text'][0]

'Through December 31, 2021, we implemented projects and signed renewable energy contracts globally that brought our total renewable energy capacity to over one gigawatt by 2023, which represents approximately 75% of our U.S. electricity use and approximately 40% of our global electricity use. In 2019 and 2020, we executed two of our largest green tariffs to date with DTE Energy Company, sourcing 840,000 megawatt hours of renewable energy that began supplying us in early 2021 in phase 1 with the remainder expected in mid-2023 in phase 2. Additionally, in 2020 we executed our largest power purchase agreement to date, with 180 megawatts of solar electricity supplying our U.S. operations starting in 2023. We continue to seek opportunities for a diversified renewable energy portfolio including wind, solar and landfill gas, including executing a 28 megawatts solar green tariff with TVA to supply our Bowling Green Assembly Plant. In 2021, Energy Star certified two assembly plants and four bui

## Hybrid Search with Neo4j
Vector Search is not the only solution. Often you will find a hybrid approach - a combination of Vector and Fulltext search perform better. You can do both in Neo4j.

To do a Fulltext Search, lets first create a fulltext index

In [11]:
gds.run_cypher("CREATE FULLTEXT INDEX doc_text IF NOT EXISTS FOR (n:Document) ON EACH [n.text]")

In [12]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "FULLTEXT"
''')

Unnamed: 0,name,type,labelsOrTypes,properties,options
0,doc_text,FULLTEXT,[Document],[text],"{'indexProvider': 'fulltext-1.0', 'indexConfig..."


In [13]:
from langchain.embeddings.vertexai import VertexAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Neo4jVector
from langchain.vectorstores.neo4j_vector import SearchType

db = Neo4jVector.from_existing_index(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings(),
    index_name = 'document-embeddings',
    keyword_index_name = 'doc_text',
    search_type = SearchType.HYBRID # Hybrid search type
)

results = db.similarity_search_with_score(query, k=10)
res_df = to_df(results)
res_df

Unnamed: 0,doc_id,score,text
0,General motors co30,1.0,"In 2021, we launched our new global GM Zero Wa..."
1,General motors co8,0.832602,HYDROTEC\n We are developing hydrogen fuel ce...
2,General motors co0,0.821065,>Item 1. Business \nGeneral Motors Company (so...
3,Verizon Communications INC8,0.817423,Local services\n. We offer an array of local d...
4,General motors co17,0.816134,"In North America, GM Financial offers a sub-pr..."
5,General motors co16,0.815707,"In some instances, we purchase systems, compon..."
6,General motors co2,0.814699,"In September 2021, we announced three new driv..."
7,Schwab Charles Corp0,0.81386,>Item 1. Business\nGeneral Corporate Overv...
8,General motors co12,0.812042,"As discussed above, total vehicle sales and ma..."
9,General motors co31,0.811816,"Through December 31, 2021, we implemented proj..."


Lets re-rank our results for relevancy

In [14]:
ranked_results = rerank_results(res_df)
ranked_results

Top-10 Cross-Encoder Re-ranker hits
	-7.386	General motors co31
	-9.050	General motors co8
	-9.898	General motors co0
	-10.088	General motors co30
	-10.299	Schwab Charles Corp0
	-10.334	General motors co17
	-10.509	General motors co16
	-10.627	General motors co2
	-11.068	General motors co12
	-11.102	Verizon Communications INC8


Unnamed: 0,doc_id,score,text,cross-score
0,General motors co31,0.811816,"Through December 31, 2021, we implemented proj...",-7.385767
1,General motors co8,0.832602,HYDROTEC\n We are developing hydrogen fuel ce...,-9.049635
2,General motors co0,0.821065,>Item 1. Business \nGeneral Motors Company (so...,-9.897727
3,General motors co30,1.0,"In 2021, we launched our new global GM Zero Wa...",-10.088181
4,Schwab Charles Corp0,0.81386,>Item 1. Business\nGeneral Corporate Overv...,-10.298505
5,General motors co17,0.816134,"In North America, GM Financial offers a sub-pr...",-10.333549
6,General motors co16,0.815707,"In some instances, we purchase systems, compon...",-10.508667
7,General motors co2,0.814699,"In September 2021, we announced three new driv...",-10.627051
8,General motors co12,0.812042,"As discussed above, total vehicle sales and ma...",-11.06806
9,Verizon Communications INC8,0.817423,Local services\n. We offer an array of local d...,-11.101725


In [15]:
ranked_results['text'][4]

'>Item 1. \xa0\xa0\xa0\xa0Business\nGeneral Corporate Overview\nThe Charles Schwab Corporation (CSC) is a savings and loan holding company. CSC engages, through its subsidiaries (collectively referred to as Schwab or the Company), in wealth management, securities brokerage, banking, asset management, custody, and financial advisory services. At December\xa031, 2021, Schwab had $8.14 trillion\xa0in client assets,\xa033.2 million active brokerage accounts, 2.2 million corporate retirement plan participants, and 1.6 million banking accounts.\nPrincipal business subsidiaries of CSC include the following:\n•\nCharles Schwab & Co., Inc. (CS&Co), incorporated in 1971, a securities broker-dealer;\n•\nTD Ameritrade, Inc., an introducing securities broker-dealer;\n•\nTD Ameritrade Clearing, Inc. (TDAC), a securities broker-dealer that provides trade execution and clearing services to TD Ameritrade, Inc.; \n•\nCharles Schwab Bank, SSB (CSB), our principal banking entity; and\n•\nCharles Schwab In

We don't have much documents from other companies chunked. But once you have it, you can see additional results with Hybrid search. The re-ranker can then help rank these results accordingly. 

## Semantic Search with Multi-hop retrieval using Knowledge Graph

First, let's create a database connection.

In [16]:
db = Neo4jVector(
    url = NEO4J_URI,
    username = NEO4J_USERNAME,
    password = NEO4J_PASSWORD,
    embedding = VertexAIEmbeddings()
)

Now let's use that query vector to search for companies.  Remember, companies have multiple documents so we will need to use a graph traversal on top of a document lookup to find which companies are most similar.

In [17]:
query_vector = db.embedding.embed_query(text=query)
query_vector[0:5]

[0.06058839336037636,
 0.0039505185559391975,
 -0.05093487352132797,
 0.01724359206855297,
 0.0011352039873600006]

In [18]:
# Search for similar companies
res_df = db.query("""
CALL db.index.vector.queryNodes('document-embeddings', 10, $queryVector)
YIELD node AS similarDocuments, score

MATCH (similarDocuments)<-[:HAS]-(c:Company)
RETURN c.companyName as companyName, avg(score) AS score
ORDER BY score DESC LIMIT 10
""", params =  {'queryVector': query_vector})
res_df

[{'companyName': 'General Electric Co', 'score': 0.8177236488887242},
 {'companyName': 'VERIZON', 'score': 0.814616858959198},
 {'companyName': 'SCHWAB STRATEGIC TR US LRG CAP ETF',
  'score': 0.8138596415519714}]

The above result is based on the limited set of documents we have in the DB. The scores are based on Average similarity scores across chunks

Now let's take this one step further and find investment managers who are most heavily focused in Energy.  This will involve a bit more Cypher for a 2-hop traversal. 

In [19]:
# Search for managers with significiant investments in area
res_df = gds.run_cypher("""
CALL db.index.vector.queryNodes('document-embeddings', 1000, $queryVector)
YIELD node AS similarDocuments, score
MATCH (similarDocuments)<-[:HAS]-(c:Company)
WITH c, avg(score) AS score ORDER BY score LIMIT 100
MATCH (c)<-[r:OWNS]-(m:Manager)
WITH m, r.value as value, score*r.value as weightedScore
WITH m.managerName AS managerName, sum(weightedScore) AS aggScore, sum(value) AS aggValue
RETURN managerName, aggScore/aggValue AS score ORDER BY score DESC LIMIT 1000

""", params =  {'queryVector': query_vector})
res_df

Unnamed: 0,managerName,score
0,"BASSETT HARGROVE INVESTMENT COUNSEL, LLC",0.794981
1,Rempart Asset Management Inc.,0.786834
2,PRIVATE ASSET MANAGEMENT INC,0.783902
3,"Emery Howard Portfolio Management, Inc.",0.781502


And we can see that our top result is:

In [20]:
top_manager = res_df['managerName'][0]
top_manager

'BASSETT HARGROVE INVESTMENT COUNSEL, LLC'

## Expanding Available Data for Knowledge Retrieval
Not every element in your data will have rich text data, and further, much like we only have 10K documents for some companies, your use cases may also have incomplete, unevenly distributed text data. 

We can check our top result investment manager to this. 

In [21]:
gds.run_cypher('''
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)-[:HAS]->(d:Document)
WITH m, count(DISTINCT c) AS ownedCompaniesWithDocs
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)
RETURN m.managerName AS managerName, ownedCompaniesWithDocs, count(DISTINCT c) AS totalOwnedCompanies
''', params =  {'managerName':top_manager})

Unnamed: 0,managerName,ownedCompaniesWithDocs,totalOwnedCompanies
0,"BASSETT HARGROVE INVESTMENT COUNSEL, LLC",2,51


This manager has significantly more other companies they own without documents.  We can use Graph Data Science Node Similarity to find the managers that have the most overlap to this one which should give us other Energy companies that we missed due to sparse text data.

In [22]:
g, _ = gds.graph.project('proj', ['Company', 'Manager'], {'OWNS':{'properties':['value']}})

In [23]:
gds.nodeSimilarity.write(g, writeRelationshipType = 'SIMILAR', writeProperty = 'score', relationshipWeightProperty = 'value')

preProcessingMillis                                                       1
computeMillis                                                            11
writeMillis                                                              33
postProcessingMillis                                                     -1
nodesCompared                                                             4
relationshipsWritten                                                     10
similarityDistribution    {'min': 0.01062631607055664, 'p5': 0.010626316...
configuration             {'writeProperty': 'score', 'writeRelationshipT...
Name: 0, dtype: object

In [24]:
g.drop()

graphName                                                             proj
database                                                             neo4j
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                              224
relationshipCount                                                      289
configuration            {'relationshipProjection': {'OWNS': {'aggregat...
density                                                           0.005786
creationTime                           2023-10-09T13:54:37.030415913+00:00
modificationTime                       2023-10-09T13:54:37.850611254+00:00
schema                   {'graphProperties': {}, 'nodes': {'Manager': {...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'Manager': {...
Name: 0, dtype: object

And now we can pull back other relevant results

In [25]:
gds.run_cypher('''
MATCH (m0:Manager {managerName: $managerName})-[r:SIMILAR]->(m:Manager)
RETURN DISTINCT m.managerName AS managerName, r.score AS score
ORDER BY score DESC LIMIT 10
''', params =  {'managerName':top_manager})

Unnamed: 0,managerName,score
0,PRIVATE ASSET MANAGEMENT INC,0.045189
1,Rempart Asset Management Inc.,0.02311
2,"Emery Howard Portfolio Management, Inc.",0.010626


## Clean Up

In [26]:
gds.run_cypher('MATCH (M:Manager)-[s:SIMILAR]->() DELETE s')