# Semantic Search with Neo4j and Vertex AI

Semantic search is loosely defined as "search with meaning" and it is key for effective knowledge retrieval.

As opposed to traditional, lexical, search which finds matches based on keywords, semantic search seeks to improve search quality and accuracy by understanding search intent and pulling results that match the user’s contextual meaning.


Semantic search is often used in reference to text embedding and vector similarity search, but this is just one implementation aspect of it. Knowledge graph and symbolic query logic can also play a critical role in making semantic search a reality. 

If all you care about is analyzing a set of documents on a file system, then sure, vector indexing and search may be sufficient.  However, once you need to retrieve and make inferences about people, places, and things connected to those documents, Knowledge graph becomes key. 


To understand this, consider our updated data model with documents from 10K filings. 

![](images/data-model.png)


If documents are the entities of interest. For example: "find all documents that talk about pharma related things" then text embeddings with vector similarity search suffices.

But what if we want second or third-order entities related to the documents?  For example: "find investors who are most focused on pharma related strategies" how would we efficiently search for them at scale in an enterprise setting?

This is what we demonstrate below.  We will also show how you can use graph relationships and Graph Data Science algorithms to further improve search results, especially in common scenarios where the presence of text data is inconsistent or sparse. 


## Setup
First, check to ensure you're using the `neo4j_genai` kernel with the following command. This kernel has the necessary runtime and dependencies for this notebook. If you see a different kernel, try changing the kernel to `neo4j_genai` in the upper right corner of the screen.

In [1]:
import sys
import os
os.path.basename(sys.executable.replace("/bin/python",""))

'neo4j_genai'

In [11]:
import json
import numpy as np
import os
import re
from string import Template

# Vertexai
import vertexai
from vertexai.language_models import TextEmbeddingModel

# Neo4j
from graphdatascience import GraphDataScience

Connect to Neo4j.

In [5]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'
# You will need to change these to match
NEO4J_URI = '<neo4j+s://xxxxx.databases.neo4j.io>'
NEO4J_PASSWORD = '<password>'

In [6]:
gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=True
)
gds.set_database('neo4j')

## Neo4j Vector Index

We will need to create a vector index for similarity search on Document nodes. Neo4j offers a vector index that enables Approximate Nearest Neighbor Search (ANN). Let's creat an index.

In [7]:
gds.run_cypher("CALL db.index.vector.createNodeIndex('document-embeddings', 'Document', 'textEmbedding', 768, 'cosine')")

You can see that vector index has been created using `SHOW INDEXES`

In [9]:
gds.run_cypher(''' 
SHOW INDEXES YIELD name, type, labelsOrTypes, properties, options
WHERE type = "VECTOR"
''')

Unnamed: 0,name,type,labelsOrTypes,properties,options
0,document-embeddings,VECTOR,[Document],[textEmbedding],"{'indexProvider': 'vector-1.0', 'indexConfig':..."


## Deep Semantic Search with Knowledge Graph
Now that we have an index.  Let’s use it in action.
In this case, we will Answer The Question - "What Investors are most focused in pharma, medicine, and healthcare?" 
Remember we do not have documents on investment managers, just companies, and there can be multiple documents for each company. 

In [14]:
# Semantic query, lets use these key words to search
semantic_query = ['pharma, medicine, healthcare']

In [15]:
# Create a query vector by embedding the query using Vertex AI text embedding

EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

In [16]:
# Get query vector
#emb_result =[e.values for e in EMBEDDING_MODEL.get_embeddings(semantic_query)][0]
query_vector = EMBEDDING_MODEL.get_embeddings(semantic_query)[0].values
query_vector[:10]

[0.007500611711293459,
 0.0003618541522882879,
 -0.027530884370207787,
 0.05591048300266266,
 0.040699176490306854,
 -0.04322928935289383,
 0.013355609029531479,
 0.008204770274460316,
 -0.06950043886899948,
 0.02362251654267311]

Now let's use that query vector to search for companies.  Remember, companies have multiple documents so we will need to use a graph traversal on top of a document lookup to find which companies are most similar.

In [27]:
%%time

# Search for similar companies
res_df = gds.run_cypher("""
CALL db.index.vector.queryNodes('document-embeddings', 1000, $queryVector)
YIELD node AS similarDocuments, score
MATCH (similarDocuments)<-[:HAS]-(c:Company)
RETURN c.companyName as companyName, avg(score) AS score
ORDER BY score DESC LIMIT 100
""", params =  {'queryVector': query_vector})
res_df

CPU times: user 16.1 ms, sys: 4.21 ms, total: 20.3 ms
Wall time: 101 ms


Unnamed: 0,companyName,score
0,MONTEREY BIO ACQUISITION COR,0.854617
1,FORIAN INC,0.850504
2,LANDEC CORP,0.850140
3,Penumbra Inc,0.848445
4,Mesa Laboratories Inc,0.845195
...,...,...
95,"Alphatec Holdings, Inc.",0.837868
96,10X GENOMICS,0.837864
97,Checkpoint Therapeutics,0.837857
98,LENSAR INC,0.837833


You may recognize some of these companies, and if not a quick google search will con firm that their business is involved in healthcare and pharma, so this seems to be working.

Now let's take this one step further and find investment managers who are most heavily focused in pharma.  This will involve a bit more Cypher for a 2-hop traversal. 

In [49]:
%%time

# Search for managers with significiant investments in area
res_df = gds.run_cypher("""
CALL db.index.vector.queryNodes('document-embeddings', 1000, $queryVector)
YIELD node AS similarDocuments, score
MATCH (similarDocuments)<-[:HAS]-(c:Company)
WITH c, avg(score) AS score ORDER BY score LIMIT 100
MATCH (c)<-[r:OWNS]-(m:Manager)
WITH m, r.value as value, score*r.value as weightedScore
WITH m.managerName AS managerName, sum(weightedScore) AS aggScore, sum(value) AS aggValue
RETURN managerName, aggScore/aggValue AS score ORDER BY score DESC LIMIT 1000

""", params =  {'queryVector': query_vector})
res_df

CPU times: user 36.7 ms, sys: 203 µs, total: 36.9 ms
Wall time: 76.9 ms


Unnamed: 0,managerName,score
0,INTERNATIONAL BIOTECHNOLOGY TRUST PLC,0.839136
1,"Old Well Partners, LLC",0.839002
2,Cannell & Co.,0.838979
3,"CI Private Wealth, LLC",0.838979
4,Allworth Financial LP,0.838979
...,...,...
368,Cowen Investment Management LLC,0.831560
369,"Whitefort Capital Management, LP",0.831560
370,"Equitable Holdings, Inc.",0.831560
371,Wealth Alliance,0.831560


And we can see that our top result is a specialized investment trust in Bio-Technology
[INTERNATIONAL BIOTECHNOLOGY TRUST PLC](https://ibtplc.com/)

## Expanding Available Data for Knowledge Retrieval

Not every element in your data will have rich text data, and further, much like we only have 10K documents for some companies, your use cases may also have incomplete, unevenly distributed text data. 

We can check our top result investment manager to this. 

In [39]:
gds.run_cypher('''
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)-[:HAS]->(d:Document)
WITH m, count(DISTINCT c) AS ownedCompaniesWithDocs
MATCH (m:Manager {managerName: $managerName})-[:OWNS]->(c:Company)
RETURN m.managerName AS managerName, ownedCompaniesWithDocs, count(DISTINCT c) AS totalOwnedCompanies
''', params =  {'managerName':'INTERNATIONAL BIOTECHNOLOGY TRUST PLC'})

Unnamed: 0,managerName,ownedCompaniesWithDocs,totalOwnedCompanies
0,INTERNATIONAL BIOTECHNOLOGY TRUST PLC,1,54


This manager has significantly more other companies they own without documents.  We can use Graph Data Science Node Similarity to find the managers that have the most overlap to this one which should give us other Biotech companies that we missed due to sparse text data.

In [64]:
g, _  = gds.graph.project('proj', ['Company', 'Manager'], {'OWNS':{'properties':['value']}})

In [66]:
gds.nodeSimilarity.write(g, writeRelationshipType='SIMILAR', writeProperty='score', relationshipWeightProperty='value')

NodeSimilarity:   0%|          | 0/100 [00:00<?, ?%/s]

preProcessingMillis                                                       0
computeMillis                                                         10398
writeMillis                                                            1045
postProcessingMillis                                                     -1
nodesCompared                                                          6027
relationshipsWritten                                                  60162
similarityDistribution    {'p1': 0.0007693730258324649, 'max': 1.0000076...
configuration             {'topK': 10, 'writeConcurrency': 4, 'similarit...
Name: 0, dtype: object

In [67]:
g.drop()

graphName                                                             proj
database                                                             neo4j
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                            21855
relationshipCount                                                  1600322
configuration            {'relationshipProjection': {'OWNS': {'orientat...
density                                                           0.003351
creationTime                           2023-08-22T02:39:49.573096253+00:00
modificationTime                       2023-08-22T02:39:50.035230514+00:00
schema                   {'graphProperties': {}, 'relationships': {'OWN...
schemaWithOrientation    {'graphProperties': {}, 'relationships': {'OWN...
Name: 0, dtype: object

And now we can pull back other relevant results

In [68]:
gds.run_cypher('''
MATCH (m0:Manager {managerName: $managerName})-[r:SIMILAR]->(m:Manager)
RETURN m.managerName AS managerName, r.score AS score
ORDER BY score DESC LIMIT 10
''', params =  {'managerName':'INTERNATIONAL BIOTECHNOLOGY TRUST PLC'})

Unnamed: 0,managerName,score
0,SECTOR GAMMA AS,0.121268
1,"Sofinnova Investments, Inc.",0.112067
2,SECTORAL ASSET MANAGEMENT INC,0.09249
3,Privium Fund Management B.V.,0.085084
4,BENDER ROBERT & ASSOCIATES,0.078136
5,SPHERA FUNDS MANAGEMENT LTD.,0.071602
6,"AIMZ Investment Advisors, LLC",0.067568
7,GREAT POINT PARTNERS LLC,0.066668
8,HighVista Strategies LLC,0.061513
9,"Slow Capital, Inc.",0.059504


## Clean Up

In [69]:
gds.run_cypher('MATCH (M:Manager)-[s:SIMILAR]->() DELETE s')