# Search & Summerization with Neo4j and GenAI Services


Below we use a graph of investment managers who own stocks in different companies from 2022 through 2023-08.  Some companies have a variable number of Documents extracted from 10k filings which explain there business via free text.  

![](images/data-model.png)

In the context of our use cases, think of search as finding \<entities\> which meet some \<criteria\>.  Criteria defined by the user, entities (people, places, things) defined by both user and the enterpise system. 

By using Text embedings in a vector similarity search, we can make the above work well when documents are the entities of interest.  However, what if we want second or third order entity related or connected to those documents?  how would we efficiently search for them at scale in an enterprise setting?

For example, in our use case, if we simply wanted to search, what documents met a prompt critieria...i.e. "find all documents that talk about pharma related things" We would not need a Graph Database, just a simple vector search mechanism.  But if we instead want to ask "find investors who are most focused on pharma related things" then it becomes a different story.  

__In this Notebook:__
1. Semantic Search: Answer the Question "What Investors are most focused in: \[x\]?"
2. \[TODO\]: Gradio Embedded UI? LLM for Summerizing Response? More Examples? 

## Setup
First off, check that the Python environment you installed in the readme is running this notebook. Make sure you select the py38 kernel in the top right of this notebook. You should see a 3.8 version when you run this command.

In [1]:
import sys
sys.version

'3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]'

Next we install and import some libraries 

In [2]:
%%capture
%pip install --user "google-cloud-aiplatform>=1.25.0" --upgrade
%pip install --user "google-cloud-aiplatform[pipelines]>=1.25.0"
%pip install --user graphdatascience
%pip install --user nltk

Now restart the kernel. That will allow the Python evironment to import the new packages.

In [3]:
import json
import numpy as np
import os
import re
from string import Template
from dotenv import load_dotenv

# Vertexai
import vertexai
from vertexai.preview.language_models import TextGenerationModel

# Neo4j
from graphdatascience import GraphDataScience

In [4]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'
# You will need to change these to match
NEO4J_URI = 'neo4j+s://7bcc79e1.databases.neo4j.io' #'<neo4j+s://xxxxx.databases.neo4j.io>'
NEO4J_PASSWORD = 'genai123' #'<password>'

In [5]:
gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=True
)
gds.set_database('neo4j')

## Semantic Search
In this Case We will Answer The Question - What Investors are most focused in Pharma

In [6]:
#semantic_query = ['railroad cargo shipping']
#semantic_query = ['smart phones, tablets']
#semantic_query = ['water heaters and boilers, manufacturing, north america, operations']
semantic_query = ['pharma, medicine, healthcare']

In [7]:
from vertexai.language_models import TextEmbeddingModel

EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

In [8]:
# Get query vector
#emb_result =[e.values for e in EMBEDDING_MODEL.get_embeddings(semantic_query)][0]
query_vector = EMBEDDING_MODEL.get_embeddings(semantic_query)[0].values

In [9]:
%%time

# Search for similar companies
res_df = gds.run_cypher('''
MATCH (b:Document)
WITH b, gds.similarity.pearson($queryVector, b.textEmbedding) AS cosineSimilarity
MATCH (b)<-[:HAS]-(c:Company)
RETURN c.companyName as companyName, max(cosineSimilarity) AS score
ORDER BY score DESC
''', params =  {'queryVector': query_vector})
res_df

CPU times: user 23.1 ms, sys: 0 ns, total: 23.1 ms
Wall time: 1.12 s


Unnamed: 0,companyName,score
0,UNITED THERAPEUTICS,0.738174
1,ORGANON & CO.,0.731347
2,FORIAN INC,0.730147
3,LIQUIDIA CORP,0.729185
4,WINDTREE THERAPEUTICS INC,0.726925
...,...,...
525,ARK RESTAURANTS CORP,0.563173
526,MIDDLEFIELD BANC CORP,0.559373
527,SAFEGUARD SCIENTIFICS INC,0.555271
528,TREASURE GLOBAL INC,0.554142


In [10]:
%%time

# Search for managers with significiant investments in area
res_df = gds.run_cypher('''
MATCH (b:Document)
WITH b, gds.similarity.cosine($queryVector, b.textEmbedding) AS cosineSimilarity
MATCH (b)<-[:HAS]-(c:Company)
WITH c, avg(cosineSimilarity) AS mcos
MATCH (c)<-[r:OWNS]-(m:Manager)
WITH m, r.value as value, mcos*r.value as weightedmcos
WITH m.managerName AS managerName, sum(weightedmcos) AS aggwmcos, sum(value) AS aggValue
RETURN managerName, aggwmcos/aggValue AS score ORDER BY score DESC LIMIT 1000

''', params =  {'queryVector': query_vector})
res_df

CPU times: user 35.9 ms, sys: 0 ns, total: 35.9 ms
Wall time: 1.22 s


Unnamed: 0,managerName,score
0,ORACLE INVESTMENT MANAGEMENT INC,0.670796
1,"Grey Street Capital, LLC",0.670796
2,Brainard Capital Management LLC,0.670796
3,"Mendota Financial Group, LLC",0.667671
4,"West Oak Capital, LLC",0.666400
...,...,...
995,"NorthRock Partners, LLC",0.570197
996,CONDOR CAPITAL MANAGEMENT,0.570197
997,"Bison Wealth, LLC",0.570197
998,Eisler Capital (US) LLC,0.570197


If you investigate the top investors you will see the relation to pharma and healthcare

For example [ORACLE INVESTMENT MANAGEMENT INC](https://oraclepartners.com/) describes itself as a "fundamental research driven investment management company that is exclusively focused on the global health care and bioscience industries"