<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/Neo4j/Neo4J_vector_index_langchain_Openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neo4j as Vector Index

https://python.langchain.com/docs/integrations/vectorstores/neo4jvector/

- approximate nearest neighbor search
- Euclidean similarity and cosine similarity
- Hybrid search combining vector and keyword searches


https://python.langchain.com/docs/integrations/document_loaders/wikipedia/

Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system called MediaWiki. Wikipedia is the largest and most-read reference work in history.


https://python.langchain.com/docs/integrations/text_embedding/openai/

Create embeddings with openai


# RAG
https://ai.meta.com/blog/retrieval-augmented-generation-streamlining-the-creation-of-intelligent-natural-language-processing-models/

https://www.promptingguide.ai/techniques/rag

RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified in an efficient manner and without needing retraining of the entire model.

RAG takes an input and retrieves a set of relevant/supporting documents given a source (e.g., Wikipedia). The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs's parametric knowledge is static. RAG allows language models to bypass retraining, enabling access to the latest information for generating reliable outputs via retrieval-based generation.

In [None]:
!pip install langchain openai wikipedia tiktoken neo4j langchain_openai --quiet

In [3]:
import os

from langchain_community.vectorstores.neo4j_vector import Neo4jVector
from langchain.document_loaders import WikipediaLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.docstore.document import Document

In [4]:
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('KEY_OPENAI')

In [24]:
topics= ["Youtube", "Twitter", "Facebook", "Google", "Microsoft", "Neo4J", "Amazon", "Langchain", "python", "Whatsapp"]

In [25]:
documents = []
for topic in topics:
  # Read the wikipedia article
  raw_documents = WikipediaLoader(query=topic).load()
  # Define chunking strategy
  text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
      chunk_size=1000, chunk_overlap=20
  )
  # Chunk the document
  docs = text_splitter.split_documents(raw_documents)
  for d in docs:
      del d.metadata["summary"]
  documents=documents + docs



  lis = BeautifulSoup(html).find_all('li')


In [26]:
len(documents)

232

In [52]:
documents[0]

Document(page_content='YouTube is an American online video sharing platform owned by Google. Accessible worldwide, it was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim, three former employees of PayPal. Headquartered in San Bruno, California, United States, it is the second most visited website in the world, after Google Search. YouTube has more than 2.5 billion monthly users, who collectively watch more than one billion hours of videos every day. As of May 2019, videos were being uploaded to the platform at a rate of more than 500 hours of content per minute, and as of 2021, there were approximately 14 billion videos in total.\nIn October 2006, YouTube was purchased by Google for $1.65 billion (equivalent to $2.31 billion in 2023). Google expanded YouTube\'s business model of generating revenue from advertisements alone, to offering paid content such as movies and exclusive content produced by and for YouTube. It also offers YouTube Premium, a paid subscri

In [61]:
documents[-1]

Document(page_content='This is a list of most-visited websites worldwide as of March 2024, along with their change in ranking compared to the previous month.\n\n\n== List ==\n\nData is compiled from Similarweb and Semrush as of March 2024. This list does not factor subpages that use the same domain as the parent site.\n\n\n== References ==', metadata={'title': 'List of most-visited websites', 'source': 'https://en.wikipedia.org/wiki/List_of_most-visited_websites'})

In [62]:
OpenAIEmbeddings()

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7bf808132590>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7bf808132ef0>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None)

In [53]:
text =documents[0].page_content
text

'YouTube is an American online video sharing platform owned by Google. Accessible worldwide, it was launched on February 14, 2005, by Steve Chen, Chad Hurley, and Jawed Karim, three former employees of PayPal. Headquartered in San Bruno, California, United States, it is the second most visited website in the world, after Google Search. YouTube has more than 2.5 billion monthly users, who collectively watch more than one billion hours of videos every day. As of May 2019, videos were being uploaded to the platform at a rate of more than 500 hours of content per minute, and as of 2021, there were approximately 14 billion videos in total.\nIn October 2006, YouTube was purchased by Google for $1.65 billion (equivalent to $2.31 billion in 2023). Google expanded YouTube\'s business model of generating revenue from advertisements alone, to offering paid content such as movies and exclusive content produced by and for YouTube. It also offers YouTube Premium, a paid subscription option for watch

In [58]:
OpenAIEmbeddings().embed_documents(texts=[text])[0][:10]

[-0.02060152733574137,
 -0.02013070835126636,
 -0.004071950730049511,
 -0.05061944181156403,
 -0.0031684867447656978,
 -0.00636560467577303,
 -0.003071459922378797,
 -0.015409789188681007,
 -0.012324013673181393,
 -0.044587861122684924]

In [56]:
len(OpenAIEmbeddings().embed_documents(texts=[text])[0])

1536

In [28]:
url = "bolt://44.204.228.84:7687"
username = "neo4j"
password = "setup-output-escape"

neo4j_db = Neo4jVector.from_documents(
    documents,
    OpenAIEmbeddings(),
    url=url,
    username=username,
    password=password,
    database="neo4j",  # neo4j by default
    index_name="wikipedia",  # vector by default
    node_label="WikipediaArticle",  # Chunk by default
    text_node_property="info",  # text by default
    embedding_node_property="vector",  # embedding by default
    create_id_index=True,  # True by default
)

# Unique node property constraints
Unique node property constraints, or node property uniqueness constraints, ensure that property values are unique for all nodes with a specific label. For property uniqueness constraints on multiple properties, the combination of the property values is unique. Node property uniqueness constraints do not require all nodes to have a unique value for the properties listed (nodes without all properties on which the constraint exists are not subject to this rule

In [29]:
neo4j_db.query("SHOW CONSTRAINTS")


[{'id': 5,
  'name': 'constraint_e5da4d45',
  'type': 'UNIQUENESS',
  'entityType': 'NODE',
  'labelsOrTypes': ['WikipediaArticle'],
  'properties': ['id'],
  'ownedIndex': 'constraint_e5da4d45',
  'propertyType': None}]

In [30]:
neo4j_db.query(
    """SHOW INDEXES
       YIELD name, type, labelsOrTypes, properties, options
       WHERE type = 'VECTOR'
    """
)

[{'name': 'wikipedia',
  'type': 'VECTOR',
  'labelsOrTypes': ['WikipediaArticle'],
  'properties': ['vector'],
  'options': {'indexProvider': 'vector-2.0',
   'indexConfig': {'vector.dimensions': 1536,
    'vector.similarity_function': 'cosine'}}}]

In [31]:
neo4j_db.add_documents(
    [
        Document(
            page_content="LangChain is the coolest library since the second world wide",
            metadata={"author": "Olonok", "confidence": 1.0}
        )
    ],
    ids=["langchain"],
)

['langchain']

In [32]:
existing_index = Neo4jVector.from_existing_index(
    OpenAIEmbeddings(),
    url=url,
    username=username,
    password=password,
    index_name="wikipedia",
    text_node_property="info",  # Need to define if it is not default
)

In [33]:
print(existing_index.node_label)
print(existing_index.embedding_node_property)

WikipediaArticle
vector


In [35]:
retrieval_query = """
OPTIONAL MATCH (node)<-[:EDITED_BY]-(p)
WITH node, score, collect(p) AS editors
RETURN node.info AS text,
       score,
       node {.*, vector: Null, info: Null, editors: editors, score: score} AS metadata
"""

existing_index_return = Neo4jVector.from_existing_index(
    OpenAIEmbeddings(),
    url=url,
    username=username,
    password=password,
    database="neo4j",
    index_name="wikipedia",
    text_node_property="info",
    retrieval_query=retrieval_query,
)

In [37]:
existing_index_return.similarity_search("What do you know about Youtube?", k=2)

[Document(page_content='YouTube (formerly YouTube Spotlight) is the official YouTube channel for the American video-sharing platform YouTube, spotlighting videos and events on the platform. Events shown on the channel include YouTube Comedy Week and the YouTube Music Awards. Additionally, the channel uploaded annual installments of YouTube Rewind between 2010 and 2019. For a brief period in late 2013, this channel was ranked as the most-subscribed on the platform. As of March 2024, the channel has earned 39.8 million subscribers and 3.05 billion video views.\n\n\n== History ==\nThe YouTube channel was registered on February 1, 2005. On November 2, 2013, the YouTube channel briefly surpassed PewDiePie\'s channel, to become the most-subscribed channel on the website. The channel ascended to the top position through auto-suggesting and pre-selecting itself as a subscription option upon new user registration for YouTube. Throughout December 2013, the channel and PewDiePie struggled for the

In [39]:
existing_index_return.similarity_search("What it is Whatsapp?", k=1)

[Document(page_content='WhatsApp (officially WhatsApp Messenger) is an instant messaging (IM) and voice-over-IP (VoIP) service owned by technology conglomerate Meta. It allows users to send text, voice messages and video messages, make voice and video calls, and share images, documents, user locations, and other content. WhatsApp\'s client application runs on mobile devices, and can be accessed from computers. The service requires a cellular mobile telephone number to sign up. In January 2018, WhatsApp released a standalone business app called WhatsApp Business which can communicate with the standard WhatsApp client.\nThe service was created by WhatsApp Inc. of Mountain View, California, which was acquired by Facebook in February 2014 for approximately US$19.3 billion. It became the world\'s most popular messaging application by 2015, and had more than 2 billion users worldwide by February 2020. By 2016, it had become the primary means of Internet communication in regions including Lat