# Google Spanner

> [Spanner](https://cloud.google.com/spanner) is a highly scalable database that combines unlimited scalability with relational semantics, such as secondary indexes, strong consistency, schemas, and SQL providing 99.999% availability in one easy solution.

This notebook goes over how to use `Spanner` for GraphRAG with `SpannerPropertyGraphStore` class.

Learn more about the package on [GitHub](https://github.com/googleapis/llama-index-spanner-python/).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/googleapis/llama-index-spanner-python/blob/main/docs/property_graph_store.ipynb)

## Before You Begin

To run this notebook, you will need to do the following:

 * [Create a Google Cloud Project](https://developers.google.com/workspace/guides/create-project)
 * [Enable the Cloud Spanner API](https://console.cloud.google.com/flows/enableapi?apiid=spanner.googleapis.com)
 * [Create a Spanner instance](https://cloud.google.com/spanner/docs/create-manage-instances)
 * [Create a Spanner database](https://cloud.google.com/spanner/docs/create-manage-databases)

### 🦜🔗 Library Installation
The integration lives in its own `llama-index-spanner` package, so we need to install it.

In [1]:
%pip install --upgrade --quiet llama-index-spanner llama-index-llms-google-genai llama-index-readers-wikipedia wikipedia pyvis

Note: you may need to restart the kernel to use updated packages.


**Colab only:** Run the following cell to restart the kernel or use the button to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### 🔐 Authentication
Authenticate to Google Cloud as the IAM user logged into this notebook in order to access your Google Cloud Project.

* If you are using Colab to run this notebook, use the cell below and continue.
* If you are using Vertex AI Workbench, check out the setup instructions [here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/setup-env).

In [None]:
from google.colab import auth

auth.authenticate_user()

### ☁ Set Your Google Cloud Project
Set your Google Cloud project so that you can leverage Google Cloud resources within this notebook.

If you don't know your project ID, try the following:

* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113).

In [None]:
# @markdown Please fill in the value below with your Google Cloud project ID and then run the cell.

PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
!gcloud config set project {PROJECT_ID}
%env GOOGLE_CLOUD_PROJECT={PROJECT_ID}

### 💡 API Enablement
The `llama-index-spanner` package requires that you [enable the Spanner API](https://console.cloud.google.com/flows/enableapi?apiid=spanner.googleapis.com) in your Google Cloud Project.

In [None]:
# enable Spanner API
!gcloud services enable spanner.googleapis.com

## Basic Usage

### Prepare documents, llm and embed_model

Prepare documents from wikipedia to be added to Spanner Graph, llm and embed_model.

In [26]:
# Get graph documents from Wikipedia
from llama_index.readers.wikipedia import WikipediaReader
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from llama_index.llms.google_genai import GoogleGenAI

loader = WikipediaReader()
documents = loader.load_data(pages=["Google"], auto_suggest=False)

llm = GoogleGenAI(
    model="gemini-2.5-flash-preview-05-20",
)
embed_model = GoogleGenAIEmbedding(
    model_name="text-embedding-004", embed_batch_size=100
)

### Set Spanner database values
Find your database values, in the [Spanner Instances page](https://console.cloud.google.com/spanner?_ga=2.223735448.2062268965.1707700487-2088871159.1707257687).

NOTE:
- The database identified by INSTANCE and DATABASE must be created beforehand.
- The graph does NOT need to be created beforehand.
  
  Note: If a graph with the specified name already exists, this library will build upon it. However, for seamless operation and to avoid unexpected errors, ensure the existing graph was also created using this library. If not, please create a new graph with a different name.


In [None]:
# @title Set Your Values Here { display-mode: "form" }
INSTANCE = ""  # @param {type: "string"}
DATABASE = ""  # @param {type: "string"}
GRAPH_NAME = ""  # @param {type: "string"}
USE_FLEXIBLE_SCHEMA = False  # @param {type: "boolean"}

### SpannerPropertyGraphStore

To initialize the `SpannerPropertyGraphStore` class you need to provide 3 required arguments and other arguments are optional and only need to pass if it's different from default ones

1.   a Spanner instance id;
2.   a Spanner database id belongs to the above instance id;
3.   a Spanner graph name used to create a graph in the above database.

#### SpannerPropertyGraphStore with flexible schema

By default, SpannerPropertyGraphStore creates an underlying table for each type of nodes and edges.
This will create many underlying tables when your graph consists of many different types of nodes and edges.

SpannerPropertyGraphStore provides a flexible schema mode that stores all your nodes in a single node table and edges in a single edge table.

To use it, set USE_FLEXIBLE_SCHEMA to True.

In [None]:
from llama_index_spanner import SpannerPropertyGraphStore

graph_store = SpannerPropertyGraphStore(
    instance_id=INSTANCE,
    database_id=DATABASE,
    graph_name=GRAPH_NAME,
    use_flexible_schema=USE_FLEXIBLE_SCHEMA,
)

print("Clean up existing data...")
graph_store.clean_up()

### Insert documents into SpannerPropertyGraphStore using PropertyGraphIndex

`PropertyGraphIndex` along with `kg_extractors` converts the documents into knowledge graph and then inserts it into the `PropertyGraphStore` (In our case it will be `SpannerPropertyGraphStore`)

In [None]:
# This allows running nested event loop
import nest_asyncio

nest_asyncio.apply()

In [None]:
from typing import Literal

from llama_index.core import PropertyGraphIndex
from llama_index.core.indices.property_graph import SchemaLLMPathExtractor

PropertyGraphIndex.from_documents(
    documents,
    kg_extractors=[
        SchemaLLMPathExtractor(
            possible_entities=Literal["PERSON", "COMPANY", "PRODUCT"],
            strict=False,
            llm=llm,
            max_triplets_per_chunk=1000,
            num_workers=4,
        )
    ],
    property_graph_store=graph_store,
    use_async=False,
    llm=llm,
    embed_kg_nodes=True,
    embed_model=embed_model,
    show_progress=True,
)

Inserting documents...


Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 45.72it/s]
Extracting paths from text with schema: 100%|██████████| 18/18 [01:58<00:00,  6.56s/it]
Generating embeddings: 100%|██████████| 18/18 [00:01<00:00, 11.72it/s]
Generating embeddings: 100%|██████████| 840/840 [00:13<00:00, 62.40it/s]


Waiting for DDL operations to complete...
Insert nodes of type `text_chunk`...
Waiting for DDL operations to complete...
Insert nodes of type `COMPANY`...
Insert nodes of type `PERSON`...
Insert nodes of type `PRODUCT`...
Waiting for DDL operations to complete...
Insert edges of type `COMPANY_PART_OF_COMPANY`...
Insert edges of type `PERSON_WORKED_ON_COMPANY`...
Insert edges of type `COMPANY_HAS_PRODUCT`...
Insert edges of type `COMPANY_HAS_COMPANY`...
Insert edges of type `COMPANY_HAS_ALIAS_COMPANY`...
Insert edges of type `PERSON_LOCATED_IN_COMPANY`...
Insert edges of type `PERSON_WORKED_ON_PRODUCT`...
Insert edges of type `COMPANY_HAS_ALIAS_PRODUCT`...
Insert edges of type `PERSON_HAS_COMPANY`...
Insert edges of type `COMPANY_USED_BY_COMPANY`...
Insert edges of type `PERSON_PART_OF_COMPANY`...
Insert edges of type `COMPANY_WORKED_ON_COMPANY`...
Insert edges of type `PRODUCT_PART_OF_COMPANY`...
Insert edges of type `COMPANY_WORKED_ON_PERSON`...
Insert edges of type `COMPANY_HAS_PERSO

<llama_index.core.indices.property_graph.base.PropertyGraphIndex at 0x7f7eb15ea1e0>

#### Query the graph
To traverse the graph in the graph store.

In [33]:
sample_query = """
  MATCH (n WHERE REGEXP_CONTAINS(n.id, 'Google')) -[e]-{1, 2} (m)
  RETURN ARRAY_AGG(DISTINCT m.id) AS google_related_nodes
"""

print(graph_store.structured_query(sample_query))

[{'google_related_nodes': ['Maps', 'Granite Systems', 'Google Search engine', 'AdSense for Mobile', 'email service', 'Google Docs', 'women', 'Marissa Mayer', 'Google Cloud Platform', 'Sun Microsystems', 'Matt Brittin', 'UniSuper', 'Unit 8200', 'BackRub', 'DoubleClick', 'Google Nest', 'suggestion feature', 'search engine', 'Google Ads', 'Andy Rubin', 'Google Home Mini', 'Larry Page', 'Israeli Defense Forces', 'YouTube', 'Gemini', 'navigation service', 'AdWords', 'Assaf Rappaport', 'Imagen', 'Yahoo!', 'Drive', 'electricity', 'operating system', 'Ubisoft', 'rooftop photovoltaic power station', 'Meet', 'Competitive Enterprise Institute', 'Craig Silverstein', 'Incognito browsing mode', 'Mathematical Sciences Research Institute', 'Google search', 'Ron Conway', 'Translate', 'National Labor Relations Board', 'Sequoia Capital', 'Google Sheets', 'Google Drive', 'NotebookLM', 'mapping service', 'Israel', '114 megawatts of power', 'SynthID Detector', 'Take-Two', 'AdSense', 'David Cheriton', 'Jungl

#### Visualize the graph

In [None]:
from pyvis.network import Network
from IPython.core.display import display, HTML

net = Network(
    notebook=True,
    cdn_resources="remote",
    bgcolor="#222222",
    font_color="white",
    height="500px",
    width="50%",
    directed=True,
)

node_query = """
  MATCH (n)
  RETURN n.id
"""

edge_query = """
  MATCH -[e]->
  RETURN e.id AS src_id, e.target_id AS dst_id, labels(e)[0] AS label
"""

nodes = graph_store.structured_query(node_query)
edges = graph_store.structured_query(edge_query)

net.add_nodes([node["id"] for node in nodes])
for edge in edges:
    net.add_edge(edge["src_id"], edge["dst_id"], title=edge["label"])

net.toggle_physics(True)
net.show("graph.html")
display(HTML("graph.html"))

#### Clean up the graph

> USE IT WITH CAUTION!

Clean up all the nodes/edges in your graph and remove your graph definition.

In [29]:
graph_store.clean_up()

Waiting for DDL operations to complete...
Waiting for DDL operations to complete...
Waiting for DDL operations to complete...
