# 1. Graph construction
This notebook will cover how to construct a Kùzu graph using data obtained from multiple Parquet files. In
this case,we use Parquet for reasons of simplicity, but the data could just as well have come from
external databases, like DuckDB or PostgreSQL tables. See our documentation on [RDBMS extensions](https://docs.kuzudb.com/extensions/rdbms/)
to achieve such integrations in your workflows.

## Create a database and start a connection
We can start by creating an empty Kùzu database and opening a connection to it.

In [1]:
import shutil
from pathlib import Path

import kuzu
import polars as pl

Path.mkdir(Path("db"), exist_ok=True)
shutil.rmtree("db", ignore_errors=True)
db = kuzu.Database("db")
conn = kuzu.Connection(db)

## Inspect the data

We can scan the data to get a sense of the data we are working with.

In [2]:
DATA_PATH = "./data/final"
df = conn.execute(f"LOAD FROM '{DATA_PATH}/winemag-reviews.parquet' RETURN *").get_as_pl()
df.head()

id,title,country,description,variety,points,price,region_1,taster_name,taster_twitter_handle
i64,str,str,str,str,i64,f64,str,str,str
0,"""Nicosia 2013 Vulkà Bianco (Et…","""Italy""","""Aromas include tropical fruit,…","""White Blend""",87,,"""Etna""","""Kerin O’Keefe""","""@kerinokeefe"""
1,"""Quinta dos Avidagos 2011 Avida…","""Portugal""","""This is ripe and fruity, a win…","""Portuguese Red""",87,15.0,,"""Roger Voss""","""@vossroger"""
2,"""Rainstorm 2013 Pinot Gris (Wil…","""US""","""Tart and snappy, the flavors o…","""Pinot Gris""",87,14.0,"""Willamette Valley""","""Paul Gregutt""","""@paulgwine """
3,"""St. Julian 2013 Reserve Late H…","""US""","""Pineapple rind, lemon pith and…","""Riesling""",87,13.0,"""Lake Michigan Shore""","""Alexander Peartree""",
4,"""Sweet Cheeks 2012 Vintner's Re…","""US""","""Much like the regular bottling…","""Pinot Noir""",87,65.0,"""Willamette Valley""","""Paul Gregutt""","""@paulgwine """


In [3]:
(
    df.group_by(["country", "variety"])
    .agg(pl.mean("price"))
    .drop_nulls()
    .sort("price", descending=False)
).head()

country,variety,price
str,str,f64
"""Ukraine""","""Rosé""",6.0
"""Ukraine""","""Cabernet Sauvignon""",6.0
"""Portugal""","""Trajadura""",7.0
"""Chile""","""Moscato""",7.0
"""Romania""","""Moscato""",7.2


In [4]:
import duckdb

In [5]:
tbl = duckdb.sql(
    """
    SELECT country, variety, AVG(price) AS avg_price
    FROM df
    GROUP BY country, variety
    ORDER BY avg_price ASC
    LIMIT 5
    """
)
tbl

┌──────────┬────────────────────┬───────────┐
│ country  │      variety       │ avg_price │
│ varchar  │      varchar       │  double   │
├──────────┼────────────────────┼───────────┤
│ Ukraine  │ Cabernet Sauvignon │       6.0 │
│ Ukraine  │ Rosé               │       6.0 │
│ NULL     │ Riesling           │       6.0 │
│ Chile    │ Moscato            │       7.0 │
│ Portugal │ Trajadura          │       7.0 │
└──────────┴────────────────────┴───────────┘

## Graph data modeling

The following raw data files are available in the `data/final/` directory. The data contains information
about customers who purchased wines from the reviews dataset, follow reviewers, live in certain
countries, and the original wine reviews from the previous section.

```
.
├── final
    ├── customers.parquet
    ├── follows.parquet
    ├── lives_in.parquet
    ├── purchases.parquet
    ├── tasted.parquet
    ├── tasters.parquet
    └── winemag-reviews.parquet
```

Some of these are structured as node files, with each column representing the node's properties.
Others are structured as edge files, with the first and second columns representing the source (FROM)
and target (TO) nodes, respectively. The files are shown here in Parquet format, but they could
just as well have been sitting in a relational database or datalake.

Our goal is to use this data to construct a graph with the following nodes and relationships:

<img src="./assets/graph_schema_wines.png" height=300/>

We first define the graph schema by creating the nodes and relationships and their associated properties.

In [6]:
# Create customer node table
def create_customer_node_table(conn: kuzu.Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Customer(
                customer_id INT64,
                name STRING,
                age INT64,
                PRIMARY KEY (customer_id)
            )
        """
    )

# Create taster node table
def create_taster_node_table(conn: kuzu.Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Taster(
                taster_twitter_handle STRING,
                taster_name STRING,
                taster_id STRING,
                PRIMARY KEY (taster_id)
            )
        """
    )

# Create wine node table
def create_wine_node_table(conn: kuzu.Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Wine(
                id INT64,
                title STRING,
                country STRING,
                description STRING,
                variety STRING,
                points INT64,
                price DOUBLE,
                state STRING,
                taster_name STRING,
                taster_twitter_handle STRING,
                PRIMARY KEY (id)
            )
        """
    )

# Create country node table
def create_country_node_table(conn: kuzu.Connection) -> None:
    conn.execute(
        """
        CREATE NODE TABLE
            Country(
                country STRING,
                PRIMARY KEY (country)
            )
        """
    )

In [7]:
# Run node table creation
create_customer_node_table(conn)
create_wine_node_table(conn)
create_taster_node_table(conn)
create_country_node_table(conn)

## Insert data into the graph
Once the tables are created, it's time to insert the data into the node and relationship tables.
This is done without any for-loops in Python by using the `COPY` command in Cypher, which is 
the fastest way to bulk-insert data into a node or relationship table.

In [8]:
# Insert nodes into graph
conn.execute("COPY Customer FROM 'data/final/customers.parquet'");
conn.execute("COPY Wine FROM 'data/final/winemag-reviews.parquet'");
conn.execute("COPY Taster FROM 'data/final/tasters.parquet'");
conn.execute("COPY Country FROM (LOAD FROM 'data/final/winemag-reviews.parquet' WHERE country IS NOT NULL RETURN DISTINCT country)");

In [9]:
# Check number of nodes
conn.execute("MATCH (w:Wine) RETURN count(w) AS num_wines").get_as_pl()

num_wines
i64
129971


In [10]:
# Check number of customer nodes
conn.execute("MATCH (c:Customer) RETURN count(c) AS num_customers").get_as_pl()

num_customers
i64
25


In a similar way, we can create relationship tables and insert the necessary data into them.
Note that for the final relationship table, `IsFrom`, we can directly obtain the necessary information from the `winemag-reviews.parquet` file
by running a predicate filter via the `LOAD FROM` subquery.

In [11]:
# Create relationship tables
conn.execute("CREATE REL TABLE LivesIn(FROM Customer TO Country)");
conn.execute("CREATE REL TABLE Purchased(FROM Customer TO Wine)");
conn.execute("CREATE REL TABLE Follows(FROM Customer TO Taster)");
conn.execute("CREATE REL TABLE Tasted(FROM Taster TO Wine)");
conn.execute("CREATE REL TABLE IsFrom(FROM Wine TO Country)");

# Insert relationships into graph
conn.execute("COPY LivesIn FROM 'data/final/lives_in.parquet'");
conn.execute("COPY Purchased FROM 'data/final/purchases.parquet'");
conn.execute("COPY Follows FROM 'data/final/follows.parquet'");
conn.execute("COPY Tasted FROM 'data/final/tasted.parquet'");
conn.execute("COPY IsFrom FROM (LOAD FROM 'data/final/winemag-reviews.parquet' WHERE country IS NOT NULL RETURN id, country)");

## Query graph
We can now run some queries that ask questions of the connected data.

In [12]:
# Number of customers who purchased wines reviewed by a particular reviewer
conn.execute(
    """
    MATCH (c:Customer)-[p:Purchased]->(w:Wine)<-[t:Tasted]-(r:Taster)
    WHERE r.taster_name = "Kerin O’Keefe"
    RETURN count(*) AS num_customers
    """
).get_as_pl()

num_customers
i64
4


In [13]:
# Write a query to find the top 3 reviewers who reviewed the most wines
conn.execute(
    """
    MATCH (t:Taster)-[r:Tasted]->(w:Wine)
    RETURN t.taster_name, count(r) AS num_reviews
    ORDER BY num_reviews DESC LIMIT 3
    """
).get_as_pl()

t.taster_name,num_reviews
str,i64
"""Roger Voss""",25514
"""Michael Schachner""",15134
"""Kerin O’Keefe""",10776


## Visualize the graph schema
We can also inspect the graph visually using [Kùzu Explorer](https://docs.kuzudb.com/visualization/)
and run more complex queries to answer questions on customer-taster-wine relationships.

Use the provided compose file to start the Kùzu Explorer in Docker and connect to the database.

```bash
docker compose up
```