# Performance comparison

In this notebook, we will demonstrate briefly how the `COPY FROM` technique to ingest data is orders
of magnitude faster than Cypher's `CREATE` statement to ingest data. The difference in
performance is due to the fact that we have a separate query processing pipeline specialized for
`COPY` that assumes a large amount of data is being inserted.

In [1]:
import kuzu
import shutil

## Generate some mock data

You can generate some mock data for this test case using the following cell (simply uncomment it and run it).
The data is written to `data/person_profiles.csv` and contains some mock data of persons and their metadata.
Just like in the real world, the data is a combination of integers, floats and short/long-form strings.

To run the cells below, install the `faker` and `polars` libraries within your Python environment:

```bash
uv pip install faker polars
```

In [2]:
# Pre-collect the entire list of first names from the Faker library
import random
from faker.providers.person.en import Provider

SEED = 37
random.seed(SEED)

first_names = list(set(Provider.first_names))
random.shuffle(first_names)
first_names[:10]

['Ester',
 'Aleck',
 'Birtha',
 'Chantel',
 'Sydnie',
 'Lisandro',
 'Caitlyn',
 'Normand',
 'Stephania',
 'Berton']

In [3]:
# uv pip install faker polars
from faker import Faker
import polars as pl

Faker.seed(SEED)
fake = Faker()

NUM_RECORDS = 5000
OUTPUT_PATH = "data/person_profiles.csv"

def generate_person_profiles(num: int) -> None:
    profiles = []
    for i in range(1, NUM_RECORDS + 1):
        profile = dict()
        profile["id"] = i
        profile["name"] = first_names[i]
        profile["age"] = fake.random_int(min=18, max=75)
        profile["net_worth"] = fake.pyfloat(positive=True, min_value=10245, max_value=100_321_251)
        profile["email"] = f"{fake.domain_word()}@{fake.free_email_domain()}"
        profile["address"] = fake.address().replace("\n", ", ")
        profile["phone"] = fake.phone_number()
        profile["comments"] = fake.text(max_nb_chars=200)
        profiles.append(profile)
    print(f"---\nGenerated {num} synthetic person profiles.\n")
    # Output to CSV file using Polars
    df = pl.DataFrame(profiles)
    df.write_csv(OUTPUT_PATH, separator="|")

generate_person_profiles(NUM_RECORDS)


---
Generated 5000 synthetic person profiles.



## Create a Kùzu database

Just as in the other notebook, we will create a Kùzu database and start a connection.

In [4]:
DB_NAME = "./db_large"
shutil.rmtree(DB_NAME, ignore_errors=True)
db = kuzu.Database(DB_NAME)
conn = kuzu.Connection(db)

## Create a Kùzu table

A node table is created with the below schema that matches the columns in the CSV file.

In [5]:
def create_node_table(name: str) -> None:
    conn.execute(
        f"""
        CREATE NODE TABLE {name} (
            id STRING,
            name STRING,
            age INT64,
            net_worth DOUBLE,
            email STRING,
            address STRING,
            phone STRING,
            comments STRING,
            PRIMARY KEY (id)
        )
        """
    )

create_node_table("Person")


## Read CSV data

The CSV data is read into a list of dicts in Python.

In [6]:
import csv

def read_csv(filename):
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f, delimiter="|")
        for line in reader:
            data.append(line)
    return data

records = read_csv(OUTPUT_PATH)

In [7]:
len(records)

5000

## Method 1: Use `CREATE` to ingest the nodes

The most naive way to ingest nodes into Kùzu via Cypher is using the `CREATE` clause. In this case,
we iterate through the list of records and create a node for each record.

 The `CREATE` clause will only add a node if a node with the same primary key value does not already exist - if it exists, there will be a Runtime error. Another similar clause that does this while overwriting existing nodes with the same primary key value is the `MERGE` clause, which will add a node if it does not exist, or update the node if it does exist.

In [8]:
%%time

conn.execute("BEGIN TRANSACTION")
for record in records:
    conn.execute(
        """
        CREATE (person:Person {id: $id})
        SET person.name = $name,
            person.age = $age,
            person.net_worth = $net_worth,
            person.email = $email,
            person.address = $address,
            person.phone = $phone,
            person.comments = $comments
        """,
        parameters={
            "id": record["id"],
            "name": record["name"],
            "age": int(record["age"]),
            "net_worth": float(record["net_worth"]),
            "email": record["email"],
            "address": record["address"],
            "phone": record["phone"],
            "comments": record["comments"],
        }
    )
conn.execute("COMMIT")

CPU times: user 1.67 s, sys: 876 ms, total: 2.55 s
Wall time: 1.77 s


<kuzu.query_result.QueryResult at 0x10aa67b60>

Note the time taken for the above cell to run.

## Method 2: Use `COPY FROM` to ingest the nodes

The next step is to perform the same task, but using the `COPY FROM` statement. As can be seen from the timing numbers below, it's much, much faster than using individual `CREATE` statements.

In [9]:
# Drop the table and recreate it
conn.execute("DROP TABLE Person")
create_node_table("Person")

In [10]:
%%time
conn.execute("COPY Person FROM 'data/person_profiles.csv' (header = true, delim = '|', parallel = false)");

CPU times: user 22.7 ms, sys: 58.8 ms, total: 81.5 ms
Wall time: 31.2 ms
