# Pixeltable Fundamentals

## Part 1: Tables and Data Operations

In this tutorial, we'll learn the basics of creating tables and manipulating data in Pixeltable. This is Part 1 of the [Pixeltable Fundamentals](...) tutorial.

First, let's ensure the Pixeltable library is installed in your environment.

In [None]:
%pip install -q pixeltable

### Tables

All data in Pixeltable is stored in tables. At a high level, a Pixeltable table behaves similarly to an ordinary SQL database table, but with many additional capabilities to support complex AI workflows. We'll introduce those advanced capabilities gradually throughout this series of tutorials; for now, the focus is on basic table and data operations.

Tables in Pixeltable are grouped into __directories__, which are simply user-defined namespaces. The following command creates a new directory, `fundamentals`, which we'll use to store the tables in our tutorial.

In [1]:
import pixeltable as pxt

# First we drop the `fundamentals` directory and all its contents,
# in order to ensure a clean environment for the tutorial.
pxt.drop_dir('fundamentals', force=True)

# Now we create the directory.
pxt.create_dir('fundamentals')

Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `fundamentals`.


<pixeltable.catalog.dir.Dir at 0x3451b7e20>

Now let's create our first table. To create a table, we must give it a name and a __schema__ that describes the table structure. Note that prefacing the name with `fundamentals` causes it to be placed in our newly-created directory.

In [2]:
films_t = pxt.create_table('fundamentals.films', {
    'film_name': pxt.StringType(),
    'year': pxt.IntType(),
    'revenue': pxt.FloatType()
})

Created table `films`.


To insert data into a table, we use the `insert()` method, passing it a list of Python dicts.

In [3]:
films_t.insert([
    {'film_name': 'Jurassic Park', 'year': 1993, 'revenue': 1037.5},
    {'film_name': 'Titanic', 'year': 1997, 'revenue': 2257.8},
    {'film_name': 'Avengers: Endgame', 'year': 2019, 'revenue': 2797.5}
])

Computing cells:   0%|                                                    | 0/3 [00:00<?, ? cells/s]
Inserting rows into `films`: 3 rows [00:00, 1051.56 rows/s]
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 553.90 cells/s]
Inserted 3 rows with 0 errors.


UpdateStatus(num_rows=3, num_computed_values=3, num_excs=0, updated_cols=[], cols_with_excs=[])

If you're inserting just a single row, you can use an alternate syntax that is sometimes more convenient.

In [4]:
films_t.insert(film_name='Inside Out 2', year=2024, revenue=1462.7)

Computing cells:   0%|                                                    | 0/1 [00:00<?, ? cells/s]
Inserting rows into `films`: 1 rows [00:00, 1148.50 rows/s]
Computing cells: 100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 509.39 cells/s]
Inserted 1 row with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=1, num_excs=0, updated_cols=[], cols_with_excs=[])

We can peek at the data in our table with the `head()` method.

In [5]:
films_t.head()

film_name,year,revenue
Jurassic Park,1993,1037.5
Titanic,1997,2257.8
Avengers: Endgame,2019,2797.5
Inside Out 2,2024,1462.7


Pixeltable keeps track of the insertion order of data in its tables, and `head()` will always return the _earliest_ rows in the table. By default, it returns (at most) 10 rows, but you can specify a different value:

In [6]:
films_t.head(2)

film_name,year,revenue
Jurassic Park,1993,1037.5
Titanic,1997,2257.8


To see the _most recently inserted_ rows in a table, use `tail()`.

In [7]:
films_t.tail(2)

film_name,year,revenue
Avengers: Endgame,2019,2797.5
Inside Out 2,2024,1462.7


### Filtering and Selecting Data

Often you want to select only certain data or certain columns in a table. You can do this with the `where()` and `select()` methods.

In [8]:
films_t.where(films_t.revenue >= 2000.0).head()

film_name,year,revenue
Titanic,1997,2257.8
Avengers: Endgame,2019,2797.5


In [9]:
films_t.select(films_t.film_name, films_t.year).head()

film_name,year
Jurassic Park,1993
Titanic,1997
Avengers: Endgame,2019
Inside Out 2,2024


Note the expressions that appear inside the calls to `where()` and `select()`, such as `films_t.year`. These are __column references__ that point to specific columns within a table. In place of `films_t.year`, you can also use dictionary syntax and type `films_t['year']`, which means exactly the same thing but is sometimes more convenient.

In [10]:
films_t.select(films_t['film_name'], films_t['year']).head()

film_name,year
Jurassic Park,1993
Titanic,1997
Avengers: Endgame,2019
Inside Out 2,2024


In addition to selecting columns directly, you can use column references inside various kinds of expressions. For example, our `revenue` numbers are given in millions of dollars. Let's say we wanted to select revenue in thousands of dollars instead; we could do that as follows:

In [11]:
films_t.select(films_t.film_name, films_t.revenue * 1000).head()

film_name,col_1
Jurassic Park,1037500.0
Titanic,2257800.0
Avengers: Endgame,2797500.0
Inside Out 2,1462700.0


Note that since we selected an abstract expression rather than a specific column, Pixeltable gave it the generic name `col_1`. You can assign it a more informative name with Python keyword syntax:

In [12]:
films_t.select(films_t.film_name, revenue_thousands=films_t.revenue * 1000).head()

film_name,revenue_thousands
Jurassic Park,1037500.0
Titanic,2257800.0
Avengers: Endgame,2797500.0
Inside Out 2,1462700.0


### Tables are Persistent

This is a good time to mention a few key differences between Pixeltable tables and other familiar datastructures, such as Python dicts or Pandas dataframes.

First, **Pixeltable is persistent. Unlike in-memory Python libraries such as Pandas, Pixeltable is a database**. When you reset a notebook kernel or start a new Python session, you'll have access to all the data you've stored previously in Pixeltable. Let's demonstrate this by using the IPython `%reset -f` command to clear out all our notebook variables, so that `films_t` is no longer defined.

In [13]:
%reset -f
films_t.head()  # Throws an exception now

NameError: name 'films_t' is not defined

The `films_t` variable is no longer defined - but that's ok, because it wasn't the source of record for our data. The `films_t` variable is just a reference to the underlying database table. We can recover it with the `get_table` command, referencing the `films` table by name.

In [14]:
import pixeltable as pxt

films_t = pxt.get_table('fundamentals.films')
films_t.head()

film_name,year,revenue
Jurassic Park,1993,1037.5
Titanic,1997,2257.8
Avengers: Endgame,2019,2797.5
Inside Out 2,2024,1462.7


You can always get a list of all existing tables with the Pixeltable `list_tables` command.

In [15]:
pxt.list_tables()

['fundamentals.films']

<div class="alert alert-block alert-info">
Note that if you're running Pixeltable on colab or kaggle, the database will persist only as long as your colab/kaggle session. If you're running it locally or on your own server, then your database will persist indefinitely (until you actively delete it).
</div>

### Tables are Typed

The second major difference is that **Pixeltable is strongly typed**. Because Pixeltable is a database, every column has a data type: that's why we specified `StringType`, `IntType`, and `FloatType` for the three columns when we created the table. These __type specifiers__ are _mandatory_ when creating tables, and they become part of the table schema. You can always see the table schema with the `describe()` method.

In [16]:
films_t.describe()

Column Name,Type,Computed With
film_name,string,
year,int,
revenue,float,


Besides the column names and types, there's a third element to the schema, `Computed With`. We'll explain what this means in the next section of the tutorial, [Computed Columns](...).

All of the methods we've discussed so far, such as `insert()` and `get_table()`, are documented in the [Pixeltable API](https://pixeltable.github.io/pixeltable/) Documentation. The following pages are particularly relevant to this section of the tutorial:
- [pixeltable](https://pixeltable.github.io/pixeltable/api/pixeltable/) package reference
- [pxt.Table](https://pixeltable.github.io/pixeltable/api/table/) class reference
- [API Cheat Sheet](https://pixeltable.github.io/pixeltable/api-cheat-sheet/)

### A Real-World Example: Earthquake Data

Now let's dive a little deeper into Pixeltable's data operations. To showcase all the features, it'll be helpful to have a real-world dataset, rather than our toy dataset with four movies. The dataset we'll be using consists of Earthquake data drawn from the US Geological Survey: all recorded Earthquakes that occurred within 100 km of Seattle, Washington, between January 1, 2023 and June 30, 2024.

The dataset is in CSV format, and we can easily load it into Pixeltable with the handy `import_csv()` function, which creates a new Pixeltable table from the contents of a CSV file.

In [17]:
eq_t = pxt.io.import_csv(
    'fundamentals.earthquakes',  # Name for the new table
    'https://raw.githubusercontent.com/aaron-siegel/pixeltable/fundamentals-tutorial/docs/source/data/earthquakes.csv',
    parse_dates=[3]  # Interpret column 3 as a timestamp
)

Created table `earthquakes`.
Computing cells:   0%|                                                 | 0/1823 [00:00<?, ? cells/s]
Inserting rows into `earthquakes`: 0 rows [00:00, ? rows/s][A
Inserting rows into `earthquakes`: 1823 rows [00:00, 12150.16 rows/s][A
Computing cells: 100%|███████████████████████████████████| 1823/1823 [00:00<00:00, 11744.85 cells/s]
Inserted 1823 rows with 0 errors.


<div class="alert alert-block alert-info">
In Pixeltable, you can always import external data by giving a URL instead of a local file path. This applies to CSV datasets, media files (such images and video), and other types of content. The URL will often be an http:// URL, but it can also be an s3:// URL referencing an S3 bucket.
</div>

Let's have a peek at our new dataset.

In [18]:
eq_t.head()

id,magnitude,location,timestamp,longitude,latitude
0,1.15,"10 km NW of Belfair, Washington",2023-01-01 08:10:37.050,-122.93,47.51
1,0.29,"23 km ENE of Ashford, Washington",2023-01-02 01:02:43.950,-121.76,46.85
2,0.2,"23 km ENE of Ashford, Washington",2023-01-02 12:05:01.420,-121.75,46.86
3,0.52,"15 km NNE of Ashford, Washington",2023-01-02 12:45:14.220,-121.95,46.89
4,1.56,"0 km WSW of Esperance, Washington",2023-01-02 13:19:27.200,-122.36,47.79
5,0.72,"8 km ESE of Buckley, Washington",2023-01-03 05:51:24.760,-121.93,47.14
6,1.9,"1 km ENE of Enetai, Washington",2023-01-03 05:56:23.780,-122.58,47.59
7,1.36,"6 km SE of Black Diamond, Washington",2023-01-03 06:34:56.990,-121.93,47.28
8,1.47,"2 km NE of Kapowsin, Washington",2023-01-03 20:41:01.760,-122.2,47.01
9,0.97,"4 km NW of Picnic Point, Washington",2023-01-03 22:40:42.400,-122.36,47.91


And the schema:

In [19]:
eq_t.describe()

Column Name,Type,Computed With
id,int,
magnitude,float,
location,string,
timestamp,timestamp,
longitude,float,
latitude,float,


Note that while specifying a schema is mandatory when _creating_ a table, it's not always required when _importing_ data. This is because Pixeltable uses the structure of the imported data to infer the column types, when feasible. You can always override the inferred column types with the `schema_overrides` parameter of `import_csv()`.

The following examples showcase some common data operations.

In [20]:
eq_t.count()  # Number of rows in the table

1823

In [21]:
# 10 highest-magnitude earthquakes

eq_t.order_by(eq_t.magnitude, asc=False).head()

id,magnitude,location,timestamp,longitude,latitude
1002,4.3,"Port Townsend, WA",2023-10-09 02:21:08.960,-122.73,48.04
1226,4.04,"6 km W of Quilcene, Washington",2023-12-24 15:14:04.220,-122.96,47.82
699,3.91,"9 km NNE of Snoqualmie, Washington",2023-08-08 10:17:23.910,-121.77,47.6
1281,3.48,"7 km SSW of River Road, Washington",2024-01-15 07:25:05.920,-123.17,48.0
1355,3.42,"17 km WSW of Brinnon, Washington",2024-02-16 16:30:18.830,-123.09,47.59
1367,3.38,"13 km E of Lake Marcel-Stillwater, Washington",2024-02-22 13:02:49.580,-121.73,47.67
1123,3.32,"2 km SSE of Stanwood, Washington",2023-11-17 00:47:40.540,-122.36,48.22
1286,3.26,"1 km NNE of Coupeville, Washington",2024-01-18 03:47:41.020,-122.68,48.24
1408,3.12,"7 km WSW of Snoqualmie, Washington",2024-03-07 01:48:35.340,-121.91,47.49
455,3.11,"6 km N of Camano, Washington",2023-05-23 07:21:34.690,-122.52,48.23


In [22]:
# 10 highest-magnitude earthquakes in Q3 2023

eq_t.where((eq_t.timestamp >= '2023-06-01') & (eq_t.timestamp < '2023-10-01')) \
  .order_by(eq_t.magnitude, asc=False).head()

id,magnitude,location,timestamp,longitude,latitude
699,3.91,"9 km NNE of Snoqualmie, Washington",2023-08-08 10:17:23.910,-121.77,47.6
799,2.86,"5 km E of Ashford, Washington",2023-08-27 10:10:23.770,-121.96,46.77
710,2.84,"8 km ENE of Fall City, Washington",2023-08-08 11:51:12.750,-121.79,47.6
577,2.79,"0 km NE of Maple Valley, Washington",2023-07-04 15:52:54.430,-122.04,47.4
769,2.73,"16 km NE of Ashford, Washington",2023-08-22 23:44:12.250,-121.88,46.87
606,2.58,"Puget Sound region, Washington",2023-07-11 18:12:58.050,-122.92,47.53
584,2.57,"20 km S of Friday Harbor, Washington",2023-07-07 09:12:34.430,-123.0,48.35
926,2.5,"9 km NNE of Snoqualmie, Washington",2023-09-17 07:08:54.480,-121.76,47.6
643,2.42,"11 km SW of Brinnon, Washington",2023-07-21 17:39:03.540,-123.0,47.61
509,2.35,"21 km NE of Ashford, Washington",2023-06-08 22:34:11.200,-121.87,46.91


Note that Pixeltable uses Pandas-like operators for filtering data:

```
(eq_t.timestamp >= '2023-06-01') & (eq_t.timestamp < '2023-10-01')
```

means _both_ conditions must be true; similarly (say),

```
(eq_t.timestamp < '2023-06-01') | (eq_t.timestamp >= '2024-01-01')
```

would mean _either_ condition must be true.

You can also use the special `isin` operator to select just those values that appear within a particular list:

In [23]:
# Earthquakes with specific ids

eq_t.where(eq_t.id.isin([123,456,789])).head()

id,magnitude,location,timestamp,longitude,latitude
123,1.23,"7 km SW of Rainier, Washington",2023-02-17 00:28:25.460,-122.75,46.84
456,0.23,Washington,2023-05-23 08:49:02.450,-121.98,46.87
789,1.67,"Puget Sound region, Washington",2023-08-26 04:04:11.200,-122.57,47.6


In [24]:
# Min and max magnitudes

eq_t.select(min=pxt.functions.min(eq_t.magnitude), max=pxt.functions.max(eq_t.magnitude)).head()

min,max
-0.83,4.3


### Extracting Data from Tables into Python/Pandas

Sometimes it's handy to pull out data from a table into a Python object. We've actually already done this; all the calls to `head()` and `tail()` return an in-memory result set, which we can then dereference in various ways. For example:

In [25]:
result = eq_t.head()
result[0]  # Get the first row of the table as a dict

{'id': 0,
 'magnitude': 1.15,
 'location': '10 km NW of Belfair, Washington',
 'timestamp': datetime.datetime(2023, 1, 1, 8, 10, 37, 50000),
 'longitude': -122.93,
 'latitude': 47.51}

In [26]:
result['timestamp']  # Get a list of the `timestamp` field of the first 10 rows

[datetime.datetime(2023, 1, 1, 8, 10, 37, 50000),
 datetime.datetime(2023, 1, 2, 1, 2, 43, 950000),
 datetime.datetime(2023, 1, 2, 12, 5, 1, 420000),
 datetime.datetime(2023, 1, 2, 12, 45, 14, 220000),
 datetime.datetime(2023, 1, 2, 13, 19, 27, 200000),
 datetime.datetime(2023, 1, 3, 5, 51, 24, 760000),
 datetime.datetime(2023, 1, 3, 5, 56, 23, 780000),
 datetime.datetime(2023, 1, 3, 6, 34, 56, 990000),
 datetime.datetime(2023, 1, 3, 20, 41, 1, 760000),
 datetime.datetime(2023, 1, 3, 22, 40, 42, 400000)]

In [27]:
df = result.to_pandas()  # Convert the result set into a Pandas data frame
df['magnitude'].describe()

count    10.000000
mean      1.014000
std       0.572367
min       0.200000
25%       0.570000
50%       1.060000
75%       1.442500
max       1.900000
Name: magnitude, dtype: float64

`head()` and `tail()` return a fixed maximum number of rows. If you want to return _all_ the rows from a query, use `collect()`. Be careful! `collect()` pulls the entire contents of a query into memory. For very large tables, this could result in out-of-memory errors. In this example, the 1823 rows in the table fit comfortable into a data frame.

In [28]:
df = eq_t.collect().to_pandas()
df['magnitude'].describe()

count    1823.000000
mean        0.900378
std         0.625492
min        -0.830000
25%         0.420000
50%         0.850000
75%         1.310000
max         4.300000
Name: magnitude, dtype: float64

You can use `limit()` to prune the results of `collect()`. Unlike `head()` and `tail()`, which go by insertion order of the table, `collect()` will return the results in an arbitrary order (which might be insertion order, but not necessarily), unless an `order_by()` clause is specified.

In [29]:
eq_t.where(eq_t.timestamp >= '2024-01-01').limit(5).collect()

id,magnitude,location,timestamp,longitude,latitude
1236,0.83,"1 km NNE of Lake Marcel-Stillwater, Washington",2024-01-01 14:32:21.920,-121.91,47.7
1237,1.17,"3 km WNW of Eatonville, Washington",2024-01-02 19:56:08.460,-122.31,46.87
1238,1.88,"10 km NE of Lake Marcel-Stillwater, Washington",2024-01-02 22:10:16.890,-121.84,47.77
1239,1.19,"3 km S of Quilcene, Washington",2024-01-03 03:42:54.260,-122.87,47.79
1240,0.5,"23 km ENE of Ashford, Washington",2024-01-03 10:31:55.890,-121.76,46.85


### Adding Columns

Like other database tables, Pixeltable tables aren't fixed entities: they're meant to evolve over time. Suppose we want to add a new column to hold user-specified comments about particular earthquake events. We can do this with the `add_column()` method:

In [30]:
eq_t.add_column(note=pxt.StringType(nullable=True))

Computing cells: 100%|████████████████████████████████████| 1823/1823 [00:00<00:00, 3019.30 cells/s]
Added 1823 column values with 0 errors.


UpdateStatus(num_rows=1823, num_computed_values=1823, num_excs=0, updated_cols=[], cols_with_excs=[])

Here, `pxt.StringType` specifies the type of the new column. `nullable=True` means that it's an _optional_ field. That `nullable=True` is mandatory for newly-added columns, because none of the existing rows have a comment (yet) - so it'd be inconsistent with the existing data to make `comment` a required field.

An alternate syntax is sometimes convenient for adding columns:

In [31]:
eq_t['contact_email'] = pxt.StringType(nullable=True)

Computing cells: 100%|████████████████████████████████████| 1823/1823 [00:00<00:00, 2902.28 cells/s]
Added 1823 column values with 0 errors.


Let's have a look at the revised schema.

In [32]:
eq_t.describe()

Column Name,Type,Computed With
id,int,
magnitude,float,
location,string,
timestamp,timestamp,
longitude,float,
latitude,float,
note,string,
contact_email,string,


### Updating and Deleting Data

Table rows can be modified and deleted with the SQL-like `update()` and `delete()` commands.

In [33]:
# Add a comment to records with IDs 123 and 127

eq_t.where(eq_t.id.isin([123,127])).update({'note': 'Still investigating.', 'contact_email': 'contact@pixeltable.com'})

Inserting rows into `earthquakes`: 2 rows [00:00, 746.58 rows/s]


UpdateStatus(num_rows=2, num_computed_values=0, num_excs=0, updated_cols=['earthquakes.note', 'earthquakes.contact_email'], cols_with_excs=[])

In [34]:
eq_t.where(eq_t.id >= 120).select(eq_t.id, eq_t.magnitude, eq_t.note, eq_t.contact_email).head()

id,magnitude,note,contact_email
120,1.17,,
121,1.87,,
122,0.34,,
123,1.23,Still investigating.,contact@pixeltable.com
124,0.13,,
125,0.29,,
126,1.48,,
127,0.63,Still investigating.,contact@pixeltable.com
128,1.54,,
129,0.7,,


In [35]:
# Delete all records in 2024

eq_t.where(eq_t.timestamp >= '2024-01-01').delete()

UpdateStatus(num_rows=587, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])

In [36]:
eq_t.count()  # How many are left after deleting?

1236

That's the end of this section of the Fundamentals tutorial! Continue on to the next section:
- [Computed Columns](...)