[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/pixeltable/pixeltable/blob/release/docs/release/fundamentals/computed-columns.ipynb)&nbsp;&nbsp;
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pixeltable/pixeltable/blob/release/docs/release/fundamentals/computed-columns.ipynb)&nbsp;&nbsp;
<a href="https://raw.githubusercontent.com/pixeltable/pixeltable/release/docs/release/fundamentals/computed-columns.ipynb" download><img src="https://img.shields.io/badge/%E2%AC%87-Download%20Notebook-blue" alt="Download Notebook"></a>

# Pixeltable Fundamentals

## Section 2: Computed Columns and Expressions

Welcome to Section 2 of the [Pixeltable Fundamentals](...) tutorial, __Computed Columns and Expressions__.

In the previous section, [Tables and Data Operations](...), we learned how to create tables, populate them with data, and query and manipulate their contents. In this section, we'll introduce one of Pixeltable's most essential and powerful concepts: computed columns.

First, let's ensure the Pixeltable library is installed in your environment.

In [None]:
%pip install -qU pixeltable

### Computed Columns

Let's start with a simple example that illustrates the basic concepts behind computed columns. We'll use a table of world population data for our example.

In [1]:
import pixeltable as pxt

pxt.drop_dir('fundamentals', force=True)
pxt.create_dir('fundamentals')
pop_t = pxt.io.import_csv(
    'fundamentals.population',
    '/Users/asiegel/Dropbox/workspace/pixeltable/pixeltable/docs/source/data/world-population-data.csv'
)

Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `fundamentals`.
Created table `population`.
Inserting rows into `population`: 234 rows [00:00, 10583.91 rows/s]
Inserted 234 rows with 0 errors.


Remember that `pop_t.head()` returns the first few rows of a table, and typing the table name `pop_t` by itself gives the schema.

In [2]:
pop_t.head(5)

cca3,country,continent,pop_2023,pop_2022,pop_2000,area__km__
IND,India,Asia,1428627663,1417173173,1059633675,3287590.0
CHN,China,Asia,1425671352,1425887337,1264099069,9706961.0
USA,United States,North America,339996563,338289857,282398554,9372610.0
IDN,Indonesia,Asia,277534122,275501339,214072421,1904569.0
PAK,Pakistan,Asia,240485658,235824862,154369924,881912.0


In [3]:
pop_t

Column Name,Type,Computed With
cca3,string,
country,string,
continent,string,
pop_2023,int,
pop_2022,int,
pop_2000,int,
area__km__,float,


Now let's suppose we want to add a new column for the year-over-year population change from 2022 to 2023. In the previous tutorial section, [Tables and Data Operations](...), we saw how one might `select()` such a quantity into a Pixeltable `DataFrame`:

In [11]:
pop_t.select(pop_t.country, yoy_change=(pop_t.pop_2023 - pop_t.pop_2022)).head(5)

country,yoy_growth
India,11454490
China,-215985
United States,1706706
Indonesia,2032783
Pakistan,4660796


A __computed column__ is a way of turning such a selection into a permanent part of the table. Here's how it works:

In [5]:
pop_t.add_column(yoy_change=(pop_t.pop_2023 - pop_t.pop_2022))

Added 234 column values with 0 errors.


UpdateStatus(num_rows=234, num_computed_values=234, num_excs=0, updated_cols=[], cols_with_excs=[])

As soon as the column is added, Pixeltable will (by default) automatically compute its value for all rows in the table, storing the results in the new column. If we now inspect the schema of `pop_t`, we see the new column and its definition.

In [6]:
pop_t

Column Name,Type,Computed With
cca3,string,
country,string,
continent,string,
pop_2023,int,
pop_2022,int,
pop_2000,int,
area__km__,float,
yoy_growth,int,pop_2023 - pop_2022


The new column can be queried in the usual manner.

In [7]:
pop_t.select(pop_t.country, pop_t.yoy_change).head(5)

country,yoy_growth
India,11454490
China,-215985
United States,1706706
Indonesia,2032783
Pakistan,4660796


Computed columns can be "chained" with other computed columns. Here's an example that expresses population change as a percentage:

In [8]:
pop_t.add_column(yoy_percent_change=(pop_t.yoy_change * 100 / pop_t.pop_2022))

Added 234 column values with 0 errors.


UpdateStatus(num_rows=234, num_computed_values=234, num_excs=0, updated_cols=[], cols_with_excs=[])

In [9]:
pop_t

Column Name,Type,Computed With
cca3,string,
country,string,
continent,string,
pop_2023,int,
pop_2022,int,
pop_2000,int,
area__km__,float,
yoy_growth,int,pop_2023 - pop_2022
yoy_percent_growth,float,(yoy_growth * 100) / pop_2022


In [10]:
pop_t.select(pop_t.country, pop_t.yoy_change, pop_t.yoy_percent_change).head(5)

country,yoy_growth,yoy_percent_growth
India,11454490,0.808
China,-215985,-0.015
United States,1706706,0.505
Indonesia,2032783,0.738
Pakistan,4660796,1.976


Although computed columns appear superficially similar to DataFrames, there is a key difference. Because computed columns are a permanent part of the table, they will be automatically updated any time new data is added to the table. These updates will propagate through any other computed columns that are "downstream" of the new data.

<div class="alert alert-block alert-info">
In traditional data workflows, it is commonplace to recompute entire pipelines when the input dataset is changed or enlarged. In Pixeltable, by contrast, <b>all updates are applied incrementally</b>. When new data appear in a table or existing data are altered, Pixeltable will recompute only those rows that are dependent on the changed data.
</div>

Let's see how this works in practice. For purposes of illustration, we'll add an entry for California to the table, as if it were a country.

In [13]:
pop_t.insert(
    cca3='',
    country='California',
    continent='North America',
    pop_2023=39110000,
    pop_2022=39030000,
    pop_2000=33990000,
    area__km__=432970.0
)

Computing cells:   0%|                                                    | 0/5 [00:00<?, ? cells/s]
Inserting rows into `population`: 1 rows [00:00, 194.20 rows/s]
Computing cells: 100%|███████████████████████████████████████████| 5/5 [00:00<00:00, 568.46 cells/s]
Inserted 1 row with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=5, num_excs=0, updated_cols=[], cols_with_excs=[])

Observe that the computed columns `yoy_growth` and `yoy_percent_growth` have been automatically updated in response to the new data.

In [15]:
pop_t.tail(5)

cca3,country,continent,pop_2023,pop_2022,pop_2000,area__km__,yoy_growth,yoy_percent_growth
FLK,Falkland Islands,South America,3791,3780,3080,12173.0,11,0.291
NIU,Niue,Oceania,1935,1934,2074,261.0,1,0.052
TKL,Tokelau,Oceania,1893,1871,1666,12.0,22,1.176
VAT,Vatican City,Europe,518,510,651,0.44,8,1.569
,California,North America,39110000,39030000,33990000,432970.0,80000,0.205


<div class="alert alert-block alert-info">
Remember that <b>all tables in Pixeltable are persistent</b>. This includes computed columns: when you create a computed column, its definition is stored in the database. You can think of computed columns as setting up a persistent compute workflow: if you close your notebook or restart your Python instance, computed columns (along with the relationships between them, and any data contained in them) will be preserved.
</div>

### A More Interesting Example: Image Processing