# 📚 Basic Data Manipulations with Pixeltable

In this tutorial, you'll be using Pixeltable's table interface to handle common data wrangling tasks. We'll be creating tables, populating them with data, querying them with various filters, and even performing basic transformations.

## About Pixeltable:

[Pixeltable](https://github.com/pixeltable/pixeltable) is a Python library that provides a data infrastructure for AI development. It offers several key benefits:
- **Data-centric**: Store, transform, index, and iterate on your data within the same table interface (text, images, embeddings, video..)
- **Transparency and Reproducibility**: Built-in lineage and versioning ensure you track the origin of your data and model outputs.
- **Experimentation**: Benefit from incremental updates, so you only re-run pipelines on modified data, saving time and resources.
- **Flexibility**: Integrate with your existing Python code and libraries, using the models, tools, and AI practices you prefer.

In [1]:
%pip install -q pixeltable

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.6/250.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.0/48.0 kB[0m [31m444.5 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m36.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import pixeltable as pxt

pxt.drop_dir('dir', force=True)  # First drop the directory `demo` to ensure a clean environment
pxt.create_dir('dir') # Create the directory `demo`

Creating a Pixeltable instance at: /root/.pixeltable
Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/root/.pixeltable/pgdata
Created directory `dir`.


<pixeltable.catalog.dir.Dir at 0x7b800036ece0>

###  Creating and Populating Tables

Pixeltable makes it easy to define [tables](https://pixeltable.github.io/pixeltable/api/table/) with typed columns.

- `create_table` creates a table with specified column names and types.
- `table.insert` adds rows of data to the table as dictionaries.

In [3]:
# Create a new table and define the table schema with named columns and their data types.
table = pxt.create_table('dir.my_data', {
    'id': pxt.IntType(),
    'text': pxt.StringType(),
    'value': pxt.FloatType(),
    'image': pxt.ImageType(nullable=True),  # Optional image column
    'video': pxt.VideoType(nullable=True), # Optional video column
    'document': pxt.DocumentType(nullable=True) # Optional document column
})

# Insert some example data (each row is a dictionary)
table.insert([
    {'id': 1, 'text': 'First row', 'value': 3.14},
    {'id': 1, 'text': 'Second row', 'value': 2.718},
    {'id': 2, 'text': 'Third row with image', 'value': 1.618, 'image': 'https://raw.github.com/pixeltable/pixeltable/master/docs/source/data/images/000000000025.jpg'},
    {'id': 2, 'text': 'Fourth row with video', 'value': 10.15, 'video': 'http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4'},
    {'id': 3, 'text': 'Fourth row with document', 'value': 4.6, 'document': 'https://us.mieleusa.com/MieleMedia/docs/products/OpIn/manuals_pdf/Hoods/DA1280_Manual.pdf'}
])

Created table `my_data`.
Computing cells: 100%|████████████████████████████████████████████| 5/5 [00:02<00:00,  2.12 cells/s]
Inserting rows into `my_data`: 5 rows [00:00, 61.22 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 5/5 [00:02<00:00,  2.02 cells/s]
Inserted 5 rows with 0 errors.


UpdateStatus(num_rows=5, num_computed_values=5, num_excs=0, updated_cols=[], cols_with_excs=[])

**Pixeltable is persistent. Unlike in-memory Python libraries such as Pandas, Pixeltable is a database**. When working locally or against an hosted version of Pixeltable, use [`get_table`](https://pixeltable.github.io/pixeltable/api/pixeltable/#pixeltable.get_table) at any time to retrieve an existing table you'd like to work with.

⚠️ In Google Colab, your session is ephemeral and not written locally so you'll not be able to retrieve your tables later.


In [4]:
# Let's see the Pixeltable table that we have created
pxt.list_tables()

['dir.my_data']

In [5]:
# Retrieve the schema
table.describe()

Column Name,Type,Computed With
id,int,
text,string,
value,float,
image,image,
video,video,
document,document,


In [6]:
# Display all rows of `table`
table.collect()

id,text,value,image,video,document
1,First row,3.14,,,
1,Second row,2.718,,,
2,Third row with image,1.618,,,
2,Fourth row with video,10.15,,,
3,Fourth row with document,4.6,,,


### Viewing and Filtering your Data with Queries

Pixeltable offers convenient functions to inspect your data and filter rows based on specific criteria:

In [7]:
# You can use head(x) resp. tail(x) to show the first (last) 10 rows
table.head(2) # resp. table.tail(2)

id,text,value,image,video,document
1,First row,3.14,,,
1,Second row,2.718,,,


In [8]:
# Count the # of rows in the table
table.count()

5

Retrieve specific columns based on [filter](https://pixeltable.github.io/pixeltable/api/data-frame/#pixeltable.DataFrame) conditions and order the result. Use where() with conditions and logical operators (&, |, ~) to filter.

In [9]:
# Filter for rows where 'value' is greater than 5 and sort in ascending order by 'value'
filtered_table = table.where(table.value > 5).order_by(table.value, asc=True).collect()
filtered_table

id,text,value,image,video,document
2,Fourth row with video,10.15,,,


In [10]:
# Set column value to 10 for all rows where text is First row
table.where(table.text == 'First row').update({'value': 10}, cascade=True) # cascade set to True will update all computed columns that transitively depend on the updated columns.

# Retrieve 2 specific columns and order the result set
table.select(table.text, table.value).order_by(table.value, asc=False).collect()

Inserting rows into `my_data`: 1 rows [00:00, 82.95 rows/s]


text,value
Fourth row with video,10.15
First row,10.0
Fourth row with document,4.6
Second row,2.718
Third row with image,1.618


Modify the structure of the table by adding or removing columns.

In [11]:
# Create a column out of computed column values
table["double_value"] = table.value * 2
table.select(table.double_value).collect()

Added 5 column values with 0 errors.


double_value
5.436
3.236
20.3
9.2
20.0


In [12]:
# Adding a new empty column accepting null values
table['new_column'] = pxt.FloatType(nullable=True)
table.head(1)

Added 5 column values with 0 errors.


id,text,value,image,video,document,double_value,new_column
1,First row,10.0,,,,20.0,


In [13]:
# Let's remove the column we just added
table.drop_column('new_column')

In [14]:
# We can create new columns out of operations such as `group_by` in order to apply aggregate functions
table.group_by(table.id).select(table.id, sum_value=pxt.functions.sum(table.value)).collect()

id,sum_value
1,12.718
2,11.768
3,4.6


In [15]:
# We create a new column using an alias and a boolean operation which is fairly common in data wrangling.
table.select(table.id, table.text, superior_3=table.value > '3').collect()

id,text,superior_3
1,Second row,False
2,Third row with image,False
2,Fourth row with video,True
3,Fourth row with document,True
1,First row,True


In [16]:
# Combine multiple conditions
filtered_table = table.where((table.value > 2) & (table.text == "First row"))
filtered_table.collect()

id,text,value,image,video,document,double_value
1,First row,10.0,,,,20.0


In [17]:
# Filter on multiple values using 'isin'
table.where(table.id.isin([1, 3])).collect()

id,text,value,image,video,document,double_value
1,Second row,2.718,,,,5.436
3,Fourth row with document,4.6,,,,9.2
1,First row,10.0,,,,20.0


In [18]:
# Create additional columns to transform the images
table['resize'] = table.image.resize((224, 125)) # Resize to a specific size
table['crop'] = table.image.crop(box=(50, 50, 125, 125)) # Crop a region of interest
table['rotate'] = table.image.rotate(30) # Rotate the image

Added 5 column values with 0 errors.
Added 5 column values with 0 errors.
Added 5 column values with 0 errors.


In [19]:
table.select(table.image, table.resize, table.crop, table.rotate).collect()

image,resize,crop,rotate
,,,
,,,
,,,
,,,
,,,


## Learn more about Computed Columns and Data Transformations

There is no limit to the definition of your data transformations as computed columns, from introducing your customized logic with [User-Defined Functions](https://pixeltable.readme.io/docs/user-defined-functions-udfs), leveraging integrations with 3rd parties (e.g. [Together AI](https://pixeltable.readme.io/docs/together-ai)) or built-in functions such as the [Document Splitter](https://pixeltable.readme.io/docs/document-indexing-and-rag).

## Real-World Example with Earthquake Data using [Pandas](https://pixeltable.github.io/pixeltable/api/io/#pixeltable.io.import_pandas)

Let's start by performing a common task in Python which is to build a dataframe out of a GET Endpoint.

In [20]:
import requests
import pandas as pd

# Let's fet some public Earthquake Data
base_url = "https://earthquake.usgs.gov/fdsnws/event/1/query?"
params = {
    "format": "geojson",
    "starttime": "2023-01-01",  # Start date of your interest
    "endtime": "2024-06-28",  # End date
    "latitude": 47.6062,      # Seattle's Latitude
    "longitude": -122.3321,    # Seattle's Longitude
    "maxradiuskm": 100       # Radius around Seattle
}

response = requests.get(base_url, params=params)
data = response.json()

# We build a Pandas DataFrame
df = pd.json_normalize(data["features"])
df = df[["properties.mag", "properties.place", "properties.time", "geometry.coordinates"]]
df.columns = ["magnitude", "location", "timestamp", "coordinates"]

# We need to convert the timestamp column
df["timestamp"] = pd.to_datetime(df["timestamp"], unit='ms')

# We extract the Latitude and Longitude
df['longitude'] = df['coordinates'].apply(lambda x: x[0])
df['latitude'] = df['coordinates'].apply(lambda x: x[1])
df = df.drop("coordinates", axis=1)


`import_pandas` simplifies the creation of a Pixeltable table from a Pandas DataFrame, inferring the column types automatically. Pixeltable also supports `import_csv` and `import_excel`.

In [21]:
# Create a new Table from a Pandas `Dataframe`. The schema of the table will be inferred from the DataFrame, unless schema is specified.
earthquake = pxt.io.import_pandas("dir.eq_table", df, schema_overrides={'magnitude': pxt.FloatType(nullable=True)})
earthquake.tail()

Created table `eq_table`.
Computing cells:   0%|                                                 | 0/1805 [00:00<?, ? cells/s]
Inserting rows into `eq_table`: 0 rows [00:00, ? rows/s][A
Inserting rows into `eq_table`: 152 rows [00:00, 1518.34 rows/s][A
Inserting rows into `eq_table`: 304 rows [00:00, 1353.96 rows/s][A
Inserting rows into `eq_table`: 449 rows [00:00, 1386.53 rows/s][A
Inserting rows into `eq_table`: 589 rows [00:00, 1256.77 rows/s][A
Inserting rows into `eq_table`: 737 rows [00:00, 1304.49 rows/s][A
Inserting rows into `eq_table`: 869 rows [00:00, 1262.78 rows/s][A
Inserting rows into `eq_table`: 997 rows [00:00, 1257.08 rows/s][A
Inserting rows into `eq_table`: 1157 rows [00:00, 1359.40 rows/s][A
Inserting rows into `eq_table`: 1297 rows [00:00, 1351.68 rows/s][A
Inserting rows into `eq_table`: 1433 rows [00:01, 1283.92 rows/s][A
Inserting rows into `eq_table`: 1563 rows [00:01, 1142.51 rows/s][A
Inserting rows into `eq_table`: 1681 rows [00:01, 1151.65 row

magnitude,location,timestamp,longitude,latitude
0.97,"4 km NW of Picnic Point, Washington",2023-01-03 22:40:42.400,-122.365,47.908
1.47,"2 km NE of Kapowsin, Washington",2023-01-03 20:41:01.760,-122.202,47.007
1.36,"6 km SE of Black Diamond, Washington",2023-01-03 06:34:56.990,-121.933,47.276
1.9,"1 km ENE of Enetai, Washington",2023-01-03 05:56:23.780,-122.58,47.592
0.72,"8 km ESE of Buckley, Washington",2023-01-03 05:51:24.760,-121.928,47.136
1.56,"0 km WSW of Esperance, Washington",2023-01-02 13:19:27.200,-122.36,47.787
0.52,"15 km NNE of Ashford, Washington",2023-01-02 12:45:14.220,-121.95,46.89
0.2,"23 km ENE of Ashford, Washington",2023-01-02 12:05:01.420,-121.754,46.858
0.29,"23 km ENE of Ashford, Washington",2023-01-02 01:02:43.950,-121.757,46.852
1.15,"10 km NW of Belfair, Washington",2023-01-01 08:10:37.050,-122.93,47.514


If you want instead to work with external files, you can see how [dedicated tutorials](https://pixeltable.readme.io/docs/working-with-external-files).

### Let's Filter with Multiple Conditions (Logical Operators):

In [22]:
filtered_table = earthquake.where((earthquake.magnitude >= 1) & (earthquake.location == '22 km ENE of Ashford, Washington')).collect()
filtered_table

magnitude,location,timestamp,longitude,latitude
1.41,"22 km ENE of Ashford, Washington",2024-05-14 15:34:05.550,-121.77,46.853
1.37,"22 km ENE of Ashford, Washington",2024-04-26 13:31:03.580,-121.766,46.846
1.07,"22 km ENE of Ashford, Washington",2024-01-11 22:56:57.340,-121.764,46.853
1.06,"22 km ENE of Ashford, Washington",2023-10-28 12:56:05.460,-121.759,46.845
1.13,"22 km ENE of Ashford, Washington",2023-10-14 22:13:46.870,-121.763,46.852
1.6,"22 km ENE of Ashford, Washington",2023-04-29 05:39:06.970,-121.764,46.842
1.44,"22 km ENE of Ashford, Washington",2023-04-29 05:32:43.730,-121.763,46.846


## Pixeltable also provides the capabilities to build [Views](https://pixeltable.github.io/pixeltable/api/pixeltable/#pixeltable.create_view).
You create a view by specifying a base table and defining either a filter or an iterator (or both).

In [23]:
# In this example, I create a snapshot view with an additional column and a filter.
filtered_snapshot = pxt.create_view('dir.my_snapshot',
                                    earthquake.where(earthquake.timestamp >= '2024-06-25 18:09:00.720'),
                                    schema={'lagitude': earthquake.longitude*earthquake.latitude},
                                    is_snapshot=True)

Inserting rows into `my_snapshot`: 14 rows [00:00, 2139.48 rows/s]
Created view `my_snapshot` with 14 rows, 0 exceptions.


In [24]:
# Here, I'll create a view that is not a snapshot view and a simple filter.
filtered_view = pxt.create_view('dir.my_view',
                                earthquake.where(earthquake.latitude >= '46'),
                                ignore_errors=False)

Inserting rows into `my_view`: 1805 rows [00:00, 4186.64 rows/s]
Created view `my_view` with 1805 rows, 0 exceptions.


In [25]:
filtered_view.tail(5)

magnitude,location,timestamp,longitude,latitude
1.56,"0 km WSW of Esperance, Washington",2023-01-02 13:19:27.200,-122.36,47.787
0.52,"15 km NNE of Ashford, Washington",2023-01-02 12:45:14.220,-121.95,46.89
0.2,"23 km ENE of Ashford, Washington",2023-01-02 12:05:01.420,-121.754,46.858
0.29,"23 km ENE of Ashford, Washington",2023-01-02 01:02:43.950,-121.757,46.852
1.15,"10 km NW of Belfair, Washington",2023-01-01 08:10:37.050,-122.93,47.514


In [26]:
# You can list all the views and snapshots of a specific table
earthquake.list_views()

['dir.my_snapshot', 'dir.my_view']

Views behave like regular tables within Pixeltable. You can:

- **Query Them**: Apply filters, select specific columns, and perform calculations.
- **Update Them**: (If not a snapshot view) Modify the underlying data, with changes automatically reflected in the view.
- **Chain Them**: Create views on top of existing views for increasingly specific or complex transformations.

See how to leverage multiple views of a table for instance to [experiment with chunking strategies](https://pixeltable.readme.io/docs/rag-operations).

### Let's update the non-snapshopt view by simply inserting a new row to the underlying data

In [27]:
# Add a new row to the base table earthquake ('eq_data') and see how the related view will be updated incrementally
new_data = [
    {'location': '2 km S of Enetai, Washington', 'timestamp': '2024-06-25 18:09:00.720', 'longitude': -124.456, 'latitude': 46.544},
]
earthquake.insert(new_data)

Computing cells:   0%|                                                    | 0/1 [00:00<?, ? cells/s]
Inserting rows into `eq_table`: 1 rows [00:00, 181.24 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 70.88 cells/s]
Inserting rows into `my_view`: 1 rows [00:00, 249.32 rows/s]
Inserted 2 rows with 0 errors.


UpdateStatus(num_rows=2, num_computed_values=1, num_excs=0, updated_cols=[], cols_with_excs=[])

In [28]:
filtered_view.where(filtered_view.latitude == 46.544).collect()

magnitude,location,timestamp,longitude,latitude
,"2 km S of Enetai, Washington",2024-06-25 18:09:00.720,-124.456,46.544


**You've have reached the end of this tutorial and now know how to insert data into tables and manipulate them to build the datasets you need.**

## What's Next?

- [Video & Audio Analysis](https://pixeltable.readme.io/docs/transcribing-and-indexing-audio-and-video): Use FrameIterator and AudioSegmenter for video and audio analysis.
- [Document Parsing](https://pixeltable.readme.io/docs/document-indexing-and-rag): Process PDFs and other documents with DocumentSplitter.
- [Large Language Model (LLM) Integration](https://pixeltable.readme.io/docs/working-with-fireworks): Generate embeddings, summaries, or answers using OpenAI or other providers.
- [Vector Similarity Search](#indexing): Build semantic search applications directly on your data.
