# Wrangling with Data 101 in Pixeltable


In this tutorial, we'll guide you through using Pixeltable's intuitive table interface to handle common data wrangling tasks. We'll cover creating tables, populating them with data, querying them with various filters, and even performing basic transformations.

#### Pixeltable simplifies data preparation with:

- Unified Interface: Handle diverse data types (images, text, embeddings, time series) in a single table format.
- Reproducibility: Track data lineage and changes, ensuring transparency and enabling you to retrace your steps.
- Efficiency: Incremental updates mean you only recompute what's changed, saving valuable time and resources.

#### See more examples for:

- [Comparing Object Detection Models](https://dash.readme.com/project/pixeltable/v1.0/docs/object-detection-in-videos) (Computer Vision)
- [Build a Q&A System in Minutes](https://pixeltable.readme.io/docs/build-a-qa-system-in-minutes-with-pixeltable) (LLM)
- [Working with OpenAI](https://pixeltable.readme.io/docs/working-with-openai) (Integrations)



In [1]:
%pip install -q pixeltable

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.3/247.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.3/34.3 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.4/54.4 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.0/48.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import pixeltable as pxt

###  Creating and Populating Tables

Pixeltable makes it easy to define tables with typed columns. Let's create a sample table to track information about things with wings and legs:

In [3]:
# Drop the table if it already exists to avoid conflict
pxt.drop_table('first_table', ignore_errors=True)

t = pxt.create_table('first_table', {
      'num_legs': pxt.IntType(nullable=True),
      'num_wings': pxt.IntType(),
      'name': pxt.StringType(nullable=True),
      'image': pxt.ImageType(nullable=True)
})

# Insert rows (each row is a dictionary)
t.insert([{'num_wings': 2, 'name': 'jake'},
          {'num_legs': 3, 'num_wings': 2},
          {'num_legs': 4, 'num_wings': 8, 'name': 'kev'}])

Creating a Pixeltable instance at: /root/.pixeltable
Connected to Pixeltable database at: postgresql://postgres:@/pixeltable?host=/root/.pixeltable/pgdata
Created table `first_table`.
Computing cells:   0%|                                                    | 0/3 [00:00<?, ? cells/s]
Inserting rows into `first_table`: 3 rows [00:00, 477.95 rows/s]
Computing cells: 100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 121.35 cells/s]
Inserted 3 rows with 0 errors.


UpdateStatus(num_rows=3, num_computed_values=3, num_excs=0, updated_cols=[], cols_with_excs=[])

### Viewing and Filtering your Data with Queries

Pixeltable offers convenient functions to inspect your data and filter rows based on specific criteria:

In [4]:
# Show a specific row (e.g., row 1)
# t.show(1)

# Get all the rows as a list of dictionaries
t.collect()

num_legs,num_wings,name,image
,2,jake,
3.0,2,,
4.0,8,kev,


In [5]:
# Show rows where num_wings is greater than or equal to 7
t.where(t.num_wings >= 7).show()

num_legs,num_wings,name,image
4,8,kev,


In [6]:
# Filter on multiple values using 'isin'
t.where(t.num_wings.isin([2, 8])).collect()

num_legs,num_wings,name,image
,2,jake,
3.0,2,,
4.0,8,kev,


### Basic Column Transformations

Pixeltable allows you to perform calculations directly on columns, creating new values on the fly:

In [7]:
# Extract data based on calculated columns
t.where(t.name == 'kev').select(leg_calc=t.num_legs + 4 * 12, wing_calc=t.num_wings * t.num_legs).show()

leg_calc,wing_calc
52,32


#### Transform image data to a new column using built-in functions

This example image will be referencing a copy of the source image available in the Pixeltable github repo. But in practice, the images can come from anywhere: an S3 bucket or local file system.

In [8]:
# We are required to insert a new row by specifying `num_wings` as by default `nullable=False` when we defined the schema of our `first_table`
t.insert([{'num_wings': 2, 'image':'https://raw.github.com/pixeltable/pixeltable/master/docs/source/data/images/000000000025.jpg'}])

Computing cells: 100%|████████████████████████████████████████████| 1/1 [00:00<00:00,  1.45 cells/s]
Inserting rows into `first_table`: 1 rows [00:00, 252.65 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 1/1 [00:00<00:00,  1.41 cells/s]
Inserted 1 row with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=1, num_excs=0, updated_cols=[], cols_with_excs=[])

In [9]:
t.collect()

num_legs,num_wings,name,image
,2,jake,
3.0,2,,
4.0,8,kev,
,2,,


In [10]:
t['transformed'] = t.image.rotate(45).resize((120, 120))

Added 4 column values with 0 errors.


In [11]:
t.show()

num_legs,num_wings,name,image,transformed
,2,jake,,
3.0,2,,,
4.0,8,kev,,
,2,,,


## Real-World Example with Earthquake Data

Let's start by creating a Pixeltable table from a Pandas dataframe

`import_pandas`: This function streamlines the creation of a Pixeltable table from a Pandas DataFrame, inferring the column types automatically.

In [12]:
import requests
import pandas as pd

# 1. Fetch Earthquake Data
base_url = "https://earthquake.usgs.gov/fdsnws/event/1/query?"
params = {
    "format": "geojson",
    "starttime": "2023-01-01",  # Start date of your interest
    "endtime": "2024-06-28",  # End date
    "latitude": 47.6062,      # Seattle's Latitude
    "longitude": -122.3321,    # Seattle's Longitude
    "maxradiuskm": 100       # Radius around Seattle
}

response = requests.get(base_url, params=params)
data = response.json()

# 2. Pandas DataFrame
df = pd.json_normalize(data["features"])
df = df[["properties.mag", "properties.place", "properties.time", "geometry.coordinates"]]
df.columns = ["magnitude", "location", "timestamp", "coordinates"]

# 3. Convert timestamp
df["timestamp"] = pd.to_datetime(df["timestamp"], unit='ms')

# 4. Extract Latitude and Longitude
df['longitude'] = df['coordinates'].apply(lambda x: x[0])
df['latitude'] = df['coordinates'].apply(lambda x: x[1])
df = df.drop("coordinates", axis=1)

earthquake = pxt.io.import_pandas("eq_table", df)
earthquake.show()

Created table `eq_table`.
Computing cells:   0%|                                                 | 0/1804 [00:00<?, ? cells/s]
Inserting rows into `eq_table`: 0 rows [00:00, ? rows/s][A
Inserting rows into `eq_table`: 241 rows [00:00, 2264.88 rows/s][A
Inserting rows into `eq_table`: 481 rows [00:00, 2303.36 rows/s][A
Inserting rows into `eq_table`: 721 rows [00:00, 2333.71 rows/s][A
Inserting rows into `eq_table`: 961 rows [00:00, 2314.55 rows/s][A
Inserting rows into `eq_table`: 1199 rows [00:00, 2336.98 rows/s][A
Inserting rows into `eq_table`: 1433 rows [00:00, 2337.00 rows/s][A
Inserting rows into `eq_table`: 1804 rows [00:00, 2276.79 rows/s]
Computing cells: 100%|████████████████████████████████████| 1804/1804 [00:00<00:00, 2195.88 cells/s]
Inserted 1804 rows with 0 errors.


magnitude,location,timestamp,longitude,latitude
0.14,"7 km S of Seabeck, Washington",2024-06-27 23:45:29.670,-122.830333,47.567667
0.46,"13 km NNE of Ashford, Washington",2024-06-27 21:26:18.340,-121.9345,46.864167
1.06,"6 km NE of Fall City, Washington",2024-06-27 19:03:08.410,-121.826333,47.609833
0.24,"22 km ENE of Ashford, Washington",2024-06-27 14:07:02.800,-121.766667,46.8535
1.66,"10 km S of Seabeck, Washington",2024-06-27 10:33:50.370,-122.848833,47.550167
0.57,"6 km W of Lake Cavanaugh, Washington",2024-06-27 04:49:55.000,-122.105667,48.322167
1.09,"6 km NE of Duvall, Washington",2024-06-26 13:51:21.480,-121.919333,47.7785
0.46,"22 km ENE of Ashford, Washington",2024-06-26 13:38:50.670,-121.767833,46.847333
0.0,"12 km NE of Ashford, Washington",2024-06-26 06:33:16.760,-121.921833,46.840667
0.53,"22 km ENE of Ashford, Washington",2024-06-26 06:29:44.970,-121.779333,46.859167


### Filtering with Multiple Conditions (Logical Operators):

In [13]:
filtered_table = earthquake.where((earthquake.magnitude >= 1) & (earthquake.location == '22 km ENE of Ashford, Washington')).collect()
filtered_table

magnitude,location,timestamp,longitude,latitude
1.41,"22 km ENE of Ashford, Washington",2024-05-14 15:34:05.550,-121.770333,46.853167
1.37,"22 km ENE of Ashford, Washington",2024-04-26 13:31:03.580,-121.7655,46.8455
1.07,"22 km ENE of Ashford, Washington",2024-01-11 22:56:57.340,-121.763833,46.852667
1.06,"22 km ENE of Ashford, Washington",2023-10-28 12:56:05.460,-121.758833,46.845167
1.13,"22 km ENE of Ashford, Washington",2023-10-14 22:13:46.870,-121.762833,46.851667
1.6,"22 km ENE of Ashford, Washington",2023-04-29 05:39:06.970,-121.763667,46.842333
1.44,"22 km ENE of Ashford, Washington",2023-04-29 05:32:43.730,-121.7635,46.8455


## Creating Views
You create a view by specifying a base table and defining either a filter or an iterator (or both).

In [14]:
filtered_view = pxt.create_view('my_view', earthquake, filter=earthquake.timestamp >= '2024-06-25 18:09:00.720')

Inserting rows into `my_view`: 13 rows [00:00, 2111.20 rows/s]
Created view `my_view` with 13 rows, 0 exceptions.


In [15]:
filtered_view.show()

magnitude,location,timestamp,longitude,latitude
0.14,"7 km S of Seabeck, Washington",2024-06-27 23:45:29.670,-122.830333,47.567667
0.46,"13 km NNE of Ashford, Washington",2024-06-27 21:26:18.340,-121.9345,46.864167
1.06,"6 km NE of Fall City, Washington",2024-06-27 19:03:08.410,-121.826333,47.609833
0.24,"22 km ENE of Ashford, Washington",2024-06-27 14:07:02.800,-121.766667,46.8535
1.66,"10 km S of Seabeck, Washington",2024-06-27 10:33:50.370,-122.848833,47.550167
0.57,"6 km W of Lake Cavanaugh, Washington",2024-06-27 04:49:55.000,-122.105667,48.322167
1.09,"6 km NE of Duvall, Washington",2024-06-26 13:51:21.480,-121.919333,47.7785
0.46,"22 km ENE of Ashford, Washington",2024-06-26 13:38:50.670,-121.767833,46.847333
0.0,"12 km NE of Ashford, Washington",2024-06-26 06:33:16.760,-121.921833,46.840667
0.53,"22 km ENE of Ashford, Washington",2024-06-26 06:29:44.970,-121.779333,46.859167


Views behave like regular tables within Pixeltable. You can:

- Query Them: Apply filters, select specific columns, and perform calculations.
- Update Them: (If not a snapshot view) Modify the underlying data, with changes automatically reflected in the view.
- Chain Them: Create views on top of existing views for increasingly specific or complex transformations.

# Transition to Tables/Views Tutorial?