# Import data from Parquet files

Load columnar data from Parquet files into Pixeltable tables for processing and analysis.

## Problem

You have data stored in Parquet format—a common format for analytics, data lakes, and ML pipelines. You need to load this data for processing with AI models or combining with other data sources.

| Source | Size | Use case |
|--------|------|----------|
| embeddings.parquet | 1M vectors | Add to similarity search |
| transactions.parquet | 10M rows | Analyze with computed columns |
| features.parquet | 500K rows | Combine with media data |

## Solution

**What's in this recipe:**

- Import Parquet files directly into tables
- Export tables to Parquet for external tools
- Handle schema type overrides

You use `pxt.create_table()` with a `source` parameter to create a table from a Parquet file. Pixeltable infers column types from the Parquet schema automatically.

### Setup

In [1]:
%pip install -qU pixeltable pyarrow pandas

In [2]:
import pixeltable as pxt
import pandas as pd
import tempfile
from pathlib import Path

### Create sample Parquet file

First, create a sample Parquet file to demonstrate the import process:

In [3]:
# Create sample data
sample_data = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5],
    'name': ['Widget A', 'Widget B', 'Gadget X', 'Gadget Y', 'Tool Z'],
    'price': [29.99, 39.99, 149.99, 199.99, 79.99],
    'category': ['widgets', 'widgets', 'gadgets', 'gadgets', 'tools'],
    'in_stock': [True, False, True, True, False]
})

# Save to temporary Parquet file
temp_dir = tempfile.mkdtemp()
parquet_path = Path(temp_dir) / 'products.parquet'
sample_data.to_parquet(parquet_path, index=False)
sample_data

Unnamed: 0,product_id,name,price,category,in_stock
0,1,Widget A,29.99,widgets,True
1,2,Widget B,39.99,widgets,False
2,3,Gadget X,149.99,gadgets,True
3,4,Gadget Y,199.99,gadgets,True
4,5,Tool Z,79.99,tools,False


### Import Parquet file

Use `create_table` with the `source` parameter to create a table directly from the Parquet file:

In [4]:
# Create a fresh directory
pxt.drop_dir('parquet_demo', force=True)
pxt.create_dir('parquet_demo')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory 'parquet_demo'.


<pixeltable.catalog.dir.Dir at 0x17f0ca920>

In [5]:
# Import Parquet file into a new table
products = pxt.create_table(
    'parquet_demo.products',
    source=str(parquet_path)
)

Created table 'products'.



Inserting rows into `products`: 0 rows [00:00, ? rows/s]


Inserting rows into `products`: 5 rows [00:00, 653.18 rows/s]


Inserted 5 rows with 0 errors.


In [6]:
# View imported data
products.collect()

product_id,name,price,category,in_stock
1,Widget A,29.99,widgets,True
2,Widget B,39.99,widgets,False
3,Gadget X,149.99,gadgets,True
4,Gadget Y,199.99,gadgets,True
5,Tool Z,79.99,tools,False


### Add computed columns

Once imported, you can add computed columns like any other Pixeltable table:

In [7]:
# Add a computed column for discounted price
products.add_computed_column(sale_price=products.price * 0.9)

Added 5 column values with 0 errors.


5 rows updated, 10 values computed.

In [8]:
# View with computed column
products.select(products.name, products.price, products.sale_price).collect()

name,price,sale_price
Widget A,29.99,26.991
Widget B,39.99,35.991
Gadget X,149.99,134.991
Gadget Y,199.99,179.991
Tool Z,79.99,71.991


### Import with primary key

Specify a primary key when you need upsert behavior or unique constraints:

In [9]:
# Import with a primary key
products_pk = pxt.create_table(
    'parquet_demo.products_with_pk',
    source=str(parquet_path),
    primary_key='product_id'
)

Created table 'products_with_pk'.



Inserting rows into `products_with_pk`: 0 rows [00:00, ? rows/s]


Inserting rows into `products_with_pk`: 5 rows [00:00, 1548.97 rows/s]


Inserted 5 rows with 0 errors.


In [10]:
# View the table
products_pk.collect()

product_id,name,price,category,in_stock
1,Widget A,29.99,widgets,True
2,Widget B,39.99,widgets,False
3,Gadget X,149.99,gadgets,True
4,Gadget Y,199.99,gadgets,True
5,Tool Z,79.99,tools,False


### Export table to Parquet

Export your processed data back to Parquet for use with other toolee

In [11]:
# Export to Parquet (note: image columns require inline_images=True)
export_path = Path(temp_dir) / 'exported_products'

pxt.io.export_parquet(
    products.select(products.name, products.price, products.sale_price),
    parquet_path=export_path
)

In [12]:
# Verify export by reading back
import pyarrow.parquet as pq

exported_table = pq.read_table(export_path)
exported_table.to_pandas()

Unnamed: 0,name,price,sale_price
0,Widget A,29.99,26.990999
1,Widget B,39.990002,35.991001
2,Gadget X,149.990005,134.990997
3,Gadget Y,199.990005,179.990997
4,Tool Z,79.989998,71.990997


## Explanation

**When to use Parquet import:**

| Scenario | Recommendation |
|----------|----------------|
| Data lake / analytics data | Use `create_table(source=path)` |
| ML feature stores | Use `create_table` with `primary_key` |
| Small datasets | Consider CSV for simplicity |
| Streaming data | Use direct `insert()` instead |

**Key features:**

- Automatic schema inference from Parquet metadata
- Support for partitioned datasets (directory of files)
- Export with `pxt.io.export_parquet` for interoperability
- Primary key support for upsert workflows

## See also

- [Import CSV files](https://docs.pixeltable.com/howto/cookbooks/data/data-import-csv) - For CSV and Excel imports
- [Import JSON files](https://docs.pixeltable.com/howto/cookbooks/data/data-import-json) - For JSON data