# How to Read and Write Parquet Files in Python: Complete Guide with Pandas and PyArrow

This comprehensive tutorial covers everything you need to know about reading and writing Parquet files in Python using pandas and PyArrow. Learn best practices, performance optimization, and advanced techniques.

## Table of Contents
1. Creating Sample Data for Examples
2. Writing Parquet Files with Pandas
3. Writing Parquet Files with PyArrow
4. Reading Parquet Files Efficiently
5. Advanced Reading Techniques
6. Working with Partitioned Datasets
7. Cloud Storage Integration
8. Best Practices and Performance Tips

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import os
import time
from datetime import datetime, timedelta

## 1. Creating Sample E-commerce Dataset for Examples

Let's create a realistic e-commerce dataset to demonstrate various Parquet operations.

In [None]:
# Generate sample e-commerce data
np.random.seed(42)
n_records = 50000

# Date range for orders
start_date = datetime(2023, 1, 1)
end_date = datetime(2024, 1, 1)

# Create dataset
ecommerce_data = {
    'order_id': [f'ORD-{i:06d}' for i in range(n_records)],
    'customer_id': np.random.randint(1000, 5000, n_records),
    'order_date': pd.date_range(start_date, end_date, periods=n_records),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home', 'Sports'], n_records),
    'product_name': [f'Product_{i % 1000}' for i in range(n_records)],
    'quantity': np.random.randint(1, 10, n_records),
    'unit_price': np.round(np.random.uniform(10, 500, n_records), 2),
    'discount_percent': np.random.choice([0, 5, 10, 15, 20, 25], n_records),
    'shipping_country': np.random.choice(['USA', 'UK', 'Germany', 'France', 'Italy', 'Spain'], n_records),
    'payment_method': np.random.choice(['Credit Card', 'PayPal', 'Bank Transfer', 'Cash'], n_records),
    'is_premium_customer': np.random.choice([True, False], n_records, p=[0.3, 0.7]),
    'customer_rating': np.random.choice([1, 2, 3, 4, 5], n_records, p=[0.05, 0.1, 0.2, 0.35, 0.3])
}

df = pd.DataFrame(ecommerce_data)

# Calculate total amount
df['total_amount'] = df['quantity'] * df['unit_price'] * (1 - df['discount_percent']/100)

print("E-commerce Dataset Created:")
print(f"Shape: {df.shape}")
print(f"\nColumn Information:")
df.info()
print(f"\nFirst 5 rows:")
df.head()

## 2. How to Write Parquet Files with Pandas

### Basic Write Operations

In [None]:
# Method 1: Basic write with default settings
df.to_parquet('ecommerce_basic.parquet')

# Method 2: Specify PyArrow engine explicitly
df.to_parquet('ecommerce_pyarrow.parquet', engine='pyarrow')

# Method 3: Write without index
df.to_parquet('ecommerce_no_index.parquet', index=False)

# Method 4: Write with specific compression
df.to_parquet('ecommerce_compressed.parquet', compression='snappy', index=False)

print("✅ Parquet files created successfully!")

### Compression Options Comparison

Different compression algorithms offer various trade-offs between file size and speed.

In [None]:
# Compare different compression methods
compression_methods = ['snappy', 'gzip', 'brotli', 'lz4', 'zstd', None]
compression_stats = []

for compression in compression_methods:
    filename = f'test_{compression if compression else "uncompressed"}.parquet'
    
    # Measure write time
    start = time.time()
    df.to_parquet(filename, compression=compression)
    write_time = time.time() - start
    
    # Get file size
    file_size = os.path.getsize(filename) / (1024 * 1024)  # MB
    
    # Measure read time
    start = time.time()
    _ = pd.read_parquet(filename)
    read_time = time.time() - start
    
    compression_stats.append({
        'compression': compression if compression else 'none',
        'file_size_mb': round(file_size, 2),
        'write_time_s': round(write_time, 3),
        'read_time_s': round(read_time, 3),
        'compression_ratio': round((df.memory_usage(deep=True).sum() / (1024**2)) / file_size, 2)
    })
    
    # Clean up
    os.remove(filename)

compression_df = pd.DataFrame(compression_stats)
print("Compression Methods Performance Comparison:")
compression_df.sort_values('file_size_mb')

## 3. How to Write Parquet Files with PyArrow (Advanced)

PyArrow provides more control over the writing process.

In [None]:
# Convert pandas DataFrame to PyArrow Table
table = pa.Table.from_pandas(df)

# Write with custom options
pq.write_table(
    table,
    'ecommerce_custom.parquet',
    compression='snappy',
    use_dictionary=True,  # Enable dictionary encoding for strings
    compression_level=None,  # Use default compression level
    use_byte_stream_split=False,  # For floating point data
    column_encoding='PLAIN',  # Encoding type
    row_group_size=20000,  # Rows per row group
    data_page_size=1024*1024  # 1MB data pages
)

print("✅ Custom Parquet file created with PyArrow")

In [None]:
# Write with custom schema
# Define specific data types
schema = pa.schema([
    pa.field('order_id', pa.string()),
    pa.field('customer_id', pa.int64()),
    pa.field('order_date', pa.timestamp('ns')),
    pa.field('product_category', pa.dictionary(pa.int32(), pa.string())),  # Dictionary encoding
    pa.field('quantity', pa.int32()),
    pa.field('unit_price', pa.float64()),
    pa.field('total_amount', pa.float64()),
    pa.field('is_premium_customer', pa.bool_())
])

# Select columns and create table with schema
selected_columns = ['order_id', 'customer_id', 'order_date', 'product_category', 
                   'quantity', 'unit_price', 'total_amount', 'is_premium_customer']
subset_df = df[selected_columns]
table_with_schema = pa.Table.from_pandas(subset_df, schema=schema)

# Write with schema
pq.write_table(table_with_schema, 'ecommerce_with_schema.parquet')
print("✅ Parquet file with custom schema created")

## 4. How to Read Parquet Files in Python

### Basic Read Operations

In [None]:
# Save a reference file for reading examples
df.to_parquet('ecommerce_data.parquet', compression='snappy')

# Method 1: Basic read with pandas
df_read = pd.read_parquet('ecommerce_data.parquet')
print(f"✅ DataFrame loaded: {df_read.shape}")

# Method 2: Read with specific engine
df_pyarrow = pd.read_parquet('ecommerce_data.parquet', engine='pyarrow')
print(f"✅ Read with PyArrow: {df_pyarrow.shape}")

# Display first few rows
df_read.head(3)

### Read Specific Columns (Column Pruning)

One of Parquet's key advantages is the ability to read only the columns you need.

In [None]:
# Read only specific columns
columns_to_read = ['order_id', 'customer_id', 'order_date', 'total_amount']
df_subset = pd.read_parquet('ecommerce_data.parquet', columns=columns_to_read)

print(f"✅ Columns read: {df_subset.columns.tolist()}")
print(f"Shape: {df_subset.shape}")

# Performance comparison: reading all vs specific columns
start = time.time()
_ = pd.read_parquet('ecommerce_data.parquet')
time_all = time.time() - start

start = time.time()
_ = pd.read_parquet('ecommerce_data.parquet', columns=columns_to_read)
time_subset = time.time() - start

print(f"\nPerformance improvement: {time_all/time_subset:.2f}x faster when reading only 4 columns")

## 5. Advanced Reading: Filters and Predicates

PyArrow supports predicate pushdown, allowing you to filter data while reading.

In [None]:
# Filter 1: Simple equality filter
filters = [('product_category', '==', 'Electronics')]
df_electronics = pd.read_parquet('ecommerce_data.parquet', filters=filters)
print(f"Electronics orders: {len(df_electronics):,} rows")

# Filter 2: Multiple conditions (AND)
filters = [
    ('product_category', '==', 'Electronics'),
    ('total_amount', '>', 1000)
]
df_expensive_electronics = pd.read_parquet('ecommerce_data.parquet', filters=filters)
print(f"Expensive electronics (>$1000): {len(df_expensive_electronics):,} rows")

# Filter 3: OR conditions (list of lists)
filters = [
    [('product_category', '==', 'Electronics')],
    [('product_category', '==', 'Books')]
]
df_electronics_or_books = pd.read_parquet('ecommerce_data.parquet', filters=filters)
print(f"Electronics OR Books: {len(df_electronics_or_books):,} rows")

# Filter 4: Complex filters with different operators
filters = [
    ('customer_rating', '>=', 4),
    ('is_premium_customer', '==', True),
    ('shipping_country', 'in', ['USA', 'UK']),
    ('order_date', '>=', pd.Timestamp('2023-06-01'))
]
df_premium_satisfied = pd.read_parquet('ecommerce_data.parquet', filters=filters)
print(f"Premium satisfied customers (rating>=4) in USA/UK after June 2023: {len(df_premium_satisfied):,} rows")

### Reading with PyArrow for More Control

In [None]:
# Open Parquet file with PyArrow
parquet_file = pq.ParquetFile('ecommerce_data.parquet')

# Get file metadata
print("📊 File Metadata:")
print(f"Total rows: {parquet_file.metadata.num_rows:,}")
print(f"Number of row groups: {parquet_file.num_row_groups}")
print(f"Created by: {parquet_file.metadata.created_by}")

# Schema information
print(f"\n📋 Schema:")
for i, field in enumerate(parquet_file.schema):
    print(f"{i}: {field.name} - {field.physical_type}")

# Read specific row groups
first_row_group = parquet_file.read_row_group(0)
print(f"\n📦 First row group: {first_row_group.num_rows:,} rows")

# Read with filters using PyArrow
table = pq.read_table('ecommerce_data.parquet',
                     filters=[('total_amount', '>', 2000)],
                     columns=['order_id', 'customer_id', 'total_amount'])
df_high_value = table.to_pandas()
print(f"\n💰 High value orders (>$2000): {len(df_high_value):,} rows")

## 6. Working with Partitioned Parquet Datasets

Partitioning is crucial for handling large datasets efficiently.

In [None]:
# Create partitioned dataset by product category
partitioned_path = 'ecommerce_partitioned'

# Write partitioned data
df.to_parquet(
    partitioned_path,
    partition_cols=['product_category'],
    engine='pyarrow',
    compression='snappy'
)

# Display partition structure
print("📁 Partition Structure:")
for root, dirs, files in os.walk(partitioned_path):
    level = root.replace(partitioned_path, '').count(os.sep)
    indent = ' ' * 2 * level
    print(f"{indent}{os.path.basename(root)}/")
    subindent = ' ' * 2 * (level + 1)
    for file in files:
        if file.endswith('.parquet'):
            file_size = os.path.getsize(os.path.join(root, file)) / 1024  # KB
            print(f"{subindent}{file} ({file_size:.1f} KB)")

In [None]:
# Read partitioned data
# Read all partitions
df_all_partitions = pd.read_parquet(partitioned_path)
print(f"All partitions: {df_all_partitions.shape}")

# Read specific partition
df_electronics_partition = pd.read_parquet(f'{partitioned_path}/product_category=Electronics')
print(f"Electronics partition only: {df_electronics_partition.shape}")
print(f"Note: partition column not included in file")

# Read with filters on partitioned data (very efficient!)
df_filtered_partition = pd.read_parquet(
    partitioned_path,
    filters=[('product_category', 'in', ['Electronics', 'Books'])]
)
print(f"\nFiltered partitions (Electronics & Books): {df_filtered_partition.shape}")

## 7. Appending Data to Existing Parquet Files

While Parquet files are immutable, you can append data by reading and rewriting.

In [None]:
# Create new data to append
new_orders = pd.DataFrame({
    'order_id': [f'ORD-NEW-{i:04d}' for i in range(1000)],
    'customer_id': np.random.randint(1000, 5000, 1000),
    'order_date': pd.date_range('2024-01-01', periods=1000, freq='H'),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books'], 1000),
    'product_name': [f'NewProduct_{i}' for i in range(1000)],
    'quantity': np.random.randint(1, 5, 1000),
    'unit_price': np.round(np.random.uniform(20, 300, 1000), 2),
    'discount_percent': np.random.choice([0, 10, 20], 1000),
    'shipping_country': np.random.choice(['USA', 'UK'], 1000),
    'payment_method': np.random.choice(['Credit Card', 'PayPal'], 1000),
    'is_premium_customer': np.random.choice([True, False], 1000),
    'customer_rating': np.random.choice([3, 4, 5], 1000)
})
new_orders['total_amount'] = new_orders['quantity'] * new_orders['unit_price'] * (1 - new_orders['discount_percent']/100)

# Method 1: Read, concatenate, and rewrite
existing_df = pd.read_parquet('ecommerce_data.parquet')
combined_df = pd.concat([existing_df, new_orders], ignore_index=True)
combined_df.to_parquet('ecommerce_data_updated.parquet')

print(f"✅ Original file: {len(existing_df):,} rows")
print(f"✅ New data: {len(new_orders):,} rows")
print(f"✅ Updated file: {len(combined_df):,} rows")

# Method 2: For large files, use PyArrow dataset API
# This is more memory efficient for large datasets

## 8. Cloud Storage Integration

Parquet works seamlessly with cloud storage services.

In [None]:
# Examples of cloud storage URLs (requires appropriate credentials)

# Amazon S3
# df = pd.read_parquet('s3://my-bucket/path/to/file.parquet')
# df.to_parquet('s3://my-bucket/path/to/output.parquet')

# Azure Blob Storage
# df = pd.read_parquet('abfs://container@account.dfs.core.windows.net/path/to/file.parquet')

# Google Cloud Storage
# df = pd.read_parquet('gs://my-bucket/path/to/file.parquet')

print("☁️ Cloud Storage Integration:")
print("1. Install cloud SDK: boto3 (AWS), azure-storage-blob (Azure), google-cloud-storage (GCS)")
print("2. Configure credentials (AWS CLI, Azure CLI, gcloud)")
print("3. Use appropriate URL format: s3://, abfs://, gs://")
print("4. Pandas/PyArrow handles the rest automatically!")

## Best Practices and Performance Tips

### 1. Compression Guidelines
- **Snappy**: Best balance of speed and compression (default)
- **Gzip**: Better compression, slower
- **Brotli/Zstd**: Maximum compression for archival
- **LZ4**: Fastest compression/decompression

### 2. Performance Optimization
- Use column pruning - read only needed columns
- Apply filters during read when possible
- Partition by frequently filtered columns
- Set appropriate row_group_size (default ~128MB)

### 3. Schema Best Practices
- Use dictionary encoding for categorical data
- Consider int32 vs int64 for memory efficiency
- Preserve data types to avoid conversion overhead

### 4. Large Dataset Guidelines
- Use partitioning for datasets > 1GB
- Consider PyArrow Dataset API for very large files
- Process data in chunks when memory is limited

## Clean Up

In [None]:
# Remove all temporary files
import shutil

files_to_remove = [
    'ecommerce_basic.parquet', 'ecommerce_pyarrow.parquet', 
    'ecommerce_no_index.parquet', 'ecommerce_compressed.parquet',
    'ecommerce_custom.parquet', 'ecommerce_with_schema.parquet',
    'ecommerce_data.parquet', 'ecommerce_data_updated.parquet'
]

for file in files_to_remove:
    if os.path.exists(file):
        os.remove(file)

# Remove partitioned directory
if os.path.exists('ecommerce_partitioned'):
    shutil.rmtree('ecommerce_partitioned')

print("✅ All temporary files removed")

## Summary

In this comprehensive guide, you learned:

1. **Writing Parquet files** with pandas and PyArrow
2. **Compression options** and their trade-offs
3. **Reading efficiently** with column selection and filters
4. **Partitioning strategies** for large datasets
5. **Cloud storage integration** patterns
6. **Best practices** for optimal performance

### Next Steps:
- Explore PyArrow Dataset API for multi-file datasets
- Learn about schema evolution and compatibility
- Integrate with big data tools like Spark and Dask
- Optimize for specific use cases (OLAP vs OLTP)