# Complete Guide to Apache Parquet Files in Python: Tutorial with Examples

## What is Apache Parquet? A Comprehensive Introduction

Apache Parquet is an open-source **columnar storage file format** optimized for use with big data processing frameworks. This tutorial will teach you everything you need to know about working with Parquet files in Python using pandas and PyArrow.

### Key Features of Parquet Files:
- **Columnar Storage Format**: Data is organized by column rather than by row
- **Efficient Compression**: Reduces storage space by 70-90% compared to CSV
- **Schema Evolution**: Supports adding, removing, or modifying columns over time
- **Performance Optimized**: Faster read/write operations for analytical queries
- **Cross-Platform Compatibility**: Works with Spark, Pandas, Arrow, and more

### Parquet vs CSV: Performance Comparison
1. **File Size**: Parquet files are typically 70-90% smaller than CSV
2. **Read Speed**: 5-10x faster for analytical queries
3. **Data Type Preservation**: Maintains original data types without parsing
4. **Complex Data Support**: Handles nested structures and arrays

## How to Install Required Libraries for Parquet in Python

To work with Parquet files in Python, you need:
- **pandas**: Data manipulation and analysis library
- **pyarrow**: Apache Arrow Python bindings for Parquet support
- **fastparquet**: Alternative Parquet engine (optional)

In [None]:
# Install required packages for Parquet file handling
# Run this cell if packages are not already installed
# !pip install pandas pyarrow
# Optional alternative engine: !pip install fastparquet

In [None]:
# Import necessary libraries for Parquet operations
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import os
import time

# Display library versions
print(f"Pandas version: {pd.__version__}")
print(f"PyArrow version: {pa.__version__}")

## Parquet vs CSV: Real-World Performance Benchmark

Let's create a sample dataset and compare Parquet and CSV formats in terms of file size, read/write speed, and data type preservation.

In [None]:
# Create a sample dataset for benchmarking
np.random.seed(42)

n_rows = 100000
sample_data = {
    'id': range(n_rows),
    'user_name': [f'User_{i}' for i in range(n_rows)],
    'age': np.random.randint(18, 80, n_rows),
    'salary': np.random.randint(20000, 150000, n_rows),
    'department': np.random.choice(['IT', 'HR', 'Sales', 'Marketing', 'R&D'], n_rows),
    'hire_date': pd.date_range('2010-01-01', periods=n_rows, freq='H'),
    'is_active': np.random.choice([True, False], n_rows, p=[0.85, 0.15])
}

df = pd.DataFrame(sample_data)

print("Dataset Information:")
print(f"Shape: {df.shape}")
print(f"\nData Types:")
print(df.dtypes)
print(f"\nFirst 5 rows:")
df.head()

In [None]:
# Compare file sizes: CSV vs Parquet
csv_file = 'sample_data.csv'
parquet_file = 'sample_data.parquet'

# Save as CSV
df.to_csv(csv_file, index=False)

# Save as Parquet
df.to_parquet(parquet_file, index=False)

# Compare file sizes
csv_size = os.path.getsize(csv_file) / (1024 * 1024)  # MB
parquet_size = os.path.getsize(parquet_file) / (1024 * 1024)  # MB

print(f"CSV file size: {csv_size:.2f} MB")
print(f"Parquet file size: {parquet_size:.2f} MB")
print(f"Size reduction: {(1 - parquet_size/csv_size)*100:.1f}%")
print(f"\nParquet is {csv_size/parquet_size:.1f}x smaller than CSV")

## Read Performance Comparison: Parquet vs CSV

In [None]:
# Benchmark read performance
# CSV read time
start_time = time.time()
df_csv = pd.read_csv(csv_file)
csv_read_time = time.time() - start_time

# Parquet read time
start_time = time.time()
df_parquet = pd.read_parquet(parquet_file)
parquet_read_time = time.time() - start_time

print(f"CSV read time: {csv_read_time:.3f} seconds")
print(f"Parquet read time: {parquet_read_time:.3f} seconds")
print(f"\nParquet is {csv_read_time/parquet_read_time:.1f}x faster than CSV")

## Understanding Parquet File Structure and Metadata

Parquet files have a sophisticated internal structure:
- **Row Groups**: Large chunks of rows (typically thousands)
- **Column Chunks**: Column data within each row group
- **Pages**: Smallest unit of data within column chunks

In [None]:
# Examine Parquet file metadata
parquet_file_obj = pq.ParquetFile(parquet_file)

print("Parquet File Metadata:")
print(f"Total rows: {parquet_file_obj.metadata.num_rows}")
print(f"Number of row groups: {parquet_file_obj.num_row_groups}")
print(f"\nFile Schema:")
print(parquet_file_obj.schema)

# Row group information
for i in range(min(3, parquet_file_obj.num_row_groups)):
    rg = parquet_file_obj.metadata.row_group(i)
    print(f"\nRow group {i}: {rg.num_rows} rows, {rg.total_byte_size:,} bytes")

## Data Type Preservation: Parquet vs CSV

One major advantage of Parquet is automatic data type preservation.

In [None]:
# Compare data types after reading
print("Original DataFrame Data Types:")
print(df.dtypes)
print("\nData Types from CSV (notice datetime and boolean changes):")
print(df_csv.dtypes)
print("\nData Types from Parquet (preserved correctly):")
print(df_parquet.dtypes)

# CSV loses datetime and boolean type information!

## Working with Different Compression Algorithms

Parquet supports multiple compression algorithms. Let's compare their performance.

In [None]:
# Test different compression methods
compression_methods = ['snappy', 'gzip', 'brotli', 'lz4', 'zstd']
compression_results = []

for compression in compression_methods:
    filename = f'test_{compression}.parquet'
    
    # Write time
    start = time.time()
    df.to_parquet(filename, compression=compression)
    write_time = time.time() - start
    
    # File size
    file_size = os.path.getsize(filename) / (1024 * 1024)  # MB
    
    # Read time
    start = time.time()
    _ = pd.read_parquet(filename)
    read_time = time.time() - start
    
    compression_results.append({
        'compression': compression,
        'file_size_mb': round(file_size, 2),
        'write_time_s': round(write_time, 3),
        'read_time_s': round(read_time, 3)
    })
    
    os.remove(filename)

compression_df = pd.DataFrame(compression_results)
print("Compression Methods Comparison:")
compression_df

## Clean Up Temporary Files

In [None]:
# Remove temporary files
for file in [csv_file, parquet_file]:
    if os.path.exists(file):
        os.remove(file)
print("Temporary files removed.")

## Key Takeaways: Why Use Parquet Files?

1. **Storage Efficiency**: Parquet files are 70-90% smaller than CSV
2. **Better Performance**: 5-10x faster read operations
3. **Data Type Preservation**: No need to parse or infer data types
4. **Compression Options**: Multiple algorithms for different use cases
5. **Big Data Ready**: Optimized for distributed computing frameworks

### Best Practices:
- Use **snappy** compression for balanced size/speed
- Use **brotli** or **zstd** for maximum compression
- Choose Parquet for analytical workloads
- Leverage column pruning for better performance

### Next Steps:
In the next tutorial, we'll explore advanced Parquet operations including:
- Reading specific columns
- Filtering data during read
- Working with partitioned datasets
- Cloud storage integration