## Parquet file format

The file layout of Parquet plays a crucial role in its efficiency and performance. It's structured in a way that optimizes storage and retrieval of data, especially for analytical processing.

### Parquet File Structure:

1. **Row Groups:**
    - Parquet files are divided into row groups, which are logical divisions of data within the file.
    - Each row group contains a subset of rows from the dataset. The size of a row group can be configured.
    - Row groups enable parallelism and efficient processing, allowing readers to work on different row groups concurrently.
2. **Column Chunk:**
    - Within each row group, columns are divided into column chunks.
    - A column chunk stores data for a specific column within a row group.
    - Column chunks allow for the efficient access of specific columns without reading unnecessary data, supporting column-wise operations.
3. **Metadata:**
    - Parquet files contain metadata that describes the schema and statistics about the data stored within.
    - Metadata includes information about data types, compression codecs used, encoding methods, and min/max statistics per column chunk.
    - This metadata is stored in the footer of the Parquet file and helps readers to understand the file's structure without needing to read the entire file.
4. **Page Structure:**
    - Data within column chunks is divided into pages for efficient storage and retrieval.
    - Pages can be of different types: data pages, dictionary pages (for dictionary encoding), and index pages.
    - Data pages store the actual values for a column, while dictionary pages store the unique values for dictionary encoding.
    
### Logical Representation:

The logical structure of a Parquet file can be represented hierarchically:

- **File → Row Group(s) → Column Chunk(s) → Page(s)**
    - File: Contains one or more row groups.
    - Row Group: Contains column chunks.
    - Column Chunk: Contains pages of a specific column's data.
    - Page: Stores actual values, metadata, or dictionary-encoded values.

In [3]:
!pip install pyarrow
import pyarrow as pa
import pyarrow.parquet as pq

Collecting pyarrow
  Downloading pyarrow-14.0.2-cp38-cp38-win_amd64.whl (24.6 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-14.0.2


In [5]:
parquet_file = pq.ParquetFile(r'C:\Users\HP\Pyspark\parquet\part-r-00000-1a9822ba-b8fb-4d8e-844a-ea30d0801b9e.gz.parquet')

In [11]:
## accessing metadata
parquet_file.metadata

<pyarrow._parquet.FileMetaData object at 0x000001C7ADD65BD0>
  created_by: parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d)
  num_columns: 3
  num_rows: 255
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 658

In [8]:
# accessing row group
parquet_file.metadata.row_group(0)  

<pyarrow._parquet.RowGroupMetaData object at 0x000001C7AF437630>
  num_columns: 3
  num_rows: 255
  total_byte_size: 5642

In [9]:
# in the row group checking first column
parquet_file.metadata.row_group(0).column(0)

<pyarrow._parquet.ColumnChunkMetaData object at 0x000001C7AF437D60>
  file_offset: 4
  file_path: 
  physical_type: BYTE_ARRAY
  num_values: 255
  path_in_schema: DEST_COUNTRY_NAME
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x000001C7AF437E00>
      has_min_max: True
      min: Afghanistan
      max: Vietnam
      null_count: 0
      distinct_count: None
      num_values: 255
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: GZIP
  encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 1242
  total_uncompressed_size: 1974

In [10]:
# stats which contains min, max, nulls etc
parquet_file.metadata.row_group(0).column(0).statistics 

<pyarrow._parquet.Statistics object at 0x000001C7AF447180>
  has_min_max: True
  min: Afghanistan
  max: Vietnam
  null_count: 0
  distinct_count: None
  num_values: 255
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8