### Introduction to PyArrow

* PyArrow is a cross-language development platform for in-memory data.
* It serves as a foundation for building high-performance analytics applications.
* Developed as a part of the Apache Arrow project, it aims to improve data interoperability.
* In-memory Columnar Data Representation: Efficiently represents complex data structures in a memory-optimized way.
* Zero-Copy Reads: Enables rapid data sharing between Python and other languages without serialization overhead.
* Schema and Metadata Support: Enables rich, self-describing data structures.



### PyArrow and Parquet

* PyArrow can read and write Parquet files efficiently and seamlessly/
* Column Pruning: Reads only the necessary columns from a Parquet file, reducing I/O.
```python
import pyarrow.parquet as pq
table = pq.read_table('your_file.parquet', columns=['column1', 'column2'])
df = table.to_pandas()
```
* Row Group Filtering: Allows selective reading of row groups based on conditions, optimizing data retrieval.
  * Only read the groups in parquet file for which date is in range


### Apache Arrow

```A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (including nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.```


                  Apache arrow Project

<div align="center">
<img src="https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/img/with_arrow.jpg" width=700>
</div>
[picture source](https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/)

In [4]:
# !pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-9.0.0-cp38-cp38-macosx_11_0_arm64.whl (21.6 MB)
[K     |████████████████████████████████| 21.6 MB 3.3 MB/s eta 0:00:01
Installing collected packages: pyarrow
Successfully installed pyarrow-9.0.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
#!wget -P https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2022-06.parquet

In [2]:
import pyarrow.parquet as pq

# Read the entire Parquet file into a PyArrow Table
table = pq.read_table('/Users/mahdi/Downloads/fhvhv_tripdata_2022-06.parquet')
table

pyarrow.Table
hvfhs_license_num: string
dispatching_base_num: string
originating_base_num: string
request_datetime: timestamp[us]
on_scene_datetime: timestamp[us]
pickup_datetime: timestamp[us]
dropoff_datetime: timestamp[us]
PULocationID: int64
DOLocationID: int64
trip_miles: double
trip_time: int64
base_passenger_fare: double
tolls: double
bcf: double
sales_tax: double
congestion_surcharge: double
airport_fee: double
tips: double
driver_pay: double
shared_request_flag: string
shared_match_flag: string
access_a_ride_flag: string
wav_request_flag: string
wav_match_flag: string
----
hvfhs_license_num: [["HV0003","HV0003","HV0003","HV0003","HV0005",...,"HV0005","HV0003","HV0005","HV0005","HV0005"],["HV0003","HV0005","HV0003","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0003"],...,["HV0005","HV0005","HV0005","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0003"],["HV0003","HV0005","HV0003","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0005"]]
di

In [11]:
import os
import psutil
def print_mem():
    gig = psutil.Process(os.getpid()).memory_info().rss / 1024 ** 3
    print(f"{gig} gigabytes")
print_mem()

3.9536590576171875 gigabytes


In [13]:
parquet_file = pq.ParquetFile('/Users/mahdi/Downloads/fhvhv_tripdata_2022-06.parquet')

In [18]:
pq_rg = parquet_file.read_row_group(0)
pq_rg

pyarrow.Table
hvfhs_license_num: string
dispatching_base_num: string
originating_base_num: string
request_datetime: timestamp[us]
on_scene_datetime: timestamp[us]
pickup_datetime: timestamp[us]
dropoff_datetime: timestamp[us]
PULocationID: int64
DOLocationID: int64
trip_miles: double
trip_time: int64
base_passenger_fare: double
tolls: double
bcf: double
sales_tax: double
congestion_surcharge: double
airport_fee: double
tips: double
driver_pay: double
shared_request_flag: string
shared_match_flag: string
access_a_ride_flag: string
wav_request_flag: string
wav_match_flag: string
----
hvfhs_license_num: [["HV0003","HV0003","HV0003","HV0003","HV0005",...,"HV0003","HV0003","HV0003","HV0003","HV0005"]]
dispatching_base_num: [["B03404","B03404","B03404","B03404","B03406",...,"B03404","B03404","B03404","B03404","B03406"]]
originating_base_num: [["B03404","B03404","B03404","B03404",null,...,"B03404","B03404","B03404","B03404",null]]
request_datetime: [[2022-06-01 00:15:35.000000,2022-06-01 00:3

In [19]:
import pandas as pd

df = pd.read_parquet('/Users/mahdi/Downloads/fhvhv_tripdata_2022-06.parquet')

In [20]:
df = df.sort_values(by=['pickup_datetime', 'DOLocationID', 'base_passenger_fare'])

In [24]:
import pyarrow as pa
import pyarrow.parquet as pq

# Convert DataFrame to PyArrow Table
table = pa.table(df)

# Write back to Parquet with optimized settings
pq.write_table(table, 'optimized_parquet_file.parquet', row_group_size=100000)

In [28]:
parquet_file = pq.ParquetFile('optimized_parquet_file.parquet')

In [29]:
num_row_groups = parquet_file.num_row_groups

for rg_index in range(num_row_groups):
    row_group = parquet_file.metadata.row_group(rg_index)
    
    num_columns = row_group.num_columns
    
    print(f"Row Group {rg_index}")
    
    for col_index in range(num_columns):
        col = row_group.column(col_index)
        
        min_value = col.statistics.min
        max_value = col.statistics.max
        
        print(f"  Column {col_index}: Min = {min_value}, Max = {max_value}")


Row Group 0
  Column 0: Min = HV0003, Max = HV0005
  Column 1: Min = B02395, Max = B03406
  Column 2: Min = B02026, Max = B03406
  Column 3: Min = 2022-05-31 16:17:22, Max = 2022-06-01 08:40:00
  Column 4: Min = 2022-05-31 22:44:16, Max = 2022-06-01 08:20:27
  Column 5: Min = 2022-06-01 00:00:00, Max = 2022-06-01 08:20:29
  Column 6: Min = 2022-06-01 00:02:31, Max = 2022-06-01 14:52:19
  Column 7: Min = 3, Max = 265
  Column 8: Min = 1, Max = 265
  Column 9: Min = -0.0, Max = 187.168
  Column 10: Min = 1, Max = 26106
  Column 11: Min = -31.83, Max = 501.42
  Column 12: Min = -0.0, Max = 65.98
  Column 13: Min = -0.0, Max = 15.95
  Column 14: Min = -0.0, Max = 37.76
  Column 15: Min = -0.0, Max = 5.5
  Column 16: Min = -0.0, Max = 5.0
  Column 17: Min = -0.0, Max = 75.0
  Column 18: Min = -0.0, Max = 364.28
  Column 19: Min = N, Max = N
  Column 20: Min = N, Max = Y
  Column 21: Min =  , Max = N
  Column 22: Min = N, Max = Y
  Column 23: Min = N, Max = Y
  Column 24: Min = 0, Max = 1213