### Introduction to PyArrow

*   PyArrow serves as a cross-language development environment specifically designed for in-memory data.
*   Its primary goal is to boost the performance of analytics applications.
*   Emerging from the Apache Arrow project, PyArrow aims to make data interoperability better across different languages and systems.
*   It uses an in-memory columnar data representation, offering an optimized memory footprint for complex data structures.
*   With zero-copy reads, it facilitates quick data sharing between Python and other languages, sidestepping the need for serialization.
*   It supports schemas and metadata, providing data structures that are rich and self-describing.


### PyArrow and Parquet

*   PyArrow offers seamless reading and writing operations for Parquet files.
*   With column pruning, you can selectively read only the necessary columns from a Parquet file, reducing I/O time.


In [None]:
import pyarrow.parquet as pq 
table = pq.read_table('your_file.parquet', columns=['column1', 'column2']) 
# Potentially conver the file to pandas if needed for more sophisticated splicing and dicing.
df = table.to_pandas()

### Apache Arrow

```The core feature of Apache Arrow is its in-memory columnar format. This language-agnostic standard is designed to store structured, table-like datasets efficiently in memory. The data format supports a rich set of data types, including nested and user-defined types, making it suitable for analytic databases, data frame libraries, and more.``` 

The Apache Arrow Project





<div align="center">
<img src="https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/img/with_arrow.jpg" width=700>
</div>

[picture source](https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/)

In [None]:
# !pip install pyarrow`

### PyArrow Data Structures

*   PyArrow offers a suite of low-level data structures and methods optimized for both speed and flexibility.
*   These structures can be used seamlessly across multiple languages.

### Arrow Array

*   An Arrow Array is essentially a column of data stored in an efficient, contiguous block of memory.
*   Unlike Python lists, these arrays are optimized for high-speed operations and can be transferred across languages without incurring serialization costs.

In [6]:
import pyarrow as pa
arrow_array = pa.array([1, 2, 3, 4, 5])
print(type(arrow_array))
print("---------")
print(arrow_array)

<class 'pyarrow.lib.Int64Array'>
---------
[
  1,
  2,
  3,
  4,
  5
]


### Arrow Buffer

* While not a data structure per se, Arrow Buffers are pivotal in understanding Arrow functionality.
* Buffers are blocks of memory that house the data for Arrow Arrays, contributing to efficient storage.
* You can even access the buffer's content directly.




In [7]:
buffer = arrow_array.buffers()[1]
print(buffer)


<pyarrow.Buffer address=0x59974030140 size=40 is_cpu=True is_mutable=True>


In [8]:
byte_data = buffer.to_pybytes()
print(byte_data)

b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00'


### Arrow Buffer - Cont'd

* Here, the buffer's data contains 40 bytes, each 8 bytes representing an `int64` value for each of the 5 elements in the array.

* You can use this buffer data to create a new NumPy array, showing that Arrow and NumPy can share memory.




In [10]:
import numpy as np 
numpy_array = np.frombuffer(buffer, dtype=np.int64)
numpy_array


array([1, 2, 3, 4, 5])

In [11]:
np.shares_memory(arrow_array, numpy_array)

True

### Arrow Buffer - Cont'd

* Both `arrow_array` and `numpy_array` share the same underlying data, demonstrating the concept of zero-copy.
* You can confirm this by modifying a value in one array and seeing the change in the other.
  * Both arrays will now show the updated value.
    
    

In [12]:
numpy_array[1] = 0
numpy_array

array([1, 0, 3, 4, 5])

In [13]:
arrow_array

<pyarrow.lib.Int64Array object at 0x105c51340>
[
  1,
  0,
  3,
  4,
  5
]

### Schema

* A schema in PyArrow defines the structure, column names, and types for Arrow Arrays.
* Schemas are crucial as they set the framework for data manipulation and operations in Arrow.
  * Give Arrow an idea on how to encode the data



In [20]:
schema = pa.schema([('column1', pa.int64()), ('column2', pa.string())])
print(schema)

column1: int64
column2: string


### Chunked Array

*   A Chunked Array in PyArrow is like a single Arrow Array but divided into smaller "chunks."
*   This structure allows for the storage and processing of datasets that are too large to fit in memory.
*   It's commonly used in distributed computing frameworks and data streaming scenarios.

* For example:
  * you could have data sent in chunks to optimize throughput
  * you might have multiple nodes in a distributed system each producing Arrow Arrays that are collected and represented as a ChunkedArray by the master node.

* From a user perspective, a Chunked Array appears as a contiguous sequence of data.




In [22]:
results_node_1 = pa.array([0,1,2,3,4])
results_node_2 = pa.array([5,6,7,8,9,10])
chunked_array = pa.chunked_array([results_node_1, results_node_2])
chunked_array


<pyarrow.lib.ChunkedArray object at 0x1269cc3b0>
[
  [
    0,
    1,
    2,
    3,
    4
  ],
  [
    5,
    6,
    7,
    8,
    9,
    10
  ]
]

In [None]:
### Chunked Array - Cont'd

* You can index into a single position or even across multiple chunks, making the data handling more versatile.
* You can also access individual chunks, allowing for parallel processing.

In [23]:
chunked_array[3:6]

<pyarrow.lib.ChunkedArray object at 0x1269cc4a0>
[
  [
    3,
    4
  ],
  [
    5
  ]
]

In [24]:
chunked_array.chunk(0)

<pyarrow.lib.Int64Array object at 0x1268b69a0>
[
  0,
  1,
  2,
  3,
  4
]

In [None]:
### Table

* A Table in PyArrow is a container for multiple Arrow Arrays with a common schema.
* Each column in the Table is an Arrow Array, and all columns share the same length.
* Tables offer an ideal format for handling data in the form of a dataframe.
* Tables can also be partitioned across multiple files for large-scale storage, or to be sent across a network, or even to be stored in-memory on a single machine.






In [25]:
column1 = pa.array([0, 1, 2, 3, 4]) 
column2 = pa.array(['a', 'b', 'c', 'd', 'e'])
table = pa.table({'column1': column1, 'column2': column2})  

table

pyarrow.Table
column1: int64
column2: string
----
column1: [[0,1,2,3,4]]
column2: [["a","b","c","d","e"]]

### Record Batch

*   A Record Batch is a collection of Arrow Arrays (columns) with the same length, all of which are bundled together with a schema.
*   Much like a Chunked Array is a collection of Arrow Arrays, a Table in Apache Arrow is a collection of Record Batches.

* Conceptual Relationship
  *   In Apache Arrow, the concept of a Record Batch is to a Table what an Arrow Array is to a Chunked Array.
    *   Arrays can be grouped together to form a Chunked Array.
    *   Record Batches can be grouped together to form a Table.




### Record Batch - Cont'd

* Use Cases
  *   The choice between using a Record Batch or a Table often depends on your specific needs. E.g.:
    
  *  Streaming Data: If you need to process data on-the-fly, perhaps in a streaming application where you want to process each chunk as it arrives, Record Batches are a good choice.
    *   You can serialize and process each Record Batch independently as they arrive, without having to wait for the entire data set.


In [35]:

column1_array = pa.array([1, 2, 3, 4, 5])
column2_array = pa.array(['a', 'b', 'c', 'd', 'e'])
schema = pa.schema([('column1', pa.int64()), ('column2', pa.string())])

record_batch = pa.record_batch([column1_array, column2_array], schema=schema)
record_batch


pyarrow.RecordBatch
column1: int64
column2: string


In [30]:
record_batch.columns

[<pyarrow.lib.Int64Array object at 0x1269a54c0>
 [
   1,
   2,
   3,
   4,
   5
 ],
 <pyarrow.lib.StringArray object at 0x1269a5100>
 [
   "a",
   "b",
   "c",
   "d",
   "e"
 ]]

In [31]:
record_batch["column1"]

<pyarrow.lib.Int64Array object at 0x1269b1fa0>
[
  1,
  2,
  3,
  4,
  5
]

In [39]:

column1_array_new = pa.array([6, 7, 8, 9, 10])
column2_array_new = pa.array(['f', 'g', 'h', 'i', 'j'])
record_batch_new = pa.record_batch([column1_array_new, column2_array_new], schema=schema)


table = pa.Table.from_batches([record_batch, record_batch_new], schema=schema)
table


pyarrow.Table
column1: int64
column2: string
----
column1: [[1,2,3,4,5],[6,7,8,9,10]]
column2: [["a","b","c","d","e"],["f","g","h","i","j"]]

### Record Batch - Cont'd

* In the example above, two Record Batches are combined to create a single Table. 
  * This is analogous to how individual Arrow Arrays can be combined to create a Chunked Array
  * Reinforces the idea that a Record Batch is to a Table what an Arrow Array is to a Chunked Array.


### Dive Into Real Data: Parquet and Memory Efficiency

1.  Let's get hands-on and read a Parquet file using Apache Arrow.
2.  Take note: the size of the data when using PyArrow is substantially smaller than a Pandas DataFrame for the same data.
3.  Think of this as a little teaser to whet your appetite for data science goodness.

**Note**: Here, I'm using the `parquet` module from the PyArrow package. This module knows how to read Parquet files among other things.



In [40]:
import pyarrow.parquet as pq
table = pq.read_table('/Users/mahdi/Downloads/fhvhv_tripdata_2022-06.parquet')
table


pyarrow.Table
hvfhs_license_num: string
dispatching_base_num: string
originating_base_num: string
request_datetime: timestamp[us]
on_scene_datetime: timestamp[us]
pickup_datetime: timestamp[us]
dropoff_datetime: timestamp[us]
PULocationID: int64
DOLocationID: int64
trip_miles: double
trip_time: int64
base_passenger_fare: double
tolls: double
bcf: double
sales_tax: double
congestion_surcharge: double
airport_fee: double
tips: double
driver_pay: double
shared_request_flag: string
shared_match_flag: string
access_a_ride_flag: string
wav_request_flag: string
wav_match_flag: string
----
hvfhs_license_num: [["HV0003","HV0003","HV0003","HV0003","HV0005",...,"HV0005","HV0003","HV0005","HV0005","HV0005"],["HV0003","HV0005","HV0003","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0003"],...,["HV0005","HV0005","HV0005","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0003"],["HV0003","HV0005","HV0003","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0005"]]
di

In [45]:
import sys
sys.getsizeof(table) / 1024 / 1024 / 1024

3.03908724244684

In [42]:
# import os
# import psutil
# def print_mem():
#     gig = psutil.Process(os.getpid()).memory_info().rss / 1024 ** 3
#     print(f"{gig} gigabytes")

# print_mem()


3.64825439453125 gigabytes


In [46]:
import pandas as pd
df = pd.read_parquet('/Users/mahdi/Downloads/fhvhv_tripdata_2022-06.parquet')
sys.getsizeof(df) / 1024 / 1024 / 1024


9.87973692920059


### Apache Arrow Datasets


*   Datasets in PyArrow let you work with large tabular data, even when it's larger than your machine's memory
*   It offers lazy data access, meaning you don't have to load the entire dataset into memory.
*   Datasets support data discovery, partitioning, and compatibility with various file systems like AWS, Google Cloud, and local storage.
  * I can read from AWS or Google without having to install anything.

* import the dataset library as:

```python
import pyarrow.dataset as ds
```


In [None]:
### Dataset Overview

* Provider: New York City Taxi and Limousine Commission (TLC)
* Data hosted on AWS. The URSA-LAB company account.
* Contains data on millions of taxi and limousine trips in NYC
* Time Period: 2009 to 2019


In [49]:
# **Note**: In the AWS S3 listing, "PRE" stands for "prefix," essentially representing a folder or directory.

!aws s3 ls "s3://ursa-labs-taxi-data/"

                           PRE 2009/
                           PRE 2010/
                           PRE 2011/
                           PRE 2012/
                           PRE 2013/
                           PRE 2014/
                           PRE 2015/
                           PRE 2016/
                           PRE 2017/
                           PRE 2018/
                           PRE 2019/


In [50]:
!aws s3 ls "s3://ursa-labs-taxi-data/2009/"

                           PRE 01/
                           PRE 02/
                           PRE 03/
                           PRE 04/
                           PRE 05/
                           PRE 06/
                           PRE 07/
                           PRE 08/
                           PRE 09/
                           PRE 10/
                           PRE 11/
                           PRE 12/


In [53]:
import pyarrow.dataset as ds
dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year", "month"])
dataset

<pyarrow._dataset.FileSystemDataset at 0x126e47dc0>

In [54]:
len(dataset.files)

125

In [55]:
dataset.files[0:10]

['ursa-labs-taxi-data/2009/01/data.parquet',
 'ursa-labs-taxi-data/2009/02/data.parquet',
 'ursa-labs-taxi-data/2009/03/data.parquet',
 'ursa-labs-taxi-data/2009/04/data.parquet',
 'ursa-labs-taxi-data/2009/05/data.parquet',
 'ursa-labs-taxi-data/2009/06/data.parquet',
 'ursa-labs-taxi-data/2009/07/data.parquet',
 'ursa-labs-taxi-data/2009/08/data.parquet',
 'ursa-labs-taxi-data/2009/09/data.parquet',
 'ursa-labs-taxi-data/2009/10/data.parquet']

In [57]:
# Here's how to load just one file (a fragment) and its schema:

frag = next(dataset.get_fragments())
frag.partition_expression

<pyarrow.compute.Expression ((year == 2009) and (month == 1))>

#### Play with a Single File

* Let's read in the data from this single fragment
* Take a look at the data
* List of column names
    

In [60]:
%%time
frag_table = frag.to_table()
frag_table

CPU times: user 4.68 s, sys: 2.84 s, total: 7.53 s
Wall time: 1min 27s


pyarrow.Table
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float
----
vendor_id: [["VTS","VTS","VTS","DDS","DDS",...,"DDS","CMT","CMT","CMT","CMT"],["CMT","DDS","DDS","CMT","DDS",...,"CMT","CMT","CMT","CMT","CMT"],...,["CMT","CMT","DDS","CMT","CMT",...,"VTS","CMT","VTS","VTS","VTS"],["VTS","VTS","VTS","VTS","VTS",...,"VTS","VTS","CMT","VTS","CMT"]]
pickup_at: [[2009-01-04 02:52:00.000000,2009-01-04 03:31:00.000000,2009-01-03 15:43:00.000000,2009-01-01 20:52:58.000000,2009-01-24 16:18:23.000000,...,2009-01-01 22:42:49.000000,2009-01-04 18:27:32.000000,2009-01-04 11:48:33.000000,2009-01-04 23:21:04.000000,2009-01-04 16:11:27.000000],[2009-01-04 21:54:44.000000,2009

In [62]:
frag_table.column_names

['vendor_id',
 'pickup_at',
 'dropoff_at',
 'passenger_count',
 'trip_distance',
 'pickup_longitude',
 'pickup_latitude',
 'rate_code_id',
 'store_and_fwd_flag',
 'dropoff_longitude',
 'dropoff_latitude',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'total_amount']

In [61]:
frag_table.num_rows


14092413

#### Chunks: The Building Blocks

* Remember how we talked about Arrow tables having columns that could be split into chunks? 
* If you take a look, each column is divided into 216 chunks
  * Proving that this table is built in the way we discussed earlier.
* Take just a slice of the data.

In [67]:
frag_table.slice(0, 5)

pyarrow.Table
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float
----
vendor_id: [["VTS","VTS","VTS","DDS","DDS"]]
pickup_at: [[2009-01-04 02:52:00.000000,2009-01-04 03:31:00.000000,2009-01-03 15:43:00.000000,2009-01-01 20:52:58.000000,2009-01-24 16:18:23.000000]]
dropoff_at: [[2009-01-04 03:02:00.000000,2009-01-04 03:38:00.000000,2009-01-03 15:57:00.000000,2009-01-01 21:14:00.000000,2009-01-24 16:24:56.000000]]
passenger_count: [[1,3,5,1,1]]
trip_distance: [[2.63,4.55,10.35,5,0.4]]
pickup_longitude: [[-73.99196,-73.9821,-74.00259,-73.974266,-74.00158]]
pickup_latitude: [[40.721565,40.73629,40.739746,40.790955,40.719383]]
rate_code_id: [5 nulls]
store_and_fwd_fla

In [68]:
[frag_table[col_name].num_chunks for col_name in frag_table.column_names]


[216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216]

In [None]:
### The Essentials of Apache Arrow Tables and Record Batches

*   Discussing how tables in Apache Arrow are essentially collections of record batches.
*   You can easily pull data from columns like `payment_type`, `fare_amount`, or `tip_amount`. 
* Because we're working with a single record batch, managing the data is pretty straightforward. 
  * We'll see that each column, for instance, holds 65,536 values.


In [70]:
record_batch_3 = frag_table.to_batches()[3]
record_batch_3

pyarrow.RecordBatch
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float

In [79]:
record_batch_3.num_rows

65536

In [74]:
record_batch_3["fare_amount"]

<pyarrow.lib.StringArray object at 0x126aeb3a0>
[
  "DDS",
  "DDS",
  "CMT",
  "CMT",
  "CMT",
  "CMT",
  "CMT",
  "DDS",
  "DDS",
  "CMT",
  ...
  "VTS",
  "VTS",
  "CMT",
  "VTS",
  "VTS",
  "VTS",
  "VTS",
  "VTS",
  "VTS",
  "VTS"
]

In [78]:
record_batch_3['tip_amount']

<pyarrow.lib.FloatArray object at 0x126ac3220>
[
  0,
  0,
  0,
  0,
  0.76,
  2.67,
  2,
  0,
  0,
  0,
  ...
  0,
  0,
  0,
  5.06,
  0,
  0,
  1,
  0,
  0,
  0
]

In [77]:
record_batch_3['payment_type']

<pyarrow.lib.StringArray object at 0x126ac31c0>
[
  "CASH",
  "CASH",
  "Cash",
  "Cash",
  "Credit",
  "Credit",
  "Credit",
  "CASH",
  "CASH",
  "Cash",
  ...
  "CASH",
  "CASH",
  "Cash",
  "Credit",
  "CASH",
  "CASH",
  "Credit",
  "CASH",
  "CASH",
  "CASH"
]

In [None]:
#### PyArrow's Computational Capabilities

*   PyArrow separates data storage concerns from computational functionality.    
    *   Structures like Arrow Arrays, Record Batches, and Tables handle data storage and serialization.
    *   For actual data operations, there's the `pyarrow.compute` module.
*   The `pyarrow.compute` module offers a range of functions for filtering, transforming, and aggregating data.    
    *   While it does provide basic operations, it's not a full-blown analytical tool. For more complex tasks, you'd typically use something like Pandas or Spark.

* Let's perform some computations like calculating the sum of tips and fares, etc.


In [80]:
import pyarrow.compute as pc
pc.add(record_batch_3['tip_amount'], record_batch_3['fare_amount'])

<pyarrow.lib.FloatArray object at 0x126e47a00>
[
  4.9,
  10.5,
  4.2,
  8.2,
  4.56,
  20.47,
  11.8,
  6.9,
  3.7,
  10.5,
  ...
  45,
  6.9,
  6.2,
  30.359999,
  5.7,
  25.3,
  6.3,
  24.1,
  6.9,
  22.1
]

* How about finding the maximum total amount for a trip, including the tip?

In [81]:
pc.max(pc.add(record_batch_3['tip_amount'], record_batch_3['fare_amount']))

<pyarrow.FloatScalar: 164.0>

* And the average?


In [82]:
pc.mean(pc.add(record_batch_3['tip_amount'], record_batch_3['fare_amount']))

<pyarrow.DoubleScalar: 10.015554052642983>

* We can also perform operations on string data, like converting the case of `payment_type`, which has been recorded inconsistently.


In [84]:
upper_cased_payment_type = pc.utf8_upper(record_batch_3["payment_type"])
upper_cased_payment_type

<pyarrow.lib.StringArray object at 0x126abf520>
[
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CREDIT",
  "CREDIT",
  "CREDIT",
  "CASH",
  "CASH",
  "CASH",
  ...
  "CASH",
  "CASH",
  "CASH",
  "CREDIT",
  "CASH",
  "CASH",
  "CREDIT",
  "CASH",
  "CASH",
  "CASH"
]

* You can then filter data based on whether the payment type was "CASH."


In [88]:
is_cash = pc.equal(upper_cased_payment_type, pa.scalar('CASH'))
is_cash 

<pyarrow.lib.BooleanArray object at 0x126abf100>
[
  true,
  true,
  true,
  true,
  false,
  false,
  false,
  true,
  true,
  true,
  ...
  true,
  true,
  true,
  false,
  true,
  true,
  false,
  true,
  true,
  true
]

In [89]:
filtered_record_batch_3 = pc.filter(record_batch_3, is_cash)
filtered_record_batch_3
filtered_record_batch_3.num_rows

51341


#### Working with Parquet Files

*   You can read Parquet data into PyArrow as a ParquetDataset, and then work with it as ParquetFile Fragments.
* Recall that: 
    * Each fragment has its own metadata, 
    * You can also get statistics about each row group within the fragment.
      * However, it's usually more efficient to work with sorted data if you carry out frequent operations
      * You can then save this sorted table into a new Parquet file for optimized data retrieval.


In [95]:
import pyarrow as pa 
import pyarrow.parquet as pq
dataset = pq.ParquetDataset('s3://ursa-labs-taxi-data/2009/', partitioning=["month"])
dataset

<pyarrow.parquet.core._ParquetDatasetV2 at 0x127fb2c70>

In [98]:
data_table = dataset.fragments[0].to_table() 
sorted_indices = pc.sort_indices(data_table, sort_keys=[("pickup_at", "ascending"), ("fare_amount", "ascending")])
sorted_indices

<pyarrow.lib.UInt64Array object at 0x1279bb9a0>
[
  11489987,
  3964040,
  543513,
  8582999,
  11812099,
  3708729,
  10177659,
  12142978,
  4811616,
  5665566,
  ...
  10604876,
  10079366,
  1839956,
  4631967,
  8528489,
  347071,
  3174063,
  6071930,
  1328472,
  7684378
]

In [99]:
sorted_table = data_table.take(sorted_indices)


In [None]:
# pq.write_table(sorted_table, 'optimized_parquet_file.parquet', row_group_size=65536)


#### Exploring Sorted Parquet Files

*   When you read the sorted table back into PyArrow, it's easier to work with.
  * We can reach the read groups meta data and only look at those we are interested in.
  * i.e., you can delve into the metadata to understand your data better.



In [100]:
optimized_parquet_file = pq.ParquetFile('optimized_parquet_file.parquet')

rg0_metadata = optimized_parquet_file.metadata.row_group(0)
rg0_metadata.to_dict()





{'num_columns': 18,
 'num_rows': 65536,
 'total_byte_size': 1645654,
 'columns': [{'file_offset': 8548,
   'file_path': '',
   'physical_type': 'BYTE_ARRAY',
   'num_values': 65536,
   'path_in_schema': 'vendor_id',
   'is_stats_set': True,
   'statistics': {'has_min_max': True,
    'min': 'CMT',
    'max': 'DDS',
    'null_count': 0,
    'distinct_count': 0,
    'num_values': 65536,
    'physical_type': 'BYTE_ARRAY'},
   'compression': 'SNAPPY',
   'encodings': ('RLE_DICTIONARY', 'PLAIN', 'RLE'),
   'has_dictionary_page': True,
   'dictionary_page_offset': 4,
   'data_page_offset': 34,
   'total_compressed_size': 8544,
   'total_uncompressed_size': 9856},
  {'file_offset': 196427,
   'file_path': '',
   'physical_type': 'INT64',
   'num_values': 65536,
   'path_in_schema': 'pickup_at',
   'is_stats_set': True,
   'statistics': {'has_min_max': True,
    'min': datetime.datetime(2009, 1, 1, 0, 0),
    'max': datetime.datetime(2009, 1, 1, 4, 22, 17),
    'null_count': 0,
    'distinct_co

In [None]:

col_idx = name_2_pos['pickup_at']

datetime_obj = datetime.strptime("2009-1-1 14:00:00", "%Y-%m-%d %H:%M:%S")

for i in range(optimized_parquet_file.num_row_groups):
    col_stats = optimized_parquet_file.metadata.row_group(i).column(col_idx).statistics
    if col_stats.min <= datetime_obj <= col_stats.max:
        print(f"found it, it's row_group {i}")
    

In [None]:
### Bonus Questions
* can you get the average transaction between 2:00-2:59 PM

In [None]:
* Which day, on average has the highest tip? 

In [None]:
* Which time of the day has the highest tip?

### Resources

1.  [Apache Arrow Homepage](https://arrow.apache.org/)
2.  [PyArrow Documentation](https://arrow.apache.org/docs/python/)
3.  [PyArrow GitHub Repository](https://github.com/apache/arrow/tree/master/python/pyarrow)