In [285]:
record_batch_3.num_rows

65536

In [277]:
results_node_3 = pa.array([11,12,13])
new_chunked_array= pa.chunked_array([chunked_array.chunk(i) for i in range(chunked_array.num_chunks)] + [results_node_3])

In [278]:
[x.buffers() for x in new_chunked_array.chunks]

[[None,
  <pyarrow.Buffer address=0x414f0020680 size=40 is_cpu=True is_mutable=True>],
 [None,
  <pyarrow.Buffer address=0x414f0020100 size=48 is_cpu=True is_mutable=True>],
 [None,
  <pyarrow.Buffer address=0x414f0020540 size=24 is_cpu=True is_mutable=True>]]

### Introduction to PyArrow 

* PyArrow is a cross-language development platform for in-memory data.
* It serves as a foundation for building high-performance analytics applications.
* Developed as a part of the Apache Arrow project, it aims to improve data interoperability.
* In-memory Columnar Data Representation: Efficiently represents complex data structures in a memory-optimized way.
* Zero-Copy Reads: Enables rapid data sharing between Python and other languages without serialization overhead.
* Schema and Metadata Support: Enables rich, self-describing data structures.



### PyArrow and Parquet

* PyArrow can read and write Parquet files efficiently and seamlessly/
* Column Pruning: Reads only the necessary columns from a Parquet file, reducing I/O.
```python
import pyarrow.parquet as pq
table = pq.read_table('your_file.parquet', columns=['column1', 'column2'])
df = table.to_pandas()
```
* Row Group Filtering: Allows selective reading of row groups based on conditions, optimizing data retrieval.
  * Only read the groups in parquet file for which date is in range


### Apache Arrow

```A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (including nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.```


                  Apache arrow Project

<div align="center">
<img src="https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/img/with_arrow.jpg" width=700>
</div>
[picture source](https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/)

In [4]:
# !pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-9.0.0-cp38-cp38-macosx_11_0_arm64.whl (21.6 MB)
[K     |████████████████████████████████| 21.6 MB 3.3 MB/s eta 0:00:01
Installing collected packages: pyarrow
Successfully installed pyarrow-9.0.0
Note: you may need to restart the kernel to use updated packages.


### PyArrow DataStructures
* Offer suite of low-level data structures and methods optimized for speed and flexibility. 
* These structures are highly interoperable and can be used between languages seamlessly. 


In [None]:
### Arrow Array
* Arrow Arrays are columns of data stored in a contiguous, efficient format. 
  * Unlike Python lists, they are optimized for speed and can be transferred between languages without serialization overhead.



In [16]:
import pyarrow as pa

# Create an Arrow Array from a Python list
arrow_array = pa.array([1, 2, 3, 4, 5])

# Display the array
print(type(arrow_array))
print(arrow_array)

<class 'pyarrow.lib.Int64Array'>
[
  1,
  2,
  3,
  4,
  5
]


### Arrow Buffer
* While not a data strctute,  Buffers are important concepts and need to be introduced to appreciate some of Arrow Functionlaity
  * Buffers are contiguous blocks of memory where data arrays are stored. 
  * In PyArrow, they usually back the data in Arrow Arrays, making the storage highly efficient.



In [17]:
# Get the buffer
buffer = arrow_array.buffers()[1]

# Convert buffer to Python bytes
byte_data = buffer.to_pybytes()
print(byte_data)

b'\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00'


In [21]:
import numpy as np
numpy_array = np.frombuffer(buffer, dtype=np.int64)
numpy_array

array([1, 2, 3, 4, 5])

In [22]:
arrow_array

<pyarrow.lib.Int64Array object at 0x13f95f1c0>
[
  1,
  2,
  3,
  4,
  5
]

In [24]:
np.shares_memory(arrow_array, numpy_array)

True

In [27]:
numpy_array[1] = 1

In [28]:
numpy_array

array([1, 1, 3, 4, 5])

In [29]:
arrow_array

<pyarrow.lib.Int64Array object at 0x13f95f1c0>
[
  1,
  1,
  3,
  4,
  5
]

In [118]:
arrow_array2 = pa.array(['a', 'bb', 'c', 'ddddddd', 'e'])
arrow_array2.buffers()

[None,
 <pyarrow.Buffer address=0x37b24030e80 size=24 is_cpu=True is_mutable=True>,
 <pyarrow.Buffer address=0x37b24030e00 size=12 is_cpu=True is_mutable=True>]

In [119]:
null_bitmap_buffer, offsets_buffer, data_buffer = record_batch['column2'].buffers()
# Decode the offsets buffer
# Note: The offsets are likely 32-bit integers (given the 24 byte size for 6 offsets).
offsets = np.frombuffer(offsets_buffer, dtype=np.int32)
print("Offsets:", offsets)


Offsets: [ 0  1  3  4 11 12]


In [None]:
# Decode the data buffer
data = data_buffer.to_pybytes().decode('utf-8')
print("Data:", data)


In [120]:
# Extract strings using offsets
strings = [data[offsets[i]:offsets[i+1]] for i in range(len(offsets) - 1)]
print("Decoded Strings:", strings)

Decoded Strings: ['a', 'bb', 'c', 'ddddddd', 'e']


### Schema
* Schemas define the structure of Arrow Arrays and other data structures specifying the column names and types.



In [113]:
schema = pa.schema([
    ('column1', pa.int64())
])

print(schema)

column1: int64


### Chunked Array 
* A Chunked Array is like an Arrow Array but can consist of multiple chunks (Arrow Arrays)
  * Allowing it to store large datasets that don't fit into memory.
    * breaking down a large array into smaller, more manageable pieces (chunks). 
    * Each chunk is a full-fledged Arrow array, and the chunked array is essentially a sequence of such chunks.
* Used for exmaple:
  * To aggregate incoming data streams where each incoming message is a chunk of data.
    * E.g., a sendor data can be sent in chunks, instead of one observations at a time, to optimize throughput.
  * In a distributed computing framework, multiple nodes processing data and producing Arrow Arrays.
    * Master node can collect data from each worker as ChunkedArray

* Chunked array appreats as contiguous


In [80]:
results_node_1 = pa.array([0,1,2,3,4])
results_node_2 = pa.array([5,6,7,8,9,10])
chunked_array = pa.chunked_array([results_node_1, results_node_2])
chunked_array

<pyarrow.lib.ChunkedArray object at 0x13fb25b80>
[
  [
    0,
    1,
    2,
    3,
    4
  ],
  [
    5,
    6,
    7,
    8,
    9,
    10
  ]
]

In [82]:
chunked_array[5]

<pyarrow.Int64Scalar: 5>

In [84]:
chunked_array[3:6]

<pyarrow.lib.ChunkedArray object at 0x13fb694a0>
[
  [
    3,
    4
  ],
  [
    5
  ]
]

In [92]:
chunked_array.chunk(0)

<pyarrow.lib.Int64Array object at 0x13f9aa280>
[
  0,
  1,
  2,
  3,
  4
]

In [104]:
results_node_3 = pa.array([11,12,13])


In [107]:
chunked_array

<pyarrow.lib.ChunkedArray object at 0x13fb25b80>
[
  [
    0,
    1,
    2,
    3,
    4
  ],
  [
    5,
    6,
    7,
    8,
    9,
    10
  ]
]

In [112]:
new_chunked_array= pa.chunked_array([chunked_array.chunk(i) for i in range(chunked_array.num_chunks)] + [results_node_3])
new_chunked_array

<pyarrow.lib.ChunkedArray object at 0x13f9e11d0>
[
  [
    0,
    1,
    2,
    3,
    4
  ],
  [
    5,
    6,
    7,
    8,
    9,
    10
  ],
  [
    11,
    12,
    13
  ]
]

### Table

* A data structure that is composed of multiple columns. 
  * Each column within this table is represented as one or more pyarrow.Array objects of the same type. 
    The purpose of the Table structure is to enable more extensive operations on a larger dataset, as opposed to individual chunks like RecordBatch.

* Homogeneous Columns: Each column consists of Arrow Arrays of a consistent type.
* Schema: Every table has an associated schema that describes the column names and types.
* Chunked Storage: While the table represents a logical sequence of records, its physical representation can be chunked.
  * This allows operations to be performed on smaller chunks, enhancing performance especially for large datasets.


In [99]:
import pyarrow as pa

# Define data arrays for two columns
data1 = pa.array([1, 2, 3, 4, 5])
data2 = pa.array(['a', 'b', 'c', 'd', 'e'])

# Define the schema for the table
schema = pa.schema([
    ('my_ints_column', pa.int32()),
    ('my_chars_column', pa.string())
])

table = pa.Table.from_arrays([data1, data2], schema=schema)
table

pyarrow.Table
my_ints_column: int32
my_chars_column: string
----
my_ints_column: [[1,2,3,4,5]]
my_chars_column: [["a","b","c","d","e"]]

In [100]:
table["my_ints_column"]

<pyarrow.lib.ChunkedArray object at 0x13f9ad770>
[
  [
    1,
    2,
    3,
    4,
    5
  ]
]

In [103]:
# sliced_table = table.slice(0, 3)
sliced_table = table[0:3]

sliced_table

pyarrow.Table
my_ints_column: int32
my_chars_column: string
----
my_ints_column: [[1,2,3]]
my_chars_column: [["a","b","c"]]

In [61]:
# Create record batches into tables
table1 = pa.table(record_batch)
table2 = pa.table(record_batch)

# Concatenate tables
final_table = pa.concat_tables([table1, table2])

print(final_table)
Directly using the pa.table constructor: If you already have the data in the form of Arrow arrays, you can create a table by specifying the data and the schema directly.


pyarrow.RecordBatch
column1: int64
column2: string

In [60]:
table = pa.table([record_batch, record_batch])
table

ValueError: Must pass names or schema when constructing Table or RecordBatch.

### Record Batch

* A Record Batch is a collection of Arrow Arrays (columns) with the same length. 
  * A single "row chunk" of a table, where each array in the batch is a column.

* The concept of record batch is somewhat simialar to Table as Array is similar to Chunked Array.

  * The choice between using a Record Batch or a Table in Apache Arrow often comes down to your specific use case and the constraints you're operating under. Here are some reasons you might prefer to use a Record Batch:
  
  * Streaming Data: process each chunk as it arrives, Record Batches are a good choice. 
    * You can serialize and process each Record Batch independently as they arrive.





In [49]:
import pyarrow as pa

# Create two Arrow Arrays of the same length
arrow_array1 = pa.array([1, 2, 3, 4, 5])
arrow_array2 = pa.array(['a', 'bb', 'c', 'ddddddd', 'e'])

# Create a schema
schema = pa.schema([
    ('column1', pa.int64()),
    ('column2', pa.string())
])

# Create a record batch using a list of arrays
record_batch = pa.record_batch([arrow_array1, arrow_array2], schema=schema)

print(record_batch)


record_batch.schema


record_batch['column1']


record_batch["column2"][2]


# record_batch.column("column2")
record_batch["column2"][2:4]

pyarrow.RecordBatch
column1: int64
column2: string


In [1]:
#!wget -P https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_2022-06.parquet

In [130]:
print_mem()

8.470672607421875 gigabytes


#### In what follws, we play with some real data.
1. The following show that 1, we can easily read  a parquet file and
2. The size is substantially smaller than that taken by the same pandas object.

In [131]:
import pyarrow.parquet as  pq
table = pq.read_table('/Users/mahdi/Downloads/fhvhv_tripdata_2022-06.parquet')
table

pyarrow.Table
hvfhs_license_num: string
dispatching_base_num: string
originating_base_num: string
request_datetime: timestamp[us]
on_scene_datetime: timestamp[us]
pickup_datetime: timestamp[us]
dropoff_datetime: timestamp[us]
PULocationID: int64
DOLocationID: int64
trip_miles: double
trip_time: int64
base_passenger_fare: double
tolls: double
bcf: double
sales_tax: double
congestion_surcharge: double
airport_fee: double
tips: double
driver_pay: double
shared_request_flag: string
shared_match_flag: string
access_a_ride_flag: string
wav_request_flag: string
wav_match_flag: string
----
hvfhs_license_num: [["HV0003","HV0003","HV0003","HV0003","HV0005",...,"HV0005","HV0003","HV0005","HV0005","HV0005"],["HV0003","HV0005","HV0003","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0003"],...,["HV0005","HV0005","HV0005","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0003"],["HV0003","HV0005","HV0003","HV0003","HV0003",...,"HV0003","HV0003","HV0003","HV0003","HV0005"]]
di

In [132]:
print_mem()

6.8361968994140625 gigabytes


In [139]:
sys.getsizeof(table) / 1024 / 1024 / 1024

3.03908724244684

In [138]:
import pandas as pd

df = pd.read_parquet('/Users/mahdi/Downloads/fhvhv_tripdata_2022-06.parquet')
sys.getsizeof(df) / 1024 / 1024 / 1024

9.87973692920059

Datasets
• Datasets are an abstraction that allows to work with large tabular data, potentially bigger than memory and distributed across multiple files.
• Datasets provide lazy access to the data, avoiding the need of loading it all in memory at once.
• Datasets are supported by the Acero compute engine in most cases in place of Tables.

A unified interface for different sources, like Parquet and Feather

Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization)

Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks.

In [34]:
import pyarrow.dataset as ds

Add one slide here about the dat
https://registry.opendata.aws/nyc-tlc-trip-records-pds/
    
    New York City Taxi and Limousine Commission (TLC) Trip Record Data
cities transportation urban

Description
Data of trips taken by taxis and for-hire vehicles in New York City. Note: access to this dataset is free, however direct S3 access does require an AWS account. Anonymous downloads are accessible from the dataset's documentation webpage listed below.

Update Frequency
As soon as new data is available to be shared publicly.

License
http://www1.nyc.gov/home/terms-of-use.page

Documentation
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Managed By
City of New York Taxi and Limousine Commission

See all datasets managed by City of New York Taxi and Limousine Commission.

Contact
research@tlc.nyc.gov

How to Cite
fNew York City Taxi and Limousine Commission (TLC) Trip Record Data was accessed on DATE from https://registry.opendata.aws/nyc-tlc-trip-records-pds.

In [None]:
> format this and explain what PRE Means
(base) ➜  ~ aws s3 ls "s3://ursa-labs-taxi-data/"
                           PRE 2009/
                           PRE 2010/
                           PRE 2011/
                           PRE 2012/
                           PRE 2013/
                           PRE 2014/
                           PRE 2015/
                           PRE 2016/
                           PRE 2017/
                           PRE 2018/
                           PRE 2019/
                            
                            

(base) ➜  ~ aws s3 ls "s3://ursa-labs-taxi-data/2009/"
                           PRE 01/
                           PRE 02/
                           PRE 03/
                           PRE 04/
                           PRE 05/
                           PRE 06/
                           PRE 07/
                           PRE 08/
                           PRE 09/
                           PRE 10/
                           PRE 11/
                           PRE 12/

In [1]:
import pyarrow.dataset as ds

In [125]:
len(list(dataset.get_fragments()))


125

In [2]:
dataset = ds.dataset("s3://ursa-labs-taxi-data/", partitioning=["year", "month"])
dataset

<pyarrow._dataset.FileSystemDataset at 0x120a1a5e0>

In [157]:
dataset.files[0:3]

['ursa-labs-taxi-data/2009/01/data.parquet',
 'ursa-labs-taxi-data/2009/02/data.parquet',
 'ursa-labs-taxi-data/2009/03/data.parquet']

In [154]:
dataset.count_rows()

1547741381

In [158]:
dir(dataset)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_get_fragments',
 '_scan_options',
 '_scanner_options',
 'count_rows',
 'files',
 'filesystem',
 'filter',
 'format',
 'from_paths',
 'get_fragments',
 'head',
 'join',
 'partition_expression',
 'partitioning',
 'replace_schema',
 'scanner',
 'schema',
 'sort_by',
 'take',
 'to_batches',
 'to_table']

In [24]:
len(dataset.files)

125

In [None]:
What is the dataset made of?

In [34]:
len(list(dataset.get_fragments()))


125

In [129]:
frag = next(dataset.get_fragments())


<pyarrow.compute.Expression ((year == 2009) and (month == 1))>

In [131]:
frag.partition_expression

<pyarrow.compute.Expression ((year == 2009) and (month == 1))>

<pyarrow.compute.Expression is_valid(((year == 2009) and (month == 1)))>

In [132]:
frag.partition_expression['year']

TypeError: 'pyarrow._compute.Expression' object is not subscriptable

In [37]:
frag_table = frag.to_table()

In [40]:
frag_table.column_names

['vendor_id',
 'pickup_at',
 'dropoff_at',
 'passenger_count',
 'trip_distance',
 'pickup_longitude',
 'pickup_latitude',
 'rate_code_id',
 'store_and_fwd_flag',
 'dropoff_longitude',
 'dropoff_latitude',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'total_amount']

In [41]:
frag_table.num_rows

14092413

In [126]:
frag_table.partition_expression

AttributeError: 'pyarrow.lib.Table' object has no attribute 'partition_expression'

In [42]:
frag_table

pyarrow.Table
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float
----
vendor_id: [["VTS","VTS","VTS","DDS","DDS",...,"DDS","CMT","CMT","CMT","CMT"],["CMT","DDS","DDS","CMT","DDS",...,"CMT","CMT","CMT","CMT","CMT"],...,["CMT","CMT","DDS","CMT","CMT",...,"VTS","CMT","VTS","VTS","VTS"],["VTS","VTS","VTS","VTS","VTS",...,"VTS","VTS","CMT","VTS","CMT"]]
pickup_at: [[2009-01-04 02:52:00.000000,2009-01-04 03:31:00.000000,2009-01-03 15:43:00.000000,2009-01-01 20:52:58.000000,2009-01-24 16:18:23.000000,...,2009-01-01 22:42:49.000000,2009-01-04 18:27:32.000000,2009-01-04 11:48:33.000000,2009-01-04 23:21:04.000000,2009-01-04 16:11:27.000000],[2009-01-04 21:54:44.000000,2009

In [47]:
frag_table['vendor_id'].num_chunks

216

In [48]:
[frag_table[col_name].num_chunks for col_name in frag_table.column_names]

[216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216,
 216]

In [57]:
type(frag_table)

pyarrow.lib.Table

In [73]:
frag_table.slice(0, 5)

pyarrow.Table
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float
----
vendor_id: [["VTS","VTS","VTS","DDS","DDS"]]
pickup_at: [[2009-01-04 02:52:00.000000,2009-01-04 03:31:00.000000,2009-01-03 15:43:00.000000,2009-01-01 20:52:58.000000,2009-01-24 16:18:23.000000]]
dropoff_at: [[2009-01-04 03:02:00.000000,2009-01-04 03:38:00.000000,2009-01-03 15:57:00.000000,2009-01-01 21:14:00.000000,2009-01-24 16:24:56.000000]]
passenger_count: [[1,3,5,1,1]]
trip_distance: [[2.63,4.55,10.35,5,0.4]]
pickup_longitude: [[-73.99196,-73.9821,-74.00259,-73.974266,-74.00158]]
pickup_latitude: [[40.721565,40.73629,40.739746,40.790955,40.719383]]
rate_code_id: [5 nulls]
store_and_fwd_fla

The `rate_code_id` is a field in the New York City Taxi and Limousine Commission (NYC-TLC) dataset that indicates the rate code for the trip. The rate code helps to categorize the fare rate applied to the trip. Common values typically include:

2: JFK (fixed fare to/from JFK Airport)

In [90]:
record_batch_3 = frag_table.to_batches()[3]
record_batch_3

pyarrow.RecordBatch
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float

In [89]:
record_batch_3['payment_type']

<pyarrow.lib.StringArray object at 0x13020ab80>
[
  "CASH",
  "CASH",
  "Cash",
  "Cash",
  "Credit",
  "Credit",
  "Credit",
  "CASH",
  "CASH",
  "Cash",
  ...
  "CASH",
  "CASH",
  "Cash",
  "Credit",
  "CASH",
  "CASH",
  "Credit",
  "CASH",
  "CASH",
  "CASH"
]

In [86]:
record_batch_3['fare_amount']

<pyarrow.lib.FloatArray object at 0x1302c5880>
[
  4.9,
  10.5,
  4.2,
  8.2,
  3.8,
  17.8,
  9.8,
  6.9,
  3.7,
  10.5,
  ...
  45,
  6.9,
  6.2,
  25.3,
  5.7,
  25.3,
  5.3,
  24.1,
  6.9,
  22.1
]

In [96]:
record_batch_3['tip_amount']

<pyarrow.lib.FloatArray object at 0x1302fb520>
[
  0,
  0,
  0,
  0,
  0.76,
  2.67,
  2,
  0,
  0,
  0,
  ...
  0,
  0,
  0,
  5.06,
  0,
  0,
  1,
  0,
  0,
  0
]

In [None]:
### Computation in PyArrow
* PyArrow separates the representation of data from the operations performed on that data. 

  * The data structures like Arrow Arrays, Record Batches, and Tables are primarily concerned with the efficient storage and serialization of data. 
  * They provide the backbone for data storage but don't offer a rich set of operations on that data themselves.
* its main strength lies in the standardized memory layout, cross-language interoperability, and I/O operations.

* On the other hand, the pyarrow.compute module is specifically designed to perform various computations on Arrow * data structures. 

* It includes a wide variety of functions for tasks like filtering, transforming, and aggregating data.
  * Provides a set of fundamental operations but isn't as extensive or sophisticated as, say, the functionality you'd find in dedicated analytical tools or libraries like Pandas. The compute module offers basic operations like sorting, filtering, and scalar transformations. However, for complex data manipulations or analytics, one might still rely on libraries like Pandas, Spark or other tools.


    
    

In [92]:
import pyarrow.compute as pc

In [97]:
pc.add(record_batch_3['tip_amount'], record_batch_3['fare_amount'])

<pyarrow.lib.FloatArray object at 0x1302fb0a0>
[
  4.9,
  10.5,
  4.2,
  8.2,
  4.56,
  20.47,
  11.8,
  6.9,
  3.7,
  10.5,
  ...
  45,
  6.9,
  6.2,
  30.359999,
  5.7,
  25.3,
  6.3,
  24.1,
  6.9,
  22.1
]

In [100]:
pc.max(pc.add(record_batch_3['tip_amount'], record_batch_3['fare_amount']))

<pyarrow.FloatScalar: 164.0>

In [102]:
pc.mean(pc.add(record_batch_3['tip_amount'], record_batch_3['fare_amount']))

<pyarrow.DoubleScalar: 10.015554052642983>

In [106]:
record_batch_3["payment_type"]

<pyarrow.lib.StringArray object at 0x1306559a0>
[
  "CASH",
  "CASH",
  "Cash",
  "Cash",
  "Credit",
  "Credit",
  "Credit",
  "CASH",
  "CASH",
  "Cash",
  ...
  "CASH",
  "CASH",
  "Cash",
  "Credit",
  "CASH",
  "CASH",
  "Credit",
  "CASH",
  "CASH",
  "CASH"
]

In [108]:
pc.utf8_upper(record_batch_3["payment_type"])

<pyarrow.lib.StringArray object at 0x130499b20>
[
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CREDIT",
  "CREDIT",
  "CREDIT",
  "CASH",
  "CASH",
  "CASH",
  ...
  "CASH",
  "CASH",
  "CASH",
  "CREDIT",
  "CASH",
  "CASH",
  "CREDIT",
  "CASH",
  "CASH",
  "CASH"
]

In [112]:
upper_cased_payment_type = pc.utf8_upper(record_batch_3["payment_type"])
upper_cased_payment_type

<pyarrow.lib.StringArray object at 0x130495e80>
[
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CREDIT",
  "CREDIT",
  "CREDIT",
  "CASH",
  "CASH",
  "CASH",
  ...
  "CASH",
  "CASH",
  "CASH",
  "CREDIT",
  "CASH",
  "CASH",
  "CREDIT",
  "CASH",
  "CASH",
  "CASH"
]

In [109]:
pc.utf8_upper(record_batch_3["payment_type"])[0]

<pyarrow.StringScalar: 'CASH'>

In [114]:
is_cash = pc.equal(upper_cased_payment_type, pa.scalar('CASH'))
is_cash

<pyarrow.lib.BooleanArray object at 0x130499b80>
[
  true,
  true,
  true,
  true,
  false,
  false,
  false,
  true,
  true,
  true,
  ...
  true,
  true,
  true,
  false,
  true,
  true,
  false,
  true,
  true,
  true
]

In [121]:
pc.filter(upper_cased_payment_type, is_cash)

<pyarrow.lib.StringArray object at 0x130495f40>
[
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  ...
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH",
  "CASH"
]

In [122]:
len(pc.filter(upper_cased_payment_type, is_cash))

51341

In [120]:
filerd_record_batch_3 = pc.filter(record_batch_3, is_cash)
filerd_record_batch_3.num_rows

51341

In [160]:
import pyarrow as pa
import pyarrow.parquet as pq


dataset = pq.ParquetDataset('s3://ursa-labs-taxi-data/2009/', partitioning=["month"])
dataset

<pyarrow.parquet.core._ParquetDatasetV2 at 0x130d09190>

In [167]:
dataset.fragments[0]


<pyarrow.dataset.ParquetFileFragment path=ursa-labs-taxi-data/2009/01/data.parquet partition=[month=1]>

In [175]:
dataset.fragments[0].metadata

<pyarrow._parquet.FileMetaData object at 0x130bc9e50>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 18
  num_rows: 14092413
  num_row_groups: 216
  format_version: 1.0
  serialized_size: 324078

In [168]:
dataset.fragments[0].num_row_groups

216

In [None]:
it turns out that arrow, as we saw earlier, have also figure out that these was 216 row groups, matching waht we foung above. So there is a perfect matching integrtion PyArrow

In [156]:
row_group = dataset.fragments[0].row_groups[0]
row_group.metadata

<pyarrow._parquet.RowGroupMetaData object at 0x13016def0>
  num_columns: 18
  num_rows: 65536
  total_byte_size: 1972220

In [170]:
row_group.statistics

{'vendor_id': {'min': 'CMT', 'max': 'VTS'},
 'pickup_at': {'min': datetime.datetime(2009, 1, 1, 0, 10, 51),
  'max': datetime.datetime(2009, 1, 31, 23, 59)},
 'dropoff_at': {'min': datetime.datetime(2009, 1, 1, 0, 17, 38),
  'max': datetime.datetime(2009, 2, 1, 17, 53)},
 'passenger_count': {'min': 0, 'max': 6},
 'trip_distance': {'min': 0.0, 'max': 46.5},
 'pickup_longitude': {'min': -76.11469268798828, 'max': 0.005876999814063311},
 'pickup_latitude': {'min': -0.00596500001847744, 'max': 81.53500366210938},
 'store_and_fwd_flag': {'min': '0', 'max': '0'},
 'dropoff_longitude': {'min': -76.09280395507812, 'max': 0.004495000001043081},
 'dropoff_latitude': {'min': -0.005026999861001968, 'max': 81.53500366210938},
 'payment_type': {'min': 'CASH', 'max': 'No Charge'},
 'fare_amount': {'min': 2.5, 'max': 159.3000030517578},
 'extra': {'min': 0.0, 'max': 1.0},
 'tip_amount': {'min': 0.0, 'max': 55.0},
 'tolls_amount': {'min': 0.0, 'max': 16.0},
 'total_amount': {'min': 2.5, 'max': 167.8000

In [218]:
 dataset.fragments[0]

<pyarrow.dataset.ParquetFileFragment path=ursa-labs-taxi-data/2009/01/data.parquet partition=[month=1]>

In [183]:
for rg in dataset.fragments[0].row_groups[0:5]:
    print(rg.statistics['pickup_at'])

{'min': datetime.datetime(2009, 1, 1, 0, 10, 51), 'max': datetime.datetime(2009, 1, 31, 23, 59)}
{'min': datetime.datetime(2009, 1, 1, 0, 3, 40), 'max': datetime.datetime(2009, 1, 31, 23, 59, 41)}
{'min': datetime.datetime(2009, 1, 1, 0, 1, 19), 'max': datetime.datetime(2009, 1, 31, 23, 58)}
{'min': datetime.datetime(2009, 1, 1, 0, 3, 40), 'max': datetime.datetime(2009, 1, 31, 23, 59, 49)}
{'min': datetime.datetime(2009, 1, 1, 0, 5, 2), 'max': datetime.datetime(2009, 1, 31, 23, 59, 52)}


In [187]:
# Get the sorted indices for column 'a'
data_table = dataset.fragments[0].to_table()
sorted_indices = pc.sort_indices(data_table, sort_keys=[("pickup_at", "ascending"), ("fare_amount", "ascending")])
sorted_indices

<pyarrow.lib.UInt64Array object at 0x1302e3520>
[
  11489987,
  3964040,
  543513,
  8582999,
  11812099,
  3708729,
  10177659,
  12142978,
  4811616,
  5665566,
  ...
  10604876,
  10079366,
  1839956,
  4631967,
  8528489,
  347071,
  3174063,
  6071930,
  1328472,
  7684378
]

In [207]:
sorted_table = data_table.take(sorted_indices)
sorted_table

pyarrow.Table
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float
----
vendor_id: [["CMT","CMT","CMT","CMT","CMT",...,"CMT","CMT","CMT","CMT","CMT"]]
pickup_at: [[2009-01-01 00:00:00.000000,2009-01-01 00:00:00.000000,2009-01-01 00:00:02.000000,2009-01-01 00:00:04.000000,2009-01-01 00:00:07.000000,...,2009-01-31 23:59:59.000000,2009-01-31 23:59:59.000000,2009-01-31 23:59:59.000000,2009-01-31 23:59:59.000000,2009-01-31 23:59:59.000000]]
dropoff_at: [[2009-01-01 00:05:03.000000,2009-01-01 00:04:12.000000,2009-01-01 00:05:40.000000,2009-01-01 00:03:08.000000,2009-01-01 00:19:01.000000,...,2009-02-01 00:07:04.000000,2009-02-01 00:06:17.000000,2009-02-01 00:11:54.000000

In [199]:
pq.write_table(sorted_table, 'optimized_parquet_file.parquet', row_group_size=65536, )

In [200]:
optimized_parquet_file = pq.ParquetFile('optimized_parquet_file.parquet')

In [209]:
optimized_parquet_file.schema


<pyarrow._parquet.ParquetSchema object at 0x130edba40>
required group field_id=-1 schema {
  optional binary field_id=-1 vendor_id (String);
  optional int64 field_id=-1 pickup_at (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int64 field_id=-1 dropoff_at (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
  optional int32 field_id=-1 passenger_count (Int(bitWidth=8, isSigned=true));
  optional float field_id=-1 trip_distance;
  optional float field_id=-1 pickup_longitude;
  optional float field_id=-1 pickup_latitude;
  optional int32 field_id=-1 rate_code_id (Null);
  optional binary field_id=-1 store_and_fwd_flag (String);
  optional float field_id=-1 dropoff_longitude;
  optional float field_id=-1 dropoff_latitude;
  optional binary field_id=-1 payment_type (String);
  optional float field_id=-1 fare_amount;
  optional float field_

In [216]:
rg0_metadata = optimized_parquet_file.metadata.row_group(0)
rg0_metadata

<pyarrow._parquet.RowGroupMetaData object at 0x130bee0e0>
  num_columns: 18
  num_rows: 65536
  total_byte_size: 1645654

In [233]:
optimized_parquet_file.metadata.row_group(0).to_dict()

{'num_columns': 18,
 'num_rows': 65536,
 'total_byte_size': 1645654,
 'columns': [{'file_offset': 8548,
   'file_path': '',
   'physical_type': 'BYTE_ARRAY',
   'num_values': 65536,
   'path_in_schema': 'vendor_id',
   'is_stats_set': True,
   'statistics': {'has_min_max': True,
    'min': 'CMT',
    'max': 'DDS',
    'null_count': 0,
    'distinct_count': 0,
    'num_values': 65536,
    'physical_type': 'BYTE_ARRAY'},
   'compression': 'SNAPPY',
   'encodings': ('RLE_DICTIONARY', 'PLAIN', 'RLE'),
   'has_dictionary_page': True,
   'dictionary_page_offset': 4,
   'data_page_offset': 34,
   'total_compressed_size': 8544,
   'total_uncompressed_size': 9856},
  {'file_offset': 196427,
   'file_path': '',
   'physical_type': 'INT64',
   'num_values': 65536,
   'path_in_schema': 'pickup_at',
   'is_stats_set': True,
   'statistics': {'has_min_max': True,
    'min': datetime.datetime(2009, 1, 1, 0, 0),
    'max': datetime.datetime(2009, 1, 1, 4, 22, 17),
    'null_count': 0,
    'distinct_co

In [235]:
optimized_parquet_file.metadata.row_group(0).column(0).statistics

<pyarrow._parquet.Statistics object at 0x121193e00>
  has_min_max: True
  min: CMT
  max: DDS
  null_count: 0
  distinct_count: 0
  num_values: 65536
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

In [246]:
schema = optimized_parquet_file.schema.to_arrow_schema()
schema.names


['vendor_id',
 'pickup_at',
 'dropoff_at',
 'passenger_count',
 'trip_distance',
 'pickup_longitude',
 'pickup_latitude',
 'rate_code_id',
 'store_and_fwd_flag',
 'dropoff_longitude',
 'dropoff_latitude',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'total_amount']

In [247]:
name_2_pos = {y:x for x,y in enumerate(schema.names)}
name_2_pos


{'vendor_id': 0,
 'pickup_at': 1,
 'dropoff_at': 2,
 'passenger_count': 3,
 'trip_distance': 4,
 'pickup_longitude': 5,
 'pickup_latitude': 6,
 'rate_code_id': 7,
 'store_and_fwd_flag': 8,
 'dropoff_longitude': 9,
 'dropoff_latitude': 10,
 'payment_type': 11,
 'fare_amount': 12,
 'extra': 13,
 'mta_tax': 14,
 'tip_amount': 15,
 'tolls_amount': 16,
 'total_amount': 17}

In [249]:
optimized_parquet_file.metadata.row_group(0).column(name_2_pos['pickup_at']).statistics

<pyarrow._parquet.Statistics object at 0x131346450>
  has_min_max: True
  min: 2009-01-01 00:00:00
  max: 2009-01-01 04:22:17
  null_count: 0
  distinct_count: 0
  num_values: 65536
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [250]:
optimized_parquet_file.metadata.row_group(1).column(name_2_pos['pickup_at']).statistics

<pyarrow._parquet.Statistics object at 0x1301ae2c0>
  has_min_max: True
  min: 2009-01-01 04:22:17
  max: 2009-01-01 13:09:17
  null_count: 0
  distinct_count: 0
  num_values: 65536
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [251]:
optimized_parquet_file.metadata.row_group(2).column(name_2_pos['pickup_at']).statistics

<pyarrow._parquet.Statistics object at 0x1312351d0>
  has_min_max: True
  min: 2009-01-01 13:09:17
  max: 2009-01-01 16:33:48
  null_count: 0
  distinct_count: 0
  num_values: 65536
  physical_type: INT64
  logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
  converted_type (legacy): NONE

In [252]:
optimized_parquet_file.metadata.row_group(2).column(name_2_pos['pickup_at']).statistics.min

datetime.datetime(2009, 1, 1, 13, 9, 17)

In [253]:
optimized_parquet_file.metadata.row_group(2).column(name_2_pos['pickup_at']).statistics.max

datetime.datetime(2009, 1, 1, 16, 33, 48)

In [254]:
from datetime import datetime

In [255]:
datetime_obj = datetime.strptime("2009-1-1 14:00:00", "%Y-%m-%d %H:%M:%S")


In [256]:
sample_row_group_metadate = optimized_parquet_file.metadata.row_group(2).column(name_2_pos['pickup_at'])
sample_row_group_metadate

<pyarrow._parquet.ColumnChunkMetaData object at 0x130819950>
  file_offset: 3278253
  file_path: 
  physical_type: INT64
  num_values: 65536
  path_in_schema: pickup_at
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x131235900>
      has_min_max: True
      min: 2009-01-01 13:09:17
      max: 2009-01-01 16:33:48
      null_count: 0
      distinct_count: 0
      num_values: 65536
      physical_type: INT64
      logical_type: Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false)
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 3160424
  data_page_offset: 3228650
  total_compressed_size: 117829
  total_uncompressed_size: 146111

In [258]:
sample_row_group_metadate.statistics.min <= datetime_obj <= sample_row_group_metadate.statistics.max

False

In [260]:

col_idx = name_2_pos['pickup_at']

datetime_obj = datetime.strptime("2009-1-1 14:00:00", "%Y-%m-%d %H:%M:%S")

for i in range(optimized_parquet_file.num_row_groups):
    col_stats = optimized_parquet_file.metadata.row_group(i).column(col_idx).statistics
    if col_stats.min <= datetime_obj <= col_stats.max:
        print(f"found it, it's row_group {i}")
    

found it, it's row_group 2


In [261]:
optimized_parquet_file.read_row_group(2)

pyarrow.Table
vendor_id: string
pickup_at: timestamp[us]
dropoff_at: timestamp[us]
passenger_count: int8
trip_distance: float
pickup_longitude: float
pickup_latitude: float
rate_code_id: null
store_and_fwd_flag: string
dropoff_longitude: float
dropoff_latitude: float
payment_type: string
fare_amount: float
extra: float
mta_tax: float
tip_amount: float
tolls_amount: float
total_amount: float
----
vendor_id: [["CMT","CMT","CMT","CMT","DDS",...,"CMT","CMT","CMT","CMT","CMT"]]
pickup_at: [[2009-01-01 13:09:17.000000,2009-01-01 13:09:17.000000,2009-01-01 13:09:18.000000,2009-01-01 13:09:18.000000,2009-01-01 13:09:18.000000,...,2009-01-01 16:33:46.000000,2009-01-01 16:33:47.000000,2009-01-01 16:33:47.000000,2009-01-01 16:33:48.000000,2009-01-01 16:33:48.000000]]
dropoff_at: [[2009-01-01 13:15:10.000000,2009-01-01 13:18:08.000000,2009-01-01 13:13:46.000000,2009-01-01 13:13:55.000000,2009-01-01 13:23:13.000000,...,2009-01-01 16:43:29.000000,2009-01-01 16:39:27.000000,2009-01-01 16:43:35.000000

In [None]:
can you get the average transaction between 2:00-2:59 PM

In [None]:
Which day, on average has the highest tip? 

In [None]:
Which time of the day has the highest tip?