### The Arrow Library

* Combines `Arrow`'s columnar format

* Column-oriented so it is faster at querying and processing slices or columns of data.

* Supports nested column types.

* Allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow.

* Language-agnostic so it supports different programming languages.
 


### Apache Arrow

```A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (including nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.```


                  Apache arrow Project

### Features 

![](https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/img/without_arrow.jpg)

[Picture from](https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/)


![](https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/img/with_arrow.jpg)

[picture form](https://blog.djnavarro.net/posts/2021-11-19_starting-apache-arrow-in-r/)

In [4]:
pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-9.0.0-cp38-cp38-macosx_11_0_arm64.whl (21.6 MB)
[K     |████████████████████████████████| 21.6 MB 3.3 MB/s eta 0:00:01
Installing collected packages: pyarrow
Successfully installed pyarrow-9.0.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
from pyarrow import csv

In [7]:
%%timeit
fn = '~/Downloads/test.csv'
table = csv.read_csv(fn)

3.12 s ± 550 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [8]:
table

pyarrow.Table
location_key: string
date: date32[day]
place_id: string
wikidata_id: string
datacommons_id: string
country_code: string
country_name: string
iso_3166_1_alpha_2: string
iso_3166_1_alpha_3: string
aggregation_level: int64
new_confirmed: int64
new_deceased: int64
cumulative_confirmed: int64
cumulative_deceased: int64
cumulative_tested: int64
new_persons_vaccinated: int64
cumulative_persons_vaccinated: int64
new_persons_fully_vaccinated: int64
cumulative_persons_fully_vaccinated: int64
new_vaccine_doses_administered: int64
cumulative_vaccine_doses_administered: int64
population: int64
population_male: int64
population_female: int64
population_rural: int64
population_urban: int64
population_density: double
human_development_index: double
population_age_00_09: int64
population_age_10_19: int64
population_age_20_29: int64
population_age_30_39: int64
population_age_40_49: int64
population_age_50_59: int64
population_age_60_69: int64
population_age_70_79: int64
population_age_80_a

In [13]:
table["location_key"]


<pyarrow.lib.ChunkedArray object at 0x10c91d540>
[
  [
    "AD",
    "AD",
    "AD",
    "AD",
    "AD",
    ...
    "AE",
    "AE",
    "AE",
    "AE",
    "AE"
  ],
  [
    "AE",
    "AE",
    "AE",
    "AE",
    "AE",
    ...
    "AE",
    "AE",
    "AE",
    "AE",
    "AE"
  ],
...,
  [
    "BR_AP_160015",
    "BR_AP_160015",
    "BR_AP_160015",
    "BR_AP_160015",
    "BR_AP_160015",
    ...
    "BR_AP_160020",
    "BR_AP_160020",
    "BR_AP_160020",
    "BR_AP_160020",
    "BR_AP_160020"
  ],
  [
    "BR_AP_160020",
    "BR_AP_160020",
    "BR_AP_160020",
    "BR_AP_160020",
    "BR_AP_160020",
    ...
    "BR_AP_160021",
    "BR_AP_160021",
    "BR_AP_160021",
    "BR_AP_160021",
    "BR_AP_160021"
  ]
]

In [17]:
table.num_rows

999999

In [53]:
subset = list(range(0, 100_000))
data = table.take(subset)

In [54]:
location_key_data = data["location_key"]

In [55]:
type(location_key_data)

pyarrow.lib.ChunkedArray

In [56]:
location_key_data

<pyarrow.lib.ChunkedArray object at 0x10ad6b5e0>
[
  [
    "AD",
    "AD",
    "AD",
    "AD",
    "AD",
    ...
    "AR_B_260",
    "AR_B_260",
    "AR_B_260",
    "AR_B_260",
    "AR_B_260"
  ]
]