<a href="https://colab.research.google.com/github/revendrat/Big-Data-Analytics/blob/main/Apache_Arrow_Installation_%26_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Apache Arrow on Colab using the command shared below

In [None]:
pip install pyarrow==6.0.*

Collecting pyarrow==6.0.*
  Downloading pyarrow-6.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.6 MB)
[K     |████████████████████████████████| 25.6 MB 1.7 MB/s 
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 3.0.0
    Uninstalling pyarrow-3.0.0:
      Successfully uninstalled pyarrow-3.0.0
Successfully installed pyarrow-6.0.1


**Creating Arrays and Tables**
 
*   Arrays in Arrow are collections of data of uniform type.
*   Arrays format leads to the best performing implementation to store the data and perform computations on it. 
* Each array has data and a type

In [2]:

import pyarrow as pa

# check pyarrow version
#pyarrow.__version__

In [3]:
# Create a pyarrow array days
days = pa.array([1, 12, 17, 23, 28], type=pa.int8())
# Create a pyarrow array months
months = pa.array([1, 3, 5, 7, 1], type=pa.int8())
# Create a pyarrow array years
years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16())

# From pyarrow arrays days, months & years, create pyarrow table birthdays_table
birthdays_table = pa.table([days, months, years], names=["days", "months", "years"])

In [4]:
# View the pyarrow table birthdays_table
birthdays_table

pyarrow.Table
days: int8
months: int8
years: int16

## View each column using syntax similar to pandas dataframe

In [4]:
# view days
birthdays_table['days']

<pyarrow.lib.ChunkedArray object at 0x7f27ff6a4dd0>
[
  [
    1,
    12,
    17,
    23,
    28
  ]
]

In [5]:
# view months
birthdays_table['months']

<pyarrow.lib.ChunkedArray object at 0x7f27ff6506b0>
[
  [
    1,
    3,
    5,
    7,
    1
  ]
]

In [6]:
# view years
birthdays_table['years']

<pyarrow.lib.ChunkedArray object at 0x7f27ff6a4fb0>
[
  [
    1990,
    2000,
    1995,
    2000,
    1995
  ]
]

## Saving and Loading Tables
Parquet is a data formats

In [5]:
import pyarrow.parquet as pq

pq.write_table(birthdays_table, 'birthdays.parquet')

* The data of birthdays_table is on disk, load it back using a single function call.
* Arrow is heavily optimized for memory and speed so loading data will be as quick as possible

In [6]:
reloaded_birthdays = pq.read_table('birthdays.parquet')

reloaded_birthdays

pyarrow.Table
days: int8
months: int8
years: int16

## Computations
* perform computations using pyarrow.compute module
* compute functions are applied for performing transformations to the data

In [7]:
# Calculate the count of number of years
import pyarrow.compute as pc

pc.value_counts(birthdays_table["years"])

<pyarrow.lib.StructArray object at 0x7f72d2db3440>
-- is_valid: all not null
-- child 0 type: int16
  [
    1990,
    2000,
    1995
  ]
-- child 1 type: int64
  [
    1,
    2,
    2
  ]