<a href="https://colab.research.google.com/github/revendrat/Big-Data-Analytics/blob/main/04_Creating_Arrow_Objects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Create following objects in arrow
* Arrays
* Tables
* Table from Plain Types
* Record Batches
* Store Categorical Data

## Creating Arrays
* Data in Arrow is in form of continuous arrays optimised for memory footprint and SIMD analyses. 
* In Python itâ€™s possible to build pyarrow.Array starting from Python lists (or sequence types in general), numpy arrays and pandas Series.
* Create array using pyarrow.array()

In [1]:
import pyarrow as pa

array = pa.array([1, 2, 3, 4, 5])

In [2]:
array

<pyarrow.lib.Int64Array object at 0x7f17d4e0f7c0>
[
  1,
  2,
  3,
  4,
  5
]

In [3]:
print(array)

[
  1,
  2,
  3,
  4,
  5
]


## Masking for null values
* Arrow provides a mask feature for arrays  to specify which values should be considered nulls
Illustration below:

In [4]:
import numpy as np

array_masked = pa.array([1, 2, 3, 4, 5],
                 mask=np.array([True, False, True, False, True]))
array_masked

<pyarrow.lib.Int64Array object at 0x7f17d3bae2f0>
[
  null,
  2,
  null,
  4,
  null
]

In [5]:
print(array_masked)

[
  null,
  2,
  null,
  4,
  null
]


## Numpy & Pandas Integration
* Arrow provides optimised integration to build arrays from numpy or pandas <br/>
Illustration below: 

In [6]:
import numpy as np
import pandas as pd

array_from_numpy = pa.array(np.arange(5))
array_from_pandas = pa.array(pd.Series([1, 2, 3, 4, 5]))

In [7]:
# Verify array created from numpy
array_from_numpy

<pyarrow.lib.Int64Array object at 0x7f17d4e0f980>
[
  0,
  1,
  2,
  3,
  4
]

In [8]:
# Verify array created from pandas
array_from_pandas

<pyarrow.lib.Int64Array object at 0x7f17d3b45f30>
[
  1,
  2,
  3,
  4,
  5
]

## Creating Tables
* Tabular data in Arrow takes form of  pyarrow.Table
* Each column is represented by a pyarrow.ChunkedArray 
* Tables can be created by pairing multiple arrays with names for their columns

In [9]:
table_random = pa.table([
    pa.array([111, 112, 113, 114, 115]),
    pa.array(["apple", "bean", "car", "dog", "egg"]),
    pa.array([21.5, 15.0, 1000000.0, 40000.0, 10.0])
], names=["number", "something", "price"])

table_random

pyarrow.Table
number: int64
something: string
price: double
----
number: [[111,112,113,114,115]]
something: [["apple","bean","car","dog","egg"]]
price: [[21.5,15,1000000,40000,10]]

In [10]:
print(table_random)

pyarrow.Table
number: int64
something: string
price: double
----
number: [[111,112,113,114,115]]
something: [["apple","bean","car","dog","egg"]]
price: [[21.5,15,1000000,40000,10]]


## Create Table from Plain Python Data Structures Types
* Arrow allows fast zero copy creation of create Arrow Arrays and Tables from plain Python Data structures such as lists and data dictionaries.
* The pyarrow.table() function allows creation of Tables from a variety of inputs, including plain python objects <br/>
Illustration below:

In [11]:
table_plain = pa.table({
    "col1": [1, 2, 3, 4, 5],
    "col2": ["a", "b", "c", "d", "e"]
})

table_plain

pyarrow.Table
col1: int64
col2: string
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]

In [12]:
print(table_plain)

pyarrow.Table
col1: int64
col2: string
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]


## Creating Record Batches
* In Arrow, A RecordBatch is a slice of a table.
* Most I/O operations in Arrow happen when shipping batches of data to their destination. 
* pyarrow.RecordBatch processes a batch of rows of columns of equal length.


In [13]:
batch = pa.RecordBatch.from_arrays([
    pa.array([1, 3, 5, 7, 9]),
    pa.array([2, 4, 6, 8, 10])
], names=["odd", "even"])

In [14]:
batch

pyarrow.RecordBatch
odd: int64
even: int64

## Combine multiple batches into a table using pyarrow.Table.from_batches()



In [15]:
second_batch = pa.RecordBatch.from_arrays([
    pa.array([11, 13, 15, 17, 19]),
    pa.array([12, 14, 16, 18, 20])
], names=["odd", "even"])

batch_table = pa.Table.from_batches([batch, second_batch])
batch_table

pyarrow.Table
odd: int64
even: int64
----
odd: [[1,3,5,7,9],[11,13,15,17,19]]
even: [[2,4,6,8,10],[12,14,16,18,20]]

### pyarrow.Table can be converted to a list of pyarrow.RecordBatch using the pyarrow.Table.to_batches() method

In [16]:
record_batches = batch_table.to_batches(max_chunksize=5)
print(len(record_batches))

2


In [17]:
record_batches

[pyarrow.RecordBatch
 odd: int64
 even: int64, pyarrow.RecordBatch
 odd: int64
 even: int64]

In [18]:
print(record_batches)

[pyarrow.RecordBatch
odd: int64
even: int64, pyarrow.RecordBatch
odd: int64
even: int64]


## Store Categorical Data
* Arrow provides the pyarrow.DictionaryArray type to represent categorical data 
* Helps reducing the cost of storing and repeating the categories over and over leading to reducing memory use when columns might have large values (such as text).
* Use pyarrow.Array.dictionary_encode() to convert an array containing repeated categorical data into a pyarrow.DictionaryArray

In [19]:
arr = pa.array(["red", "green", "blue", "blue", "green", "red"])

categorical = arr.dictionary_encode()
print(categorical)


-- dictionary:
  [
    "red",
    "green",
    "blue"
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    1,
    0
  ]


### Incase you know the categories already, skip the encode step and directly create the DictionaryArray using pyarrow.DictionaryArray.from_arrays()

In [20]:
categorical = pa.DictionaryArray.from_arrays(
    indices=[0, 1, 2, 2, 1, 0],
    dictionary=["red", "green", "blue"]
)
print(categorical)


-- dictionary:
  [
    "red",
    "green",
    "blue"
  ]
-- indices:
  [
    0,
    1,
    2,
    2,
    1,
    0
  ]
