# **ORC (`O`ptimized `R`ow `C`olumnar) format**

ORC (Optimized Row Columnar) is a columnar storage file format used primarily in big data processing frameworks like Apache Hive and Apache Spark. It is designed to provide efficient storage, compression, and processing of large-scale structured data.

In Python, you can work with ORC files using the pyarrow library, which provides support for reading and writing ORC files.

## **Deep understanding**

`Source`: medium.com

* **Definition**

An ORC (Optimized Row Columnar) file is a data storage format designed for Hadoop and other big data processing systems. It is a columnar storage format, which means that the data is stored in a way that is optimized for column-based operations like filtering and aggregation.

<img src="./images/orc_structure.webp" width="400px"/>

Internally, ORC stores data in a series of stripes, where each stripe is a collection of rows. Each stripe is further divided into a series of data chunks, where each chunk stores the data for a specific set of columns. The chunks are compressed using a combination of techniques such as predicate filtering, dictionary encoding, and run-length encoding.

ORC also stores metadata about the file, such as the schema, at the end of the file. This metadata is used to quickly read the data without having to scan the entire file. Additionally, ORC can store indexes for specific columns, allowing for faster retrieval of specific rows.


* **Advantages**

One of the main advantages of using ORC files is that they offer significant performance improvements over row-based storage formats like text and avro. This is because, in a columnar storage format, the data for a single column is stored together, which makes it faster to read and process. Additionally, ORC files also support advanced features such as predicate pushdown, which allows the storage format to filter out unnecessary data before it is read into memory, further improving performance.



## **Install pyorc**

In [None]:
%pip install pyorc

# **Read Data From an ORC File**

The files are in this folder and they are named: (source: pyorc github)

* `demo-12-zlib.orc`
* `TestOrcFile.testStripeLevelStats.orc`
* `TestStringDictionary.testRowIndex.orc`

In [32]:
# import pyorc
import pyorc

# open the orc file
example = open("./demo-12-zlib.orc", "rb")
reader = pyorc.Reader(example)

# see the schema of the selected file
"""
a schema refers to the structure or layout of the data stored within the file.
It defines the columns, their data types, and any additional metadata associated with the columns.
"""
schema = reader.schema
schema = str(schema)
print(f'Schema:\n{schema}\n')

# check the number of rows in the file by calling len() on the Reader
number_rows = len(reader)
print(f'Number of rows: {number_rows}\n')

# read a row
"""
The Reader is an interable object, yielding a new row after every iteration
"""
first_row = next(reader)
second_row = next(reader)
print(f'1st row: {first_row}\n' \
      f'2nd row: {second_row}\n')

"""
Iterating over the file’s content to process its rows is the preferable way, but we can also read the entire file into the memory with the read method. This method has an optional parameter to control the maximal number of rows to read:
"""
rows_read = reader.read(10_000)
print(f'Rows:\n{rows_read}\n')


"""
Using this optional parameter for larger ORC file is highly recommended!

After all the rows are read, the Reader object has no more rows to yield. There’s a seek method to jump a specific row in the file and continue the read from that point:
"""
reader.seek(1000)
row = next(reader)
print(f'New row: {row}\n')

"""
By default all fields are loaded from an ORC file, but that can be changed by passing either column_indices or column_names parameter to Reader:
"""
reader = pyorc.Reader(example, column_names=("_col0", "_col5"))
row = next(reader)
print(f'Row with only 2 cols:\n{row}\n')

# We can also change the representation of a struct from tuple to dictionary
from pyorc.enums import StructRepr

reader = pyorc.Reader(example, column_indices=(1, 5), struct_repr=StructRepr.DICT)
row = next(reader)
print(f'Row with another representation:\n{row}\n')

"""
Stripes:

ORC files are divided in to stripes. Stripes are independent of each other. Let’s open an other ORC files that has multiple stripes in it:
"""
example = open("./TestOrcFile.testStripeLevelStats.orc", "rb")
reader = pyorc.Reader(example)
num_stripes = reader.num_of_stripes
print(f'Number of stripes: {num_stripes}\n')

"""
The num_of_stripes property of the Reader shows how many stripes are in the file. We can read a certain stripes using the read_stripe method:
"""
# get the 2nd stripe
stripe2 = reader.read_stripe(2)

# as it is an iterable, we can use next to read its content iteratively
first_row_stripe = next(stripe2)
print(f'1st row of 2nd stripe:\n{first_row_stripe}\n')


"""
Filtering row groups

It is possible to skip certain records in an ORC file using simple filter predicates (or search arguments). Setting a predicate expression to the Reader can help to exclude row groups that don’t satisfy the condition during reading:
"""
# read
example = open("./TestStringDictionary.testRowIndex.orc", "rb")
reader = pyorc.Reader(example)

# display first row
first_row = next(reader)
print(f'Row before filtering:\n{first_row}\n')

# filter
reader = pyorc.Reader(example, predicate=pyorc.predicates.PredicateColumn(pyorc.TypeKind.STRING, "str") > "row 004096")

# display first row after filtering
first_row = next(reader)
print(f'Row after filtering:\n{first_row}\n')


"""
The predicate can be used to select a single row group, but not an individual record. The size of the row group is determined by the row_index_stride, set during writing of the file. You can create more complex predicate using logical expressions:

pred = (PredicateColumn(TypeKind.INT, "c0") > 300) & (PredicateColumn(TypeKind.STRING, "c1") == "A")
"""


Schema:
struct<_col0:int,_col1:string,_col2:string,_col3:string,_col4:int,_col5:string,_col6:int,_col7:int,_col8:int>

Number of rows: 1920800

1st row: (1, 'M', 'M', 'Primary', 500, 'Good', 0, 0, 0)
2nd row: (2, 'F', 'M', 'Primary', 500, 'Good', 0, 0, 0)

Rows:
[(3, 'M', 'S', 'Primary', 500, 'Good', 0, 0, 0), (4, 'F', 'S', 'Primary', 500, 'Good', 0, 0, 0), (5, 'M', 'D', 'Primary', 500, 'Good', 0, 0, 0), (6, 'F', 'D', 'Primary', 500, 'Good', 0, 0, 0), (7, 'M', 'W', 'Primary', 500, 'Good', 0, 0, 0), (8, 'F', 'W', 'Primary', 500, 'Good', 0, 0, 0), (9, 'M', 'U', 'Primary', 500, 'Good', 0, 0, 0), (10, 'F', 'U', 'Primary', 500, 'Good', 0, 0, 0), (11, 'M', 'M', 'Secondary', 500, 'Good', 0, 0, 0), (12, 'F', 'M', 'Secondary', 500, 'Good', 0, 0, 0), (13, 'M', 'S', 'Secondary', 500, 'Good', 0, 0, 0), (14, 'F', 'S', 'Secondary', 500, 'Good', 0, 0, 0), (15, 'M', 'D', 'Secondary', 500, 'Good', 0, 0, 0), (16, 'F', 'D', 'Secondary', 500, 'Good', 0, 0, 0), (17, 'M', 'W', 'Secondary', 500, 'Good', 0, 0

'\nThe predicate can be used to select a single row group, but not an individual record. The size of the row group is determined by the row_index_stride, set during writing of the file. You can create more complex predicate using logical expressions:\n\npred = (PredicateColumn(TypeKind.INT, "c0") > 300) & (PredicateColumn(TypeKind.STRING, "c1") == "A")\n'

In [33]:
# Write an ORC
"""
To write a new ORC file we need to open a binary file-like object and pass to a Writer object with an ORC schema description. The schema can be a TypeDescription or a simple string ORC schema definition:
"""
# create the schema
output = open("./new.orc", "wb")
writer = pyorc.Writer(output, "struct<col0:int,col1:string>")

# display the writer
print(f'{writer}\n')

# We can add rows to the file with the write method:
writer.write((0, "Test 0"))
writer.write((1, "Test 1"))

# Don’t forget to close the writer to write out the necessary metadata, 
# otherwise it won’t be a valid ORC file.
writer.close()

"""
For simpler use the Writer object can be used as a context manager and you can also change the struct representation to use dictionaries as rows instead of tuples as well:
"""
with open("./new2.orc", "wb") as output:
    with pyorc.Writer(output, "struct<col0:int,col1:string>", struct_repr=StructRepr.DICT) as writer:
        # let's use a for loop to populate the file
        for i in range(10):
            writer.write({"col0": i, "col1": "Test {}".format(i)})



<pyorc.writer.Writer object at 0x00000204349E3BF0>

