In [1]:
import sys
import logging
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create STDERR handler
handler = logging.StreamHandler(sys.stderr)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Set STDERR handler as the only handler 
logger.handlers = [handler]

logger.info("Test Logging Output")

INFO - Test Logging Output


In [2]:
sys.executable

'/Users/jacopo/.scientific_jupyter/bin/python3'

# Data Formats

When you want to store data, one of the most important considerations is which format to use. 

You can choose between several popular formats, such as: JSON, CSV, Avro, Parquet, and others.

As an example, let's take the data of some regular employees, and try to store it in all 4 of the above-mentioned data formats. 

In [3]:
employees = [
    {
        'name': "Bruce Wayne",
        'company': "Wayne Enterprises",
        'role': "Chairman",
        'quote': "I am the Night",
    },
    {
        'name': "Clark Kent",
        'company': "Daily Planet",
        'role': "Reporter",
    },
    {
        'name': "Steve Rogers",
        'company': "United States Army",
        'role': "Captain",
        'quote': "Avengers! Assemble.",
    },
]

## JSON (JavaScript Object Notation)

> JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

In [4]:
import json

for emp in employees:
    print("\n" + json.dumps(emp, sort_keys=True))


{"company": "Wayne Enterprises", "name": "Bruce Wayne", "quote": "I am the Night", "role": "Chairman"}

{"company": "Daily Planet", "name": "Clark Kent", "role": "Reporter"}

{"company": "United States Army", "name": "Steve Rogers", "quote": "Avengers! Assemble.", "role": "Captain"}


## CSV

> A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields.

In [5]:
import csv
keys = employees[0].keys()
# Write to a CSV File
with open('/tmp/employees.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(employees)
# Read from CSV file
with open('/tmp/employees.csv', 'r') as output_file:
    print(output_file.read())

name,company,role,quote
Bruce Wayne,Wayne Enterprises,Chairman,I am the Night
Clark Kent,Daily Planet,Reporter,
Steve Rogers,United States Army,Captain,Avengers! Assemble.



### CSV

Pros:
+ Compact

Cons:
- File means nothing without headers
- No Types are enforced
- Data may have commas, or line breaks in it - making it harder to read

### JSON

Pros:
+ No headers needed, each record is self-sufficient

Cons:
- Bulky
- No Types are enforced

# Enter: Data Types and Schemas

In [6]:
integers = [-1, 7, 1134, -456]
decimals = [3.14, 1.618, 10.0]
varchars = ["Hello World", "DSR"]
booleans = [True, False]
dates = ["2014-08-14", "1995-03-10"]
datetimes = ["23-08-14 12:34:56", "1995-03-10 00:34:11"]

## Pros and Cons of using Binary data formats

Pros:

- More compact than text formats, takes less space on disk
- Schema acts as documentation
- Enables type checking when reading or writing data

Cons: 

- Not human readable, making them harder to work with
- Might need additional libraries

## Apache Avro

> Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format.

In [7]:
from fastavro import writer, reader, parse_schema

avro_schema = {
    'doc': 'A weather reading.',
    'name': 'Weather',
    'namespace': 'test',
    'type': 'record',
    'fields': [
        {'name': 'station', 'type': 'string'},
        {'name': 'time', 'type': 'long'},
        {'name': 'temp', 'type': 'int'},
    ],
}
parsed_schema = parse_schema(avro_schema)

In [8]:
# 'records' can be an iterable (including generator)
records = [
    {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
    {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
    {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
    {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
]

# Writing
with open('/tmp/weather.avro', 'wb') as out:
    writer(out, parsed_schema, records)

# Reading
with open('/tmp/weather.avro', 'rb') as fo:
    for record in reader(fo):
        print(record)

{'station': '011990-99999', 'time': 1433269388, 'temp': 0}
{'station': '011990-99999', 'time': 1433270389, 'temp': 22}
{'station': '011990-99999', 'time': 1433273379, 'temp': -11}
{'station': '012650-99999', 'time': 1433275478, 'temp': 111}


### What about type checks? 

Let's emulate what happens when you try to use an incorrect type 

In [9]:
# 'records' can be an iterable (including generator)
records = [
    {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
    {u'station': u'011990-99999', u'temp': 22, u'time': 'waldo'},
    {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
    {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
]

# Writing
try:
    with open('/tmp/weather.avro', 'wb') as out:
        writer(out, parsed_schema, records)
except Exception as e:
    logger.exception(e)

ERROR - an integer is required on field time
Traceback (most recent call last):
  File "fastavro/_write.pyx", line 353, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 70, in fastavro._write.write_long
TypeError: an integer is required

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<ipython-input-9-52ead7c5d03e>", line 12, in <module>
    writer(out, parsed_schema, records)
  File "fastavro/_write.pyx", line 686, in fastavro._write.writer
  File "fastavro/_write.pyx", line 639, in fastavro._write.Writer.write
  File "fastavro/_write.pyx", line 389, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 379, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 323, in fastavro._write.write_record
  File "fastavro/_write.pyx", line 388, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 353, in fastavro._write.write_data
  File "fastavro/_write.pyx", line 70, in fastavro

## Apache Parquet

> Apache Parquet is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem.

> It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

What does **column-oriented data storage** mean? 

![Parquet Columnar Storage](images/parquet_columnar_format.png)

In [10]:
import pandas as pd

MULTIPLIER = 10_000
records = [
    {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
    {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
    {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
    {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
] * MULTIPLIER

pd_df = pd.DataFrame(records)

You need either pyarrow or fastparquet. I prefer pyarrow as fastparquet has some issue with latest (11.0) LLVM version. Also I prefer the idea of Arrow as it's more flexible

In [19]:
# Write to parquet

# from fastparquet import write
file_path = '/tmp/outfile.parquet'
# write(file_path, pd_df)
pd_df.to_parquet(file_path)
logger.info(f"Wrote a parquet file containing {len(pd_df.index)} records at {file_path}")

INFO - Wrote a parquet file containing 40000 records at /tmp/outfile.parquet


# Data Formats - Exercise 1

Prove that reading a single column from a parquet file is faster than reading all the columns from the file.