---
Serialization Frameworks
===

---
Why do we need Serialization Frameworks?
----

I'm a _fancy_ Data Scientist and <3 my .csvs.

However, distributed systems move __a lot__ data over the wire.

We need to be efficient.

---
What are Serialization Frameworks?
----

![](http://www.codingeek.com/wp-content/uploads/2014/11/Serialization-deserialization-in-Java-Object-Streams.jpg)

Serialization helps data in memory speak to the wire.

Serialization is the process of translating data structures or object state into a format that can be stored and reconstructed later

---
By the end of this session you should know:
----
- The importance and important aspects of Serialization Frameworks
- The common Big Data Serialization Frameworks:
    - Protocol buffers
    - Thrift
    - Avro
    - Parquet

----
You already knowSerialization Frameworks
----

__JSON__:

````
{"menu": {
  "id": "file",
  "value": "File",
  "popup": {
    "menuitem": [
      {"value": "New", "onclick": "CreateNewDoc()"},
      {"value": "Open", "onclick": "OpenDoc()"},
      {"value": "Close", "onclick": "CloseDoc()"}
    ]
  }
}}
````

__Pickle__:

![](https://www.safaribooksonline.com/library/view/head-first-python/9781449397524/httpatomoreillycomsourceoreillyimages1368712.png.jpg)

<details><summary>
What are the limitations of JSON and pickles?
</summary>

JSON is "heavy". For example, precision of numbers can not be specified.

<br>
Pickle is Python specific (even Python version specific)
</details>

----
Graph schemas sidebar
----

Graphs are nice CS ways of modeling the world

### Elements of a graph schema
![Visualizing the relationship between FaceSpace facts](https://s3-us-west-2.amazonaws.com/dsci6007/assets/fig2-16.png)
* Nodes are the entities in the system.
* Edges are relationships between nodes.
* Properties are information about entities.

### The need for an enforceable schema
Suppose you chose to represent Tom’s age using JSON:
```json
{"id": 3, "field":"age", "value":28, "timestamp":1333589484}
```
There’s no way to ensure that all subsequent facts will follow the same format.
```json
{"name": "Alice", "field":"age", "value":25, "timestamp":"2012/03/29 08:12:24"}
{"id":2, "field":"age", "value":36}
```
Both of these examples are valid JSON, but they have inconsistent formats or missing data.

---
"Advanced"  Serialization Frameworks
---

---
Protocol buffers
---

![](http://i62.tinypic.com/otfel3.jpg)

Protocol buffers are Google's 
- language-neutral
- platform-neutral
- extensible mechanism for serializing structured data

Think XML, but smaller, faster, and simpler. 

You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.


```
message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}
```

```
Person brian = Person.newBuilder()
    .setId(42)
    .setName("Brian Spiering")
    .setEmail("brian.spiering@galvanize.com")
    .build();
output = new FileOutputStream(args[0]);
john.writeTo(output);

```

Check for understanding
----

<details><summary>
When don't you want to use a protocol buffer?'
</summary>

- You aren’t prepared to tie the data model to a schema
- You don’t have the bandwidth to add another tool to your arsenal
- You need or want data to be human readable
- Data from the service is directly consumed by a web browser
</details>

---
Apache Thrift 
---

![](http://www.symfony-project.org/uploads/plugins/f4618597b8cf4bdd736298718bf66af5.png)

Allows different components talk to each other

Google Translate for machines and services

---
What does Thrift do?
----

- Quickly define your service
- Compiles client and server wrappers for API calls
- Makes all networking, serialization transparent
- Takes care of low-level, repetitive details

----
Apache Avro
----

![](https://avro.apache.org/images/avro-logo.png)

Avro stores both the data definition and the data together in one message or file making it easy for programs to dynamically understand the information. 

1. Store data defintion (data types and protocols) in JSON 
2. Store the data itself compact and efficient binary format

![](images/avro.png)

#### Primitive Types
The set of primitive type names is:  
* `null`: no value
* `boolean`: a binary value
* `int`: 32-bit signed integer
* `long`: 64-bit signed integer
* `float`: single precision (32-bit) IEEE 754 floating-point number
* `double`: double precision (64-bit) IEEE 754 floating-point number
* `bytes`: sequence of 8-bit unsigned bytes
* `string`: unicode character sequence  

Primitive types have no specified attributes.

Primitive type names are also defined type names. Thus, for example, the schema "string" is equivalent to:

    {"type": "string"}

#### Complex Types
Avro supports six kinds of complex types: `records`, `enums`, `arrays`, `maps`, `unions` and `fixed`.

See http://avro.apache.org/docs/current/spec.html for more details.

Fun Fact
---

<details><summary>
Why does Avro work better for Hadoop?
</summary>
 Avro files include markers that can be used to splitting large data sets into subsets suitable for MapReduce processing
<br>

</details>

In [1]:
import json
import avro.schema

### Nodes

In [2]:
PersonID = [{
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PersonID1",
        "fields": [
            {
                "name": "cookie",
                "type": "string"
            }
        ]
    },
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PersonID2",
        "fields": [
            {
                "name": "user_id",
                "type": "long"
            }
        ]
    }]

In [3]:
PageID = [{
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PageID",
        "fields": [
            {
                "name": "url",
                "type": "string"
            }
        ]
    }]

In [4]:
Nodes = PersonID + PageID

### Edges

In [5]:
EquivEdge = {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "EquivEdge",
        "fields": [
            {
                "name": "id1",
                "type": [
                    "PersonID1",
                    "PersonID2"
                ]
            },
            {
                "name": "id2",
                "type": [
                    "PersonID1",
                    "PersonID2"
                ]
            }
        ]
    }

In [7]:
PageViewEdge = {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PageViewEdge",
        "fields": [
            {
                "name": "person",
                "type": [
                    "PersonID1",
                    "PersonID2"
                ]
            },
            {
                "name": "page",
                "type": "PageID"
            },
            {
                "name": "nonce",
                "type": "long"
            }
        ]
    }

In [8]:
Edges = [EquivEdge, PageViewEdge]

### Properties

#### Page Properties

In [9]:
PageProperties = [{
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PagePropertyValue",
        "fields": [
            {
                "name": "page_views",
                "type": "int"
            }
        ]
    }, 
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PageProperty",
        "fields": [
            {
                "name": "id",
                "type": "PageID"
            },
            {
                "name": "property",
                "type": "PagePropertyValue"
            }
        ]
    }]

or

In [10]:
PageProperties = [{
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PageProperty",
        "fields": [
            {
                "name": "id",
                "type": "PageID"
            },
            {
                "name": "property",
                "type": {
                    "type": "record",
                    "name": "PagePropertyValue",
                    "fields": [
                        {
                            "name": "page_views",
                            "type": "int"
                        }
                    ]
                }
            }
        ]
    }]

#### Person Properties

In [11]:
PersonProperties = [
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "Location",
        "fields": [
            {"name": "city", "type": ["string", "null"]},
            {"name": "state", "type": ["string", "null"]},
            {"name": "country", "type": [ "string","null"]}
        ]
    },
    {
        "namespace": "analytics.avro",
        "type": "enum",
        "name": "GenderType",
        "symbols": ["MALE", "FEMALE"]
    },
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "PersonProperty",
        "fields": [
            {
                "name": "id",
                "type": [
                    "PersonID1",
                    "PersonID2"
                ]
            },
            {
                "name": "property",
                "type": [
                    {
                        "type": "record",
                        "name": "PersonPropertyValue1",
                        "fields": [{"name": "full_name", "type": "string"}]
                    },
                    {
                        "type": "record",
                        "name": "PersonPropertyValue2",
                        "fields": [{"name": "gender", "type": "GenderType"}]
                    },
                    {
                        "type": "record",
                        "name": "PersonPropertyValue3",
                        "fields": [{"name": "location", "type": "Location"}]
                    }
                ]
            }
        ]
    }]

### Tying everything together into data objects

In [12]:
Data = [
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "Pedigree",
        "fields": [{"name": "true_as_of_secs", "type": "int"}]
    },
    {
        "namespace": "analytics.avro",
        "type": "record",
        "name": "Data",
        "fields": [
            {
                "name": "pedigree",
                "type": "Pedigree"
            },
            {
                "name": "dataunit",
                "type": [
                    {
                        "type": "record",
                        "name": "DataUnit1",
                        "fields": [{"name": "person_property", "type": "PersonProperty"}]
                    },
                    {
                        "type": "record",
                        "name": "DataUnit2",
                        "fields": [{"name": "page_property", "type": "PageProperty"}]
                    },
                    {
                        "type": "record",
                        "name": "DataUnit3",
                        "fields": [{"name": "equiv", "type": "EquivEdge"}]
                    },
                    {
                        "type": "record",
                        "name": "DataUnit4",
                        "fields": [{"name": "page_view", "type": "PageViewEdge"}]
                    }
                ]
            }
        ]
    }
]

In [13]:
schema = avro.schema.parse(json.dumps(Nodes + Edges + PageProperties + PersonProperties + Data))

---
Parquet
---

![](http://www.bauwerk-parkett.com/parquet-images/floor/1200x800/2444/parquet-acacia-monopark-steamed-strip-470x70x96mm.jpg)

![](https://pbs.twimg.com/profile_images/474255479032385537/OGYr_m6J.jpeg)

Parquet ia an efficient data store for analytics.

![](images/parquet.png)

---
Graph Schema Flashback
---

![](images/schema_columns.png)

---
Parquet Features
---

- __Columnar__ File Format
- Not tied to any commerical framework
- Supports Nested Data Structures
- Accessible by HIVE, Spark, Pig, Drill, MapReduce (MR)
- R/W in HDFS or local file system

---
Columnar FTW!
---

Features

- Limits IO to data needed
- Compresses better
- Type specific encodings available
- Enables vectorized execution engines

![](images/layout.png)

![](images/read.png)

![](images/encoding.png)

---
Delta Encoding Deep Dive
---

![](images/delta.png)

[Go way do the rabbt hole](http://www.slideshare.net/julienledem/parquet-hadoop-summit-2013)

WARNING: bit packing!

Check for understanding

<details><summary>
Where else have we seen delta encoding?
</summary>
git/GitHub
<br>

</details>

![](images/file.png)

![](images/file_format.png)

Recommended Reading: 
- [Dremel made simple with parquet](https://blog.twitter.com/2013/dremel-made-simple-with-parquet)
- [Understanding parquet](https://dzone.com/articles/understanding-how-parquet)

---
Common Workflows
----

![](images/workflow.png)

---
Out of Scope
---

### Actor Model

The actor model is a programming model for concurrency in a single process. 

This hugely important. The most interesting, cutting-edge systems are using this. 

Check out the [Akkka Actor Model](http://doc.akka.io/docs/akka/current/general/actors.html)

---

### Schema Evolution

![](images/schema_break.png)

Things change. You will break stuff. Have a plan. Prepare to suffer.


---

---
Summary
-------------------------------------------------

- Serialization Frameworks are efficent and common ways of transfering and storing data.
- It is important to have schema and have it be self-describing and flexible.
- Protocol buffers and Thrift are general, language-agnostic ways of encoding data with schemas.
- Avro also specifies the structure of the data being encoded and isbetter suited to Hadoop.
- Parquet is the best for Data Science workloads. Compact way to store data for efficent analytic querying.

<br>

---