# <center> STATS 607 - LECTURE 7
## <center> 09/26/2018

# <center> Serialization

Thank you Prof. Kerby Shedden for making available this material.

Serialization is the act of taking an essentially arbitrary data object and converting it into a stream of bytes. These bytes can be written to storage or transmitted over a network. A serialized object should be self describing, meaning that given the byte stream, it should be possible to reconstruct the original object without any additional information.

Serialization is an active area of research and development. Many languages support an optimized serialization format that is tailored to the idiosyncrasies of the language. There are also language-neutral formats that can be used to share data between different languages.

The native Python serialization format is called a pickle. To illustrate, lets create a dictionary:

In [1]:
x = {j : j*j for j in range(1000)}

Now we can pickle the dictionary and write it to a file. Pickle is a binary format, so we open the file in binary mode.

In [2]:
import pickle

fid = open("dict.pkl", "wb")
pickle.dump(x, fid)
fid.close()

Now we can load the dictionary back into the interpreter (“deserialize” it):

In [3]:
fid = open("dict.pkl", "rb")
y = pickle.load(fid)
fid.close()

You can check that y and x are distinct objects but have identical values.

In [4]:
print(x is y)
print(x == y)

False
True


An alternative and very popular serialization format is called “JSON” (it stands for “JavaScript Object Notation”, but the connection to the JavaScript language is mainly historical). JSON is a text format so you can open a JSON file in a text viewer, and nearly any programming language has JSON encoders and decoders.

Here is a basic illustration of using JSON serialization in Python:

In [5]:
import json

fid = open("dict.json", "w")
json.dump(x, fid)
fid.close()

Now we can read the dictionary back in from the file:

In [6]:
fid = open("dict.json")
y = json.load(fid)
fid.close()

You will notice that x and y are not exactly equivalent. JSON only supports string-indexed maps (used to represent Python dict objects). So the integer keys in x have become strings in y.

In [7]:
print(type(list(x.keys())[0]))
print(type(list(x.values())[0]))

<class 'int'>
<class 'int'>


In [8]:
print(type(list(y.keys())[0]))
print(type(list(y.values())[0]))

<class 'str'>
<class 'int'>


# <center> Classes

Python has a class system that can be used to create compound data types and to write object oriented programs. For work focusing on data, we use classes a lot, but we do not need to create them as often. So we will cover classes only briefly here.

First, we can consider using classes to define a compound data type (like a C struct). The minimal class definition in Python is:

In [9]:
class Myclass(object):
    pass

Now we can use the class as follows:

In [10]:
x = Myclass()

x.country = "Mexico"
x.population = 119530753
x.capital = "Mexico City"

The first line above creates an instance of the class, and the remaining lines create attributes in the class instance. Since Python is a dynamic language, we can add attributes to a class instance after it is created, as is done above. Thus, different instances of the class may have different attributes (but usually you would not want to actually do this).

In [11]:
y = Myclass()
y.country = "US"
y.parameter = 'Something'

In [14]:
print(type(x))
print(type(y))
x.parameter

<class '__main__.Myclass'>
<class '__main__.Myclass'>


AttributeError: 'Myclass' object has no attribute 'parameter'

If you want to be a bit more strict you can define attributes as private variables with getter and setter methods:

In [15]:
class Country(object):

    def __init__(self):
        self._name = ""
        self._capital = ""
        self._population = None

    def get_name(self):
        return self._name

    def set_name(self, name):
        self._name = name

    # similar for population and capital

The getter and setter implemented above are methods. These are basically functions that use infix notation and take a class instance as an implicit first argument. We will discuss the implicit passing of the class instance below, but for now just note that self is a variable name that could in principle be anything, but by convention it is nearly always given as self. This means that we can call the methods using:

In [16]:
c = Country()
c.set_name("Mexico")

When calling c.set_name(x), inside the method body self is bound to c and name is bound to x.

In [21]:
c.get_name()

'Mexico'

The class definition above also contains an __init__ method that is called automatically when a new class instance is created. We could also allow the caller to pass in initial values for the class attributes, by using the following __init__ implementation:

In [22]:
def __init__(self, name="", capital="", population=None):
    self._name = name
    self._capital = capital
    self._population = population

More advanced object oriented programming centers around the idea of inheritance, meaning that you can build a class by extending an existing class. This can get somewhat complex. The most basic type of inheritance is single inheritance, meaning that a class has only one ancestor. Here is a simplistic example of single inheritance, based on the Country class that we defined above.

In [23]:
class EconomicCountry(Country):

    def set_gdp(self, gdp):
        self.gdp = gdp

    def set_taxrate(self, taxrate):
        self.taxrate = taxrate

An instance of EconomicCountry will have the methods and attributes of Country along with any additional methods and attributes that were added in the class definition of EconomicCountry.

There is a lot more to classes, but we will not cover it in this class since you'll likely not use it for data analysis. 