# <center> STATS 607 - LECTURE 7
## <center> 09/26/2018

# <center> Serialization

Thank you Prof. Kerby Shedden for making available this material.

Serialization is the act of taking an essentially arbitrary data object and converting it into a stream of bytes. These bytes can be written to storage or transmitted over a network. A serialized object should be self describing, meaning that given the byte stream, it should be possible to reconstruct the original object without any additional information.

Serialization is an active area of research and development. Many languages support an optimized serialization format that is tailored to the idiosyncrasies of the language. There are also language-neutral formats that can be used to share data between different languages.

The native Python serialization format is called a pickle. To illustrate, lets create a dictionary:

In [1]:
x = {j : j*j for j in range(1000)}

Now we can pickle the dictionary (serialization) and write it to a file. Pickle is a binary format, so we need to open the file in binary mode.

In [2]:
import pickle

fid = open("dict.pkl", "wb")
pickle.dump(x, fid)
fid.close()

Now we can load the dictionary back into the interpreter (“deserialize” it):

In [3]:
fid = open("dict.pkl", "rb")
y = pickle.load(fid)
fid.close()

You can check that y and x are distinct objects but have identical values. You can use 'id' to show that they are distint objects, but you can also use 'is'.

In [4]:
print(id(x))
print(id(y))
print(x is y)
print(x == y)

4381499464
4381499752
False
True


An alternative and very popular serialization format is called “JSON” (it stands for “JavaScript Object Notation”, but the connection to the JavaScript language is mainly historical). JSON is a text format so you can open a JSON file in a text viewer, and nearly any programming language has JSON encoders and decoders.

Here is a basic illustration of using JSON serialization in Python:

In [5]:
import json

fid = open("dict.json", "w")
json.dump(x, fid)
fid.close()

Now we can read the dictionary back in from the file:

In [6]:
fid = open("dict.json")
y = json.load(fid)
fid.close()

You will notice that x and y are not exactly equivalent. JSON only supports string-indexed maps (used to represent Python dict objects). So the integer keys in x have become strings in y.

In [7]:
print(type(list(x.keys())[0]))
print(type(list(x.values())[0]))

<class 'int'>
<class 'int'>


In [8]:
print(type(list(y.keys())[0]))
print(type(list(y.values())[0]))

<class 'str'>
<class 'int'>


# <center> Classes

Python has a class system that can be used to create compound data types and to write object oriented programs. First, we can consider using classes to define a compound data type (like a C struct). The minimal class definition in Python is:

In [9]:
class Myclass(object):
    pass

Now we can use the class as follows:

In [10]:
x = Myclass()

x.country = "Mexico"
x.population = 119530753
x.capital = "Mexico City"

The first line above creates an instance of the class, and the remaining lines create attributes in the class instance. Since Python is a dynamic language, we can add attributes to a class instance after it is created, as is done above. Thus, different instances of the class may have different attributes (but usually you would not want to actually do this).

In [11]:
y = Myclass()
y.country = "US"
y.parameter = 'Something'

In [14]:
print(type(x))
print(type(y))
x.parameter

<class '__main__.Myclass'>
<class '__main__.Myclass'>


AttributeError: 'Myclass' object has no attribute 'parameter'

If you want to be a bit more strict you can define attributes as private variables with getter and setter methods:

In [15]:
class Country(object):

    def __init__(self):
        self._name = ""
        self._capital = ""
        self._population = None
        
    def get_name(self):
        return self._name

    def set_name(self, name):
        self._name = name

    # similar for population and capital

The getter and setter implemented above are methods. These are basically functions that use infix notation and take a class instance as an implicit first argument. We will discuss the implicit passing of the class instance below, but for now just note that self is a variable name that could in principle be anything, but by convention it is nearly always given as self. This means that we can call the methods using:

In [16]:
c = Country()
c.set_name("Portugal")

When calling c.set_name(x), inside the method body self is bound to c and name is bound to x.

In [17]:
c.get_name()

'Portugal'

The class definition above also contains an __init__ method that is called automatically when a new class instance is created. We could also allow the caller to pass in initial values for the class attributes, by using the following __init__ implementation:

In [18]:
class Country(object):

    def __init__(self, name="", capital="", population=None):
        self._name = name
        self._capital = capital
        self._population = population
        
    def get_name(self):
        return self._name

    def set_name(self, name):
        self._name = name

    # similar for population and capital

In [19]:
c = Country()
c = Country(capital = 'Lisbon', name = 'Portugal')
c.get_name()

'Portugal'

More advanced object oriented programming centers around the idea of inheritance, meaning that you can build a class by extending an existing class. This can get somewhat complex. The most basic type of inheritance is single inheritance, meaning that a class has only one ancestor. Here is a simplistic example of single inheritance, based on the Country class that we defined above.

In [20]:
class EconomicCountry(Country):

    def set_gdp(self, gdp):
        self.gdp = gdp

    def set_taxrate(self, taxrate):
        self.taxrate = taxrate

In [21]:
c = EconomicCountry()
dir(c)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_capital',
 '_name',
 '_population',
 'get_name',
 'set_gdp',
 'set_name',
 'set_taxrate']

# <center> Numpy

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides several array-like data structures, including a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

The most commonly used array-like data structure is the ndarray (“n-dimensional array”) object. An ndarray is a Python wrapper around a contiguous chunk of memory that allows it to be manipulated like an array.

Conventionally 'Numpy' is abbreviated as np:

In [22]:
import numpy as np

x = np.zeros(5)
print(type(x))
x

<class 'numpy.ndarray'>


array([0., 0., 0., 0., 0.])

Numpy arrays are homogeneous, contiguous, typed arrays. This makes them dramatically faster than core Python lists for many operations, since the Python list stores all values by indirection and is dynamically typed. The main exception to this would be if you need to store heterogeneous data, and/or you need to shrink or grow the array frequently, in which case the Python list type may actually be more efficient.

There are currently 24 Numpy data types, called “dtypes”, documented here. This includes the usual 12 numeric types (1, 2, 4, and 8 byte signed and unsigned integers, 4 and 8 byte floating point values, and 4 and 8 byte complex number values). In addition there are string, date/time, and Python object dtypes. The default type for many array creation operations is float64, which is an 8 byte floating point value that is mostly interchangeable with a regular Python float value.


The np.zeros function creates an array of zeros, defaulting to float64 type. The following are all equivalent:

In [23]:
m = 10
x = np.zeros(m)
print(x.dtype)
x = np.zeros(m, np.float64)
print(x.dtype)
x = np.zeros(m, dtype=np.float64)
print(x.dtype)
x = np.zeros(m, dtype=float)
print(x.dtype)
x = np.zeros(m, dtype='d')
print(x.dtype)
x = np.zeros(m, dtype='double')
print(x.dtype)

float64
float64
float64
float64
float64
float64


The following examples create arrays of zeros with other data types:

In [24]:
x = np.zeros(m, np.int32)
print(x.dtype)
x = np.zeros(m, np.uint8)
print(x.dtype)
x = np.zeros(m, np.int64)
print(x.dtype)

int32
uint8
int64


Below are some other ways to create arrays. Each of these functions can take the dtype argument specifying any dtype, but we use the default float64 here:

In [25]:
x = np.ones(m)   # Sets all values to 1
print(x, x.dtype)
x = np.arange(m) # 0, 1, 2, ..., m-1
print(x, x.dtype)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] float64
[0 1 2 3 4 5 6 7 8 9] int64


Unlike Python lists, Numpy arrays behave like mathematical vectors and matrices with respect to arithmetic operations, e.g. you can do something like this:

In [26]:
x = np.arange(5)
print(x)
y = np.arange(1, 6)
print(y)

x + y  # Pointwise sum
x - y  # Pointwise difference
x / y  # Pointwise quotient
x ** y # Pointwise exponentiation
x % y  # Pointwise remainder
x * y  # Pointwise product

[0 1 2 3 4]
[1 2 3 4 5]


array([ 0,  2,  6, 12, 20])

There is a lot more going on here... each of these operators has to be implemented separately for each dtype, i.e. + for int64 is a different function than + for float32. If you pass in mixed dtypes, i.e. multiplying an int64 by a float32, there will be a type promotion, which in this case means that the int64 array will be converted into a float32 array before the addition function is called. These hidden type promotions can degrade performance so sometimes it is better to convert your data to a common dtype before an intensive calculation begins. Conversions can be done with the astype method:

In [27]:
x = np.arange(5, dtype=np.float64)
y = x.astype(np.int32)
z = x.astype(np.float32)
x,y,z

(array([0., 1., 2., 3., 4.]),
 array([0, 1, 2, 3, 4], dtype=int32),
 array([0., 1., 2., 3., 4.], dtype=float32))

Another place where type promotion occurs is in performing division. Division will produce a float64 result unless both inputs are of type float32.

Indexing and slicing numpy arrays behaves similarly to indexing and slicing Python lists. Note the difference to lists - slices will normally return a “view” of the underlying data, meaning that if you change a slice, the same values will change in the parent array:

In [28]:
x = list(range(10))
y = x[3:6]
y[0] = 99
print(x,y)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] [99, 4, 5]


In [29]:
x = np.arange(10)
y = x[3:6]
y[0] = 99
print(x,y)

[ 0  1  2 99  4  5  6  7  8  9] [99  4  5]


Note that views can result even for certain types of discontinuous slices. See the following example:

In [30]:
x = np.arange(20)
y = x[::2]
y[4] = 99
print(x,y)

[ 0  1  2  3  4  5  6  7 99  9 10 11 12 13 14 15 16 17 18 19] [ 0  2  4  6 99 10 12 14 16 18]


Numpy uses a very general approach for indexing from the array to its low-level memory block. As a result, for many operations we do not need to copy the underlying data. For example, we may have a very large array x, and if we create a new variable y = x.T holding the transpose of x, then y and x share the same storage. In addition to the example above, see the following example:

In [31]:
x = np.random.normal(size=(3, 2))
y = x.T

print('This is x:\n', x, '\n')
print('This is y:\n', y, '\n')

print(id(x))
print(id(y))
print(x.flags.owndata)
print(y.flags.owndata)

This is x:
 [[ 2.19800643  1.18288158]
 [-0.6247888  -0.3859218 ]
 [ 0.92072699  1.5511588 ]] 

This is y:
 [[ 2.19800643 -0.6247888   0.92072699]
 [ 1.18288158 -0.3859218   1.5511588 ]] 

4433123328
4433121968
True
False


The Python data analysis tools (core Python, Numpy, Pandas, and others) lack a high performance and universal way to represent missing values. The current work-around is to use NaN and None to represent missing values, but this approach has limitations. One issue is that by definition, NaN is the only value that is not equal to itself:

In [32]:
x = float('nan')
print(x == x)

False


This means, for example, that you cannot count the NaN values in an array in the obvious way:

In [33]:
x = np.array([1, np.nan, 2, np.nan])
print((x == np.nan).sum())

0


The proper way to detect NaN values is with the np.isnan function:

In [34]:
print(np.isnan(x).sum())

2


Another issue is that NaN exists for float type variables (float32 and float64 in Numpy) but not for other variable types, e.g. integers. Therefore, when you insert a missing value into a Numpy integer array, it will be promoted to float type.

Numpy provides two main ways to work with string data. The first approach, which is much more common, uses the Python string pool to manage the strings, and simply places the object id’s into the ndarray. This produces an array of type Object, e.g.

In [35]:
x = np.array(["cat", "dog", "fish"], dtype = 'O')
x.dtype

dtype('O')

You can see that this array only contains object id’s by running the following:

In [36]:
s = "fish"
x[0] = s
print(id(s), id(x[0]))
x

4349454072 4349454072


array(['fish', 'dog', 'fish'], dtype=object)

Note that this array can actually hold references to any Python object, not just strings:

In [37]:
x[0] = {i : i for i in range(10)}
x

array([{0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9},
       'dog', 'fish'], dtype=object)

The other way to store strings in a ndarray is to use a fixed string width, in which case the string data is actually packed into the array directly:

In [38]:
x = np.array(["cat", "dog"])
print(x)
x.dtype

['cat' 'dog']


dtype('<U3')

The dtype “<U3” refers to a Unicode string of 3 characters. Note that in this setting, if you attempt to assign a string that does not fit into the allotted storage, the string is truncated:

In [39]:
x[0] = "fish"
print(x)

['fis' 'dog']
