In this mini-lecture, we explore the 'pickle' module in Python, which is widely used by data scientists for large scale projects.

In [15]:
import numpy as np
import pandas as pd
import pickle
import bz2
import gzip
import os

As a data scientist, one may work with different types of data in varieties of forms such as dictionaries, 'DataFrames' objects, lists or any other data types. In many times we want to save them to a file, so we can use them later on or send them to someone. For example, instead of recreating a final dataset that needs to go through a lengthy script, we can simply get the 'pickled' dataset for later use (as if from a jar!). It's useful for ML because when you are wokring with some algorithms, you wan to save them to be able to make new predictions at a later time, without having to rewrite everything or retrain the model all over again. This is the basic idea of 'picklin: it is used for serializing and de-serializing Python object structures, also called **marshalling** or **flattening**. **Serialization** refers to the process of converting an object in memory to a byte stream that can be stored on disk or sent over a network. later on, this character stream can then be retreived and deserialized back to a Python object. **Pickling** is not the same as compression. The former is the conversion of an object from one representation (data in RAM to anther), while the latter is the process of encoding data with few bits, in order to save disk space. 

Pickling is specific to Python language, so if you want to use data across different programming languages, pickle is not recommended. Also, unpickling a file that was pickeld in a different version of Python may not alawys work propertly. In Python, we can pickle objects of the following data types: booleans, integers, floats, complex numbers, tuples, strings, lists, sets, dictionaries, classes, functions, 'DataFrame' objects etc. However, not everything can be pickled easily, examples of this include generators, inner classes, lambda functions and 'defaultdict' objects. In thes case of lambda functions, you need to use an additional package called 'dill'. With 'defaultdict' objects, you need to create them with a module-level function.

Pickle is very similar to JSON, which stands for 'JavaScript Object Notation'. JSON is a lightweight format for data-interchange, that is easily readable by humans. Although it was derived from JavaScript, JSON is standardized and language-independent. This is a serious advantage over pickle. It's also more secure and much faster than pickle. However, if you only need to use Python, then the pickle module is still a good choice for its ease of use and ability to reconstruct complete Python objects. An alternative is 'cPickle'. It is nearly identical to pickle, but written in C, which makes it up to 1000 times faster. For small files, however, you won't notice the difference in speed. Both produce the same data streams, which means that pickle and 'cPickle' can use the same files.


In [7]:
os.chdir("C:\\Users\\GAO\\GAO_Jupyter_Notebook\\Datasets")

Let's first pickle a dictionary, called dogs_dict. Call the pickled file 'dogs'. To pickle this dictionary, you first need to specify the name of the file you will write it to, which is dogs in this case. Do not add an extension. 

To open the file for writing, simply use the open() method. The first argument should be the name of your file. The second argument is 'wb'. The 'w' means that you'll be writing to the file, and 'b' refers to binary mode. This means that the data will be written in the form of byte objects. If you forget the 'b', an errow will show up. You may sometimes come across a slightly different notation, like 'w+b', but it provides the same functionality. 

Once the file is opened for writing, you can use pickle.dump(), which takes two arguments: the object you want to pickle and the file to which the object has to be saved. In this case, the former will be dogs_dict, while the latter will be outfile.

In [8]:
dogs_dict = { 'Ozzy': 3, 'Filou': 8, 'Luna': 5, 'Skippy': 10, 'Barco': 12, 'Balou': 9, 'Laika': 16 }
filename = 'dogs' # step 1
outfile = open(filename,'wb') # step 2
pickle.dump(dogs_dict,outfile) # step 3
outfile.close()

So to summarize, you can pickle a dataset in the following steps:
   1. name the jar of the pickle
   2. open the jar of the pickle
   3. dump the pickle in the jar
   4. close the lid of the pickle
   
Now let's discuss unpicling files. The process of loading a pickled file back into a Python program is similar, the steps involve:

   1. open the jar of the pickle
   2. name the bowl in your Python code that will contain the pickle, and load the pickle from the jar to the bowl
   3. close the jar
   
Below, let's load the pickled file 'dopgs':

In [12]:
infile=open(filename, 'rb') # step 1
new_dict=pickle.load(infile) # step 2
infile.close() # step 3
print(new_dict) # Stare at the pickle, smell it, and drool as much as you can (make sure you validate what you see)

{'Ozzy': 3, 'Filou': 8, 'Luna': 5, 'Skippy': 10, 'Barco': 12, 'Balou': 9, 'Laika': 16}


If you are unpickling Python 2 objects in Python 3, we recommend adding an addtional argument encoding='latin1' in the pickle.load() method to avoid possible errors. 

Additionally, you can compress the pickled file. This can be done using 'bzip2' or 'gzip' libraries. They both compress files, but 'bzip2' is a bit slower, whereas 'gzip' produces files about twice as large as 'bzip2'.

In [16]:
smallerfile='dogs'
sfile=bz2.BZ2File('smallerfile','w')
pickle.dump(dogs_dict, sfile)
sfile.close()

We can also pickle a function (lambda expressions excluded). A very similar method is the pickle.dumps(), which returns the pickled representation of the object as a string, instead of writing it to a file. Notice that pickle.dump() is different from pickle.dumps(). Similarly, pickle.loads() reads a pickled object hierarchy from a string. Characters in the string past the pickled object’s representation are ignored. More details of pickling protocols and methods can be found here: https://docs.python.org/2/library/pickle.html 

In [38]:
def linear(x):
    return 3*x+4
linear(0.5)

5.5

In [42]:
show_pickle = pickle.dumps(linear) # this does not dump the pickle into the jar
show_pickle

b'\x80\x03c__main__\nlinear\nq\x00.'

Let's actually pickle the function now:

In [51]:
filename='pickled_function'
outfile=open(filename, 'wb')
pickle.dump(linear, outfile)
outfile.close()

infile=open(filename, 'rb') 
new_function=pickle.load(infile) 
infile.close()
print(new_function) 
new_function(-0.5)

<function linear at 0x00000245B399B378>


2.5

For lambda expressions, basically replace the 'pickle' with 'dill':

In [60]:
import dill

func=lambda x: x-2

filename='pickled_lamda_exp'
outfile=open(filename, 'wb')
dill.dump(func, outfile)
outfile.close()

infile=open(filename, 'rb') 
new_lambda_expr=dill.load(infile) 
infile.close()

new_lambda_expr(2)

0

References:
   - https://www.datacamp.com/community/tutorials/pickle-python-tutorial
   - https://medium.com/@emlynoregan/serialising-all-the-functions-in-python-cd880a63b591
   - https://docs.python.org/2/library/pickle.html
   - https://docs.python.org/3/library/pickle.html