# Cython and the Buffer Protocol for loading binary data

There are times when you need to load binary data into NumPy / Pandas but the data format is either binary or some irregularly structured ASCII data and you __don't__ already have a good reader for the format.  Today, I'm going to show you some simple code you can use to do the loading and get back to analysis!  

### Binary Formats

We're going to focus on loading binary data that is stored in a record format that's similar to the diagram below.  This is just for the purposes of demonstration, though.  All the code we'll develop here can be easily modified to fit whatever binary format your data is in.  You could even use these same methods to efficiently load irregularly structured text data.

For demonstration purposes we'll consider binary data where each record is laid out something like this:

![alt text](images/SampleBinaryLayout.png "Sample Binary Layout")

### Simplest Case:  All records have an identical, fixed length format

This is the simplest case.  Let's say our binary data has the layout above but where the file contains only a single record type which consists of:

* 4 byte header ( as defined above )
* 9 bytes total in the message body
    * First 4 bytes encode an usigned int ( [little endian](https://en.wikipedia.org/wiki/Endianness) byte ordering )
    * Next 5 bytes encode a character array

Ultimately, we want to get our data into a Pandas DataFrame, and to do this we're going to first load our data to a NumPy array.  Once we have a NumPy array, it's just a one liner to create the Pandas DataFrame.  

The only tricky part here is that NumPy arrays are homogeneous meaning that all the elements in the array have to be of the same type.  Fortunately, however, NumPy lets us define structured types where the dtype contains a bunch of separate components.  So what we'll do next is construct a NumPy dtype which has the same structure as our binary records.  If you want to read the docs you can do that [here](https://NumPy.org/devdocs/reference/arrays.dtypes.html) but specifying the NumPy dtype is really pretty simple:

In [1]:
import numpy as np
import pandas as pd

# define a np.dtype that matches our binary record layout
dt = np.dtype([
    ('body_length', '<u2'),   # two byte unsigned integer (little endian)
    ('msg_type', '<u2'),      
    ('number', '<i4'),        # four byte signed integer (little endian) 
    ('name', 'S5')            # 5 byte character array
])

In the companion notebook WorkingWithBytes, I already setup a binary file with the format above.  So now, with the NumPy.dtype defined we can go from binary data to Pandas dataframe in just a few lines: 

In [2]:
with open('data/simple_binary.bin', 'rb') as f:
    b = f.read()                 # read in the binary file as bytes
    
np_data = np.frombuffer(b, dt)   # creates a NumPy array
df = pd.DataFrame(np_data)
df

Unnamed: 0,body_length,msg_type,number,name
0,9,1,1,b'one'
1,9,1,2,b'two'
2,9,1,3,b'three'


Couldn't be easier right?  One little thing to take care of, however, is that the 'name' column in our data is holding objects of type 'byte'.  We'd probably rather have strings so let's use the Series.str.decode() method to do the conversion from bytes to a Python string:

In [3]:
df['name'] = df['name'].str.decode('utf-8')
df

Unnamed: 0,body_length,msg_type,number,name
0,9,1,1,one
1,9,1,2,two
2,9,1,3,three


### What about binary data with multiple record types?

Loading the binary data above was super easy but unfortunately, binary data is usually not structured so nicely.  Typically there are many different record types all mixed together in a single file and we need a way to load these into one or more dataframes.  

The challenge here is that NumPy only knows how to load binary data that is stored in a 'simple' format where the data exists in a contiguous block of memory consisting of identical records stacked back to back.  In the example above, our data had only a single fixed-length record type, and that made it very easy to load.

In general however, in order to load binary data to NumPy, we'll need to split it into one or more homogeneous arrays as shown below

![alt text](images/CollateMultipleRecords.png "Collate multiple records")

One way to do the split above is to write some preprocessing code (pick any language you want!) to split the binary data into one or more files.  If you go that route then you can simply do your preprocessing and then load the individual files like we did above.  The downside to this approach is that the preprocessing will create multiple copies of your data on disk which isn't very elegant and could potentially be a hassle.   

So instead of writing out separate files, we'll show how to setup memory arrays in Cython, one for each record type that we're interested in, and efficiently fill them with our binary records.  We'll then expose these arrays to NumPy by using the buffer protocol from the Python C-API.  We could do all of this in native Python, but we'll use Cython because we want our solution to be fast (binary files are sometimes quite large).  There's quite a bit here, but it turns out you can do a lot with just a little bit of code so let's get started!

### The Python C-API and the Buffer Protocol

The Python C-API is the doorway into a lower level implementation of Python.  It allows programmers to extend Python with code written in C/C++ and also lets you embed Python into other programming languages.  We won't need to know much about the C-API though.  All we need is a high level understanding of the buffer protocol.

The buffer protocol operates at the C-API level and defines a way that Python objects can access and share each others memory.  When we call NumPy.frombuffer on an object that implements the buffer protocol, NumPy goes down into the C-API and asks the object for a view of its internal memory.  If successful, NumPy goes on to setup an array using the shared data.  Note that there is no copying going on here!  After the call to NumPy.frombuffer, both the original buffer object and the NumPy array are sharing the same underlying memory.  A simplified version of the process looks something like this:

![alt text](images/BufferProtocolUsage.png "Buffer Protocol Diagram")

Rather than use the C-API directly, however, we're going to interact with the C-API via Cython because it's a lot easier than writing code directly in C/C++.  It's pretty simple to implement the buffer protocol from Cython but first, lets do a quick Hello World in Cython to make sure everythings setup right.

### Cython:  Hello World

Cython is an extension to Python which is a combination of Python and C/C++.  Code compiled from Cython often runs much faster than native Python and gives you the ability to use functions and classes from C/C++ libraries.  The process happens in two stages

1.  Write some Cython code and compile it to C/C++ with Cython
2.  Compile the C/C++ code to create a Python module that you can import

In iPython, these steps can be combined and simplified by using some iPython magic.  Let's try it out with a simple 'Hello World' 

In [4]:
# Cython is included in many common Python distributions but if not you'll need to 
# do a '!pip install Cython'.  As always, it's best to use a dedicated virtual environment.
# but here we'll assume that Cython is already installed. 

In [5]:
# first load the magic Cython extension
%load_ext Cython

In [6]:
%%cython --cplus

# The magic %%Cython command above has to be the first thing.  Evaluating this cell
# will compile this Cython -> C++, then compile the C++, and finally import the 
# HelloWorldBuffer class to our iPython session
cdef class HelloWorldBuffer:
    def __cinit__(self, b):
        print("I was initialized with '{}'".format(b))
        
    def say_something(self):
        print('hello world')

In [7]:
# Let's try it
h = HelloWorldBuffer('hi')    # should print "I was initialized with 'hi'"
h.say_something()             # should print "hello world"

I was initialized with 'hi'
hello world


So hopefully the above cell worked for you.  Now let's create a more usefull class that implements the buffer protocol!!!

### Cython:  implement the buffer protocol

Our first goal here is to setup a Python object that implements the buffer protocol.  Once we've done that, we'll go back and write a little bit of extra code to create multiple buffer objects and fill them with records from the binary file.

Implementing the buffer protocol from Cython just requires us to implement two methods \_\_getbuffer__ and \_\_releasebuffer__.  Behind the scenes, Cython has some special handling of these so that they get correctly tied to our object in the C-API but we don't need to worry about that;  all we need to do is implement the two methods and they're both pretty simple for us.  Here's what they do:

**\_\_getbuffer__(self, Py_buffer *, int)**  This method will be called by any consumer object that wants a view of our memory.  It has two arguments: an integer of bit flags, and a pointer to a simple C struct Py_buffer.  The flags indicate details about the data format that the consumer is expecting.  In our case, we'll support just the simplest type which is one dimensional data stored in a contigous block of memory.  So all we have to do in \_\_getbuffer__ is check that the flags indicate a simple buffer, and then fill in a few fields in the Py_buffer struct.  In our case these fields are all self-explanatory ( see below ).

**\_\_releasebuffer__(self, Py_buffer *)**  The purpose of \_\_releasebuffer__ is to allow reference counting so that our code knows when it can release and/or reallocate memory in the Py_buffer structure.  NumPy, however, doesn't respect this and expects that buffers maintain their data even after calls to \_\_releasebuffer__.  Because of this we don't need to do anything with the \_\_releasebuffer__ method.

In [8]:
%%cython --cplus
# or optionally %%Cython --cplus --annotate which will show information about how
# the code is compiled to C/C++ binary

from cpython cimport Py_buffer
from cpython.buffer cimport PyBUF_SIMPLE, PyBUF_WRITEABLE
from libcpp.vector cimport vector
from libc.stdint cimport uint8_t
from libc.string cimport memcpy


cdef class SimplestBuffer:
    cdef:
        vector[uint8_t] buf   # We're using vector from C++ to manage memory allocation
     
    # in Cython, methods defined with 'def' are slower but accessible from Python
    # extend will add bytes to our internal memory
    def extend(self, input_bytes):
        self.add_bytes(input_bytes, len(input_bytes))   
    
    # methods defined with 'cdef' may be faster but are accessible only from Cython
    cdef add_bytes(self, char *b, int num_bytes):  
        cdef int curr_size = self.buf.size()
        self.buf.resize(curr_size + num_bytes)                     # resize vector if necessary
        memcpy(&(self.buf[curr_size]), <uint8_t *>b, num_bytes)    # copy bytes into self.buf
    
    def __getbuffer__(self, Py_buffer *buffer, int flags):
        # if the requested buffer type is not PyBUF_SIMPLE then error out
        # we will allow either readonly or writeable buffers however
        if flags != PyBUF_SIMPLE and flags != PyBUF_SIMPLE | PyBUF_WRITEABLE:
            raise BufferError
            
        buffer.buf = <char *>&(self.buf[0])  # points to our buffer memory
        buffer.format = NULL                 # NULL format means bytes 
        buffer.internal = NULL               # this is for our own use if needed
        buffer.itemsize = 1                     
        buffer.len = self.buf.size()
        buffer.ndim = 1
        buffer.obj = self
        buffer.readonly = not (flags & PyBUF_WRITEABLE)
        buffer.shape = NULL                  # none of shapes, strides or suboffsets are used for PyBUF_SIMPLE
        buffer.strides = NULL
        buffer.suboffsets = NULL    

    # the buffer protocol requires this method
    def __releasebuffer__(self, Py_buffer *buffer):
        pass       

Let's try it out!  All we need to do is create one of these SimplestBuffer objects, fill it with some byte data and then use np.frombuffer() just like we did earlier:

In [9]:
with open('data/simple_binary.bin', 'rb') as f:
    b = f.read() 
    
sb = SimplestBuffer() 
sb.extend(b)  # we implemented this method above.  it fills the buffer with the bytes b   
df = pd.DataFrame(np.frombuffer(sb, dt))
df

Unnamed: 0,body_length,msg_type,number,name
0,9,1,1,b'one'
1,9,1,2,b'two'
2,9,1,3,b'three'


So if you made it this far, congradulations!!!  The hard part is done.  Now all we need to do is write a little more code to take binary data with mixed record types, and fan the data out to multiple buffers.  Note that while SimplestBuffer is a fairly generic reusable class, this next bit of code in the function fan_bytes should be specialized to the exact format of **your** binary data:

In [10]:
%%cython --cplus
# or optionally %%Cython --cplus --annotate which will show information about how
# the code is compiled to C/C++ binary

from cpython cimport Py_buffer
from cpython.buffer cimport PyBUF_SIMPLE, PyBUF_WRITEABLE
from libcpp.vector cimport vector
from libc.stdint cimport uint8_t, uint16_t
from libc.string cimport memcpy, strlen
from cython.operator cimport dereference as deref

# this is the same as the class above but we need to repeat it here
cdef class SimplestBuffer:
    cdef:
        vector[uint8_t] buf   # We're using vector from C++ to manage memory allocation
     
    # in Cython, methods defined with 'def' are slower but accessible from Python
    # extend will add bytes to our internal memory
    def extend(self, input_bytes):
        self.add_bytes(input_bytes, len(input_bytes))   
    
    # methods defined with 'cdef' may be faster but are accessible only from Cython
    cdef add_bytes(self, char *b, int num_bytes):  
        cdef int curr_size = self.buf.size()
        self.buf.resize(curr_size + num_bytes)                     # resize vector if necessary
        memcpy(&(self.buf[curr_size]), <uint8_t *>b, num_bytes)    # copy bytes into self.buf
    
    def __getbuffer__(self, Py_buffer *buffer, int flags):
        # if the requested buffer type is not PyBUF_SIMPLE then error out
        # we will allow either readonly or writeable buffers however
        if flags != PyBUF_SIMPLE and flags != PyBUF_SIMPLE | PyBUF_WRITEABLE:
            raise BufferError
            
        buffer.buf = <char *>&(self.buf[0])  # points to our buffer memory
        buffer.format = NULL                 # NULL format means bytes 
        buffer.internal = NULL               # this is for our own use if needed
        buffer.itemsize = 1                     
        buffer.len = self.buf.size()
        buffer.ndim = 1
        buffer.obj = self
        buffer.readonly = not (flags & PyBUF_WRITEABLE)
        buffer.shape = NULL                  # none of shapes, strides or offsets are used for PyBUF_SIMPLE
        buffer.strides = NULL
        buffer.suboffsets = NULL    

    # the buffer protocol requires this method
    def __releasebuffer__(self, Py_buffer *buffer):
        pass   

# this function walks through the input_bytes and copies each record into one 
# or the other of the buffers.  ( depending on the value of msg_type )
def fan_bytes(bytes input_bytes, SimplestBuffer buf1, SimplestBuffer buf2):
    cdef int num_bytes = len(input_bytes)
    cdef char *b = <char *>input_bytes                # you can cast bytes objects to char *
    cdef int cursor = 0
    cdef uint16_t msg_type
    cdef uint16_t body_len
    
    # here we step through the character array by doing some C/C++ pointer arithmetic...
    while cursor < num_bytes:
        body_len = deref(<uint16_t*>(b + cursor)) 
        msg_type = deref(<uint16_t*>(b + cursor + 2))
        
        # copy bytes into either buf1 or buf2 depending on msg_type.  
        if msg_type == 1:
            buf1.add_bytes(b + cursor, body_len + 4)  # body_len + 4 is our record length incl. the 4 byte header
        elif msg_type == 2:
            buf2.add_bytes(b + cursor, body_len + 4)
            
        cursor += body_len + 4                        # advance cursor forward to beginning of next record 
    

The fan_bytes function above is specialized to handle binary data with the example structure that we illustrated earlier.  That is, all records have a 4 byte header where the first two bytes tell the length ( in bytes ) of the message body and the second two bytes encode a 'message type' which labels how to decode the message body into a set of data fields.  To decode your binary data, you'll need to look up the reference documentation for your binary format and modify fan_bytes as appropriate.  Note that in this example, the fan_bytes function doesn't need to know anything about the structure of the message body; all it does is copy bytes from the binary record onto one of our buffer objects.

So let's see it in action!  Here we'll use SimplestBuffer and fan_bytes together to decode a binary file that a mix of two different record types.  The data was setup in the notebook WorkingWithBytes and is similar to our previous binary data except now we have mixed in a second record type which has the same header but the body of the message consists of 4 consecutive 32 bit integers.

In [11]:
with open('data/simple_binary_mixed.bin', 'rb') as f:
    b = f.read()

# create two buffers and use the helper function above to fill them
sb1 = SimplestBuffer()
sb2 = SimplestBuffer()
fan_bytes(b, sb1, sb2)

# dt1 is the same dtype we used before.  dt2 specifies a 20 byte record:  same header followed by four 32 bit integers
dt1 = np.dtype([('body_length', '<i2'), ('msg_type', '<i2'), ('number', '<u4'), ('name', 'S5')])
dt2 = np.dtype([('body_length', '<i2'), ('msg_type', '<i2'), ('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])

# fan_bytes already loaded up the SimplestBuffer objects so now we just convert these to dataframes
df1 = pd.DataFrame(np.frombuffer(sb1, dt1))
df2 = pd.DataFrame(np.frombuffer(sb2, dt2))

# and display them!
from IPython.display import display
display(df1)
display(df2)

Unnamed: 0,body_length,msg_type,number,name
0,9,1,1,b'one'
1,9,1,2,b'two'
2,9,1,3,b'three'


Unnamed: 0,body_length,msg_type,a,b,c,d
0,16,2,1,2,3,4
1,16,2,2,4,6,8
2,16,2,3,6,9,12


So that's progress!!!  At this point we've successfully loaded a binary file containing mixed record types into two dataframes, one for each record type.  Before wrapping up, however, there's a few improvements that are worth making.  

First we need to improve the memory safety of SimplestBuffer so that the underlying memory can't get reallocated while NumPy or Pandas is sharing the memory.  In the buffer protocol, this is supposed to be done by doing reference counting based on calls to getbuffer and releasebuffer and we implement this below.  See comments in the code for more details.  

Secondly we're also adding the ability to preallocate memory on the buffer and read bytes directly from a file as optional abilities.  Note however, that vector already does a decent job with being efficient about reallocating memory ( ie, it will reserve successively larger and larger blocks of memory rather than reallocate every time you want to extend the vector by a few bytes ).  And with regard to reading from files, it's often faster to read all the binary data into an intermediate buffer before processing rather than making many small reads on the file system.  Nevertheless, both of these can lead to speedups so we include them in our SimpleBuffer implementation here.

And finally, it's often useful to create loadable modules from Cython rather than putting all of your Cython into Jupyter notebooks.  So in the cells below, rather than use the %%Cython magic as we did above, we're going to output the Cython code to a file and use setuptools to create a loadable module:

In [12]:
%%writefile binbuffer.pyx

# the next line is not just a comment, it tells Cython to build C++ code
# distutils: language = c++

from cpython cimport Py_buffer
from cpython.buffer cimport PyBUF_SIMPLE, PyBUF_WRITEABLE
from libcpp.vector cimport vector
from libcpp cimport bool
from libc.stdio cimport FILE, fread, fopen, fclose, fseek, SEEK_CUR
from libc.stdint cimport uint8_t, uint16_t
from libc.string cimport memcpy
from cython.operator cimport dereference

"""
Here is our generic SimplestBuffer reimplemented with:

1) access restrictions ( no reallocating memory if the buffer has ever been accessed )
2) ability to preallocate memory if desired ( for some small speed gains )
3) method to read bytes directly from a file

NOTE:  If you examine SimpleBuffer below, you'll see that the view_count and buffer_accessed are somewhat redundant
and you could actually remove view_count entirerly without changing any functionality.  We've included
view_count for illustration purposes however, because it shows how reference counting is supposed to work
with the buffer protocol:  ie, if there are any open views on the buffer, then the buffer memory should not be changed
because other objects are referencing it.  Unfortunately, NumPy currently doesn't respect this protocol
and expects buffers to exist and be unchanged even after calls to releasebuffer.  For this reason, we added the bool
buffer_accessed to prevent any reallocation of buffer memory once a view on the memory has been requested
via getbuffer.

"""
cdef class SimpleBuffer:
    cdef: 
        vector[uint8_t] buf   # vector is useful here.  We get a contiguous block of memory but don't have to manage memory ourselves.
        unsigned int cursor            # keep track of where to put next elements in buf
        int view_count        # reference counting for open views
        bool buffer_accessed  # we need this because NumPy expects buffers to exist even after releasebuffer

    def __cinit__(self):
        self.view_count = 0   
        self.cursor = 0
        self.buffer_accessed = False  
        
    def extend(self, b):
        self.add_bytes(b, len(b))
    
    # we split out this method so that we can preallocate if we want.  
    cdef maybe_allocate(self, unsigned int n):
        if self.buffer_accessed or self.view_count > 0:
            raise RuntimeError('Buffer has been locked to changes in size')
        if self.buf.size() < self.cursor + n:
            self.buf.resize(self.cursor + n)
        
    cdef add_bytes(self, char *b, unsigned int n):
        self.maybe_allocate(n)  
        memcpy(&(self.buf[self.cursor]), b, n)
        self.cursor += n
        
    cdef add_bytes_from_file(self, FILE *fp, unsigned int n):
        self.maybe_allocate(n)
        fread(&(self.buf[self.cursor]), 1, n, fp)
        self.cursor += n
            
    def __getbuffer__(self, Py_buffer *buffer, int flags):
        if flags != PyBUF_SIMPLE and flags != PyBUF_SIMPLE | PyBUF_WRITEABLE:
            raise BufferError
            
        buffer.buf = <char *>&(self.buf[0])
        buffer.format = NULL                    # NULL format means bytes 
        buffer.internal = NULL                  # see References
        buffer.itemsize = 1
        buffer.len = self.buf.size()
        buffer.ndim = 1
        buffer.obj = self
        buffer.readonly = not (flags & PyBUF_WRITEABLE)
        buffer.shape = NULL
        buffer.strides = NULL
        buffer.suboffsets = NULL    
        
        self.view_count += 1
        self.buffer_accessed = True

    def __releasebuffer__(self, Py_buffer *buffer):
        self.view_count -= 1  
        

def fan_bytes(bytes input_bytes, SimpleBuffer buf1, SimpleBuffer buf2):
    cdef int num_bytes = len(input_bytes)
    cdef char *b = <char *>input_bytes  # you can cast bytes objects to char *
    cdef int cursor = 0
    cdef uint16_t msg_type
    cdef uint16_t msg_len
    
    # here we step through the character array by doing some C/C++ pointer arithmetic
    while cursor < num_bytes:
        body_len = dereference(<uint16_t*>(b + cursor)) 
        msg_type = dereference(<uint16_t*>(b + cursor + 2))
        
        if msg_type == 1:
            buf1.add_bytes(b + cursor, body_len + 4)  # msg_len + 4 is our total record length including the 4 byte header
        elif msg_type == 2:
            buf2.add_bytes(b + cursor, body_len + 4)
            
        cursor += body_len + 4


def fan_binary_file(bytes filename, SimpleBuffer buf1, SimpleBuffer buf2):
    cdef uint16_t header[2]
    cdef FILE *fp = fopen(filename, "r")
    while fread(header, 1, 4, fp) == 4:
        if header[1] == 1:
            buf1.add_bytes(<char *>header, 4)
            buf1.add_bytes_from_file(fp, header[0])
        elif header[1] == 2:
            buf2.add_bytes(<char *>header, 4)
            buf2.add_bytes_from_file(fp, header[0])
        else:
            fseek(fp, header[0], SEEK_CUR)
    
    fclose(fp)


Overwriting binbuffer.pyx


The above cell wrote the Cython code to the file 'binbuffer.pyx'  This next cell writes a file 'setup.py' that we'll need for compiling and installing the Cython using setuptools

In [13]:
%%writefile setup.py

from setuptools import setup
from Cython.Build import cythonize

setup(
    name='binbuffer',
    ext_modules=cythonize('binbuffer.pyx', language_level=3)
)

Overwriting setup.py


In [14]:
# and now compile and install the binary in the local directory
!Python setup.py build_ext --inplace

Compiling binbuffer.pyx because it changed.
[1/1] Cythonizing binbuffer.pyx
running build_ext
building 'binbuffer' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/opt/anaconda3/include -arch x86_64 -I/opt/anaconda3/include -arch x86_64 -I/opt/anaconda3/include/python3.7m -c binbuffer.cpp -o build/temp.macosx-10.9-x86_64-3.7/binbuffer.o
g++ -bundle -undefined dynamic_lookup -L/opt/anaconda3/lib -arch x86_64 -L/opt/anaconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.9-x86_64-3.7/binbuffer.o -o build/lib.macosx-10.9-x86_64-3.7/binbuffer.cpython-37m-darwin.so
copying build/lib.macosx-10.9-x86_64-3.7/binbuffer.cpython-37m-darwin.so -> 


In [15]:
# now we can import the binbuffer module that we built above
from binbuffer import SimpleBuffer, fan_bytes, fan_binary_file

In [16]:
import numpy as np
import pandas as pd
with open('data/simple_binary_mixed.bin', 'rb') as f:
    b = f.read()

# create two buffers and use the helper function above to fill them
sb1 = SimpleBuffer()
sb2 = SimpleBuffer()
fan_bytes(b, sb1, sb2)

# alternately, instead of using fan_bytes, here's an example that loads the buffers directly from the binary file
#fan_binary_file(b'simple_binary_mixed.bin', sb1, sb2)  

# dt1 is the same dtype we used before.  dt2 specifies a 20 byte record:  same header followed by four 32 bit integers
dt1 = np.dtype([('body_length', '<i2'), ('msg_type', '<i2'), ('number', '<u4'), ('name', 'S5')])
dt2 = np.dtype([('body_length', '<i2'), ('msg_type', '<i2'), ('a', '<i4'), ('b', '<i4'), ('c', '<i4'), ('d', '<i4')])

# fan_bytes ( or fan_binary_file ) already loaded up the SimplestBuffer objects so now we just convert these to dataframes
df1 = pd.DataFrame(np.frombuffer(sb1, dt1))
df2 = pd.DataFrame(np.frombuffer(sb2, dt2))

# and display them!
from IPython.display import display
display(df1)
display(df2)

Unnamed: 0,body_length,msg_type,number,name
0,9,1,1,b'one'
1,9,1,2,b'two'
2,9,1,3,b'three'


Unnamed: 0,body_length,msg_type,a,b,c,d
0,16,2,1,2,3,4
1,16,2,2,4,6,8
2,16,2,3,6,9,12


### Final Remarks

I hope you've found this notebook useful and that it helps you to load your binary data and get back to analysis!!!  Before leaving, I want to add just a few more remarks.

**Evaluation Speed**:  We didn't do any benchmarking here, but in my tests, I've generally found that loading binary data using the above methods is about as fast as loading equivalent dataframes from pickled binaries (sometimes it's even faster!).  So the code runs quite fast.  One area that is not fast however, is the conversion of byte arrays to strings ( using Pandas.Series.str.decode('utf-8') ).  In my experience this conversion is often the slowest part of loading binary data.  If this conversion is causing you headaches, you can consider just leaving some or all of your character data as byte arrays rather than native string objects.  Also, In Pandas 1.0, a new string type has been introduced but it's still experimental.  Perhaps something there can make the conversion faster.

**Variable Record Lengths**:  In the examples here, our record types all had fixed lengths but in the wild, binary records often have variable lengths due either to the presence of variable length character arrays, or repeating groups within the record.  In order to handle records of this type, you'll have to truncate the character arrays to some maximum length and find a way to deal with any repeating groups.  The general tools above are all you really need however, so just beware that this is something you may have to deal with and you'll have no problems coming up with some solution that works for you in your situation.

**Irregularly Structured ASCII data**:  In this article, we focused on binary data, but I just want to note quickly, that if you have large quantities of irregularly structured ASCII data, you can use the same techniques here to efficiently process and load your data.  Again, just figure out a final structure that works as a dataframe, and then write some Cython to parse your ascii file into one or more buffers as we did above.