# STRUCTURED DATA CONTAINER: Performance

This is a research about structured data containers using a typical dataset of my projects (nrows=17302, ncols=187). It will be tested the performance between:
    - Pandas dataframes.
    - Numpy basic arrays.
    - Numpy structured arrays.
    - Numpy record arrays (from Pandas dataframe).
    - Custom array: Numpy basic arrays with column selector.

In [1]:
import sys
import pandas as pd
import numpy as np

### arguments

In [2]:
file_input = 'meteo.csv'

### classes

In [3]:
## basic numpy array with columns selector
class Array:
    """
    Basic numpy array with columns selector.

    Attributes:
    data -- array of X data.
    colums -- column names.
    ncols -- number of columns.
    nrows -- number of rows.
    """

    # constructor
    def __init__(self, array:'np array', columns:list):
        """
        Array constructor.
        array -- array of X data.
        col -- column names.
        """
        # validation
        assert type(array)==np.ndarray, "The data is not a numpy array."
        assert type(columns)==list, "The columns list not match with the data shape."
        assert array.shape[1]==len(columns), "The columns list not match with the data shape."
        # set attributes
        self.data = array
        self.col = columns 
        # calculate number of rows and x columns
        self.nrow = self.data.shape[0]
        self.ncol = self.data.shape[1]
    # columns selector
    def sel(self,query:'string or list of strings'):
        """
        Data column selector
        """
        if type(query)==str: return self.data[:,self.col.index(query)]
        elif type(query)==list:
            if len(query)==0: return self.data
            elif len(query)==1: return self.data[:,self.col.index(query[0])]
            else: return self.data[:,[self.col.index(iq) for iq in query]]
        else: return None
        
    # custom display
    def __repr__(self):
        return "<Array nrow:%s ncol:%s>" % (self.nrow, self.ncol)
    
    def __str__(self):
        return "Array: nrow=%s ncol=%s" % (self.nrow, self.ncol)

### read data

In [4]:
# read data
dfdata = pd.read_csv(file_input).set_index('dt')
# include a string variable
dfdata['strcol'] = ['this is a str value' for i in range(len(dfdata))]

In [5]:
print('nrows = %s   ncols = %s'%(len(dfdata),len(dfdata.columns.values)))
type(dfdata)

nrows = 17302   ncols = 187


pandas.core.frame.DataFrame

### data preparation

In [6]:
## numpy basic array
data = dfdata.values
type(data)

numpy.ndarray

In [7]:
## numpy structured array
tp = np.dtype({'names':tuple(dfdata.dtypes.index.tolist()),
          'formats':tuple(dfdata.dtypes.tolist())})
data_stru = np.zeros(len(dfdata), dtype=tp)
for ic in dfdata.columns.values: data_stru[ic] = dfdata[ic].tolist()
type(data_stru)

numpy.ndarray

In [8]:
## numpy record array
data_rec = dfdata.to_records()
type(data_rec)

numpy.recarray

In [9]:
## my own array object (numpy basic array with data selector)
odata = Array(data,list(dfdata.columns.values))
odata

<Array nrow:17302 ncol:187>

## PERFORMANCE

### Memory usage

In [10]:
MB = 1024*1024
print("Pandas %d MB / %d B " % (sys.getsizeof(dfdata)/MB,sys.getsizeof(dfdata)))
print("Numpy Basic %d MB / %d B  " % (sys.getsizeof(data)/MB,sys.getsizeof(data)))
print("Numpy Structured %d MB / %d B  " % (sys.getsizeof(data_stru)/MB,sys.getsizeof(data_stru)))
print("Numpy Record %d MB / %d B  " % (sys.getsizeof(data_rec)/MB,sys.getsizeof(data_rec)))
print("Custom Array %d MB / %d B  " % (sys.getsizeof(odata)/MB,sys.getsizeof(odata)))

Pandas 27 MB / 28375304 B 
Numpy Basic 0 MB / 112 B  
Numpy Structured 24 MB / 25883888 B  
Numpy Record 24 MB / 26022328 B  
Custom Array 0 MB / 56 B  


## Performance

### Data access: integer

In [11]:
print('Pandas:')
%timeit dfdata['hforecast']
print('Numpy Basic:')
%timeit data[:,0]
print('Numpy Structured:')
%timeit data_stru['hforecast']
print('Numpy Record:')
%timeit data_rec['hforecast']
%timeit data_rec.hforecast
print('Custom Array:')
%timeit odata.sel('hforecast')

Pandas:
1.56 µs ± 4.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Numpy Basic:
213 ns ± 1.32 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Numpy Structured:
140 ns ± 1.68 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Numpy Record:
2.81 µs ± 18.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.95 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Custom Array:
659 ns ± 6.29 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Data access: float

In [12]:
print('Pandas:')
%timeit dfdata['y']
print('Numpy Basic:')
ii = list(dfdata.columns.values).index("y")
%timeit list(dfdata.columns.values).index("y")
%timeit data[:,ii]
print('Numpy Structured:')
%timeit data_stru['y']
print('Numpy Record:')
%timeit data_rec['y']
%timeit data_rec.y
print('Custom Array:')
%timeit odata.sel('y')

Pandas:
1.59 µs ± 1.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Numpy Basic:
5.67 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
221 ns ± 1.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Numpy Structured:
134 ns ± 2.88 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Numpy Record:
2.78 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
19.6 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Custom Array:
2.45 µs ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### Data access: string

In [13]:
print('Pandas:')
%timeit dfdata['strcol']
print('Numpy Basic:')
%timeit data[:,-1]
print('Numpy Structured:')
%timeit data_stru['strcol']
print('Numpy Record:')
%timeit data_rec['strcol']
%timeit data_rec.strcol
print('Custom Array:')
%timeit odata.sel('strcol')

Pandas:
1.53 µs ± 27.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Numpy Basic:
219 ns ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Numpy Structured:
133 ns ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Numpy Record:
2.76 µs ± 15.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
19.6 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Custom Array:
2.68 µs ± 8.28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


### Operations on a column (unfiltered): mean

In [14]:
print('Pandas:')
%timeit dfdata['y'].mean()
print('Numpy Basic:')
ii = list(dfdata.columns.values).index("y")
%timeit list(dfdata.columns.values).index("y")
%timeit np.mean(data[:,ii])
print('Numpy Structured:')
%timeit np.mean(data_stru['y'])
print('Numpy Record:')
%timeit np.mean(data_rec['y'])
%timeit np.mean(data_rec.y)
print('Custom Array:')
%timeit np.mean(odata.sel('y'))

Pandas:
42.8 µs ± 719 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy Basic:
6.07 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
163 µs ± 859 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy Structured:
59.3 µs ± 304 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy Record:
63.7 µs ± 171 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
82.3 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Custom Array:
169 µs ± 2.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


### Operations on a column (unfiltered): vectorized log

In [16]:
print('Pandas:')
%timeit np.log(dfdata['y'])
print('Numpy Basic:')
#ii = list(dfdata.columns.values).index("y")
#%timeit list(dfdata.columns.values).index("y")
#%timeit np.log(data[:,ii])
print('Numpy Structured:')
%timeit np.log(data_stru['y'])
print('Numpy Record:')
%timeit np.log(data_rec['y'])
%timeit np.log(data_rec.y)
#print('Custom Array:')
#%timeit np.log(odata.sel('y'))

Pandas:
58.8 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Numpy Basic:
Numpy Structured:


  """Entry point for launching an IPython kernel.


481 µs ± 4.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Numpy Record:


  """Entry point for launching an IPython kernel.


479 µs ± 858 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)


  """Entry point for launching an IPython kernel.


497 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### Filtering: rows and one column

In [65]:
print('Pandas:')
%timeit dfdata[(dfdata['hforecast']>=6) & (dfdata['hforecast']<=12)]['hforecast']
print('Numpy Basic:')
%timeit data[(data[:,0]>=6) & (data[:,0]<=12)][:,0]
print('Numpy Structured:')
%timeit data_stru[(data_stru['hforecast']>=6) & (data_stru['hforecast']<=12)]['hforecast']
print('Numpy Record:')
%timeit data_rec[(data_rec['hforecast']>=6) & (data_rec['hforecast']<=12)]['hforecast']
%timeit data_rec[(data_rec.hforecast>=6) & (data_rec.hforecast<=12)].hforecast

Pandas:
3.99 ms ± 4.07 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy Basic:
50.4 ms ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy Structured:
40.4 ms ± 66.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy Record:
38.5 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.2 ms ± 521 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Operations on a column (filtered): mean

In [67]:
print('Pandas:')
%timeit dfdata[(dfdata['hforecast']>=6) & (dfdata['hforecast']<=12)]['hforecast'].mean()
print('Numpy Basic:')
%timeit np.mean(data[(data[:,0]>=6) & (data[:,0]<=12)][:,0])
print('Numpy Structured:')
%timeit np.mean(data_stru[(data_stru['hforecast']>=6) & (data_stru['hforecast']<=12)]['hforecast'])
print('Numpy Record:')
%timeit np.mean(data_rec[(data_rec['hforecast']>=6) & (data_rec['hforecast']<=12)]['hforecast'])
%timeit np.mean(data_rec[(data_rec.hforecast>=6) & (data_rec.hforecast<=12)].hforecast)

Pandas:
4.31 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Numpy Basic:
50.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy Structured:
40.4 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numpy Record:
39.5 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.5 ms ± 726 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## **Ranking:**
- **Memory usage**: Pandas > Numpy Structured >< Numpy Record >> Numpy Basic > Custom Array. 
- **Data (column) access performance**: Numpy Structured > Numpy Basic >> Pandas > Numpy Record > Numpy Record (attribute). **Note**: depend on type of data but normally is in 3rd or 4th position.
- **Data (column) operation (mean) performance**: Pandas > Numpy Structured > Numpy Record > Numpy Basic > Custom Array. 
- **Data (column) operation (vectorized log) performance**: Pandas >> Numpy Structured > Numpy Record > Numpy Basic.
- **Data filtering: select one column and filter rows**: Pandas >> Numpy Record > Numpy Structured > Numpy Basic.

## **Conclusions:**

- For **Exploratory Data Analysis (EDA)** in particular, for data handle in development phase in general, the best options is using **Pandas dataframes** because the memory usage here is not so important and his available tools are extremely usefuls.
- For an **operative predictive model** in particular, and for Machine Learning development in general, the best options is using **Numpy basic arrays"** because his memory usage is the performance is not so different than using Pandas.

#### **Pandas usage exceptions:**
In the **data preparation** phase maybe we need to use Pandas because:
- necessary tools to be used for our purposes.
- a much better performance filtering data.

### References:
- [Structured Data: NumPy's Structured Arrays](https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html)
- [Numpy Vs Pandas Performance Comparison](http://gouthamanbalaraman.com/blog/numpy-vs-pandas-comparison.html)