# NumPy Seralisation and I/O

In this notebook we will focus on NumPy built-in support for **Serialisation** and **I/O**. In other words, we will learn how to save and load NumPy `ndarray` objects in native (binary) format for easy sharing. Moreover we are going to discover how NumPy can load data from external files.

In [1]:
import numpy as np

## Comma-separated values (CSV)

A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values). 

To read data from such file into Numpy arrays we can use the `numpy.genfromtxt` function.

In [7]:
# In Jupyter, all commands starting with ! are mapped as SHELL commands
!head stockholm_td_adj.dat

Year Month Day T_6 T12 T18 Valid 
1800  1  1    -6.1    -6.1    -6.1 1
1800  1  2   -15.4   -15.4   -15.4 1
1800  1  3   -15.0   -15.0   -15.0 1
1800  1  4   -19.3   -19.3   -19.3 1
1800  1  5   -16.8   -16.8   -16.8 1
1800  1  6   -11.4   -11.4   -11.4 1
1800  1  7    -7.6    -7.6    -7.6 1
1800  1  8    -7.1    -7.1    -7.1 1
1800  1  9   -10.1   -10.1   -10.1 1


In [9]:
np.genfromtxt?

[0;31mSignature:[0m
[0mnp[0m[0;34m.[0m[0mgenfromtxt[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfname[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m=[0m[0;34m<[0m[0;32mclass[0m [0;34m'float'[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcomments[0m[0;34m=[0m[0;34m'#'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mskip_header[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mskip_footer[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mconverters[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmissing_values[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfilling_values[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0

In [14]:
st_temperatures = np.genfromtxt('stockholm_td_adj.dat', skip_header=1)

In [15]:
st_temperatures.shape

(77431, 7)

### DYI

Let's play a bit with the data loaded `st_temperatures` to combine **fancy indexing** (i.e. defining conditions to get subset of data) and very simple statistics.

For example:

In [16]:
st_temperatures[:10, ]

array([[ 1.80e+03,  1.00e+00,  1.00e+00, -6.10e+00, -6.10e+00, -6.10e+00,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  2.00e+00, -1.54e+01, -1.54e+01, -1.54e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  3.00e+00, -1.50e+01, -1.50e+01, -1.50e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  4.00e+00, -1.93e+01, -1.93e+01, -1.93e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  5.00e+00, -1.68e+01, -1.68e+01, -1.68e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  6.00e+00, -1.14e+01, -1.14e+01, -1.14e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  7.00e+00, -7.60e+00, -7.60e+00, -7.60e+00,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  8.00e+00, -7.10e+00, -7.10e+00, -7.10e+00,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  9.00e+00, -1.01e+01, -1.01e+01, -1.01e+01,
         1.00e+00],
       [ 1.80e+03,  1.00e+00,  1.00e+01, -9.50e+00, -9.50e+00, -9.50e+00,
         1.00e+00]])

In [17]:
st_temperatures.dtype

dtype('float64')

In [19]:
## Calculate which and how many years we have in our data
years = np.unique(st_temperatures[:, 0]).astype(np.int)
years, len(years)

(array([1800, 1801, 1802, 1803, 1804, 1805, 1806, 1807, 1808, 1809, 1810,
        1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821,
        1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 1832,
        1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840, 1841, 1842, 1843,
        1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851, 1852, 1853, 1854,
        1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862, 1863, 1864, 1865,
        1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873, 1874, 1875, 1876,
        1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887,
        1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898,
        1899, 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909,
        1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920,
        1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931,
        1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942,
        1943, 1944, 1945, 1946, 1947, 

In [20]:
## Calculate the mean temperature of mid-days on February in 1984



In [21]:
## ....

## Numpy's native file format

* Useful when storing and reading back numpy array data. 

* Use the functions `np.save` and `np.load`:

### `np.save`

In [22]:
np.save("st_temperatures.npy", st_temperatures)

**See also**:

- `np.savez` : save several NumPy arrays into one single file
- `np.savez_compressed`
- `np.savetxt`

### `np.load`

In [23]:
T = np.load("st_temperatures.npy")
print(T.shape, T.dtype)

(77431, 7) float64


---

## NumPy for Matlab Users (really?)


If you are a MATLAB&reg; user I do recommend to read [Numpy for MATLAB Users](https://docs.scipy.org/doc/numpy-1.15.0/user/numpy-for-matlab-users.html).

### Numpy can load and save native MATLAB® files:

---

### The `Matrix` Array Type

In addition to the `numpy.ndarray` type, NumPy also support a very specific data type called `Matrix`. 

This special type of object has been introduced to allow for API and programming compatibility with
MATLAB®. 

**Note**: The most relevant feature of this new _array type_ is the behavior of the standard arithmetic operators `+, -, *` to use matrix algebra, which work as they would in MATLAB.

In [2]:
from numpy import matrix

In [3]:
a = np.arange(0, 5)
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])

In [4]:
a

array([0, 1, 2, 3, 4])

In [5]:
A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [6]:
M = matrix(A)
v = matrix(a).T # make it a column vector

In [7]:
a

array([0, 1, 2, 3, 4])

In [8]:
M * M

matrix([[ 300,  310,  320,  330,  340],
        [1300, 1360, 1420, 1480, 1540],
        [2300, 2410, 2520, 2630, 2740],
        [3300, 3460, 3620, 3780, 3940],
        [4300, 4510, 4720, 4930, 5140]])

In [9]:
A @ A  # @ operator equivalent to np.dot(A, A)

array([[ 300,  310,  320,  330,  340],
       [1300, 1360, 1420, 1480, 1540],
       [2300, 2410, 2520, 2630, 2740],
       [3300, 3460, 3620, 3780, 3940],
       [4300, 4510, 4720, 4930, 5140]])

In [10]:
# Element wise multiplication in NumPy
A * A

array([[   0,    1,    4,    9,   16],
       [ 100,  121,  144,  169,  196],
       [ 400,  441,  484,  529,  576],
       [ 900,  961, 1024, 1089, 1156],
       [1600, 1681, 1764, 1849, 1936]])

In [11]:
M * v

matrix([[ 30],
        [130],
        [230],
        [330],
        [430]])

In [12]:
A * a

array([[  0,   1,   4,   9,  16],
       [  0,  11,  24,  39,  56],
       [  0,  21,  44,  69,  96],
       [  0,  31,  64,  99, 136],
       [  0,  41,  84, 129, 176]])

In [13]:
# inner product
v.T * v

matrix([[30]])

In [14]:
# with matrix objects, standard matrix algebra applies
v + M*v

matrix([[ 30],
        [131],
        [232],
        [333],
        [434]])

If we try to add, subtract or multiply objects with incomplatible shapes we get an error:

In [15]:
v_incompat = matrix(list(range(1, 7))).T

In [16]:
M.shape, v_incompat.shape

((5, 5), (6, 1))

In [17]:
M * v_incompat

ValueError: shapes (5,5) and (6,1) not aligned: 5 (dim 1) != 6 (dim 0)

See also the related functions: `inner`, `outer`, `cross`, `kron`, `tensordot`. 

Try for example `help(inner)`.

---

## Loading and Saving `.mat` file

Let's create a `numpy.ndarray` object

In [21]:
A = np.random.rand(10000, 300, 50)  # note: this may take a while

In [22]:
A

array([[[0.30788845, 0.60569692, 0.74159203, ..., 0.99513856,
         0.86615676, 0.65581839],
        [0.29972906, 0.1727805 , 0.73877596, ..., 0.57321798,
         0.52657155, 0.15148499],
        [0.91677054, 0.30289045, 0.47086303, ..., 0.91076997,
         0.15659756, 0.74502433],
        ...,
        [0.16246413, 0.57601666, 0.64519549, ..., 0.04166688,
         0.71115738, 0.75984878],
        [0.99626814, 0.89529207, 0.89520696, ..., 0.927474  ,
         0.46998733, 0.809978  ],
        [0.52545775, 0.42922203, 0.40999633, ..., 0.7497839 ,
         0.26582518, 0.68821719]],

       [[0.93763072, 0.68660253, 0.03060252, ..., 0.08489496,
         0.3368953 , 0.0040575 ],
        [0.17680589, 0.44922269, 0.32552186, ..., 0.49081397,
         0.7718607 , 0.91216332],
        [0.48935017, 0.28293444, 0.57762148, ..., 0.64988995,
         0.96036063, 0.62395338],
        ...,
        [0.77554755, 0.23174591, 0.80126054, ..., 0.34982511,
         0.13648038, 0.63953428],
        [0.4

### Introducing SciPy (ecosystem)

![scipy](images/scipy.png)

### `scipy.io`

In [20]:
from scipy import io as spio

### NumPy $\mapsto$ MATLAB :  `scipy.io.savemat`

In [23]:
spio.savemat('numpy_to.mat', {'A': A}, oned_as='row')  # savemat expects a dictionary

MATLAB $\mapsto$ NumPy: `scipy.io.loadmat`

In [24]:
data_dictionary = spio.loadmat('numpy_to.mat')


In [25]:
list(data_dictionary.keys())

['__header__', '__version__', '__globals__', 'A']

In [26]:
data_dictionary['A']

array([[[0.30788845, 0.60569692, 0.74159203, ..., 0.99513856,
         0.86615676, 0.65581839],
        [0.29972906, 0.1727805 , 0.73877596, ..., 0.57321798,
         0.52657155, 0.15148499],
        [0.91677054, 0.30289045, 0.47086303, ..., 0.91076997,
         0.15659756, 0.74502433],
        ...,
        [0.16246413, 0.57601666, 0.64519549, ..., 0.04166688,
         0.71115738, 0.75984878],
        [0.99626814, 0.89529207, 0.89520696, ..., 0.927474  ,
         0.46998733, 0.809978  ],
        [0.52545775, 0.42922203, 0.40999633, ..., 0.7497839 ,
         0.26582518, 0.68821719]],

       [[0.93763072, 0.68660253, 0.03060252, ..., 0.08489496,
         0.3368953 , 0.0040575 ],
        [0.17680589, 0.44922269, 0.32552186, ..., 0.49081397,
         0.7718607 , 0.91216332],
        [0.48935017, 0.28293444, 0.57762148, ..., 0.64988995,
         0.96036063, 0.62395338],
        ...,
        [0.77554755, 0.23174591, 0.80126054, ..., 0.34982511,
         0.13648038, 0.63953428],
        [0.4

In [27]:
A_load = data_dictionary['A']

In [28]:
np.all(A == A_load)

True

In [30]:
type(A_load)

numpy.ndarray