While I don't use NumPy arrays directly so much anymore, they underly the types I do use, so the concepts are probably useful to understand.

Python provides types for collections of things, namely `list`, `set`, and `tuple`.  However, when dealing with large amounts of data, these types are somewhat memory-hungry and operations on them are inclined to be slow.

Numpy arrays, in contrast to the python collection types, must have all data of the same type, but store the data more efficiently and implement many operations as loops in C.

## NumPy arrays

`np.array` takes its argument and converts it to a NumPy N-D array.

In [1]:
import numpy as np

np.array(range(10))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [2]:
np.array([[1, 0], [0, 1]])

array([[1, 0],
       [0, 1]])

`np.arange` and `np.linspace` are slightly different approaches to creating evenly-spaced arrays, and `np.geomspace` is a similar function for a different definition of evenly-spaced.

In [3]:
np.arange(1, 3, 0.1)

array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2,
       2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9])

In [4]:
np.linspace(1, 3, 10)

array([1.        , 1.22222222, 1.44444444, 1.66666667, 1.88888889,
       2.11111111, 2.33333333, 2.55555556, 2.77777778, 3.        ])

In [5]:
np.geomspace(1, 100, 3)

array([  1.,  10., 100.])

You can do most arithmetic on numeric arrays, and there are methods to find the sum, mean, and standard deviation.

In [6]:
a = np.arange(10)
b = np.arange(10, 0, -1)

a + b

array([10, 10, 10, 10, 10, 10, 10, 10, 10, 10])

In [7]:
a - b

array([-10,  -8,  -6,  -4,  -2,   0,   2,   4,   6,   8])

In [8]:
a * b

array([ 0,  9, 16, 21, 24, 25, 24, 21, 16,  9])

In [9]:
a / b

array([0.        , 0.11111111, 0.25      , 0.42857143, 0.66666667,
       1.        , 1.5       , 2.33333333, 4.        , 9.        ])

In [10]:
# Integer division, quotient
a // b

array([0, 0, 0, 0, 0, 1, 1, 2, 4, 9])

In [11]:
# Integer division, remainder
a % b

array([0, 1, 2, 3, 4, 0, 2, 1, 0, 0])

In [12]:
(a // b) * b + a % b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]:
a ** 2

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [14]:
a.sum()

45

In [15]:
a.mean()

4.5

In [16]:
a.std()

2.8722813232690143

In [17]:
np.median(a)

4.5

NumPy arrays can perform some operations faster than python builtins can, once the arrays are big enough.

In [18]:
%timeit np.sum(np.arange(1000))

12.6 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [19]:
%timeit sum(range(1000))

23.4 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


It is also possible to perform sums over only some axes of an array:

In [20]:
a = np.arange(9).reshape(3, 3)
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [21]:
# Sum of each row across columns
a.sum(1)

array([ 3, 12, 21])

In [22]:
# Sum of each column down rows
a.sum(0)

array([ 9, 12, 15])

In [23]:
# Sum of the whole array together
a.sum()

36

It is also possible to pull out every other row, or every other column, or a single element.

In [24]:
a[::2, ::2]

array([[0, 2],
       [6, 8]])

In [25]:
a[::2]

array([[0, 1, 2],
       [6, 7, 8]])

In [26]:
a[:, ::2]

array([[0, 2],
       [3, 5],
       [6, 8]])

In [27]:
a[1, 1]

4

## NumPy dtypes

There are five broad classes of data that NumPy arrays can work with efficiently: boolean (True/False), integers, real numbers, strings, and time-related data.  Most of these have a few variants based on how much memory each element of the array takes up.

### Boolean

There are only two boolean values, `True` and `False`.  This data type is always stored as one byte per array element.

The NumPy dtype for this is `numpy.bool_`

### Integers

You will generally want to work with signed integer types, such as `numpy.int8`, `numpy.int16`, `numpy.int32`, and `numpy.int64`.  These types take up one, two, four, and eight bytes per element.  You can find what numbers each can represent with `numpy.iinfo`.

In [28]:
import numpy as np

np.iinfo(np.int8)

iinfo(min=-128, max=127, dtype=int8)

In [29]:
np.iinfo(np.int32)

iinfo(min=-2147483648, max=2147483647, dtype=int32)

If you know exactly what integers will be going into the array, using the smallest type that can represent all of them will save a bit of memory.  If not, using `np.int64` will hold off surprises due to integer overflow for as long as possible.

In [30]:
np.array([np.iinfo(np.int8).max], dtype=np.int8) + 1

array([-128], dtype=int8)

#### Unsigned integers

Unsigned integers are usually only recommended for use if you are doing bitwise operations on the binary representation of the integers, and don't care so much for the interpretation as integers.  The dtypes are `np.uint8`, `np.uint16`, `np.uint32`, and `np.uint64`

### Real numbers

Real numbers are represented as floating point values, either single-precision (32-bit), double precision (64-bit), long double precision (96- or 128-bit), or sometimes as half-precision (16-bit).  You can find characteristics of the types with `np.finfo`

In [31]:
np.finfo(np.float32)

finfo(resolution=1e-06, min=-3.4028235e+38, max=3.4028235e+38, dtype=float32)

Resolution is a rough estimate of how many digits the type keeps track of during arithmetic.  The minimum and maximum are the largest and smallest representable numbers; anything larger than those values will be represented by infinity, as `-np.inf` or `np.inf`.

In [32]:
np.finfo(np.float32).tiny

1.1754944e-38

Tiny is the smallest positive number the type can tell apart from zero.

Floating-point numbers only keep so many digits around when doing arithmetic, which means results will not be exact.

In [33]:
0.1 + 0.1 + 0.1 - 0.3

5.551115123125783e-17

In [34]:
np.finfo(np.double).resolution

1e-15

Python's built-in `float` type corresponds to `np.double`, which is usually `np.float64`.

Generally people select which type to use by considering the uncertainties in the numbers: if you are only sure about the first two or three digits of a number, it often doesn't make sense to keep track of fifteen digits.  The big exception is when adding two sequences of numbers to produce two numbers of similar magnitude, which you then subtract.

In [35]:
np.finfo(np.float16).resolution

0.001

In [36]:
a = np.arange(1, 10, 0.01, dtype=np.float16)
b = np.arange(10, 1, -0.01, dtype=np.float16)

a.sum() - b.sum()

-988.0

In [37]:
a.sum()

4852.0

In [38]:
np.arange(1, 10, 0.01, dtype=np.longdouble).sum()

4945.500000000003593

In [39]:
b.sum()

5840.0

In [40]:
np.arange(10, 1, -0.01, dtype=np.longdouble).sum()

4954.500000000086235

Each addition in the sum keeps around three digits of precision, but there are around nine hundred additions, so the roundoff error in each addition adds up, leaving slightly under one digit of precision by the end, more if doing the sum in ascending order.  

For `np.float32`, eating three digits of precision from the five it keeps leaves around two digits of precision.  Since meteorological uncertainties are often around five to ten percent, this is usually plenty, but if you are doing a larger sum or are very sure of the numbers you are adding, `np.float64` will give you around fifteen digits of precision.

There are a few tricks to make addition more accurate.  Doing the addition in ascending order of magnitude was mentioned earlier, the second is to center the numbers around zero:

In [41]:
(a - a.mean()).sum() + a.mean() * len(a)

4850.68359375

In [42]:
(a - 5).sum() + 5 * len(a)

4850.75

Subtracting a number close to the mean of the array allows more of the precision in the sum to go to adding up the differences from that number.  The effect is more pronounced if the numbers in question are, say, temperatures in the range 250 to 310 Kelvin, where two digits go to saying the number is in the normal range for temperatures in Earth's atmosphere.

### String data

NumPy can represent string data, but only provides some of python's string-processing functionality.

In [43]:
import string

alpha = np.array(list(string.ascii_lowercase))
alpha

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
       'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'],
      dtype='<U1')

In [44]:
np.char.upper(alpha)

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
      dtype='<U1')

In [45]:
np.char.center(alpha, 3)

array([' a ', ' b ', ' c ', ' d ', ' e ', ' f ', ' g ', ' h ', ' i ',
       ' j ', ' k ', ' l ', ' m ', ' n ', ' o ', ' p ', ' q ', ' r ',
       ' s ', ' t ', ' u ', ' v ', ' w ', ' x ', ' y ', ' z '],
      dtype='<U3')

In [46]:
string2 = np.char.add(alpha, alpha[::-1])
string2

array(['az', 'by', 'cx', 'dw', 'ev', 'fu', 'gt', 'hs', 'ir', 'jq', 'kp',
       'lo', 'mn', 'nm', 'ol', 'pk', 'qj', 'ri', 'sh', 'tg', 'uf', 've',
       'wd', 'xc', 'yb', 'za'], dtype='<U2')

In [47]:
np.char.count(string2, "m")

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0])