# The NumPy library for array operations

* The Numpy library provides very efficient array data types for scientific computing.
* Numpy arrays, unlike Python's native lists, are contiguous in memory, uniformly spaced, and very fast to access.
* Numpy is one of the bedrocks of data science and scientific programming with Python.
* The SciPy library, matplotlib visualization library, pandas, scikit-learn are all based on Numpy arrays.

# Importing the Numpy module

In [1]:
import numpy as np

# Creating arrays

## Create an array from a list:

In [2]:
L = [1,2,3]
a = np.array(L)
a

array([1, 2, 3])

Create a 2-dimensional array from a list of lists:

In [3]:
L = [[1,2,3,4],[5,6,7,8]]
b = np.array(L)  # two-dimensional array
b

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

## Create an array with initializers

In [4]:
np.ones(5)  # 5 is the length of the array.

array([1., 1., 1., 1., 1.])

Create an array of zeros with 4 rows and 3 columns.

In [5]:
np.zeros((4,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

Initialize randomly

In [6]:
np.random.rand(3,4)  # same as np.random.random_sample((3,4))

array([[0.16378137, 0.77590153, 0.23676943, 0.91485007],
       [0.46534414, 0.02904816, 0.90726369, 0.17473491],
       [0.31905187, 0.748765  , 0.37158787, 0.38771239]])

Generate a regular sequence with fixed step size. (Note that the end point is not included.)

## Create regular arrays

An array with start, stop, and step size values (end value is not included).

In [7]:
np.arange(-1,1,0.25)

array([-1.  , -0.75, -0.5 , -0.25,  0.  ,  0.25,  0.5 ,  0.75])

Generate a regular sequence where we specify the length of the array. (Note that the end point is included by default.)

In [8]:
np.linspace(-1,1,9)

array([-1.  , -0.75, -0.5 , -0.25,  0.  ,  0.25,  0.5 ,  0.75,  1.  ])

# Array Operations

The main advantage of arrays is that an operation is applied on every element:

In [10]:
a = np.array([1,2,3,4,5])
a + 2

array([3, 4, 5, 6, 7])

Regular lists do not support this:

In [11]:
L = [1,2,3,4,5]
L + 2

TypeError: can only concatenate list (not "int") to list

Arithmetic operations on arrays are performed elementwise:

In [12]:
a = np.array([1,2,3,4,5])
b = np.array([4,5,6,7,8])

print("a+b =", a+b)
print("a-b =", a-b)
print("a*b =", a*b)
print("a/b =", a/b)

a+b = [ 5  7  9 11 13]
a-b = [-3 -3 -3 -3 -3]
a*b = [ 4 10 18 28 40]
a/b = [0.25       0.4        0.5        0.57142857 0.625     ]


Similarly, mathematical functions defined in Numpy (so-called *ufuncs*) are mapped on elements.
This is called **vectorization**.

In [13]:
np.sin(a), np.sqrt(a)

(array([ 0.84147098,  0.90929743,  0.14112001, -0.7568025 , -0.95892427]),
 array([1.        , 1.41421356, 1.73205081, 2.        , 2.23606798]))

For comparison, the built-in math library is not vectorized

In [14]:
import math
math.sqrt([1,2,3,4,5])

TypeError: must be real number, not list

We need to map the function explicitly to the list elements:

In [15]:
[math.sqrt(x) for x in [1,2,3,4,5]]

[1.0, 1.4142135623730951, 1.7320508075688772, 2.0, 2.23606797749979]

We can use the `%%timeit` magic to compare the speed of the two methods. Vectorized numpy operations are 10 times faster.

In [16]:
%%timeit
a = np.random.rand(100000)
np.sqrt(a)

927 μs ± 80.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [17]:
%%timeit
[math.sqrt(_) for _ in np.random.rand(100000)]

9.71 ms ± 551 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Array indexing

Arrays are indexed and sliced in a way similar to Python lists.

## 1-dimensional arrays

In [18]:
a = np.array([2,4,6,8,10,12,14,16])
print(a)
a[0], a[2], a[-1]

[ 2  4  6  8 10 12 14 16]


(np.int64(2), np.int64(6), np.int64(16))

In [19]:
a[2:7]

array([ 6,  8, 10, 12, 14])

In [20]:
a[2:7:2]

array([ 6, 10, 14])

## 2-dimensional arrays
In two (or more) dimensions we can use the `[i,j]` notation.

In [21]:
a = np.array([[1,2,-5],[3,0,4],[-1,5,6]])
a

array([[ 1,  2, -5],
       [ 3,  0,  4],
       [-1,  5,  6]])

In [22]:
a[1,2]

np.int64(4)

To select all elements in a specific axis, use colons.

In [23]:
a[:,1] # second element in each row

array([2, 0, 5])

In [24]:
a[2,:] # all columns in third row

array([-1,  5,  6])

## Fancy indexing

We can select elements by providing a list of indices.

In [25]:
a = np.arange(10,21)
a

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

In [26]:
a[[1,4,7]] # get elements at index 1, 4, 7

array([11, 14, 17])

## Boolean indexing

We can select elements satisfying a Boolean criterion

In [27]:
a = np.arange(10,31)
a

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30])

In [28]:
a[a>20]

array([21, 22, 23, 24, 25, 26, 27, 28, 29, 30])

The relational operator generates an array of Boolean (True/False) values.

Numpy returns the elements for which the value is `True`.

In [29]:
a>20

array([False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

Conditions can be combined:

In [30]:
a[(a>20) & (a<24)]  # "and" operation

array([21, 22, 23])

In [31]:
a[(a<13) | (a>25)] # "or" operation

array([10, 11, 12, 26, 27, 28, 29, 30])

In [32]:
a[~(a>20)]  # "not" operation

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

Same can be done with arrays having two or more indices.

In [33]:
np.random.seed(321)  # to get the same random sequence every time
a = np.random.random((4,5))
a

array([[0.88594794, 0.07791236, 0.97964616, 0.24767146, 0.75288472],
       [0.52667564, 0.90755375, 0.8840703 , 0.08926896, 0.5173446 ],
       [0.34362129, 0.21229369, 0.36067344, 0.27077517, 0.76162502],
       [0.4780419 , 0.09899468, 0.27539478, 0.79442731, 0.51397031]])

In [34]:
a[a>0.5]

array([0.88594794, 0.97964616, 0.75288472, 0.52667564, 0.90755375,
       0.8840703 , 0.5173446 , 0.76162502, 0.79442731, 0.51397031])

Select columns where the first row is larger than 0.5

In [35]:
ind = a[0,:]>0.5
ind

array([ True, False,  True, False,  True])

In [36]:
a[:,ind]

array([[0.88594794, 0.97964616, 0.75288472],
       [0.52667564, 0.8840703 , 0.5173446 ],
       [0.34362129, 0.36067344, 0.76162502],
       [0.4780419 , 0.27539478, 0.51397031]])

# Statistical functions

In [37]:
a = np.array([12,  7, 10, 20, 16, 15, 18, 28, 21, 25, 23, 24, 27, 17, 5, 22, 13, 4,  1, 11])

Sum of all elements

In [38]:
np.sum(a) # or, a.sum()

np.int64(319)

The array of cumulative sums

In [39]:
np.cumsum(a) # or, a.cumsum()

array([ 12,  19,  29,  49,  65,  80,  98, 126, 147, 172, 195, 219, 246,
       263, 268, 290, 303, 307, 308, 319])

Minimum and maximum values in the array

In [40]:
np.min(a), np.max(a) # or, a.min(), a.max()

(np.int64(1), np.int64(28))

Indices of minimum and maximum elements in the array

In [41]:
np.argmin(a), np.argmax(a)

(np.int64(18), np.int64(7))

The range (peak-to-peak difference) of the array

In [42]:
np.ptp(a)

np.int64(27)

The mean, median and standard deviation

In [43]:
np.mean(a), np.median(a), np.std(a)

(np.float64(15.95), np.float64(16.5), np.float64(7.742577090349182))

The first and the last quartile values

In [44]:
np.percentile(a,25), np.percentile(a,75)

(np.float64(10.75), np.float64(22.25))

Weighted average of an array

In [45]:
np.average([1,2,1,2,3], weights=[0.1, 0.1, 0.25, 0.25,0.5])

np.float64(2.125)

# Random numbers

Uniformly distributed random numbers between 0 and 1.

In [46]:
np.random.rand(10)

array([0.45329481, 0.25515125, 0.1139766 , 0.82431305, 0.3177535 ,
       0.15230703, 0.21497959, 0.91211032, 0.04311515, 0.37595241])

In [47]:
np.random.rand(3,5)

array([[0.31796557, 0.35403302, 0.93335757, 0.3885452 , 0.89593944],
       [0.14550322, 0.4903603 , 0.9233404 , 0.8013113 , 0.84837182],
       [0.66544598, 0.14321914, 0.11609391, 0.07739594, 0.38291192]])

Uniformly distributed numbers between given limits

In [48]:
np.random.uniform(-2,2,10)

array([-1.41428059, -0.20857075, -0.57789057, -0.26743228,  1.20322655,
       -0.57997729, -1.8089976 ,  1.39831358,  0.49370272, -1.43360428])

Normally distributed numbers between given limits

In [49]:
np.random.normal(size = 20)

array([ 1.69462233,  0.11776364, -0.17195471,  0.01965903,  0.97882293,
       -0.40133133,  0.57007085,  0.86167782, -2.05625746,  0.1875131 ,
        1.12667228,  0.40734397,  2.16171343,  0.27018242, -0.07639843,
        1.20306559,  0.97070569,  1.03987951, -0.56445753,  0.45275902])

In [50]:
np.random.normal(loc=1, scale=2, size=(4,5))

array([[ 1.53909317, -0.54037335,  1.24018008, -1.07891589,  4.08025415],
       [ 0.04623951, -0.41051629,  2.50610179,  0.67669923,  3.12599829],
       [ 1.40707539, -1.25923409,  0.79897728,  1.38759348, -1.41053601],
       [ 3.06309689,  1.26099733,  1.58688438,  1.8357458 ,  1.48535766]])

Random integers between given limits

In [51]:
np.random.randint(low=-2, high=5, size=20)

array([ 3,  1, -2, -1, -1,  3, -1, -1, -2,  2,  0, -2,  0,  0,  2,  2,  4,
        0,  3,  0])

Randomly select elements from array-like object.

In [52]:
np.random.choice(["a","b","c","d"], 10)

array(['b', 'a', 'b', 'c', 'c', 'c', 'a', 'd', 'a', 'c'], dtype='<U1')

Selecting without replacement.

In [53]:
np.random.choice(np.arange(1,50), 6, replace=False)  # 6/49 lotto draw

array([36, 42, 48, 27, 47, 15])

# Reading from and writing to CSV files

Read a simple file and load into a NumPy array.

In [54]:
%%writefile grades.csv
86 48 75
75 66 99
27 63 61
85 74 78
66 55 49

Overwriting grades.csv


In [55]:
grades = np.loadtxt("grades.csv")
grades

array([[86., 48., 75.],
       [75., 66., 99.],
       [27., 63., 61.],
       [85., 74., 78.],
       [66., 55., 49.]])

Multiply grades by 2 and write to a new file *grades2.csv*

In [56]:
np.savetxt("grades2.csv", grades*2, fmt="%.2f")

Load the Wholesale customers data:
```
Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
2,3,12669,9656,7561,214,2674,1338
2,3,7057,9810,9568,1762,3293,1776
2,3,6353,8808,7684,2405,3516,7844
...
```

* Requires all numeric values: Skip the header row
* Set the delimiter to comma
* Set the datatype to integer (default is float)

In [57]:
wcd = np.loadtxt("data_ex8_3.csv", delimiter=",",
                 skiprows=1, dtype=int)
wcd

array([[    2,     3, 12669, ...,   214,  2674,  1338],
       [    2,     3,  7057, ...,  1762,  3293,  1776],
       [    2,     3,  6353, ...,  2405,  3516,  7844],
       ...,
       [    1,     2,  9351, ...,  8170,   442,   868],
       [    1,     2,     3, ..., 15601,    15,   550],
       [    1,     2,  2617, ...,  9584,   573,  1942]], shape=(340, 8))

# Data analysis

In [58]:
grades = np.loadtxt("data_ex8_1.csv", delimiter=",",
                 skiprows=1, usecols=[2,3,4,5],dtype=int)
grades

array([[62, 71, 51, 81],
       [58, 62, 60, 61],
       [63, 51, 50, 83],
       [61, 53, 55, 87],
       [49, 60, 51, 68],
       [50, 68, 62, 52],
       [59, 70, 44, 47],
       [70, 59, 48, 78],
       [47, 67, 35, 67],
       [66, 65, 52, 67],
       [67, 46, 64, 68],
       [44, 68, 57, 73],
       [56, 57, 41, 62],
       [60, 73, 56, 82],
       [54, 64, 50, 66],
       [75, 82, 41, 54],
       [63, 54, 57, 66],
       [64, 70, 48, 86],
       [74, 69, 44, 74],
       [57, 38, 56, 75],
       [55, 35, 47, 50],
       [51, 63, 33, 69],
       [58, 61, 32, 63],
       [65, 55, 46, 59],
       [52, 49, 47, 69],
       [57, 61, 60, 72],
       [56, 60, 43, 65],
       [65, 66, 57, 68],
       [66, 57, 17, 72],
       [66, 61, 49, 69],
       [57, 60, 59, 67],
       [38, 65, 33, 62],
       [38, 59, 51, 62],
       [55, 66, 50, 61],
       [53, 61, 60, 81],
       [63, 64, 54, 61],
       [72, 68, 46, 70],
       [66, 65, 49, 62],
       [59, 66, 47, 82],
       [46, 54, 58, 82],


In [59]:
np.mean(grades)

np.float64(59.545)

In [60]:
np.mean(grades, axis=0)

array([58.19, 60.64, 50.3 , 69.05])

In [61]:
np.mean(grades, axis=1)

array([66.25, 60.25, 61.75, 64.  , 57.  , 58.  , 55.  , 63.75, 54.  ,
       62.5 , 61.25, 60.5 , 54.  , 67.75, 58.5 , 63.  , 60.  , 67.  ,
       65.25, 56.5 , 46.75, 54.  , 53.5 , 56.25, 54.25, 62.5 , 56.  ,
       64.  , 53.  , 61.25, 60.75, 49.5 , 52.5 , 58.  , 63.75, 60.5 ,
       64.  , 60.5 , 63.5 , 60.  , 63.75, 59.5 , 59.  , 62.  , 57.  ,
       65.75, 64.25, 53.25, 50.75, 55.5 , 61.5 , 60.5 , 61.75, 63.75,
       50.  , 57.25, 71.25, 59.5 , 58.  , 69.75, 60.5 , 54.25, 58.5 ,
       61.  , 55.75, 58.  , 68.  , 53.75, 60.5 , 60.5 , 60.25, 60.75,
       58.5 , 62.5 , 58.75, 56.75, 59.75, 64.  , 55.5 , 57.5 , 58.75,
       51.5 , 52.  , 61.  , 53.5 , 55.  , 60.25, 60.5 , 58.  , 62.  ,
       59.25, 64.25, 58.5 , 65.75, 53.25, 68.75, 62.5 , 64.25, 59.75,
       66.  ])

In [62]:
np.median(grades, axis=0)

array([58. , 61.5, 50. , 68. ])

In [63]:
np.percentile(grades, q=25, axis=0)

array([52., 55., 45., 62.])