Let us go deeeper into NumPy library for array computing.We will be discussing more about array manipulations and algorithms

The NumPy ndarray consists

• A pointer to data—that is, a block of data in RAM or in a memory-mapped file

• The data type or dtype, describing fixed-size value cells in the array

• A tuple indicating the array’s shape

• A tuple of strides, integers indicating the numnumber of bytes to “step” in order to advance one element along a dimension

Sometimes you may want to check the code if it contains intergers,floating-point numbers,strings or Python objects.
As you would have know from our previous sessions that there are multiply types of floating-point number(float16 through float128).
To check the data types there is a super-classes such as np.interger and np.floating, which can be used in conjuction with the np.issubdtype function

In [1]:
import numpy as np

In [2]:
ints = np.ones(10, dtype = np.uint16)

In [3]:
floats = np.ones(10,dtype = np.float32)

In [4]:
np.issubdtype(ints.dtype, np.integer) #Checks if ints is integer

True

In [5]:
np.issubdtype(floats.dtype, np.floating)

True

In [6]:
# To see all the parent classes by calling the type's mro
np.float64.mro()

[numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object]

In [7]:
np.int32.mro()

[numpy.int32,
 numpy.signedinteger,
 numpy.integer,
 numpy.number,
 numpy.generic,
 object]

### Advanced Array Manipulation
Much of the heavy lifting for data analysis application is handled by higher-level functions in pandas, we may at some point need to write a data algorithm that is not found in one of the existing libraries

### Reshaping Arrays

In [8]:
arr = np.arange(8)

In [9]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7])

In [10]:
arr.reshape(4,2) # You can see the order of the number

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

![image.png](attachment:image.png)

In [11]:
# A multidimensional array can be reshaped too
arr = arr.reshape((4,2))
arr.reshape(2,4)

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

Can be -1 in which case value used for that dimension will be inferred from the data

In [12]:
arr = np.arange(15)
arr.reshape(5,-1)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

Arrays shape can be passed to reshape

In [13]:
other_arr = np.ones((3,5))
other_arr.shape

(3, 5)

In [14]:
arr.reshape(other_arr.shape)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

The opposite to operation of reshape from 1D to Higher Dimension is known as flattening or revalting 
revel does not produce a copy of the underlying values. But, flattern method always returns a copy of the data

In [15]:
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [16]:
arr.flatten()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

### Concatennating and Splitting Arrays

Some convenience functions vstack and hstack

In [17]:
arr_1 = np.array([[1, 2, 3], [4, 5, 6]])
arr_2 = np.array([[7, 8, 9], [10, 11, 12]])

In [18]:
np.concatenate([arr_1,arr_2], axis = 0)

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [19]:
np.vstack((arr_1,arr_2))

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [20]:
np.hstack((arr_1,arr_2))

array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])

In [21]:
arr = np.random.randn(5,2)
arr

array([[ 0.32625394, -1.83969914],
       [ 0.98725594, -0.28305893],
       [-1.84786857, -0.44970751],
       [-1.76751707, -2.22571328],
       [ 0.24366421,  1.00805949]])

In [22]:
first,second, third = np.split(arr,[1,3])

The value [1, 3] passed to np.split indicate the indices at which to split the array
into pieces.

In [23]:
first

array([[ 0.32625394, -1.83969914]])

In [24]:
second

array([[ 0.98725594, -0.28305893],
       [-1.84786857, -0.44970751]])

In [25]:
third

array([[-1.76751707, -2.22571328],
       [ 0.24366421,  1.00805949]])

![image.png](attachment:image.png)

In [26]:
arr = np.arange(6)

In [27]:
arr1 = arr.reshape((3,2))

In [28]:
arr2 = np.random.randn(3,2)

In [29]:
r = np.r_[arr1,arr2]
r

array([[ 0.        ,  1.        ],
       [ 2.        ,  3.        ],
       [ 4.        ,  5.        ],
       [ 2.38332109,  0.81701595],
       [-0.47924417,  0.87462366],
       [ 0.4568432 ,  0.57400828]])

In [30]:
np.c_[r,arr]

array([[ 0.        ,  1.        ,  0.        ],
       [ 2.        ,  3.        ,  1.        ],
       [ 4.        ,  5.        ,  2.        ],
       [ 2.38332109,  0.81701595,  3.        ],
       [-0.47924417,  0.87462366,  4.        ],
       [ 0.4568432 ,  0.57400828,  5.        ]])

In [31]:
np.c_[1:6,-10:-5]

array([[  1, -10],
       [  2,  -9],
       [  3,  -8],
       [  4,  -7],
       [  5,  -6]])

In [32]:
arr =np.arange(3)
arr.repeat(3)

array([0, 0, 0, 1, 1, 1, 2, 2, 2])

In [33]:
arr.repeat([2,3,4]) # Resulting 0s 2 times, 1s 3 time and so on

array([0, 0, 1, 1, 1, 2, 2, 2, 2])

In [34]:
# For 2D 
arr = np.random.randn(2,2)
arr

array([[-0.44380556,  0.32672173],
       [-1.10361628,  0.54648908]])

In [35]:
arr.repeat(2, axis = 0)

array([[-0.44380556,  0.32672173],
       [-0.44380556,  0.32672173],
       [-1.10361628,  0.54648908],
       [-1.10361628,  0.54648908]])

In [36]:
arr.repeat(2, axis = 1) 

array([[-0.44380556, -0.44380556,  0.32672173,  0.32672173],
       [-1.10361628, -1.10361628,  0.54648908,  0.54648908]])

In [37]:
arr.repeat([2,3], axis = 0)

array([[-0.44380556,  0.32672173],
       [-0.44380556,  0.32672173],
       [-1.10361628,  0.54648908],
       [-1.10361628,  0.54648908],
       [-1.10361628,  0.54648908]])

In [38]:
arr.repeat([2,3], axis = 1)

array([[-0.44380556, -0.44380556,  0.32672173,  0.32672173,  0.32672173],
       [-1.10361628, -1.10361628,  0.54648908,  0.54648908,  0.54648908]])

The second argument is the number of tiles; with a scalar, the tiling is made row by row, rather than column by column. The second argument to tile can be a tuple indicating the layout of the “tiling”

In [39]:
np.tile(arr, 2)

array([[-0.44380556,  0.32672173, -0.44380556,  0.32672173],
       [-1.10361628,  0.54648908, -1.10361628,  0.54648908]])

In [40]:
np.tile(arr,(2,1))

array([[-0.44380556,  0.32672173],
       [-1.10361628,  0.54648908],
       [-0.44380556,  0.32672173],
       [-1.10361628,  0.54648908]])

In [41]:
np.tile(arr,(3,2))

array([[-0.44380556,  0.32672173, -0.44380556,  0.32672173],
       [-1.10361628,  0.54648908, -1.10361628,  0.54648908],
       [-0.44380556,  0.32672173, -0.44380556,  0.32672173],
       [-1.10361628,  0.54648908, -1.10361628,  0.54648908],
       [-0.44380556,  0.32672173, -0.44380556,  0.32672173],
       [-1.10361628,  0.54648908, -1.10361628,  0.54648908]])

### Fancy Indexing 

In [42]:
arr = np.arange(10)*100

In [43]:
inds = [1,2,4,5]

In [44]:
arr[inds]

array([100, 200, 400, 500])

In [45]:
arr.put(inds, 42)

In [46]:
arr

array([  0,  42,  42, 300,  42,  42, 600, 700, 800, 900])

In [47]:
arr.put(inds,[40,41,42,43])

In [48]:
arr

array([  0,  40,  41, 300,  42,  43, 600, 700, 800, 900])

In [49]:
inds = [2,1,3,1]

In [50]:
arr = np.random.randn(2,4)

In [51]:
arr

array([[ 0.28361532, -0.11443697,  0.21524197,  0.37933352],
       [ 1.11400527, -0.02489187, -0.79798509,  0.40299933]])

In [52]:
arr.take(inds, axis = 1)

array([[ 0.21524197, -0.11443697,  0.37933352, -0.11443697],
       [-0.79798509, -0.02489187,  0.40299933, -0.02489187]])

### Broadcasting
It describes how arithmetic works between arrays of different shapes.

In [53]:
arr = np.random.randn(4,3)
arr

array([[ 0.56160873, -0.09015331, -1.72124332],
       [ 0.77894036,  0.58058981,  0.2836494 ],
       [ 1.41709836, -1.96539168,  0.29572325],
       [ 0.73875489, -2.11831264, -0.5614973 ]])

In [54]:
arr.mean(0)

array([ 0.87410059, -0.89831696, -0.42584199])

In [55]:
dearr = arr - arr.mean(0)

In [56]:
dearr

array([[-0.31249186,  0.80816365, -1.29540132],
       [-0.09516022,  1.47890676,  0.70949139],
       [ 0.54299778, -1.06707472,  0.72156524],
       [-0.13534569, -1.21999569, -0.13565531]])

In [57]:
dearr.mean(0) # Which is 0.0...

array([-2.77555756e-17,  0.00000000e+00,  1.38777878e-17])

## The Broadcasting Rule
Two arrays are compatible for broadcasting if for each trailing dimension (i.e., starting from the end) the axis lengths match or if either of the lengths is 1. Broadcasting is then performed over the missing or length 1 dimensions.


![image.png](attachment:image.png)


In [58]:
arr 

array([[ 0.56160873, -0.09015331, -1.72124332],
       [ 0.77894036,  0.58058981,  0.2836494 ],
       [ 1.41709836, -1.96539168,  0.29572325],
       [ 0.73875489, -2.11831264, -0.5614973 ]])

In [59]:
row_means = arr.mean(1)
row_means.shape

(4,)

In [60]:
row_mean = row_means.reshape((4,1))

In [61]:
dearr = arr - row_mean

In [62]:
dearr.mean(1)

array([ 0.00000000e+00,  0.00000000e+00, -5.55111512e-17, -7.40148683e-17])

See Figure below for another illustration, this time adding a two-dimensional array to a three-dimensional one across axis 0.
![image.png](attachment:image.png)
![image.png](attachment:image.png)

If you want to understand broadcast, Please visit this site
https://jakevdp.github.io/PythonDataScienceHandbook/02.05-computation-on-arrays-broadcasting.html

In [63]:
arr = np.zeros((4,4)) # You must include data type as((x,y)) where x & y are number

In [64]:
arr3d = arr[:,np.newaxis,:]

We use the special np.newaxis attribute along with “full” slices to insert the new axis

In [65]:
arr3d.shape

(4, 1, 4)

In [66]:
arr3d # On a 2d surface it can only show 2d 

array([[[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]]])

In [67]:
arr1d = np.random.normal(size = 3)

In [68]:
arr1d[:, np.newaxis]

array([[ 0.42381188],
       [-1.0491771 ],
       [ 1.6601505 ]])

In [69]:
arr1d[np.newaxis, :]

array([[ 0.42381188, -1.0491771 ,  1.6601505 ]])

![image.png](attachment:image.png)

In [70]:
# De_mean axis 2 then the following is the code
arr = np.random.randn(3,4,5)
arr

array([[[ 1.21467048, -1.37896326,  0.59295685,  0.54908628,
         -1.01547873],
        [ 0.26440978, -0.36000685,  0.00644868, -0.58845657,
          0.09285069],
        [ 0.12471863, -0.41191218,  0.34160397, -0.69223213,
          0.29162891],
        [-0.35257296, -1.19059213,  1.43920473,  0.15793257,
         -0.10594664]],

       [[ 0.69246087, -0.82101283, -1.84639206,  0.50911123,
          0.10970794],
        [-2.19386294,  0.14791042,  0.99888599, -0.03083411,
          0.84940784],
        [ 1.30662848, -0.03692118,  0.32776931,  1.2851988 ,
         -0.30430415],
        [ 0.3331467 , -0.67193516, -0.03713158,  3.42742645,
         -0.6233804 ]],

       [[ 0.92904544, -0.98235397,  0.32955699, -0.45132798,
          1.63703526],
        [ 0.52837574, -0.1171106 , -0.17705816, -0.51373544,
          0.06328441],
        [ 0.44332857,  0.20215225,  0.56859325,  1.63305489,
         -0.54868263],
        [ 0.02236786,  2.53124575,  1.04609538,  0.20243989,
         -1

In [71]:
depth_means = arr.mean(2)

In [72]:
depth_means.shape

(3, 4)

In [73]:
depth_means

array([[-0.00754568, -0.11695085, -0.06923856, -0.01039488],
       [-0.27122497, -0.04569856,  0.51567425,  0.4856252 ],
       [ 0.29239115, -0.04324881,  0.45968927,  0.54767485]])

In [74]:
demeaned = arr - depth_means[:,:, np.newaxis]
demeaned.shape

(3, 4, 5)

In [75]:
demeaned

array([[[ 1.22221616, -1.37141759,  0.60050253,  0.55663196,
         -1.00793305],
        [ 0.38136064, -0.243056  ,  0.12339953, -0.47150572,
          0.20980155],
        [ 0.19395719, -0.34267362,  0.41084253, -0.62299357,
          0.36086747],
        [-0.34217807, -1.18019724,  1.44959962,  0.16832746,
         -0.09555176]],

       [[ 0.96368584, -0.54978786, -1.57516709,  0.7803362 ,
          0.38093291],
        [-2.14816438,  0.19360898,  1.04458455,  0.01486445,
          0.8951064 ],
        [ 0.79095423, -0.55259543, -0.18790494,  0.76952455,
         -0.8199784 ],
        [-0.1524785 , -1.15756036, -0.52275678,  2.94180125,
         -1.10900561]],

       [[ 0.63665429, -1.27474512,  0.03716584, -0.74371913,
          1.34464411],
        [ 0.57162455, -0.07386179, -0.13380935, -0.47048663,
          0.10653322],
        [-0.0163607 , -0.25753702,  0.10890398,  1.17336563,
         -1.00837189],
        [-0.52530699,  1.9835709 ,  0.49842054, -0.34523496,
         -1

In [76]:
demeaned.mean(2)

array([[ 4.44089210e-17, -1.11022302e-17, -1.11022302e-17,
        -4.44089210e-17],
       [-1.11022302e-17, -2.22044605e-17,  0.00000000e+00,
         0.00000000e+00],
       [ 0.00000000e+00, -1.38777878e-17,  4.44089210e-17,
         4.44089210e-17]])

You might be wondering if there’s a way to generalize demeaning over an axis without
sacrificing performance. There is, but it requires some indexing gymnastics

In [77]:
def demean_axis(arr, axis = 2):
    means = arr.mean(axis)
    
    index = [slice(None)]*arr.ndim
    index[axis] = np.newaxis
    return arr - means[index]

In [78]:
a = demean_axis(arr)

  


In [79]:
a.mean(2)

array([[ 4.44089210e-17, -1.11022302e-17, -1.11022302e-17,
        -4.44089210e-17],
       [-1.11022302e-17, -2.22044605e-17,  0.00000000e+00,
         0.00000000e+00],
       [ 0.00000000e+00, -1.38777878e-17,  4.44089210e-17,
         4.44089210e-17]])

### Setting Array Values by Broadcasting

In [80]:
arr = np.zeros((4,3))

In [81]:
arr[:] = 5

In [82]:
arr

array([[5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.]])

In [83]:
colomn = np.array([1.28,5.67,0.44,0.89])

In [84]:
arr[:] =  colomn[:,np.newaxis]

In [85]:
arr # Replaced the colomn with the values

array([[1.28, 1.28, 1.28],
       [5.67, 5.67, 5.67],
       [0.44, 0.44, 0.44],
       [0.89, 0.89, 0.89]])

In [86]:
arr[:2] = [[-1.44],[0.54]]

In [87]:
arr

array([[-1.44, -1.44, -1.44],
       [ 0.54,  0.54,  0.54],
       [ 0.44,  0.44,  0.44],
       [ 0.89,  0.89,  0.89]])

In [88]:
arr[2:] = [[-1.44],[0.54]]

In [89]:
arr

array([[-1.44, -1.44, -1.44],
       [ 0.54,  0.54,  0.54],
       [-1.44, -1.44, -1.44],
       [ 0.54,  0.54,  0.54]])

### Advanced ufunc Usage

In [90]:
arr = np.arange(10)

In [91]:
np.add.reduce(arr) # Alternative way to sum elements in an array

45

In [92]:
arr.sum()

45

In [93]:
np.random.seed(1234)

In [94]:
arr = np.random.randn(5,5)

In [95]:
arr[::2].sort(1) # sort a few rows
arr

array([[-1.19097569e+00, -7.20588733e-01, -3.12651896e-01,
         4.71435164e-01,  1.43270697e+00],
       [ 8.87162940e-01,  8.59588414e-01, -6.36523504e-01,
         1.56963721e-02, -2.24268495e+00],
       [-2.02125482e+00, -3.34077366e-01,  9.53324128e-01,
         9.91946022e-01,  1.15003572e+00],
       [ 2.11836468e-03,  4.05453412e-01,  2.89091941e-01,
         1.32115819e+00, -1.54690555e+00],
       [-6.55969344e-01, -2.02646325e-01,  1.93421376e-01,
         5.53438911e-01,  1.31815155e+00]])

In [96]:
b = arr[:,:-1] < arr[:, 1:]

In [97]:
b

array([[ True,  True,  True,  True],
       [False, False,  True, False],
       [ True,  True,  True,  True],
       [ True, False,  True, False],
       [ True,  True,  True,  True]])

In [98]:
arr[:,:-1]

array([[-1.19097569, -0.72058873, -0.3126519 ,  0.47143516],
       [ 0.88716294,  0.85958841, -0.6365235 ,  0.01569637],
       [-2.02125482, -0.33407737,  0.95332413,  0.99194602],
       [ 0.00211836,  0.40545341,  0.28909194,  1.32115819],
       [-0.65596934, -0.20264632,  0.19342138,  0.55343891]])

In [99]:
arr[:, 1:]

array([[-0.72058873, -0.3126519 ,  0.47143516,  1.43270697],
       [ 0.85958841, -0.6365235 ,  0.01569637, -2.24268495],
       [-0.33407737,  0.95332413,  0.99194602,  1.15003572],
       [ 0.40545341,  0.28909194,  1.32115819, -1.54690555],
       [-0.20264632,  0.19342138,  0.55343891,  1.31815155]])

In [100]:
# logical_and.reduce is equivalent to the all method.
np.logical_and.reduce(b, axis=1)

array([ True, False,  True, False,  True])

accumulate is related to reduce like cumsum is related to sum. It produces an array of the same size with the intermediate “accumulated” values

In [101]:
arr = np.arange(15).reshape((3,5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [102]:
np.add.accumulate(arr,axis=1)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]], dtype=int32)

In [103]:
arr = np.arange(3).repeat([1, 2, 2])
arr

array([0, 1, 1, 2, 2])

In [104]:
np.multiply.outer(arr,np.arange(5))

array([[0, 0, 0, 0, 0],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 2, 4, 6, 8],
       [0, 2, 4, 6, 8]])

In [105]:
x, y = np.random.randn(3, 4), np.random.randn(5)
result = np.subtract.outer(x, y)
print(x,y)

[[-0.46930528  0.67555409 -1.81702723 -0.18310854]
 [ 1.05896919 -0.39784023  0.33743765  1.04757857]
 [ 1.04593826  0.86371729 -0.12209157  0.12471295]] [-0.32279481  0.84167471  2.39096052  0.07619959 -0.56644593]


In [106]:
result.shape

(3, 4, 5)

In [107]:
result

array([[[-0.14651048, -1.31098   , -2.8602658 , -0.54550487,
          0.09714065],
        [ 0.99834889, -0.16612063, -1.71540643,  0.5993545 ,
          1.24200002],
        [-1.49423242, -2.65870194, -4.20798774, -1.89322681,
         -1.2505813 ],
        [ 0.13968627, -1.02478325, -2.57406906, -0.25930813,
          0.38333739]],

       [[ 1.38176399,  0.21729447, -1.33199133,  0.9827696 ,
          1.62541512],
        [-0.07504542, -1.23951494, -2.78880074, -0.47403982,
          0.1686057 ],
        [ 0.66023246, -0.50423706, -2.05352286,  0.26123807,
          0.90388358],
        [ 1.37037338,  0.20590386, -1.34338194,  0.97137899,
          1.6140245 ]],

       [[ 1.36873306,  0.20426354, -1.34502226,  0.96973867,
          1.61238419],
        [ 1.1865121 ,  0.02204258, -1.52724322,  0.7875177 ,
          1.43016322],
        [ 0.20070323, -0.96376629, -2.51305209, -0.19829116,
          0.44435436],
        [ 0.44750776, -0.71696176, -2.26624756,  0.04851337,
          0

reduceat, performs a “local reduce,” in essence an array groupby
operation in which slices of the array are aggregated together

In [108]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [109]:
np.add.reduceat(arr,[0,5,9]) # sum of 0:5 and 5:9 and 9:

array([10, 26,  9], dtype=int32)

In [110]:
arr = np.multiply.outer(np.arange(4), np.arange(5))

In [111]:
arr

array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12]])

In [112]:
np.add.reduceat(arr, [0, 2, 4], axis=1)

array([[ 0,  0,  0],
       [ 1,  5,  4],
       [ 2, 10,  8],
       [ 3, 15, 12]], dtype=int32)

![image.png](attachment:image.png)

### Writing New Ufuncs in Python

In [113]:
def add_elements(x,y):
    return x+y

numpy.frompyfunc accepts a Python function along with a specification for the number of inputs and outputs

In [114]:
add_the = np.frompyfunc(add_elements, 2,1)

In [115]:
add_the(np.arange(8),np.arange(8))

array([0, 2, 4, 6, 8, 10, 12, 14], dtype=object)

frompyfunc always return arrays of Python objects, which
can be inconvenient. Fortunately, there is an alternative (but slightly less featureful) function, numpy.vectorize, that allows you to specify the output type

In [116]:
add_the = np.vectorize(add_elements, otypes =[np.float64])

In [117]:
add_the(np.arange(8),np.arange(8))

array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14.])

In [118]:
arr = np.random.randn(10000)

In [119]:
# Lets timeit
import time
%timeit add_the(arr,arr)

3.38 ms ± 85.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [120]:
%timeit np.add(arr,arr)

7.62 µs ± 70.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


As you can see from the above example the add_the function is slower than add which is a in-build function

### Structured and Record Array
A structured array is an ndarray in which each element can be thought of as representing a struct in C (hence the “structured” name) or a row in a SQL table with multiple named fields:

In [121]:
# Different datatype for individual elements
data = [('x', np.float64),('y',np.int32)]

In [122]:
sar = np.array([(1.5, 6), (np.pi, -2)], dtype=(data))

In [123]:
sar

array([(1.5       ,  6), (3.14159265, -2)],
      dtype=[('x', '<f8'), ('y', '<i4')])

In [124]:
sar[0]

(1.5, 6)

In [125]:
sar[0]['y']

6

In [126]:
dtype = [('x', np.int64, 3), ('y', np.int32)]

In [127]:
ar = np.zeros(4,dtype = dtype)

In [128]:
ar

array([([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0), ([0, 0, 0], 0)],
      dtype=[('x', '<i8', (3,)), ('y', '<i4')])

In [129]:
ar[0]['x']

array([0, 0, 0], dtype=int64)

In [130]:
ar['x']

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]], dtype=int64)

In [131]:
ar['y']

array([0, 0, 0, 0])

### More about Sorting

In [132]:
arr = np.random.randn(6)

In [133]:
arr.sort()

In [134]:
arr

array([-1.03537573, -0.57602859, -0.05836937,  0.20624816,  1.14154299,
        1.40299899])

In [135]:
arr = np.random.randn(3,5)

In [136]:
arr

array([[ 0.6499097 , -0.78484081, -0.33888732,  0.42015745,  2.13360954],
       [ 0.46804556,  1.12776536,  0.17051599,  1.55297386, -1.73467752],
       [ 0.00433219, -1.1737013 , -0.00943135, -0.8157525 , -0.69938172]])

In [137]:
arr[:,0].sort() # First colomn sort 

In [138]:
arr

array([[ 0.00433219, -0.78484081, -0.33888732,  0.42015745,  2.13360954],
       [ 0.46804556,  1.12776536,  0.17051599,  1.55297386, -1.73467752],
       [ 0.6499097 , -1.1737013 , -0.00943135, -0.8157525 , -0.69938172]])

In [139]:
arr.sort(axis = 1) # Column Sorting
arr

array([[-0.78484081, -0.33888732,  0.00433219,  0.42015745,  2.13360954],
       [-1.73467752,  0.17051599,  0.46804556,  1.12776536,  1.55297386],
       [-1.1737013 , -0.8157525 , -0.69938172, -0.00943135,  0.6499097 ]])

Python users are familiar with the “trick” that for a list values, values[::-1] returns a list in reverse order. The same is true for ndarrays

In [140]:
arr[:,::-1]

array([[ 2.13360954,  0.42015745,  0.00433219, -0.33888732, -0.78484081],
       [ 1.55297386,  1.12776536,  0.46804556,  0.17051599, -1.73467752],
       [ 0.6499097 , -0.00943135, -0.69938172, -0.8157525 , -1.1737013 ]])

### Indirect Sorts: argsort and lexsort
In data analysis you may need to reorder datasets by one or more keys. For example, a table of data about some students might need to be sorted by last name, then by first name. This is an example of an indirect sort, and if you’ve read the pandas-related chapters you have already seen many higher-level examples. Given a key or keys (an array of values or multiple arrays of values), you wish to obtain an array of integer indices (I refer to them colloquially as indexers) that tells you how to reorder the data
to be in sorted order. Two methods for this are argsort and numpy.lexsort. As an example -- Source Python for Data Analysis

In [141]:
value = np.array([5,0,1,2,3])

In [142]:
index = value.argsort()

In [143]:
index

array([1, 2, 3, 4, 0], dtype=int64)

In [144]:
value[index]

array([0, 1, 2, 3, 5])

In [145]:
# Another example
arr = np.random.randn(3,5)

In [146]:
arr[0] = value

In [147]:
arr

array([[ 5.        ,  0.        ,  1.        ,  2.        ,  3.        ],
       [-3.81785587, -0.8050689 , -1.00043174, -0.44801792, -0.25285489],
       [ 1.42305365,  0.3354397 ,  0.38831871, -0.29577236, -0.35137157]])

In [148]:
arr[:,arr[0].argsort()]

array([[ 0.        ,  1.        ,  2.        ,  3.        ,  5.        ],
       [-0.8050689 , -1.00043174, -0.44801792, -0.25285489, -3.81785587],
       [ 0.3354397 ,  0.38831871, -0.29577236, -0.35137157,  1.42305365]])

In [149]:
# Some more sorting 
values = np.array(['2:first', '2:second', '1:first', '1:second','1:third'])
key = np.array([2, 2, 1, 1, 1])
index = key.argsort(kind='mergesort')

In [150]:
index

array([2, 3, 4, 0, 1], dtype=int64)

In [151]:
values.take(index)

array(['1:first', '1:second', '1:third', '2:first', '2:second'],
      dtype='<U8')

In [152]:
arr = np.array([0,2,23,3,5,6,7,8])

In [153]:
arr.searchsorted(5)

2

Search sort is a binary search on a sorted array, where it returns the value to be inserted to maintain being sorted

In [154]:
arr = np.array([0, 0, 0, 1, 1, 1, 1])
arr.searchsorted([0, 1])

array([0, 3], dtype=int64)

In [155]:
arr.searchsorted([0, 1], side='right')

array([3, 7], dtype=int64)

In [156]:
data = np.floor(np.random.uniform(0, 10000, size=50))
bins = np.array([0, 100, 1000, 5000, 10000])
data

array([7828., 2860., 9619., 5120., 6297., 6403., 4870., 4566., 5200.,
       1935., 6168.,  943., 2357., 5177., 1667., 1387.,  606., 7410.,
       7968., 7477., 9986., 1091., 7097., 1965., 6576., 2233., 9719.,
       7623., 7917., 5965., 2892., 9001., 7509., 3790., 2786., 8948.,
       7650., 3569., 3527., 2702., 9036., 1038.,  982., 3472., 6088.,
       9663., 4933., 9184., 8891., 6329.])

To then get a labeling of which interval each data point belongs to (where 1 would mean the bucket [0, 100)), we can simply use searchsorted

In [157]:
labels = bins.searchsorted(data)
labels

array([4, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 2, 3, 4, 3, 3, 2, 4, 4, 4, 4, 3,
       4, 3, 4, 3, 4, 4, 4, 4, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 4, 3, 2, 3,
       4, 4, 3, 4, 4, 4], dtype=int64)

In [158]:
# Combined with pandas’s groupby
import pandas as pd
pd.Series(data).groupby(labels).mean()

2     843.666667
3    2823.157895
4    7566.035714
dtype: float64

### Fast Numpy Functions with Numba
Numba is an open source project that creates fast functions for NumPy-like data using CPUs, GPUs, or other hardware.
Lets understand this with the following Python example

In [159]:
import numpy as np

In [160]:
def mean_D(x,y):
    nx = len(x)
    result = 0.0
    count = 0
    for i in range(nx):
        result += x[i] -y[i]
        count += 1
    return result/count
x = np.random.randn(1000000)
y = np.random.randn(1000000)



In [161]:
import time 
%timeit mean_D(x,y)

1.09 s ± 44.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [162]:
%timeit (x-y).mean()

9.04 ms ± 55.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Lets see how can we make the above function even faster

In [163]:
import numba as nb
numba_mean_D = nb.jit(mean_D)

In [164]:
@nb.jit  # As a devorator
def mean_D(x,y):
    nx = len(x)
    result = 0.0
    count = 0
    for i in range(nx):
        result += x[i] - y[i]
        count += 1
    return result / count

In [165]:
%timeit numba_mean_D(x,y)

2.18 ms ± 49.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


jit(nopython=True) has a shorter alias numba.njit

In [166]:
from numba import float64, njit
@njit(float64(float64[:], float64[:]))
def mean_distance(x, y):
    return (x - y).mean()

In [167]:
%timeit mean_distance(x,y)

10.6 ms ± 413 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Multithreading with Numba

<span style="color: red">**WARNING**</span>: *Due to the CPU restrictions on notebook execution, the benefits of multithreading are going to be erratic.  The code in this notebook will run on Colabs or any other Notebook executor, but for reasonable benchmarks, you should download and run this notebook on your own system.*

Numba supports several approaches to multithreading:

* Automatic multithreading of array expressions and reductions
* Explicit multithreading of loops with `prange()`
* External multithreading with tools like concurrent.futures or Dask.

The first two options make use of the *ParallelAccelerator* optimization pass (contributed by Intel) in Numba.  ParallelAccelerator is only supported on 64-bit platforms, and is not available for Python 2.7 on Windows.  It is also only effective when compiling in nopython mode.

In [168]:
from numba import jit

@jit(nopython=True, parallel=True)
def gaussians(x, means, widths):
    '''Return the value of gaussian kernels.
    
    x - location of evaluation
    means - array of kernel means
    widths - array of kernel widths
    '''
    n = means.shape[0]
    result = np.exp( -0.5 * ((x - means) / widths)**2 ) / widths
    return result / n

In [169]:
means = np.random.uniform(-1, 1, size=1000000)
widths = np.random.uniform(0.1, 0.3, size=1000000)

gaussians(0.4, means, widths)

array([5.15464617e-17, 1.07810512e-07, 7.16586228e-26, ...,
       4.10855667e-09, 8.22113149e-10, 5.61803034e-12])

In [170]:
gaussians_nothread = jit(nopython=True)(gaussians.py_func)

%timeit gaussians_nothread(0.4, means, widths)
%timeit gaussians(0.4, means, widths)

29.7 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.82 ms ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [171]:
%timeit gaussians.py_func(0.4, means, widths) # compare to pure NumPy

59.3 ms ± 390 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


You can create a custom numpy ufunc object with Numba
The numba.vectorize function creates compiled NumPy ufuncs, which behave like built-in ufuncs. Let’s consider a Python implementation of numpy.add

In [172]:
from numba import vectorize
@vectorize
def nb_add(x, y):
    return x + y

In [173]:
x = np.arange(10)

In [174]:
nb_add(x,x)

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18], dtype=int64)

In [175]:
%timeit nb_add(x,x)

831 ns ± 5.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [176]:
def nb_add1(x, y):
    return x + y

In [177]:
%timeit nb_add1(x,x) 

958 ns ± 15.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### Advance Array Input and Output
We were acquainted with np.save and np.load for storing arrays in binary format on disk.There are a number of additional options to consider for more sophisticated use. In particular, memory maps have the additional benefit of enabling
you to work with datasets that do not fit into RAM.

### `Memory-Mapped Files` 
A memory-mapped file is a method for interacting with binary data on disk as though
it is stored in an in-memory array. NumPy implements a memmap object that is
ndarray-like, enabling small segments of a large file to be read and written without
reading the whole array into memory.

In [178]:
memory_map = np.memmap('mymmap', dtype = 'float64', mode = 'w+',shape = (10000,10000))
memory_map

memmap([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [179]:
# Slicing memory_map
section = memory_map[:5]

In [180]:
section[:] = np.random.randn(5,10000)

In [181]:
memory_map.flush() # You can write it to disk by calling flush

In [182]:
memory_map

memmap([[ 0.23666986, -0.14993497, -0.69173602, ...,  0.39788219,
         -0.12114594, -0.16014347],
        [ 0.24285773, -0.5238142 ,  1.09358265, ...,  0.26032179,
         -1.34337409,  2.15478028],
        [ 0.0300722 ,  0.58223574,  0.04159271, ...,  0.19704506,
         -1.02921312, -0.48447341],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])

In [183]:
del memory_map