<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# The Basics of NumPy Arrays

Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array.
This section will present several examples of using NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays.
These operations are essential building blocks for many analysis 

Some categories of basic array manipulations discusses here are:

- *Attributes of arrays*: Determining the size, shape, memory consumption, and data types of arrays
- *Indexing of arrays*: Getting and setting the value of individual array elements
- *Slicing of arrays*: Getting and setting smaller subarrays within a larger array
- *Reshaping of arrays*: Changing the shape of a given array
- *Joining and splitting of arrays*: Combining multiple arrays into one, and splitting one array into many

## NumPy Array

NumPy arrays have the benefits of a smaller memory consumptions and better runtime behaviour. NumPy arrays also provide the convenience of integrated mathematical operations not availabe in lists.

### Exercise: 
Add a value of one (1) to all elements within a ```list``` of numbers [1, 2, 3, 4] in Python. 

In [None]:
# todo: Exercise



By using NumPy, you can add one (1) to an array of numbers by

In [None]:
import numpy as np
a = np.array([1,2,3,4])
a = a + 1
print(a)

### Create arrays

From the previous example, you can observe a NumPy array can be created as follow

In [None]:
import numpy as np
a = np.array([1,2,3,4])

3 random arrays (one-dimensional, two-dimensional, and three-dimensional array) are created as examples.

In [None]:
import numpy as np
np.random.seed(0)  # seed for reproducibility

a1 = np.random.randint(10, size=6)  # One-dimensional array
a2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
a3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

print ("1-D array", "\n\n", a1)
print ("2-D array", "\n\n", a2)
print ("3-D array", "\n\n", a3)

#### Do you know?
NumPy's random number seed makes the random numbers predictable. Try alternating the seed value with another number in the previous code and observe the numbers generated.

### Attributes of array
Each array has attributes ``ndim`` (the number of dimensions), ``shape`` (the size of each dimension), and ``size`` (the total size of the array):

In [None]:
print("a3 ndim: ", a3.ndim)
print("a3 shape:", a3.shape)
print("a3 size: ", a3.size)

Another useful attribute is the ``dtype``, the data type of the array.

In [None]:
print("dtype:", a3.dtype)

Other attributes include ``itemsize``, which lists the size (in bytes) of each array element, and ``nbytes``, which lists the total size (in bytes) of the array:

In [None]:
print("itemsize:", a3.itemsize, "bytes")
print("nbytes:", a3.nbytes, "bytes")

In general, we expect that ``nbytes`` is equal to ``itemsize`` times ``size``.

#### Creating unwritable NumPy arrays 

In [None]:
array_immutable = np.arange(5)
print (array_immutable)
array_immutable.flags.writeable = False
array_immutable[0] = 1
#RuntimeError: Assignment destination is read-only and not writeable

#### More on creating arrays
The following are various ways to create arrays

In [None]:
# Create an array of ones
np.ones((3,4))

```numpy.zeros``` and ```numpy.empty``` differs because ```numpy.empty``` does not set the array values to zero. Therefore, `numpy.empty` may be marginally faster. However, having an array of garbage values will usually still require manual setting at a later stage. As such, it is recommended to use `numpy.zeros`.

In [None]:
# Create an array of zeros
np.zeros((2,3,4), dtype=np.int16)

In [None]:
# Create an empty array
np.empty((3,2), dtype=np.int16)

In [None]:
# Create an array with random values
np.random.random((2,2))

In [None]:
# Create a full array
np.full((2,2),7)

In [None]:
# Create an array of evenly-spaced values
np.arange(10,25,5)

In [None]:
# Create an array of evenly-spaced values
np.linspace(0,2,9)

In [None]:
# Create an array with ones on the diagonal and zeros elsewhere
np.eye(3, k=1)

In [None]:
# Create a square array with ones on the main diagonal
np.identity(3)

### Exercise - Observe the difference

Measure the time the taken to multiply a list by itself compared to a numpy array multiplying by itself. Carry out the observation at size = 1000000

In [None]:
import numpy as np
import timeit 

# size of arrays and lists
size = 1000000

x = range(size) # declare a list
y = np.arange(size) # declare a numpy array

# NumPy array
startTime = timeit.default_timer()
resultArray = y * y

# print and calculate execution time
print("Time taken by NumPy Array :", (timeit.default_timer() - startTime), "seconds")

# list
startTime = timeit.default_timer()
resultList = [(a * a) for a in x]

# print and calculate execution time
print("Time taken by List :", (timeit.default_timer() - startTime), "seconds")


timeit() is more accurate as

* it repeats the tests many times to eliminate the influence of other tasks on your machine, such as disk flushing and OS scheduling.
* it disables the garbage collector to prevent that process from skewing the results by scheduling a collection run at an inopportune moment.
* it picks the most accurate timer for your OS, time.time or time.clock in Python 2 and time.perf_counter() on Python 3. 
See timeit.default_timer.

In [None]:
size = 1000000
x = range(size) # declare a list
%timeit resultList = [(a * a) for a in x]

In [None]:
size = 1000000
y = np.arange(size) # declare a numpy array
%timeit resultArray = y * y

## Array Indexing: Accessing Single Elements

In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets.

In [None]:
a1

In [None]:
a1[0]

In [None]:
a1[4]

To index from the end of the array, you can use negative indices:

In [None]:
a1[-1]

In [None]:
a1[-2]

In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices:

In [None]:
a2

In [None]:
a2[0, 0]

In [None]:
a2[2, 0]

In [None]:
a2[2, -1]

Values can also be modified using any of the above index notation:

In [None]:
a2[0, 0] = 12
a2

#### Important
Unlike Python lists, NumPy arrays have a fixed type.
If you attempt to insert a floating-point value to an integer array, the value will be silently truncated.

In [None]:
import numpy as np
aa = np.zeros([4], dtype = float)

bb = np.zeros([4], dtype = int)

aa[0] = 3.14159  
bb[0] = 3.14159  # this will be truncated!

print("aa[0]:", aa[0]) 
print("bb[0]:", bb[0]) 

## Array Slicing: Accessing Subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array ``x``, use this:
``` python
x[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.
We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.


### One-dimensional subarrays

In [None]:
x = np.arange(10)
x

In [None]:
x[:5]  # first five elements

In [None]:
x[5:]  # elements after index 5

In [None]:
x[4:7]  # middle sub-array

In [None]:
x[::2]  # every other element

In [None]:
x[1::2]  # every other element, starting at index 1

A potentially confusing case is when the ``step`` value is negative.
In this case, the defaults for ``start`` and ``stop`` are swapped.
This becomes a convenient way to reverse an array:

In [None]:
x[::-1]  # all elements, reversed

In [None]:
x[5::-2]  # reversed every other from index 5

### Multi-dimensional subarrays

NumPy slices can slice through multiple dimensions. Multi-dimensional slices work in the same way, with multiple slices separated by commas.
For example:

In [None]:
a2

In [None]:
a2[:2, :3]  # two rows, three columns

In [None]:
a2[:3, ::2]  # all rows, every other column

Finally, subarray dimensions can even be reversed together:

In [None]:
a2[::-1, ::-1]

### Exercise - Observe the slices

All arrays generated by NumPy basic slicing are direct views (reference) of the original array, while slices of lists are shallow copies. Verify that the slicing of NumPy arrays are indeed referencing the original array by changing all values in the obtained slice to zero.

* Generate a 1-dimensional random integer (between 0 and 10) Numpy array of size 10
* Access from index 1 to 4
* Change all values in the slice to zero
* Determine if values change affected original numpy array

In [None]:
#todo exericse
a5 = np.random.randint(10, size=10) # One-dimensional array

b5 = a5.tolist()
print("original array is \n", a5, "\n")

a5slice = a5[1:5]
print("obtained slice of array is \n", a5slice, "\n")

a5slice *= 0

print("modified slice of array is \n", a5slice,"\n")

print("original array after slice modified is \n", a5, "\n")

print("original list is \n", b5, "\n")

b5slice = b5[1:5]
print("obtained list slice is \n", b5slice, "\n")

for x in range (len(b5slice)):
  b5slice[x] = 0
print("modified list slice is \n", b5slice,"\n")
print("original list after slice modified is \n", b5, "\n")


#### Accessing array rows and columns

One commonly needed routine is accessing of single rows or columns of an array.
This can be done by combining indexing and slicing, using an empty slice marked by a single colon (``:``):

In [None]:
print(a2)

In [None]:
print(a2[:, 0])  # first column of a2

In [None]:
print(a2[0, :])  # first row of a2

In the case of row access, the empty slice can be omitted for a more compact syntax:

In [None]:
print(a2[0])  # equivalent to a2[0, :]

### Subarrays as no-copy views

Again, recall that numpy array slices return *views* rather than *copies* of the array data.
This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
Let's consider a two-dimensional array in this case:

In [None]:
print(a2)

Let's extract a $2 \times 2$ subarray from this:

In [None]:
a2_sub = a2[:2, :2]
print(a2_sub)

Now if we modify this subarray, we'll see that the original array is changed! Observe:

In [None]:
a2_sub[0, 0] = 88
print(a2_sub)

In [None]:
print(a2)

This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.

### Creating copies of arrays

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the ``copy()`` method:

In [None]:
a2_sub_copy = a2[:2, :2].copy()
print(a2_sub_copy)

If we now modify this subarray, the original array is not touched:

In [None]:
a2_sub_copy[0, 0] = 42
print(a2_sub_copy)

In [None]:
print(a2)

## Reshaping of Arrays

The reshaping of arrays is a usefule operation that changes the shape of an array without changing the data of the array.
The most flexible way of doing this is with the ``reshape`` method.
For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [None]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

Note that for this to work, the size of the initial array must match the size of the reshaped array. 
Where possible, the ``reshape`` method will use a no-copy view of the initial array, but with non-contiguous memory buffers this is not always the case. A contiguous array is just an array stored in an unbroken block of memory and to access the next value in the array, we just move to the next memory address

Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix.
This can be done with the ``reshape`` method, or more easily done by making use of the ``newaxis`` keyword within a slice operation. The ```newaxis``` increases the exisitng array by one more dimension, when used once.

> A shape of (3, 4), which means that the array has 2 dimensions, where the first dimension has 3 elements and the second has 4.

In [None]:
x = np.array([1, 2, 3])
print(x, "has shape", x.shape,"\n")

# row vector via reshape
x2 = x.reshape((1, 3))
print(x2, "has shape", x2.shape,"\n")

x3 = x2.reshape((3, 1))
print(x3, "has shape", x3.shape,"\n")

x4 = np.array([[1,2],[2,2], [3,2]])
print(x4, "has shape", x4.shape,"\n")

In [None]:
# row vector via newaxis
x5 = x[np.newaxis, :]
print(x5, "has shape", x5.shape)

In [None]:
# column vector via reshape
x.reshape((3, 1))

In [None]:
# column vector via newaxis
x[:, np.newaxis]

### Exercise

Make a copy of the following Numpy array, x, and reshape the copy of x into a shape of (1,6)

```python
x = np.array([[1, 2, 3], [4, 5, 6]])
```

In [None]:
#todo: exercise


## Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays. We'll take a look at those operations here.

### Concatenation of arrays

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines ``np.concatenate``, ``np.vstack``, and ``np.hstack``.
``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:

In [None]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

You can also concatenate more than two arrays at once:

In [None]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

It can also be used for two-dimensional arrays:

In [None]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

In [None]:
# concatenate along the first axis
np.concatenate([grid, grid])

In [None]:
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)

For working with arrays of mixed dimensions, it can be clearer to use the ``np.vstack`` (vertical stack) and ``np.hstack`` (horizontal stack) functions:

In [None]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

# vertically stack the arrays
np.vstack([x, grid])

In [None]:
# horizontally stack the arrays
y = np.array([[99],
              [99]])
np.hstack([grid, y])

Similary, ``np.dstack`` will stack arrays along the third axis.

In [None]:
z = np.array([[1,2,3],
            [5,6,7]])

np.dstack([grid, z])

### Splitting of arrays

The opposite of concatenation is splitting, which is implemented by the functions ``np.split``, ``np.hsplit``, and ``np.vsplit``.  For each of these, we can pass a list of indices giving the split points:

In [None]:
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)

Notice that *N* split-points, leads to *N + 1* subarrays.
The related functions ``np.hsplit`` and ``np.vsplit`` are similar:

In [None]:
grid = np.arange(16).reshape((4, 4))
grid

In [None]:
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

In [None]:
left, right = np.hsplit(grid, [2])
print(left)
print(right)

Similarly, ``np.dsplit`` will split arrays along the third axis.

# Array mathematics

Basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the NumPy module.

In [None]:
import numpy as np

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

v = np.array([9,10])
w = np.array([11, 12])

Elementwise addition

In [None]:
print(x + y)
print(np.add(x, y))

Elementwise subtraction

In [None]:
print(x - y)
print(np.subtract(x, y))

Elementwise multiplication

In [None]:
print(x * y)
print(np.multiply(x, y))

Elementwise division

In [None]:
print(x / y)
print(np.divide(x, y))

Elementwise square root

In [None]:
print(np.sqrt(x))

Inner product of vectors

In [None]:
print(v.dot(w))
print(np.dot(v, w))

Vector product

In [None]:
print(x.dot(v))
print(np.dot(x, v))
print()
print(x.dot(y))
print(np.dot(x, y))

Sum of all elements

In [None]:
print(x, "\n")
print(np.sum(x))

Sum of each column

In [None]:
print(np.sum(x, axis=0))

Sum of each row

In [None]:
print(np.sum(x, axis=1))

# Structured Data: NumPy's Structured Arrays

While data can be well represented by a homogeneous array of values, this is not always the case. NumPy's *structured arrays* and *record arrays*, can provide efficient storage for compound, heterogeneous data.  While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas ``Dataframe``s.

In [None]:
import numpy as np

Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we'd like to store these values for use in a Python program.
It would be possible to store these in three separate arrays:

In [None]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

But this is a bit clumsy. There's nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all of this data.
NumPy can handle this through structured arrays, which are arrays with compound data types.

Recall that previously we created a simple array using an expression like this:

In [None]:
x = np.zeros(4, dtype=int)

We can similarly create a structured array using a compound data type specification:

In [None]:
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'i4', 'f8')})
print(data.dtype)

Here ``'U10'`` translates to "Unicode string of maximum length 10," ``'i4'`` translates to "4-byte (i.e., 32 bit) integer," and ``'f8'`` translates to "8-byte (i.e., 64 bit) float."
We'll discuss other options for these type codes in the following section.

Now that we've created an empty container array, we can fill the array with our lists of values:

In [None]:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

As we had hoped, the data is now arranged together in one convenient block of memory.

The handy thing with structured arrays is that you can now refer to values either by index or by name:

In [None]:
# Get all names
data['name']

In [None]:
# Get first row of data
data[0]

In [None]:
# Get the name from the last row
data[-1]['name']

Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:

In [None]:
# Get names where age is under 30
data[data['age'] < 30]['name']

Note that if you'd like to do any operations that are any more complicated than these, you should probably consider the Pandas package.
The Pandas package provides a ``Dataframe`` object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we've shown here, as well as much, much more.

## Creating Structured Arrays

Structured array data types can be specified in a number of ways.
Earlier, we saw the dictionary method:

In [None]:
np.dtype({'names':('name', 'age', 'weight'),
          'formats':('U10', 'i4', 'f8')})

For clarity, numerical types can be specified using Python types or NumPy ``dtype``s instead:

In [None]:
np.dtype({'names':('name', 'age', 'weight'),
          'formats':((np.str_, 10), int, np.float32)})

A compound type can also be specified as a list of tuples:

In [None]:
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:

In [None]:
np.dtype('S10,i4,f8')

The shortened string format codes may seem confusing, but they are built on simple principles.
The first (optional) character is ``<`` or ``>``, which means "little endian" or "big endian," respectively, and specifies the ordering convention for significant bits.
The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below).
The last character or characters represents the size of the object in bytes.

| Character        | Description           | Example                             |
| ---------        | -----------           | -------                             | 
| ``'b'``          | Byte                  | ``np.dtype('b')``                   |
| ``'i'``          | Signed integer        | ``np.dtype('i4') == np.int32``      |
| ``'u'``          | Unsigned integer      | ``np.dtype('u1') == np.uint8``      |
| ``'f'``          | Floating point        | ``np.dtype('f8') == np.int64``      |
| ``'c'``          | Complex floating point| ``np.dtype('c16') == np.complex128``|
| ``'S'``, ``'a'`` | String                | ``np.dtype('S5')``                  |
| ``'U'``          | Unicode string        | ``np.dtype('U') == np.str_``        |
| ``'V'``          | Raw data (void)       | ``np.dtype('V') == np.void``        |

## More Advanced Compound Types

It is possible to define even more advanced compound types.
For example, you can create a type where each element contains an array or matrix of values.
Here, we'll create a data type with a ``mat`` component consisting of a $3\times 3$ floating-point matrix:

In [None]:
tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
X = np.zeros(1, dtype=tp)
print(X[0])
print(X['mat'][0])

Now each element in the ``X`` array consists of an ``id`` and a $3\times 3$ matrix.
Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary?
The reason is that this NumPy ``dtype`` directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program.
If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful!

## RecordArrays: Structured Arrays with a Twist

NumPy also provides the ``np.recarray`` class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys.
Recall that we previously accessed the ages by writing:

In [None]:
data['age']

If we view our data as a record array instead, we can access this with slightly fewer keystrokes:

In [None]:
data_rec = data.view(np.recarray)
data_rec.age

The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:

In [None]:
%timeit data['age']
%timeit data_rec['age']
%timeit data_rec.age

Whether the more convenient notation is worth the additional overhead will depend on your own application.

### Exercise

Create a structure array for the following table of data

| index | stock | product ID | name
|---|---|---|---|
|1|200|Z143|Batteries|
|2|100|BOK9|Books|
|3|250|C982|Bicycles|

In [None]:
#todo exercise

# Use a compound data type for structured arrays


### Sorting
NumPy arrays can be sorted in-place using the sort method:

In [None]:
array1 = np.random.randn(5)
array1.sort()
print(array1)

In [None]:
array2 = np.random.randn(5, 5)

array2_a = array2.copy()
array2_b = array2.copy()

print(array2,"\n")

#sort along axis 1
array2_a.sort(0)
array2_b.sort(1)

print (array2_a,"\n")
print (array2_b,"\n")

#### Save arrays on disk in binary format

In [None]:
print(array2)
np.save('array_on_disk',array2)

#### Load arrays on disk in binary format

In [None]:
np.load('array_on_disk.npy')

#### Save array to text file

In [None]:
np.savetxt('array_txt.txt', array2, delimiter=',' )

#### Load array from text file

In [None]:
np.loadtxt('array_txt.txt', delimiter=',')