## CS2101 - Programming for Science and Finance
Prof. Götz Pfeiffer<br />
School of Mathematical and Statistical Sciences<br />
University of Galway

***

### Objects and Classes
# Week 5: Matrices vs Arrays

* At a large scale, in terms of both space and time, plain Python is not as efficient as other programming languages.
* The `numpy` package restores some of this efficiency.
* To understand the problem, and its solution, we now look into the inner workings of Python's data types.
* First, we review `python`'s basic data types, and some of their properties.
* Then we introduce the `numpy` package and discuss its `array` data type for **homogeneous** lists of data.

* `numpy` arrays improve on `python` lists in many ways.
* `numpy` arrays are **homogeneous** **multi-dimensional** collections of data.
* As such, a `numpy` array has:
    * a **shape**, specifying its size in each dimension;
    * a common **data type** for all its elements.
* These (and related) **attributes** of an array can be directly accessed.
* Among many other operations, `numpy` extends `python`s set of **indexing** and **slicing** operators.


## Dynamic Typing vs Static Typing

* A **statically-typed** language like `C` or `Java` requires each 
  variable to be explicitly declared, together with a type.
* In a **dynamically-typed** language like `python` this kind of
  specification is not needed, a variable is implicitly declared
  when it is first used.

* For example, in the `C` language one might specify a particular operation as follows:
  ```C
  /* C code */
  int result = 0;
  for (int i = 0; i < 100; i++) {
      result += i;
  }
  ```
  Note how every variable (`result` and `i`) in this code is declared to be of type `int`.
* In `C`, the data type of each variable (`int` for integer) is
  explicitly declared (and thus known to the compiler).

* In `python` the equivalent operation could be written this way:
  ```python
  # Python code
  result = 0
  for i in range(100):
      result += i
  ```
  Here, the variables (`result` and `i`) have no declared type, they just happen to have values of type `int`.
* In `python`, the data type is **dynamically** inferred (at runtime) from the **value** of the variable.

* The standard `python` interpreters are implemented in `C`.  Thus, at runtime, every `python` object is really a `C` object.
* However, there is a big difference between the **memory** needed for storing an integer value in a `C` variable
  or in  a `python` variable ...
* In `C`, an integer variable is simply a label for a **slot** in 
  machine memory whose bytes encode an integer value.
* A `python` variable is a **pointer** to a complex data structure, which contains administrative information about a
  `python` object (such as its **type**) in addition to the 
  integer value.

* The `C` type definition for a `python` (long) integer effectively looks like this:
  ```C
  /* C code */
  struct _longobject {
      long ob_refcnt;
      PyTypeObject *ob_type;
      size_t ob_size;
      long ob_digit[1];
  };
  ```
* A single integer object in Python thus actually contains four pieces of information:
  * `ob_refcnt`, a **reference** count that helps Python silently handle memory allocation and deallocation
  * `ob_type`, which encodes the **type** of the object
  * `ob_size`, which specifies the **size** of the object (i.e., its number of "digits")
  * `ob_digit`, the array of digits representing an actual integer value.

![c vs python](images/cint_vs_pyint.png)

* The function `sys.getsizeof` reveals that in `python`, there is 
  an **administrative overhead** of $24$ bytes for each integer, whose
  value only requires $4$ bytes (if it is small, i.e. a single digit) ...

In [None]:
from sys import getsizeof
print(getsizeof(100))
print(getsizeof(2**30-1))
print(getsizeof(2**30))
print(getsizeof(0))

* This memory overhead becomes even more drastic when it comes to 
  **lists** of integers.

## The Size of a List

* Dynamic typing allows lists in `python` to be **heterogeneous**.

In [None]:
L = [True, 2, 3.0, "4"]
[type(x) for x in L]

* This flexibility comes at a price, as each object in the list
  needs to store their own administrative information, in addition to the list's own overhead.

In [None]:
[(x, getsizeof(x)) for x in L]

In [None]:
getsizeof(L)

* Clearly, if all the objects in a list are of the same type, most of
  this information is redundant.
* It would be more efficient to store these data in a
  fixed-type array, or even as a string ...

In [None]:
P = [2, 3, 5, 7, 11, 13, 17, 19]
print(P)
print(sum(getsizeof(x) for x in P) + getsizeof(P))
print(str(P))
print(getsizeof(str(P)))

## `NumPy` Arrays

* This is where the `numpy` package comes in.
* `numpy` provides efficient **storage** and efficient **operations** on array based data.

![array vs list](images/array_vs_list.png)

* When imported, `numpy` ususally gets the short nickname `np`.

In [None]:
import numpy as np

* `numpy` provides a new data type `np.array` for **homogeneous** lists of (lists of ...) data.
* Such a `numpy` fixed-type array can easily be constructed from a `python` list.

In [None]:
l = [3,1,4,1,5,9,2,6]
a = np.array(l)
a

In [None]:
print("Size of list: ", sum(getsizeof(x) for x in l) + getsizeof(l))
print("Size of array: ", getsizeof(a))

* If necessary (and possible) `numpy` will **upcast** to
  make all values in the array have the same type.

In [None]:
b = np.array([3.14, 1, 5, 9, 2, 6, 5, 3])
b

In [None]:
getsizeof(b)

* `numpy` arrays can be **multidimensional**, e.g., 2-d matrices.

In [None]:
ll = [list(range(i, i+4)) for i in [2,5,7]]
print(ll)
np.array(ll)

* There are many ways to create arrays from scratch, using
  `numpy`'s builtin routines.

In [None]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

In [None]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

In [None]:
# Create an array of five values evenly spaced between 0 and 1 (incl)
np.linspace(0, 1, 5)

* Matrices with random entries can sometimes serve as useful examples or test cases.
* Numpy provides a **random number generator** for that purpose

In [None]:
# create a random number generator
rng = np.random.default_rng()

In [None]:
# Create a 3x4 array of uniformly distributed
# random values between 0 and 1
rng.random((3, 4))

In [None]:
# Create a 3x5 array of random integers in the interval [0, 10)
rng.integers(0, 10, (3, 5))

In [None]:
# Create a 3x3 identity matrix
np.eye(3, dtype=int)

## Array Attributes

* a `numpy` array is a **multi-dimensional** **homogeneous** collection of data.
* in **mathematics** and **physics**, such an object is often called a **tensor**.
* a `numpy` array has a **shape** and a **dtype**.
* let's investigate these in some simple examples.
* start with three random arrays, a one-dimensional, two-dimensional, and three-dimensional array.

In [None]:
x0 = rng.integers(0, 10) # A single random integer
print(x0)

In [None]:
x1 = rng.integers(0, 10, size=4)  # One-dimensional array; here size means shape :-(
x1

In [None]:
x2 = rng.integers(0, 10, size=(3, 4))  # Two-dimensional array: shape 3 x 4
x2

In [None]:
x3 = rng.integers(0, 10, size=(2, 3, 4))  # Three-dimensional: shape 2 x 3 x 4
x3

Each array has the attributes 
* `dtype`: the **data type** of the array.
* `shape`: the **size in each dimension**, and

In [None]:
print("x2 dtype:", x2.dtype)
print("x2 shape:", x2.shape)

In [None]:
print("x3 dtype:", x3.dtype)
print("x3 shape:", x3.shape)

Further attributes of interest are
* `ndim`: the **number of dimensions**, and
* `size`: the **total number** of elements.

In [None]:
print("x3 ndim: ", x3.ndim)
print("x3 size: ", x3.size)

Obviously, `size` is the product of the numbers in the list `shape`, and `ndim` is the length of that list.

In [None]:
from math import prod 
prod(x3.shape) == x3.size

In [None]:
len(x3.shape) == x3.ndim

## Indexing: Accessing Single Elements

* In a one-dimensional array, the $i^{th}$ value (counting from **zero**) can be accessed by specifying the desired index in square brackets, just as with `python` lists:

In [None]:
x1

In [None]:
print(x1[0])
print(x1[3])
print(x1[-1])

* **NEW:** In a **multi-dimensional** array, items can be accessed using **comma-separated indices**:

In [None]:
x2

In [None]:
print(x2[0, 0])
print(x2[2, -1])

* Comma separated indices can also be use for assignments.

In [None]:
x2[0, 0] = 12
x2

## Slicing: Accessing Subarrays

* The `numpy` slicing syntax follows that of the standard `python` list.
* To access a slice of an array ``x``, use
  ``` python
  x[start:stop:step]
  ```
  where the `:step` part is optional.
* If any of these are unspecified, they default to the values 
$0$ for `start`, the size (of the dimension) for `stop`, and $1$ for `step`.

### One-dimensional slicing

In [None]:
x = np.arange(10)
x

In [None]:
x[:5]  # first five elements

In [None]:
x[5:]  # elements after index 5

In [None]:
x[4:7]  # middle sub-array

In [None]:
x[::2]  # every other element

In [None]:
x[1::2]  # every other element, starting at index 1

* **Note:** When the `step` value is **negative**, the defaults for `start` and `stop` are **swapped**.
* This gives a convenient way to reverse an array

In [None]:
x[::-1]  # all elements, reversed

In [None]:
x[7::-2]  # reversed every other from index 7 down to 0

### Multi-dimensional slicing

* **NEW:** Multi-dimensional slices work similar, with multiple **slices separated by commas**.

In [None]:
x2

In [None]:
x2[:2, :3]  # two rows, three columns

In [None]:
x2[:, ::2]  # all rows, every other column

In [None]:
x2[::-1, ::-1]  # reversing both rows and cols

### Accessing array rows and columns

* Single rows or columns of an array can be accessed by **combining indexing and slicing**.

In [None]:
x2[:, 0]  # first column of x2

In [None]:
x2[0, :]  # first row of x2

* Trailing empty slices can be omitted.

In [None]:
x2[0]  # equivalent to x2[0, :]

### Subarrays are no-copy views!

* Recall that, for a `python` list `l`, the slice `l[:]` is a convenient way of making a copy of the list `l`.
* **CAUTION:** Array slices are **views** rather than **copies** of the array data.
* This means that they refer to (and modify) the same underlying data as the original array.

In [None]:
print(x2)

* Let's extract a $2 \times 2$ subarray from this:

In [None]:
x22 = x2[:2, :2]
print(x22)

* Now, if we modify this subarray, the original array is changed, too!

In [None]:
x22[0, 0] = 99
print(x22)
print(x2)

* When working with **large datasets**, this behaviour allows us to access and process pieces of these datasets without the need to copy the entire underlying data buffer.

### Creating copies of arrays

* To make an explicit copy of the data within an array or a subarray use the `copy()` method:

In [None]:
x22copy = x2[:2, :2].copy()
print(x22copy)

* Now, if we modify this copied subarray, the original array is not affected:

In [None]:
x22copy[0, 0] = 42
print(x22copy)
print(x2)

## Reshaping of Arrays

* Another useful type of operation is reshaping of arrays.
* The most flexible way of doing this is with the `reshape` method.
* For example, to put the numbers 1 through 9 into a $3 \times 3$ matrix grid, you can do the following:

In [None]:
grid = np.arange(1, 10)
print(grid)
grid = np.arange(1, 10).reshape(3, 3)
print(grid)

In [None]:
grid.shape

* Note that for this to work, the **size** of the initial array **must match** the size of the reshaped array. 
* **CAUTION:** Where possible, the ``reshape`` method will use a **no-copy view** of the initial array.

## References

### `python`

* `sys.getsizeof`: [[doc]](https://docs.python.org/2/library/sys.html#sys.getsizeof)
determines the size (in bytes) of an object
* `l[i]`: indexing [[doc]](https://docs.python.org/3/library/stdtypes.html?highlight=mutable%20sequence#sequence-types-list-tuple-range)
* `l[start:stop:step]`: slicing [[doc]](https://docs.python.org/3/library/stdtypes.html?highlight=mutable%20sequence#sequence-types-list-tuple-range)
* `slice` [[doc]](https://docs.python.org/3/library/functions.html#slice)

### `numpy`

* `np.array`: [[doc]](https://numpy.org/doc/stable/user/basics.creation.html)
  constructs a `numpy` multidimensional array.
* `dtype`: [[doc]](https://numpy.org/doc/stable/reference/arrays.dtypes.html)
  the common type of the entries of a `numpy` array.
* `np.zeros`: [[doc]](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html) fills an array with $0$ values of the specified type
* `np.ones`: [[doc]](https://numpy.org/doc/stable/reference/generated/numpy.ones.html) fills an array with $1$ values
* `np.eye`: [[doc]](https://numpy.org/doc/stable/reference/generated/numpy.eye.html) creates the identity matrix of the given shape
* `np.empty`: [[doc]](https://numpy.org/doc/stable/reference/generated/numpy.empty.html) creates an array w/o setting the values
* `np.linspace`: [[doc]](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html) constructs an array of equally spaced values.
* `np.random`: [[doc]](https://numpy.org/doc/stable/reference/random/index.html) random sampling.
* indexing, slicing: [[doc]](https://numpy.org/doc/stable/reference/arrays.indexing.html)
* `reshape`: [[doc]](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html)

##  Exercises

1. Create a `numpy` array with entries $2, 3, 5, 7, 11, 13, 17, 19$
   and then use `sys.getsizeof` to compare its size with the size
   of the `python` list with the same entries.   

2. Construct a `numpy` $3 \times 3 \times 3$ array of $1$s (of type `int`).

3. Construct a `numpy` $3 \times 4 \times 5$ array of random integers
   in the range $1$ to $99$ (inclusive).   

4. Create an array of $21$ values, evenly spaced between $0$ and $100$.

5. Determine the basic attributes of the above arrays.

6. Create an array with a sequence of integers,
   starting at $1950$, ending at $2015$, stepping by $5$.   

7. Create a list of all odd squares between $0$ and $10000$.

8. Starting with a $1$-dimensional array of length $60$,
   reshape it into a $3$-dimensional array with dimensions
   of sizes $5$, $4$ and $3$, respectively.

9. Create a `numpy` array from the matrix
   ```python
   ma = Matrix(
    Vector(1, 0, 1),
    Vector(2, 1, 1),
    Vector(0, 1, 1),
    Vector(1, 1, 2)
   )
   ```

In [None]:
from python.matrix import Vector, Matrix
ma = Matrix(
    Vector(1, 0, 1),
    Vector(2, 1, 1),
    Vector(0, 1, 1),
    Vector(1, 1, 2)
)
na = np.array(ma)
na