[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/joshmaglione/CS102-Jupyter/main?labpath=.%2FWeek02.ipynb)

<a href="https://colab.research.google.com/github/joshmaglione/CS102-Jupyter/blob/main/Week02.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[View on GitHub](https://github.com/joshmaglione/CS102-Jupyter/blob/main/Week02.ipynb)

# Week 2: Understanding Data Types in Python

Effective data-driven science and computation requires understanding how data is stored and manipulated.

## Dynamic vs Static typing

In `Python`, data is *dynamically typed*. This means that variable types are determined and checked at **runtime**. 

In `C`, data is *statically typed*. This means that variable types are determined and checked during **compilation**.

For example, one operation in `C` might go as follows:

```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
    result += i;
}
```

The same operation in `Python` would look like:

```python
# Python code
result = 0
for i in range(100):
    result += i
```

Notice `result` was declared an `int` in `C`, whereas in `Python` it was not.

As another example, the following code will run properly in `Python`, but in `C` it won't compile.

In [None]:
x = 4
x = "four"
x == [4]

There is more going on under the hood of a Python type.

## More than they seem

The standard `Python` implementation is written in `C`.

Every `Python` object is a `C` structure. 

Consider an `int` for example. 

In `C`, an `int` is essentially a label for a location in memory whose bytes encode the value.

Looking at the `Python` 3.12.1 [source code](https://github.com/python/cpython/blob/main/Objects/clinic/longobject.c.h), an integer is encoded as a 'longobject', and the definition looks essentially like

```C
struct _longobject {
    long ob_refcnt;
    PyTypeObject *ob_type;
    size_t ob_size;
    long ob_digit[1];
};
```

A single integer in Python actually contains four pieces:
- `ob_refcnt` : reference count that helps Python silently handle memory allocation and deallocation
- `ob_type` : encodes the type of the variable
- `ob_size` : specifies the size of the following data members
- `ob_digit` : contains the actual integer value that we expect the Python variable to represent.

Therefore, `Python` integers take more data than `C` integers. 

The following illustrates the two different encodings of the integer `1`

![](imgs/cint_vs_pyint.png)

Here '`PyObject_HEAD`' is a container for `ob_refcnt`, `ob_type`, and `ob_size`.

We can simply see this. In `C` an `int` takes 4 bytes. In `Python`, we can use the `sys` package to check.

In [None]:
from sys import getsizeof
print(getsizeof(1))         # Number of bytes for 1

## Lists in Python

Dynamic typing allows for *heterogeneous* lists.

In [None]:
L = [True, 1, "one", [1], 1.0]
[(x, type(x)) for x in L]

In [None]:
print(f"Number of bytes for L : {getsizeof(L)}")
[(x, type(x), getsizeof(x)) for x in L]

Consider just keeping the type the same in one list.

In [None]:
P = [2, 3, 5, 7, 11, 13, 17, 19]
print(f"Number of bytes for P : {getsizeof(P)}")
[getsizeof(x) for x in P]

We could expect more efficiency since much of the overhead is redundant. 

One way to get better storage efficiency is to use Python's `array` module (built-in)

Since we want to do operations on arrays, `numpy` is the better approach.

## NumPy arrays

If you want to expand upon the foundations here, check out [Numpy Fundamentals](https://numpy.org/doc/stable/user/basics.html) to learn more about, well, the fundamental ideas and philosophy present in NumPy.

Python lists contain a pointer to a block of pointers, each of which points to a full Python object (like the Python integer that we saw earlier).

Numpy arrays are fixed type and are essentially pointers to a contiguous block of data.

![](imgs/array_vs_list.png)

We import `numpy` as follows.

In [6]:
import numpy as np

We have standards. 

We don't use anything else other than `np` for abbreviating `numpy`.

Code can get **hard** to read. By sticking to this standard, you never have to think when you see `np`.

We can construct `numpy` arrays, called `ndarray`, from Python lists. 

In [None]:
a = np.array([2, 3, 5, 7, 11, 19])
print(type(a))
print(a)

When you feed the animal a variety of food, it makes a choice. 

In [None]:
print(np.array([True, 1, 1.0]))
print(np.array([True, 1, 1.0, "one"]))

This is called *upcasting*. 

Can figure out the order of precedence by playing around? (You could just look it up, but that's not as fun.)

Arrays can be multi-dimensional.

In [None]:
M = np.array([(i, i + 3, i - 1) for i in range(5) if i != 2])
print(M)

In [None]:
T = np.array([[(i, i + 1) for i in range(4*j, 4*j + 4, 2)] for j in range(2)])
print(T)

Many ways to create arrays.

In [None]:
# Create a length-10 integer array filled with zeros
print(np.zeros(10, dtype=int))

In [None]:
# Create a 3x5 floating-point array filled with ones
print(np.ones((3, 5), dtype=float))

In [None]:
# Create a 3x5 array filled with 42
print(np.full((3, 5), 42))

In [None]:
# Create an array filled with a linear sequence. Starting at 0, ending at 20,
# stepping by 2
print(np.arange(0, 21, 2))

In [None]:
# Create an array of five values evenly spaced between 0 and 1
print(np.linspace(0, 1, 5))

But wait, there's more!

In [None]:
# Create a 3x4 array of continuously uniformly distributed random values between
# 0 and 1
print(np.random.random_sample((3, 4)))

In [None]:
# Create a 3x4 array of normally distributed random values with mean 0 and
# standard deviation 1
print(np.random.normal(0, 1, (3, 4)))

In [None]:
# Create a 2x5 array of random integers in the interval [0, 10)
print(np.random.randint(0, 10, (6, 4)))

In [None]:
# Create a 3x3 identity matrix
print(np.eye(3, dtype=int))

## What makes NumPy arrays so special?

![](imgs/BigL.png)

[Big Lebowski clip](https://youtu.be/4LGX8TbvGew?si=-VpNkvj6FjsS0RbG)

It's not a matter of opinion that `ndarrays` are fantastic. 🙃

The fundamental difference is that an `ndarray` is stored in a *homogeneous and contiguous block of memory*. 

- Computations on arrays can be written in `C`.
- Knowing the address of the memory block and the data type, it is just simple arithmetic to loop over all items.
- Spatial locality in memory access patterns results in performance gains notably due to the CPU cache.
- NumPy can take advantage of vectorized instructions of modern CPUs.

One of the downsides to this approach is if you want to add a row or column to your matrix, for example, NumPy will do an array copy. 

There are ways to avoid things like this. See [Section 4.5 of the 'Cookbook'](https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/).

Two arrays may share the same memory. Be careful out there.

In [None]:
X = np.array([1, 2, 3])
Y = X                       # Sneaky
X[0] = 4
print(Y)

### A quick note on data types within `ndarray`

One can specify the data types for an `ndarray` using the `dtype` keyword argument.

We did this above. 

It might be useful to know what some of the types mean. Here's a table:

| Data type	 | Description |
|-------------|-------------|
| `bool_`     | Boolean (True or False) stored as a byte |
| `int_`      | Default integer type (same as C `long`; normally either `int64` or `int32`)| 
| `intc`      | Identical to C `int` (normally `int32` or `int64`)| 
| `intp`      | Integer used for indexing (same as C `ssize_t`; normally either `int32` or `int64`)| 
| `int8`      | Byte (–128 to 127)| 
| `int16`     | Integer (–32768 to 32767)|
| `int32`     | Integer (–2147483648 to 2147483647)|
| `int64`     | Integer (–9223372036854775808 to 9223372036854775807)| 
| `uint8`     | Unsigned integer (0 to 255)| 
| `uint16`    | Unsigned integer (0 to 65535)| 
| `uint32`    | Unsigned integer (0 to 4294967295)| 
| `uint64`    | Unsigned integer (0 to 18446744073709551615)| 
| `float_`    | Shorthand for `float64`| 
| `float16`   | Half-precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| `float32`   | Single-precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| `float64`   | Double-precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| `complex_`  | Shorthand for `complex128`| 
| `complex64` | Complex number, represented by two 32-bit floats| 
| `complex128`| Complex number, represented by two 64-bit floats| 

## Attributes of `ndarrays`

Recall *attributes* are data associated to a class. It is accessed without arguments and without parentheses. 

In [21]:
x1 = np.random.randint(10, size=6)              # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))         # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))      # Three-dimensional array
xs = [x1, x2, x3]

Each array has attributes 
- `dtype` : the data type of the entries,
- `ndim` : the number of dimensions, 
- `shape` : the size of each dimension, 
- `size` : the total size of the array

In [None]:
print(f"The data type of the arrays : {[x.dtype for x in xs]}")
print(f"The dimensions of the arrays : {[x.ndim for x in xs]}")
print(f"The shapes of the arrays : {[x.shape for x in xs]}")
print(f"The sizes of the arrays : {[x.size for x in xs]}")

## Exercises

1.  Construct a NumPy $4 \times 4 \times 4$ array with all $1$ (of type `int`).

2. Construct a NumPy $7 \times 4 \times 6$ array of random integers in the range $1$ to $99$ (inclusive).
   
3. Create an array of $71$ values, evenly spaced between $0$ and $100$.

4. Create an array with a sequence of integers, starting at $1950$, ending at $2015$, stepping by $5$.
   
5. Create a list of all odd squares between $0$ and $10000$.

6. Make a few NumPy arrays, with random entries or ranges of integers, of varying dimensions.

7. Determine the basic attributes of these arrays.