<a href="https://colab.research.google.com/github/paigemb4/DS1002/blob/main/numpyfeb14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### NumPy 1
### University of Virginia
### DS 1002: Programming for Data Science
---  

### PREREQUISITES
- import
- functions
- for ... in

### SOURCES
- https://numpy.org/
- https://en.wikipedia.org/wiki/NumPy
- https://www.scipy.org/
- https://en.wikipedia.org/wiki/SciPy
- https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html
- https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.randint.html

### OBJECTIVES
- Introduction to Numpy

### CONCEPTS
- The numpy package contains useful functions for math operations
- The ndarray is the workhorse of the package

# NumPy

**A new data structure**

Essentially, NumPy introduces a new data structure to Python -- the **n-dimensional array**. Along with it, it introduces a collection of function and methods that take advantage of this data structure.

The data structure is designed to support the use of **numerical methods**: algorithmic approximations to the problems of mathematical analysis.

**New Functions**

It also provides a new way of appling functions to day made possible by the data structure -- **vectorized functions**.  
Vectorized functions replace the use of loops and comprehensions to apply a function to a set of data.

In addition, given the data structure, it provides a library of **linear algebra** functions.

**New Data Types**

NumPy also introduces a bunch of new **data types**.

**Python for Science**

Finally, because [numerical methods](https://www.britannica.com/science/numerical-analysis) are so important to so many sciences, NumPy is the basis of what is called **the scientific "stack"** in Python, which consists of SciPy, Matplotlib, SciKitLearn, and Pandas. All of these assume that you have some knowledge of NumPy.

Let's take a look at it.

In [3]:
import numpy as np

NumPy is by widespread convention aliased as `np`.

# The ndarray

The ndarray is a multidimensional array object.

Let's explore it some.

First, let's generate some fake data using NumPy's built-a random number generator.

**`numpy.random.randn()`**  
[documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html)

In [4]:
# generate a single random number
number = np.random.randn()
print(number, type((number)))

1.2104957444467896 <class 'float'>


In [5]:
# generate some data - what do the arguments specify?

data = np.random.randn(2, 3)

In [6]:
# look at the data
data

array([[-0.66273253, -0.79844063,  0.42973552],
       [-0.06500314,  0.29317367,  0.05101617]])

In [7]:
data * data

array([[0.43921441, 0.63750743, 0.18467262],
       [0.00422541, 0.0859508 , 0.00260265]])

^list of lists

Dot Product

1) array * the multiplier
2) **`np.dot()`**

[documentation](https://numpy.org/doc/stable/reference/generated/numpy.dot.html)  
[dot product](https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/null-column-space/v/matrix-vector-products)

In [8]:
data * 10

array([[-6.62732531, -7.98440627,  4.2973552 ],
       [-0.65003137,  2.93173667,  0.51016165]])

In [9]:
b = 10 # need to assign multiplier to variable
np.dot(data, b)

array([[-6.62732531, -7.98440627,  4.2973552 ],
       [-0.65003137,  2.93173667,  0.51016165]])

Addition

In [10]:
data + data

array([[-1.32546506, -1.59688125,  0.85947104],
       [-0.13000627,  0.58634733,  0.10203233]])

**`.shape`**

In [11]:
data.shape

(2, 3)

In [12]:
data.dtype

dtype('float64')

## About Dimensions

The term dimension is ambiguous.
* Sometimes refers to the dimensions of things in the world, such as space and time.
* Sometimes refers to the dimensions of a data structure, independent of what it represents in the world.

NumPy dimensions are the latter, although they can be used to represent the former, as physicists do.

The dimensions of data structures are sometimes called **axes**.

Consider this: Three-dimensional space can be represented as three columns in a two-dimensional table OR as three axes in a data cube.


# Creating ndarrays

**`np.array()`**  
- take an object and casts an array data structure  
- [documentation](https://numpy.org/doc/stable/reference/generated/numpy.array.html)

In [13]:
data1 = [6, 7.5, 8, 0, 1] # create a list
arr1 = np.array(data1) # turn list into a numpy array
arr1

array([6. , 7.5, 8. , 0. , 1. ])

In [14]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [15]:
arr2.ndim

2

In [16]:
arr2.shape

(2, 4)

In [17]:
arr1.dtype

dtype('float64')

In [18]:
arr2.dtype

dtype('int64')

**`np.zeros()`**
- returns new array of given shape and type filled with zeros
- [documentation](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html)

In [19]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [20]:
np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

**`np.arange()`**  
- returns a range of a specified length & step  
- [documentation](https://numpy.org/doc/stable/reference/generated/numpy.arange.html)

In [21]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

### Student Exercises

* Convert `sudoku_game` into a NumPy array called `sudoku_array`.  
* Print the class `type()` of `sudoku_array` to check that your code has worked properly.



In [None]:
# sudoku_game is Python list containing a sudoku game

sudoku_game = [[0, 0, 4, 3, 0, 0, 2, 0, 9],
               [0, 0, 5, 0, 0, 9, 0, 0, 1],
               [0, 7, 0, 0, 6, 0, 0, 4, 3],
               [0, 0, 6, 0, 0, 2, 0, 8, 7],
               [1, 9, 0, 0, 0, 7, 4, 0, 0],
               [0, 5, 0, 0, 8, 3, 0, 0, 0],
               [6, 0, 0, 0, 0, 0, 1, 0, 5],
               [0, 0, 3, 5, 0, 8, 6, 9, 0],
               [0, 4, 2, 9, 1, 0, 3, 0, 0]]

Create and print an array filled with zeros called `zero_array`, which has two rows and four columns.

Create and print an array of random floats between 0 and 1 called `random_array`, which has three rows and six columns.

Your task is to create a scatter plot with the values from another array called `doubling_array` (not yet defined) on the y-axis.  

To to this we will need to import another package for plots, called `pyplot`, which itself is part of the larger `matplotlib` .

A scatter plot can be created using the following code:  

plt.scatter(x_values, y_values)  
plt.show()  

* Using np.arange(), create a 1D array called `one_to_ten` which holds all integers from one to ten (inclusive).  

* Create a scatterplot with `doubling_array` as the y values and `one_to_ten` as the x values.  

## Data Types for ndarrays

**ndarrays must have a single data type associated with them**

- In NumPy you can specify bytes (8, 32, 64, etc)
- More on bytes [here](https://web.stanford.edu/class/cs101/bits-bytes.html)

In [None]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr1.dtype

In [None]:
arr2 = np.array([1, 2, 3], dtype=np.int32)
arr2.dtype

In [None]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

In [None]:
float_arr = arr.astype(np.float64)
float_arr.dtype

In [None]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

In [None]:
arr.astype(np.int32)

In [None]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(float)

In [None]:
int_array = np.arange(10)
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)
int_array.astype(calibers.dtype)

**NumPy Data Types**

```
i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )
```

## Arithmetic

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

In [None]:
arr.shape

In [None]:
arr * arr

In [None]:
arr - arr

In [None]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2

In [None]:
arr2 > arr

# Basic Indexing and Slicing

In [None]:
arr = np.arange(10)
arr

In [None]:
arr[5]

In [None]:
arr[5:8]

Notice that if we assign a scalar to a slice, all of the elements of the slice get that value. This is called **broadcasting**.

In [None]:
arr[5:8] = 12

In [None]:
arr

Also, notice that changes to slices are changes to the arrays they are slices of. They are **views**, not copies.

In [None]:
arr_slice = arr[5:8]
arr_slice

In [None]:
arr_slice[1] = 12345
arr

In [None]:
arr_slice[:] = 64
arr

In [None]:
arr_slice

As NumPy has been designed with large data use cases in mind, you could imagine performance and memory problems if NumPy insisted on copying data left and right.

⭐ If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array; for example `arr[5:8].copy()`.

**Higher Dimensional Arrays**

In [None]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d

In [None]:
arr2d[2]

In [None]:
arr2d[0][2]

**Simplified notation**

In [None]:
arr2d[0, 2]

A nice visual of a 2D array

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781449323592/files/httpatomoreillycomsourceoreillyimages2172112.png" height="50%" width="50%"/>

**Two-Dimensional Array Slicing**

<img src="https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781449323592/files/httpatomoreillycomsourceoreillyimages2172114.png" height="50%" width="50%"/>

**3D arrays**

In [None]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

In [None]:
arr3d.shape

In [None]:
arr3d


💡 **Here is a way to visualize 3 and higher dimensional data:**

```python
[ # AXIS 0                     CONTAINS 2 ELEMENTS (arrays)
    [ # AXIS 1                 CONTAINS 2 ELEMENTS (arrays)
        [1, 2, 3], # AXIS 3    CONTAINS 3 ELEMENTS (integers)
        [4, 5, 6]  # AXIS 3
    ],  
    [ # AXIS 1
        [7, 8, 9],
        [10, 11, 12]
    ]
]
```
Each axis is a level in the nested hierarchy, i.e. a tree or DAG (directed-acyclic graph).

* Each axis is a container.
* There is only one top container.
* Only the bottom containers have data.

**Omit lower indices**

In multidimensional arrays, if you omit later indices, the returned object will be a **lower-dimensional ndarray** consisting of all the data contained by the higher indexed dimension.

So in the 2 × 2 × 3 array `arr3d`:

In [None]:
arr3d[0]

Saving data before modifying an array.

In [None]:
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d

Putting the data back.

In [None]:
arr3d[0] = old_values
arr3d

Similarly, `arr3d[1, 0]` gives you all of the values whose indices start with (1, 0), forming a 1-dimensional array:

In [None]:
arr3d[1, 0]

In [None]:
x = arr3d[1]
x

In [None]:
x[0]

## Student Exercises

In the first lesson, you created a `sudoku_game` two-dimensional NumPy array. Perhaps you have hundreds of sudoku game arrays, and you'd like to save the solution for this one, `sudoku_solution`, as part of the same array as its corresponding game in order to organize your sudoku data better. You could accomplish this by stacking the two 2D arrays on top of each other to create a 3D array.



In [None]:
sudoku_solution = [[8, 6, 4, 3, 7, 1, 2, 5, 9],
                   [3, 2, 5, 8, 4, 9, 7, 6, 1],
                   [9, 7, 1, 2, 6, 5, 8, 4, 3],
                   [4, 3, 6, 1, 9, 2, 5, 8, 7],
                   [1, 9, 8, 6, 5, 7, 4, 3, 2],
                   [2, 5, 7, 4, 8, 3, 9, 1, 6],
                   [6, 8, 9, 7, 3, 4, 1, 2, 5],
                   [7, 1, 3, 5, 2, 8, 6, 9, 4],
                   [5, 4, 2, 9, 1, 6, 3, 7, 8]]


* Create a 3D array called `game_and_solution` by stacking the two 2D arrays, created from `sudoku_game` and `sudoku_solution`, on top of one another; in the final array, `sudoku_game` should appear before `sudoku_solution`.  

* Print game_and_solution.

* Create another 3D array called `new_game_and_solution` with a different 2D game and 2D solution pair: `new_sudoku_game` and `new_sudoku_solution`. `new_sudoku_game` should appear before `new_sudoku_solution`.   

* Create a 4D array called `games_and_solutions` by making an array out of the two 3D arrays: `game_and_solution` and `new_game_and_solution`, in that order.  

* Print the shape of `games_and_solutions`  

In [None]:
new_sudoku_game = [[0, 0, 4, 3, 0, 0, 0, 0, 0],
                   [8, 9, 0, 2, 0, 0, 6, 7, 0],
                   [7, 0, 0, 9, 0, 0, 0, 5, 0],
                   [5, 0, 0, 0, 0, 8, 1, 4, 0],
                   [0, 7, 0, 0, 3, 2, 0, 6, 0],
                   [6, 0, 0, 0, 0, 1, 3, 0, 8],
                   [0, 0, 1, 7, 5, 0, 9, 0, 0],
                   [0, 0, 5, 0, 4, 0, 0, 1, 2],
                   [9, 8, 0, 0, 0, 6, 0, 0, 5]]

new_sudoku_solution = [[2, 5, 4, 3, 6, 7, 8, 9, 1],
                       [8, 9, 3, 2, 1, 5, 6, 7, 4],
                       [7, 1, 6, 9, 8, 4, 2, 5, 3],
                       [5, 3, 2, 6, 9, 8, 1, 4, 7],
                       [1, 7, 8, 4, 3, 2, 5, 6, 9],
                       [6, 4, 9, 5, 7, 1, 3, 2, 8],
                       [4, 2, 1, 7, 5, 3, 9, 8, 6],
                       [3, 6, 5, 8, 4, 9, 7, 1, 2],
                       [9, 8, 7, 1, 2, 6, 4, 3, 5]]

* Flatten `sudoku_game` so that it is a 1D array, and save it as `flattened_game`.  

* Print the `.shape` of `flattened_game`.

* Hint: look up the documentation on `.flatten`.

* Reshape the `flattened_game` back to its original shape of nine rows and nine columns; save the new array as `reshaped_game`.  

* Look up documenation on `.reshape`.

* Does NumPy to keep the array elements in the same order after being flattened and reshaped?