### Data Science Week: Day 1

### Introduction to numPy

##### CSAA2022

#### Day 1, Part 1: Overview

* What are numpy arrays (ndarrays)
* Why ndarrays
* Lists vs ndarrays
* Creating ndarrays
* Array operations
    * Slicing and Indexing
    * Boolean Indexing
    * Mathematical and statistical operations with arrays
* Additional resources

### What is numPy? 

* powerful package for scientific computing
* contains n-dimensional array object (ndarray)
* provides support for linear algebra 
* loads of useful functions
* integrates nicely with the scientific Python stack: matplotlib, scipy, etc

### Ndarrays

**ndarrays** stands for n-dimensional arrays.

Why would we use arrays? 

* Some operations are easily expressed as array operations
* Makes linear algebra easier
* Memory and computationally efficient 
* Deep learning: speech recognition, image recognition, etc


#### Recap: lists in Python (1)

*** Let's remind ourselves of what we can do with lists *** 


Lists can hold **different data types**. 

For example the simplel list contains a string, an int, and a list. 

In [2]:
simplel = ['week',3,['day1','day2']]

#### Recap: lists in Python (2)

We can **update** the contents of a list. Let's say we want to update the 2nd element to be 2 instead of 3. Remember In Python we start indexing from 0.

In [3]:
simplel [1] = 2
print (simplel)

['week', 2, ['day1', 'day2']]


#### Recap: lists in Python (3)

We can always **augment a list and add more elements to it**. 

In [4]:
simplel.append(10)
print (simplel)

['week', 2, ['day1', 'day2'], 10]


####  But what is an array and how is it different from what we've been using so far?

They have

* fixed, predefined size (or "shape")
* fixed, predefined type (all elements have the same type)
* they are inherently multidimensional
* a 2D array must have the same number of columns in each row, for example.

Arrays **are** typically mutable, we can change the values they hold after creation. So we can write new values into an existing array.


### Numpy and ndarrays

Main characterists of arrays:
*  **dtype** : the type of elements contained in the array
* **shape**: dimensions, shape is usually presented in the form of **rows** by **columns**

**dtype** just stands for the *d*ata *type*, and it is the data type of every element in an array (what kind of number it is). For the moment, we will assume we get floating point elements in our arrays: we'll discuss this in more detail later. Every element has the same *dtype*; it applies to the whole array.

Common *dtypes* are:
* `float64` double-precision float numbers
* `float32` single-precision float numbers
* `int32` 32 bit integers

though there are many more.

### Creating NumPy Arrays (ndarrays)

* From numpy sequence objects
* Using numpy functions
    * np.arange(start,stop, step)
    * np.linspace(start,stop, num=50)
    * np.zeros(shape)
    * np.ones(shape)
    * np.full (shape, value)

In [5]:
# tradional way of importing numpy
import numpy as np

# Creating an ndarray from a list
my_list1 = [ 2, 7, -1, 10]

my_array1 = np.array(my_list1) #dtype = int64

One of the simplest ways to create an ndarray is to convert a sequence data object (e.g. a list) to an array.

* Note that the data type of the array created depends on the data type of the elements of the sequence data object. 

* To check the size of the array created, use the len() function or size function. **len and size will return the same for one dimensional arrays**

* If the original sequence contains another sequence (e.g. list of lists), 
the sub-sequences need to be of the same length.

In [6]:
print (len(my_array1))

print (my_array1.size)

4
4


Tuples can also be converted to ndarrays.

The elements are **promoted** to the highest data type in the original sequence



In [7]:
my_list2 = (2, 3, 4.3, 7.9)

my_array2 = np.array(my_list2) #dtype = float64

print (my_array2)

print (my_array2.dtype)

[2.  3.  4.3 7.9]
float64


What about a list of lists? 

**multidimensional array**

In [8]:
my_list3 = [[2, 4, 5], [12.2, 9, 6 + 2j]]

my_array3 = np.array(my_list3) #dtype = complex128

print (len(my_array3))
print (my_array3.size)

2
6


#### Creating ndarrays (cont)

In addition to converting lists to arrays, numpy provides us with a range of functions. Let's see some examples. You will also have the chance to practice some of those in the afternoon. 

        np.arange(start, stop, step)
        
        np.linspace(start, stop,num) 
        
        np.ones(shape)
        
        np.zeros(shape)
        
        np.full(shape, value): could also be done with np.empty followed by np.fill


In [9]:
print (np.arange(10,20,2))

print (np.linspace(1,10,20))

print (np.ones((2,2)))
       
print (np.full((3,3),np.pi))

[10 12 14 16 18]
[ 1.          1.47368421  1.94736842  2.42105263  2.89473684  3.36842105
  3.84210526  4.31578947  4.78947368  5.26315789  5.73684211  6.21052632
  6.68421053  7.15789474  7.63157895  8.10526316  8.57894737  9.05263158
  9.52631579 10.        ]
[[1. 1.]
 [1. 1.]]
[[3.14159265 3.14159265 3.14159265]
 [3.14159265 3.14159265 3.14159265]
 [3.14159265 3.14159265 3.14159265]]


#### Ndarrays are more efficient

Why would we prefer arrays over lists?
Let's have a look and compare how long a simple operation would take when using lists and when using arrays.

In [2]:
import numpy as np

In [3]:
a_array = np.arange(10000000)
b_array = np.arange(10000000)

a_list = list(a_array)
b_list = list(b_array)


In [4]:
%%timeit
for i in range(len(a_list)):
        a_list[i] += 2 * b_list[i]

3.49 s ± 223 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [5]:
%%timeit
newArray = a_array + 2*b_array

49.8 ms ± 553 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Element wise operations

Some further examples of avoiding for loops: 
* Addition 
* Subtration
* Multiplication
* Division (e.g. used to normalise data features in data analysis)

In [None]:
A = np.arange(1,10)
B = np.arange(11,20)

X = A + B 
print(X)

Y = 2*A + B-3
print(Y)

Z = Y / 5
print(Z)

# To perform matrix multiplication, use the '@' operator or matmul()
Q = A @ B
print(Q)

### Mutability and copying 


In [13]:
x = np.array([1,2,3])
z = x
x[0] = 0 

In [14]:
print (x)
print (z)

[0 2 3]
[0 2 3]


In [15]:
x = np.array([1,2,3])
y = np.array(x) # copy
z = x.copy()  #same as the line above
x[0] = 0

print (x)
print (y)
print (z)

[0 2 3]
[1 2 3]
[1 2 3]


#### Indexing and slicing


With lists we were able to: 

* access an element at a particular index
* use negative indexing 
* take a slice 

In [None]:
mytest = [1,2,4,5,6,6,8]

print (mytest[0])

print (mytest[-2])

print(mytest[2:6])

#### Indexing and slicing in arrays 

In [None]:
array1 = np.arange(1,20)
print (array1)
print (array1[0]) # first element
print(array1[-1]) # last element
print (array1[2:6]) #slicing
print (array1[1:15:2])

#### What about multidimensional arrays? 

In [None]:
array2 = np.arange(1,21).reshape(4,5)
print (array2)
print ("__________")
print (array2[0]) #first row of the array (a subarray)
print (array2[0,1]) #first row (0 index) and second column
#print (array2[0][1]) #equaivalent to the line above

In [None]:
print (array2[0:2,:])
print (array2[:,0:2])

In [None]:
gas = np.loadtxt("data/gas.csv",delimiter=',')
np.set_printoptions(suppress=True)
print (gas.shape)
print (gas[:3])

In [None]:
print (gas[::4])

#### Boolean indexing

* For a range of tasks, we would like to select only elements of an array that meet a certain condition. 
* Boolean indexing results in a 1-D array corresponding to elements of the original array where the given
condition evaluates to *True*.
* For example, let's say we want to subset only the array elements greater than 5

In [17]:
booleanexample = np.arange(1,15)
print (booleanexample)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14]


In [None]:
booleanexample>5 #boolean values specifying if we are meeting a condition

In [None]:
# What we actually need is members of the original array that are > 5, not boolean values!
# We obtain this by using the array to index into the original array as follows

booleanexample[booleanexample>5]

#### Some further examples

In [None]:
booleanexample[booleanexample%2==0]

my_multi_arr = np.arange(20).reshape(4,5) #reshaping to obtain a multidimensional array
print(my_multi_arr)

my_multi_arr[my_multi_arr.sum(axis=1)>20]

#### Some useful functions


* sum(axis): In the earlier example, we've used the funcion sum() on the rows, that's what axis=1 specifies. 

* mean(axis)

* randn(shape)


In [16]:
testarray = np.random.randn(10,2)
print (testarray)

[[-0.26267404 -1.81399199]
 [ 1.50271015  0.62088865]
 [-0.2345111   0.27922498]
 [-1.57344788 -0.65686361]
 [-0.7295151   2.57619549]
 [-0.75209127 -0.3942504 ]
 [ 1.22801891 -0.3541633 ]
 [-0.79990598 -1.08630509]
 [-1.36365505 -0.11788272]
 [ 0.74967716  3.04068524]]


In [None]:
print (testarray.sum(axis=0))
print (testarray.sum(axis=1))