# Intro to numpy

Numpy is a very popular python module for managing numeric datasets and all things mathematics.

It is already installed in Colab and it's usually imported like:

In [1]:
import numpy as np

The core type you'll manage with numpy is the ``ndarray``, which stands for N-dimensional array, since it can support an arbitrary number of dimensions. Let's start with one.

In [4]:
#a standard python list
my_list = [1, 2, 3, 4, 5, 0, 7, 1, 9]

#a numpy 1D array
my_array = np.array([1, 2, 3, 4, 5, 0, 7, 1, 9])

At a superficial level they are similar

In [5]:
print(my_list)
print(my_array)

[1, 2, 3, 4, 5, 0, 7, 1, 9]
[1 2 3 4 5 0 7 1 9]


But obviously of different types

In [6]:
print(type(my_list))
print(type(my_array))

<class 'list'>
<class 'numpy.ndarray'>


Numpy array support a long list of [attributes](https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-attributes) and [methods](https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-methods). E.g. you can retrieve its dimension(s) through the ``.shape`` attribute.

In [8]:
my_array.shape

(9,)

We see that by default `numpy` does not specify the second dimension for one-dimensional arrays (vectors). If you want to make this explicit (which may turn out to be helpful for some array operations and to avoid ambiguous results), you can use the **method** `reshape()`:

In [9]:
my_array = my_array.reshape(len(my_array),1)
print(my_array.shape)

(9, 1)


Going up in dimensions:

In [10]:
my_array_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
print(my_array_2d)
print(my_array_2d.shape) 

[[1 2 3 4]
 [5 6 7 8]]
(2, 4)


In [11]:
my_array_3d = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
print(my_array_3d)
print(my_array_3d.shape) 

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
(3, 4)


# Typing

Numpy array must have the same types, and are managed as such:

In [14]:
my_array1 = np.array([1, 2, 3])
print(my_array1)

my_array2 = np.array([1, 2, 3.14])
print(my_array2)

[1 2 3]
[1.   2.   3.14]


We can explicitly access the [data type](https://numpy.org/doc/stable/reference/arrays.scalars.html#arrays-scalars-built-in) via the ``.dtype`` property:

In [15]:
print(my_array1.dtype)
print(my_array2.dtype)

int64
float64


# Missing values

Numpy doesn't support "holes" in the arrays. However there are two special values, namely "not a number" and "infinite" that can be used for that purpose.   

In [23]:
my_array = np.array([1, 2, 3, np.NaN, np.inf])
print(my_array)
print(my_array.dtype)

[ 1.  2.  3. nan inf]
float64


Since they are numeric values, technically you can do operations with them.

In [27]:
 #predict the output!
 print(np.NaN + 3)
 print(np.inf + 3)
 print(np.NaN + np.inf)

nan
inf
nan


And there is functionality to select and count them:



In [29]:
my_array = np.array([1, 2, 3, np.NaN, np.inf])
print(np.isnan(my_array))

[False False False  True False]


# Slicing

With "slicing" we refer to the act of selecting subsets of numpy arrays. The general syntax uses the square brackets ``[]`` to indicate the operation of subsetting, and the semicolon ``:`` to indicate an interval, like this:

- use array indices for slicing: `[start:end]`
- a step argument can be added: `[start:end:step]`

Keep in mind that ``start`` is the coordinate (zero-indexed) for the first element to be included, ``end`` is the coordinate (zero-indexed) for the first element to be **excluded**

In [39]:
#declaring a 1-dimensional array
arr_1d = np.array([1, 2, 3, 4, 5, 0, 7, 1, 9])

#predict the output!
print(arr_1d[:])   
print(arr_1d[3:6]) 
print(arr_1d[:3])  
print(arr_1d[7:])  
print(arr_1d[::2]) 

[1 2 3 4 5 0 7 1 9]
[4 5 0]
[1 2 3]
[1 9]
[1 3 5 7 9]


You can also do "negative slicing", passing negative values to refer to an index from the end: 

In [None]:
#declaring a 1-dimensional array
arr_1d = np.array([1, 2, 3, 4, 5, 0, 7, 1, 9])

#predict the output!
print(arr_1d[-3:])  
print(arr_1d[-3:-1])  
print(arr_1d[2:-3]) 

# Operations

Since numpy arrays are big chuncks of numeric data, they support a lot of mathematical operations

In [35]:
arr_1d = np.array([1, 2, 3, 4, 5, 0, 7, 1, 9])

#predict the output!
print(arr_1d + 3)
print(arr_1d + 2.5)
print(arr_1d * 5)
print(arr_1d + arr_1d)

[ 4  5  6  7  8  3 10  4 12]
[ 5 10 15 20 25  0 35  5 45]
[ 2  4  6  8 10  0 14  2 18]


It works on more dimensions, too

In [38]:
my_array_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

#predict the output!
print(my_array_2d + 1000)
print(my_array_2d + my_array_2d)
print(my_array_2d * my_array_2d)

[[1001 1002 1003 1004]
 [1005 1006 1007 1008]]
[[ 2  4  6  8]
 [10 12 14 16]]
[[ 1  4  9 16]
 [25 36 49 64]]


But the dimensions need to be compatible

In [40]:
arr_1d = np.array([1, 2, 3, 4, 5, 0, 7, 1, 9])

#predict the output!
print(arr_1d[0:3] + arr_1d)

ValueError: ignored

# Filtering

It's easy to filter an array when you know the indexes of the cells to be picked, but it's way more interesting to extract subsets of it based on the actual values.

Luckily any boolean list can be used for indexing.

https://www.w3schools.com/python/numpy/numpy_array_filter.asp

https://www.w3schools.com/python/numpy/numpy_exercises.asp

In [42]:
my_array = np.array([1, 2, 4, 8, 16])

#predict the output!

my_first_filter = [True, False, True, False, True]
print(my_array[my_first_filter]) 

my_second_filter = my_array > 5
print(my_array[my_second_filter]) 

[ 1  4 16]
[ 8 16]


---

# ASSIGNMENT! (filtering)

* use the basic python [range()](https://www.w3schools.com/python/ref_func_range.asp) function to fill a numpy array with all the integer numebers between 100 and 200.
* filter the array so to select only the elements that are divisible by seven (remember the ``%`` operator)
---

In [None]:
#your solution here

# Random

The ``numpy.random`` submodule is useful for generating values that follow a defined distribution. This come handy for examples and for simulating data.

As an example you can use the [randint()](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html) function to generate a random integer between 0 and 100.

In [3]:
from numpy import random

print(random.randint(101))
print(random.randint(101))
print(random.randint(101))
print(random.randint(101))
print(random.randint(101))

87
64
89
36
64


Each successive call is a new (pseudo)random number, and each of you has a different result. It is possible however to "synchronize" our random generators, through an operation called "setting the seed".

In [6]:
random.seed(42)
print(random.randint(101))
print(random.randint(101))
print(random.randint(101))
print(random.randint(101))
print(random.randint(101))

print ('')

random.seed(42)
print(random.randint(101))
print(random.randint(101))
print(random.randint(101))
print(random.randint(101))
print(random.randint(101))

51
92
14
71
60

51
92
14
71
60


---

# ASSIGNMENT! (random)

Look up the numpy function needed to generate samples from a normal (gaussian) distribution, then use it to generate 100 samples from a distribution with mean = 5 and standard deviation = 2.

---

In [None]:
#your solution here

# Functions

There's a lot of useful functions in numpy. Let's just cite:

* array manipulation
  * [transpose()](https://numpy.org/doc/stable/reference/generated/numpy.transpose.html) : reverse or permute the axes 
  * [sort()](https://numpy.org/doc/stable/reference/generated/numpy.sort.html) : sort the array 
  * [concatenate()](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) : used to collate array along existing axes
* maths
  * [sum()](https://numpy.org/devdocs/reference/generated/numpy.sum.html#numpy.sum) : adding elements
  * [mean()](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) : compute the average 
  * [sqrt()](https://numpy.org/doc/stable/reference/generated/numpy.sqrt.html) : square root
  * [sin()](https://numpy.org/doc/stable/reference/generated/numpy.sin.html) [cos()](https://numpy.org/doc/stable/reference/generated/numpy.cos.html) ... : all the trig functions
  * [max()](https://docs.python.org/3/library/functions.html#max), [argmax()](https://numpy.org/devdocs/reference/generated/numpy.argmax.html#numpy.argmax) : find the maximum value, or the index for the maximum value 
* logical
  * [any()](https://numpy.org/doc/stable/reference/generated/numpy.any.html) : return ``True`` if at least one of the element of the array is ``True``
  * [all()](https://numpy.org/doc/stable/reference/generated/numpy.all.html#numpy.all) : return ``True`` if all the elements of the array are ``True``

---

# ASSIGNMENT! (final numpy)

In this final assignment you'll test a mix of what we have seen. You are required to:

* create an array of 100 random integer between 1 and 10 picked from a uniform distribution
* compute and print the average
* subtract the average from the array
* compute and print the average again (can you predict this value?)
* compute the third quartile value (e.g. the threshold that separates the top 25% of values from the remainer part). Hint: there's a quantile function in numpy...
* print the values in the top quartile
* how many values are in the top quartile?
---

In [None]:
#your solution here