<a href="https://colab.research.google.com/github/kimdanny/RSPW-23/blob/main/Exercise/Day1/03_numpy_matplotlib_exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Numpy & matplotlib

This notebook is modifed from Tim Hillel's (tim.hillel@ucl.ac.uk) material for the UCL Department of Civil, Environmental, and Geomatric Engineering (CEGE) Introduction to Python sessions.

Please contact before distributing or reusing the material below.


## Overview 

This notebook will introduce two new Python *libraries*:
* Numpy
* Matplotlib

## Numpy

Data structures (e.g. `lists` and `dictionaries`) are fine for 1-D data, but what if we have data in two or more dimensions?

Many (all?) of you have used Matlab - use arrays!

Python has matlab-like array functionality - *numpy*

Python has been highlighted as a great programming language in the field of data science because it is easy to learn and is supported by a number of scientific computing libraries. Numpy is one of the vital libraries that deals with mathematical computation and enables users to compute on multi-dimensional data structures more efficiently and easily.

Great documentation and tutorials are available for numpy: 

https://docs.scipy.org/doc/numpy/user/quickstart.html

### Importing libraries

Python is geared towards code reuse, and there is a huge number of libraries you can use to add functionality to Python.

Anaconda has the most useful libraries for data science pre-installed, including `numpy`

We can import a library using the `import` keyword, e.g.

    import numpy
    
However, we usually give numpy the alias *np* using the keyword `as`

In [None]:
import numpy as np

In [None]:
a = np.array([1, 2, 3, 4]) # create a rank 1 array
print("Type of a: ", type(a))
print("Shape of a: ", a.shape)
print("The first element of a: ", a[0])
print("The last element of a: ", a[-1])

### Why NumPy?

You might be wondering, why do I need Numpy? Can I not do things the usual way, with for loops? One of the reasons is the SPEED.

Let's see an example of the two functions doing the same operation but with and without Numpy.

In [None]:
# A library to help us measure how fast our algorithms are
import timeit

def numpy_method(n):
    return np.arange(n) ** 2
    
def for_loop_method(n):
    result = []
    for i in range(n):
        result.append(i ** 2)
        
%timeit numpy_method(1000)
%timeit for_loop_method(1000)

NumPy is a lot faster than our for loop method. This shows us the beauty of NumPy - getting the performance of a low-level language (C) with high-level language (Python).

### How do you initialise numpy arrays / matrices?

You will need to use NumPy's documentation to answer the questions below.

In [None]:
# create a matrix full of ones
b = np.ones((2, 2))
print("Matrix b")
print(b)

# create a matrix full of zeros
c = np.zeros((2, 3))
print("\nMatrix c")
print(c)

# create an identity matrix
d = np.eye(3)
print("\nMatrix d")
print(d)

# create a matrix filled with random numbers between 0 and 1
e = np.random.random((2, 2))
print("\nMatrix e")
print(e)

# create an array which has 0-9 as its elements in sorted order
f = np.arange(10)
print("\nMatrix f")
print(f)

# create a matrix placeholder, without initializing entries (elements in the matrix).
g = np.empty((5, 3))
print("\nMatrix g")
print(g)

###  Matrix Calculation

In machine learning, we will deal with a lot of matrix calculations. It is therefore good for us to get accustomed to some of the common operations we perform on them. Here is a list of the first few:

- `np.transpose()` : Transpose of an array
- `np.dot(a, b)` : Dot product of two arrays
- `np.linalg.inv()` : Inverse matrix of an array (only valid for square matrices, whose dimension is n * n)
- `np.diagonal()` : Diagonal components of a two-dimensional array
- `a.reshape(row = x, column = y)` : Reshape an array to the given dimension

Now let's check what each of them does.

In [None]:
# Initialise the data we will use below
x = np.array([
    [3, 11, 1],
    [7, 5, 2],
    [6, 8, 9],
    [0, 10, 4]
])
x

In [None]:
# Transpose an array
transposed = x.T
transposed

In [None]:
# Dot product of two arrays: original x and x_transposed
# (4x3) dot (3x4) should give you (4x4)

y = x.dot(x.T)
y

In [None]:
x

In [None]:
# Element-wise multiplication with 'broadcaster' and 'x'
# You will know what we meant by 'broadcast' once you check your result.

broadcaster = np.array([
    [0],
    [1],
    [2],
    [3]
])
print("broadcaster: \n{}\n".format(broadcaster))

elementwise_broadcasting = x * broadcaster
print("broadcasted: \n{}".format(elementwise_broadcasting))

In [None]:
# Extracting the diagonal elements of an array x

diagonal = np.diag(x)
print(diagonal)

In [None]:
# Reshaping an array x to one that has 6 rows and 2 columns

reshaped = x.reshape(6, 2)
print(reshaped)

### Statistics in Numpy

When we deal with large amounts of data, we will often want to know things about the data as a whole. This is where NumPy's statistics come to the rescue. Most of them are self-explanatory:

- `np.sum()` : sum of all elements in an array
- `np.max()` : returns the maximum element in an array
- `np.min()` : Minimum value of an array
- `np.mean()` : Mean of elements in an array
- `np.median()` : Median value among elements
- `np.var()` : Variance of the elements in the array
- `np.std()` : Standard deviation of the elements in the array

As before, fill in the cells below to get used to these methods.

In [None]:
x = np.array(
    [34, 56, 6, 3, 9, 89, 120, 12, 201],
    dtype = np.int32
)

In [None]:
# Summation of elements 

summation = np.sum(x)
print(summation)

In [None]:
# Get Minimum element in the array

minimum = np.min(x)
print(minimum)

In [None]:
# Maximum element in the array

maximum = np.max(x)
print(maximum)

In [None]:
# Average value of elements in the array

mean = np.mean(x)
print(mean)

In [None]:
# Median element in the array

median = np.median(x)
print(median)

In [None]:
# Variation of x

variation = np.var(x)
print(variation)

In [None]:
# Standard deviation of the array

std = np.std(x)
print(std)

### Quick recap: Creating and manipulating arrays

Unlike matlab, we need to specifically call `numpy` functions when we are using arrays in python.

For example, we can create an array using the `np.array()` function with a list of lists

In [None]:
np.array([[1,2,3],[3,2,1]])

Numpy can also easily create arrays of regular format. Use the `arange` function to create a vector with the numbers 0 to 19, and store it as `a1`

In [None]:
# create vector a1


Arrays have lots of methods you can use with them. Use the `reshape` method to turn the array into a 4x5 array, and store it as `a2`. Can you reshape it to 7x3? What happens if you try?

In [None]:
# reshape a1 to 4x5 and store the result in a2


With any 2D array, we can reshape it with -1 to get it back to a vector. Try reshaping `a2` to a vector (do not store it!)

### Indexing arrays

Indexing arrays is similar to lists, except now we can specify a row and a column. 
Get the 2nd value in the 3rd row of `a2`

We can also slice arrays. Try extracting (from a2):
* the 2nd row of the array `[5, 6, 7, 8, 9]`
* the 3rd column of the array `[2, 7, 12, 17]`
* the top-right 2x3 sub-array `[[3, 4], [8, 9] [13, 14]]`

*Hint*: remember the `:` can be used to select all values in a row or column

### Attributes and methods

Arrays have a datatype, which we can check with the `dtype` *attribute*. Note, as it is an attribute, we do not call it!

In [None]:
a2.dtype

Try dividing the array by two and storing it as a3, and then checking the dtype

Numpy arrays have several attributes and methods. Try checking the a3's `shape`. What about the `max` value? How about the index of the max value?

### Random numbers

We can also use numpy to generate (pseudo) random numbers, using the `random` submodule. 

When generating random numbers, it is a good idea to set the `seed`, so that we can generate the same numbers when we repeat our experiments.

In [None]:
np.random.seed(42)

Create an array of uniform random floats the same shape as a3, between -2 and 2, using the `rand` method in the random submodule. Call it a4

In [None]:
np.random.seed(42)
# create a4


Try calculating the mean and variance (using `sum` and `size`)

Compare the answers you get to using the `mean` and `var` methods

### Boolean arrays and indexing

We can use boolean conditions, e.g. `==` (is equal to) and `=<` (is equal to or greater than) on arrays to create boolean arrays.

Try creating a boolean array of all the values larger than 1 in `a4`

A boolean array can be used a *mask*, which extracts only the elements with a true value, as follows:
    
    <array>[<boolean_mask>]
    
Try extracting all of the values in a4 smaller than -1

## Plotting

As with `numpy`, python borrows its primary plotting interface from matlab.

The main library for plotting in python is `matplotlib`. Matplotlib has multiple interfaces, the most commonly used is the `pyplot` interface. We normally give it the alias `plt`

In [None]:
import matplotlib.pyplot as plt

We can generate a plot with `plt.plot()`. Try plotting the array `a4` and see what happens. Can you explain the plot?

In [None]:
a5 = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8]])

Plot two lines ( [1, 2, 3, 4] and [5, 6, 7, 8] )

Try adding a3 and a4 and plotting all of the data as a single line. Remember we can *reshape* a 2D array into a vector

## Exercise

You will need to use the matplotlib documentation to achieve this! Feel free to work together and ask for help!

You are going to try and recreate the following plot:

![](https://github.com/kimdanny/RSPW-23/blob/main/data/day2-example-plot.jpg?raw=true)

* Use the numpy random module to generate a 2x500 array of normally distributed numbers with mean 10 and standard deviation 20. Use the random seed 404 to generate the random points
* Plot the data as a 2D scatter plot (so that each column represents one data point x, y, z). Use orange crosses for the data points. Add the label 'data' to the points
* Add the mean to the plot as a large blue dot. Add the label 'mean' to the plot.
* Label the axis 'x' and 'y'.
* Add a legend
* Save the plot as normal_scatter.jpeg

### Want to learn more?
Helpful websites for your further study of NumPy:
- [A Visual Intro to NumPy and Data Representation](https://jalammar.github.io/visual-numpy/?fbclid=IwAR2MT-imY4dKpUcfHWfjdPOROUBadObVO7Wftf1detHWZCxSwNeA5paVI08)
- [Stanford CS231n Python Numpy Tutorial](http://cs231n.github.io/python-numpy-tutorial/)
- [DataCamp Python Numpy Array Tutorial](https://www.datacamp.com/community/tutorials/python-numpy-tutorial)
- [Machine Learning Plus 101 Numpy Exercises for Data Analysis (Python)](https://www.machinelearningplus.com/python/101-numpy-exercises-python/)

Also checkout https://github.com/UCLAIS/Machine-Learning-Tutorials/blob/master/notebooks/Session01-Matplotlib-Solution.ipynb
for advanced Matplotlib!