# Python "from zero to hero" - Basics of data analysis

July 5th, 2022

The content of the following notebook is almost entirely based on the course [https://it.softpython.org/](https://it.softpython.org/), which are made available under the license CC-BY 4.0. The above-mentioned website was realized with funds provided by the departments of Information Science and Engineering, Mathematics, and Sociology of the University of Trento, and it was written by David Leoni, Marco Caresia, Alessio Zamboni, Luca Bosotti, and Massimiliano Luca.

## Part II.1 - Packages

We used only built-in functions so far. However, a useful feature of Python is the ease of use of third party libraries, which allow us relying on a bunch of ready to use code for the most diverse purposes.

Third party libraries of code are not available with the basic Python installation: You need to download and install them. Don't worry, it is easier than what you think, at least at the beginning...

Python packages are made available on repositories on the Internet, from which they can be downloaded and used.
The most common repositories are [PyPI](https://pypi.org/), which is the official Python repository, and [Conda](https://docs.conda.io/en/latest/) if you are working within the Anaconda environment.

Most of the Python packages you will ever need are available on PyPI and can be accessed using `pip` (or `pip3` in Linux systems). Usually `pip` is already available with the basic Python installation; otherwise give a look at the following [page](https://pip.pypa.io/en/stable/cli/pip_install/).

### Basic `pip` usage

The command to install a package is usually:

    pip install <package_name>

Sometimes, more complex options must be specified. It is a good practice to look at the instructions on the official package documentation to see how to install it.

For install to install Numpy, a common data handling library, it is enough to use the following command:

    pip install numpy

To find more specific installation options, read the official [page](https://numpy.org/install/).

<div class="alert alert-warning">

**ATTENTION**

A package is installed only once by using the command line. Then, it can be imported and used in any Python script without being installed each time.
</div>

<div class="alert alert-warning">

**ATTENTION**

Some packages are already available with the basic Python installation, e.g., `os`, `sys`, `itertools`, and `math`.
</div>

### Import a package

Packages are imported in the Python namespace only when they are required. This can be done using the following command:

In [1]:
# to import pandas
import numpy
# ... easy!

numpy.__version__

'1.21.1'

Alternatively, one can label a package with a custom name and to use such a name to refer to the package:

In [2]:
import numpy as np
np.__version__

'1.21.1'

When you want to use `numpy` in you code, you need to import `numpy` in your working environment (Python interpreter) to use the code of the library.

    # The word `as` allows to rename the pandas package
    import numpy as np

## Part II.2 - The `numpy` library

There are two ways to represent matrices in Python:
as a lists of lists, or with the external library `numpy`.
A very popular choice is `numpy`, let's see why and what are the main differences with lists of lists.

Lists of lists
- are native in Python
- are not efficient
- are everywhere in Python, therefore it is very likely that you will meet them once
- provide an idea of how to represent nested data in Python
- can be useful for understanding pointers and copies

Numpy
- is not natively available in Python
- is efficient
- is used in several scientific libraries
- the syntax to access the elements of a matrix is slightly different from lists of lists
- very rarely, it could rise compatibility issues with other packages, or with old Python versions

In this part of the course we will see the basics of the `numpy` library.
The way we will use `numpy` is far from good practices that allow to exploit `numpy` speed of calculation, i.e., despite nested `for` cycles are inefficient, we will use them anyway.
However, to fully exploit `numpy` for its speed and efficiency, it is required to use operations on vectors and matrices, but this is out of the scope of this seminar.

### Create a matrix

To declare an empty matrix, i.e., a matrix containing only elements that are zero, the function `zeros` can be used.

In [4]:
mat = np.zeros((3,4))
mat

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

The only argument provided here is a tuple containing the dimension of the matrix. It is possible to create $N$-dimensional arrays by adding new parameters to the tuple.

The type of object produced by `zeros` is an `ndarray`, which is the basic data type in `numpy`.

In [5]:
type(mat)

numpy.ndarray

Within an `array`, the content look like that of a list of lists.
However, the internal representation of an `array` is a linear sequence of elements that allows Python accessing the numbers faster than it does with lists of lists.

An `nparray` can be created starting from a list of lists

In [6]:
arr = [
    [1, 2, 3, 4],
    [5, 6, 7, 8]
]

mat = np.array(arr)
mat

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

#### Create a matrix full of 1

In [7]:
np.ones((2, 4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

#### Create a matrix full of a number $k$

In [8]:
        #Dimension # number
np.full((3, 4),    9)

array([[9, 9, 9, 9],
       [9, 9, 9, 9],
       [9, 9, 9, 9]])

#### Shape of a matrix

The shape of a matrix is stored in the `shape` attribute of the `ndarray` object.
The attribute `shape` is a tuple.

In [9]:
mat = np.ones((1,3))
print(mat)
rows, columns = mat.shape

print('rows', rows, 'columns', columns)

[[1. 1. 1.]]
rows 1 columns 3


### Reading and writing

To access an element of an array, `numpy` allows to use the notation with `[``]`, within which the indexes of the element are separated by a comma.

In [10]:
arr = [
    [1, 2, 3, 4],
    [5, 6, 7, 8]
]
mat = np.array(arr)

mat[1,3]

8

In [11]:
mat[0,2] = 9
mat

array([[1, 2, 9, 4],
       [5, 6, 7, 8]])

In [12]:
mat[0,0] = "c"

ValueError: invalid literal for int() with base 10: 'c'

In [13]:
mat[1,1.0]

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

### Slicing

Slicing is one of the most useful feature of `numpy`, which allows the access to 'slices' of arrays.
To access a slice of an array, we can use ranges, separated by commas, within the same brackets.

In [14]:
mat = np.array([[5, 8, 1],
                [4, 3, 2],
                [6, 7, 9],
                [9, 3, 4],
                [8, 2, 7]])

mat[0:4, 1:3]

array([[8, 1],
       [3, 2],
       [7, 9],
       [3, 4]])

In [15]:
mat[0:1,0:3]

array([[5, 8, 1]])

In [16]:
mat[0:1,:]

array([[5, 8, 1]])

In [17]:
mat[0:5, 0:1]

array([[5],
       [4],
       [6],
       [9],
       [8]])

In [18]:
mat[:, 0:1]

array([[5],
       [4],
       [6],
       [9],
       [8]])

It is also possible to specify a step as the third parameter of a range after an additional `:`.

In [19]:
mat[0:5:2, :]

array([[5, 8, 1],
       [6, 7, 9],
       [8, 2, 7]])

<div class="alert alert-warning">

**ATTENTION: when we modify a slice of an array, the changes persist in the original matrix**
</div>

Conversely to lists of lists, when slicing is used `numpy` produces a *view* on the array.
It means that changing the view, we change also the original matrix.
Assigning a slice to a new variable does not create a new `array`.

In [20]:
mat = np.array([[5, 8, 1],
                [4, 3, 2],
                [6, 7, 9],
                [9, 3, 4],
                [8, 2, 7]])

sotto_mat = mat[0:4, 1:3]
sotto_mat

array([[8, 1],
       [3, 2],
       [7, 9],
       [3, 4]])

In [21]:
sotto_mat[0,0] = 999
mat

array([[  5, 999,   1],
       [  4,   3,   2],
       [  6,   7,   9],
       [  9,   3,   4],
       [  8,   2,   7]])

#### Writing a constant in a slice

We can write a constant in all the cells of a region using slicing and setting them equal to a constant.

In [22]:
mat = np.array( [ [5, 8, 1],
                  [4, 3, 2],
                  [6, 7, 9],
                  [9, 3, 4],
                  [8, 2, 5]])

mat[0:4, 1:3]  = 7

mat

array([[5, 7, 7],
       [4, 7, 7],
       [6, 7, 7],
       [9, 7, 7],
       [8, 2, 5]])

#### Writing a matrix in a slice

We can write on a region of an array using slicing and assigning the region equal to a matrix from which we want to read the cells.

In [23]:
mat = np.array( [ [5, 8, 1],
                  [4, 3, 2],
                  [6, 7, 9],
                  [9, 3, 4],
                  [8, 2, 5]])

mat[0:4, 1:3]  = np.array([
                            [10,50],
                            [11,51],
                            [12,52],
                            [13,53],
                        ])

mat

array([[ 5, 10, 50],
       [ 4, 11, 51],
       [ 6, 12, 52],
       [ 9, 13, 53],
       [ 8,  2,  5]])

### Copy of arrays

To create a copy of an array, it is possible to use the method `copy` on the array that we want to copy and to assign the result to a variable.

In [24]:
va = np.array([1,2,3])
vc = va.copy()
vc

array([1, 2, 3])

In [25]:
vc[0] = 100
vc

array([100,   2,   3])

In [26]:
va

array([1, 2, 3])

### Calculations

A point of strength of `numpy` is the possibility to use operations on arrays that resembles those that are used in algebra.

Let's see some of these.

In [27]:
va = np.array([5,9,7])
vb = np.array([6,8,0])
vc = va + vb
vc

array([11, 17,  7])

Notice that the sum produced a new array that was assigned to the variable `vc`.

#### Scalar multiplication

In [28]:
m = np.array([[5, 9, 7],
              [6, 8, 0]])

3 * m

array([[15, 27, 21],
       [18, 24,  0]])

#### Scalar summation

In [29]:
3 + m

array([[ 8, 12, 10],
       [ 9, 11,  3]])

#### Multiplication (element-wise)

The multiplication of two array using the operator `*` is performed element-wise.
Therefore, the arrays that are multiplied must have the same dimension.

In [30]:
ma = np.array([[1,  2,  3],
               [10, 20, 30]])

mb = np.array([[1,  0,  1],
               [4,  5,  6]])

ma * mb

array([[  1,   0,   3],
       [ 40, 100, 180]])

If we want to perform a matrix multiplication (rows-per-columns), i.e., the one we learned during the algebra course, we must use the operator `@` and we must pay attention that the dimensions of the matrices are compatible.

In [31]:
mc = np.array([[1,  2,  3],
               [10, 20, 30]])
md = np.array([[1, 4],
               [0, 5],
               [1, 6]])

mc @ md

array([[  4,  32],
       [ 40, 320]])

#### Scalar division

In [32]:
ma = np.array([[1,  2,  0.0],
               [10, 0.0, 30]])

ma / 4

array([[0.25, 0.5 , 0.  ],
       [2.5 , 0.  , 7.5 ]])

**ATTENTION**: when an array is divided by `0.0` the program **does not** return an error, it will return a warning and the execution of the program will continue!
It will continue by inserting `NaN` and `Inf` objects in the array.
These identify a non-number (the result of a division by zero) and the infinite.

In [33]:
ma / 0.0

  ma / 0.0
  ma / 0.0


array([[inf, inf, nan],
       [inf, nan, inf]])

#### Aggregation functions

Numpy provides several functions to perform operations on arrays. A few examples are presented in the following.

In [34]:
m = np.array([[5, 4, 6],
              [3, 7, 1]])
np.sum(m)

26

In [35]:
np.max(m)

7

In [36]:
np.min(m)

1

Aggregation over rows of columns are also possible by adding the parameter `axis`. `axis=0` means aggregation over columns, whereas `axis=1` produces aggregation over rows.

In [37]:
np.max(m, axis=0)

array([5, 7, 6])

In [38]:
np.sum(m, axis=0)

array([ 8, 11,  7])

In [39]:
np.max(m, axis=1)

array([6, 7])

In [40]:
np.sum(m, axis=1)

array([15, 11])

### Filter an array

One of Numpy most useful feature is filtering of the elements according to a criterion.

In [41]:
mat = np.array([[5, 2, 6],
                [1, 4, 3]])
mat

array([[5, 2, 6],
       [1, 4, 3]])

Assume that we want to obtain an array with all the number greater than `2` from `mat`.

We can perform filtering by specifying the filtering condition between `[` `]` right after the array name, and within brackets we write the condition using again the variable name.

In [42]:
mat[ mat > 2 ]

array([5, 6, 4, 3])

The expression that we wrote within brackets produces a boolean array, which tells Numpy which elements to select.

It is possible to use multiple criteria for the selection of elements by concatenating the operator `&`, which works as an *and*, and the operator `|`, which works as an *or*.

In [43]:
mat = np.array([[5, 2, 6],
                [1, 4, 3]])
mat[(mat > 3) & (mat < 6)]

array([5, 4])

In [44]:
mat = np.array([[5, 2, 6],
                [1, 4, 3]])
mat[(mat < 2) | (mat > 4)]

array([5, 6, 1])

In [45]:
mat = np.array([[5, 2, 6],
                [1, 4, 3]])
mat[mat > 3 & (mat < 6)]

array([5, 2, 6, 4, 3])

In [46]:

mat = np.array([[5, 2, 6],
                [1, 4, 3]])
mat[mat > 3 & mat < 6]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [47]:
mat > 3 & (mat < 6)

array([[ True,  True,  True],
       [False,  True,  True]])

In [48]:
3 & (mat < 6)

array([[1, 1, 0],
       [1, 1, 1]], dtype=int32)

**ATTENTION** In Numpy, the operators `and` and `or` do not work!

It might be useful to retrieve the indexes of the elements that satisfy our conditions.
This is possible through the function `np.where`, to which we pass the expressions that encode our conditions.

In [50]:
             #0  1  2  3  4  5
v = np.array([30,60,20,70,40,80])

idx = np.where((v < 40) | (v > 60))

In [51]:
v[idx]

array([30, 20, 70, 80])

`np.where` can do thighs that are way more complicated.
For instance, we may want to substitute the elements that satisfy a condition with those in another matrix, and to substitute those elements that do not satisfy the condition with the values in another matrix.

In [52]:
ma = np.array([
    [ 1, 2, 3, 4],
    [ 5, 6, 7, 8],
    [ 9,10,11,12]
])

mb = np.array([
    [ -1, -2, -3, -4],
    [ -5, -6, -7, -8],
    [ -9,-10,-11,-12]
])


mat = np.array([
    [40,70,10,80],
    [20,30,60,40],
    [10,60,80,90]
])

np.where(mat < 50, ma, mb)

array([[  1,  -2,   3,  -4],
       [  5,   6,  -7,   8],
       [  9, -10, -11, -12]])

### `arange` and `linspace` sequences

Two useful Numpy utility functions are `arange` and `linspace`.
`arange` can be considered an extension of the Python function `range`; moreover, `arange` allows decimal increments, which are not possible using `range`.

In [53]:
np.arange(0.0, 1.0, 0.2)

array([0. , 0.2, 0.4, 0.6, 0.8])

Alternatively to `arange` it is possible to use `linspace`, which divides a range into `n` parts and returns the values that define these sub-spaces, included the left and right boundaries.

In [52]:
np.linspace(0, 0.8, 5)

array([0. , 0.2, 0.4, 0.6, 0.8])

In [53]:
np.linspace(0, 0.8, 10)

array([0.        , 0.08888889, 0.17777778, 0.26666667, 0.35555556,
       0.44444444, 0.53333333, 0.62222222, 0.71111111, 0.8       ])

### `NaN` and `Inf`

Float numbers can be numbers, *non-numbers*, and infinities.
During calculations, some extreme conditions may happen, for instance, to divide a small number by a huge number.
In such cases, it might be that we end with a `NaN` number, i.e., a special type of float number.
Viceversa, one can obtain an `Inf` number, which identifies an infinite.
Finding such numbers in our calculations may lead to unexpected results, and it is worth to know what to do in such cases.

#### `NaN`

`NaN` means *not a number*.
In fact, `NaN`s are a special type of float, and they present the following property:

<div class="alert alert-warning">

`NaN` is not equal to itself.
</div>

In [54]:
np.nan == np.nan

False

The same is true if we use the `nan` object from the Python library `math`. In fact, to find a `nan` is so common that they were implemented also in Numpy; however, they are exactly the same of the module `math`.

In [54]:
import math
math.nan == math.nan

False

#### Find `NaN` values

To find `NaN` values is of fundamental importance, both when we do calculations and when we are analyzing data (e.g., when field of a table are empty).

In [55]:
a = np.nan
np.isnan(a)

True

#### Operations with `NaN`

Operations with `NaN` are possible, but the results will be somethig *strange*. Let's see some examples.

In [56]:
5 * math.nan

nan

In [57]:
math.nan + math.nan

nan

In [58]:
math.nan / math.nan

nan

Let's see where we can find `NaN`s in practice.
Consider the function `log`.

In [61]:
np.log(-1)

  np.log(-1)


nan

In this case, Numpy noticed us of the bad operation with a warning, but the program did not stop.
In case we find a `NaN` along the way, Numpy execute the calculations anyway and is saves the results; however, the results of *bad* operations are saved as `NaN`.

In [62]:
np.log(np.array([3,7,-1,9]))

  np.log(np.array([3,7,-1,9]))


array([1.09861229, 1.94591015,        nan, 2.19722458])

#### The `Inf` type

The probability to encounter an operation that produces a huge number is not that low.
Therefore, the creators of the Numpy library designed a special float type for such an occurrence.
Some important properties of `Inf` are:
- The object `Inf` is not equal to `NaN`
- The positive infinite is not equal to the negative infinite
- The infinite is equivalent to the positive infinite
- The infinite is equal to itself

In [59]:
np.inf == np.inf

True

In [60]:
math.inf == math.inf

True

Also in this case, instances of `Inf` can be identified using the function `isinf`.

### Exercises

1) Given a `n` x `m` matrix, return a new matrix with 1 column and `n` rows that contains the average of the values in the rows of the input matrix.

    3 2 1 4
    6 2 3 5
    4 3 6 2
    4 6 5 4
    7 2 9 3

2) Given a `n` x `m` matrix, write a function that returns a new `n` x `m` matrix, wich values in the even rows are multiplied by two and all the other values are equal to the input matrix.

In [61]:
m  = np.array( [
                    [ 2, 5, 6, 3],
                    [ 8, 4, 3, 5],
                    [ 7, 1, 6, 9],
                    [ 5, 2, 4, 1],
                    [ 6, 3, 4, 3]
               ])

In [79]:
# import numpy as np

# def radalt(mat):
#     nrows, ncol = mat.shape
#     ret = np.zeros( (nrighe, ncol) )

#     for i in range(nrows):
#         for j in range(ncol):
#             if i % 2 == 0:
#                 ret[i,j] = mat[i,j] * 2
#             else:
#                 ret[i,j] = mat[i,j]
#     return ret

# def radalt(m):
#     r = np.copy(m)
#     r[::2,:] = r[::2,:] * 2
#     return r

In [80]:
radalt(m)

array([[ 2,  5,  6,  3],
       [16,  8,  6, 10],
       [ 7,  1,  6,  9],
       [10,  4,  8,  2],
       [ 6,  3,  4,  3]])

3) **The chessboard**: return matrix with `n` rows and `n` columns, where the cells alternate ones and zeros.

In [62]:
def chessboard(n):
    mat = np.zeros( (n,n)  )

    for i in range(0,n, 2):
        for j in range(0,n, 2):
            mat[i, j] = 1

    for i in range(1,n, 2):
        for j in range(1,n, 2):
            mat[i, j] = 1

    return mat

def chessboard_pro(n):
    ret = np.zeros((n, n))
    ret[::2, ::2] = 1
    ret[1::2, 1::2] = 1
    return ret

4) Given a matrix with dimensions `2n` x `2n`, divide the matrix into four parts (four quadrants) and return a matrix `2 * 2` containing the average of each quadrant.

In [64]:
1, 2 , 5 , 7
4, 1 , 8 , 0
2, 0 , 5 , 1
0, 2 , 1 , 1

(0, 2, 1, 1)

In [65]:
import numpy as np

# def quadrants(matrix):
#     ret = np.zeros( (2,2) )

#     dim = matrix.shape[0]
#     n = dim // 2
#     elements_per_q = n * n

#     for i in range(n):
#         for j in range(n):
#             ret[0,0] += matrix[i,j]
#     ret[0,0] /= elements_per_q

#     for i in range(n,dim):
#         for j in range(n):
#             ret[1,0] += matrix[i,j]
#     ret[1,0] /= elements_per_q

#     for i in range(n,dim):
#         for j in range(n,dim):
#             ret[1,1] += matrix[i,j]
#     ret[1,1] /= elements_per_q

#     for i in range(n):
#         for j in range(n,dim):
#             ret[0,1] += matrix[i,j]
#     ret[0,1] /= elements_per_q

#     return ret

def quadrants(matrix):
    n = matrix.shape[0]
    av = np.zeros((2,2))
    i = n // 2
    av[0,0] = np.average(matrix[:i,:i])
    av[0,1] = np.average(matrix[:i,i:])
    av[1,0] = np.average(matrix[i:,:i])
    av[1,1] = np.average(matrix[i:,i:])
    return av

In [66]:
m2 = np.array( [ [1.0, 2.0 , 5.0 , 7.0],
                 [4.0, 1.0 , 8.0 , 0.0],
                 [2.0, 0.0 , 5.0 , 1.0],
                 [0.0, 2.0 , 1.0 , 1.0] ])

av = quadrants(m2)
print(av)

[[2. 5.]
 [1. 2.]]
