<a href="https://colab.research.google.com/github/maurapintor/ai4dev/blob/main/AI4Dev_01_intro_numpy_lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to libraries and basic operations

## Basic Operations with arrays

In Python, we can use the `numpy` library to represent data in a structured way. It's also better than using generic python lists, as:
* arrays are more easy to index
* APIs are better and targeted to numerical applications (including ML!)
* computational efficiency - interfaces are in python, but the operations are run with efficient C++ backend

And these are just a few aspects. There are way more that we will not cover here.

We first introduce the array creation operations to either wrap existing data structures into numpy arrays, or to generate arrays with known properties (e.g., a matrix of zeros).

In [None]:
import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])
print(a.shape)  # prints the dimensions of the array
print(a.dtype)  # prints the data type of the elements

a.dtype = np.float64  # casting operation
print(a.dtype)

n_rows, n_cols = 2, 4
# creates a matrix of zero-valued elements with the given shape
a = np.zeros(shape=(n_rows, n_cols))

print(a)

## More methods to create arrays

Here we see other APIs for creation of arrays.
These are useful to sample distributions with known properties or also to create useful structures to transform the arrays (e.g., masking operations, indexing).

In [None]:
a = np.ones(shape=(n_rows, n_cols))  # creates matrix of ones
print(a)

a = np.eye(n_rows, n_cols)  # creates identity matrix
print(a)

# random numbers from Normal distribution
# with zero mean and unit variance
a = np.random.randn(n_rows, n_cols)
print(a)

# random numbers from Uniform distribution in [0,1]
a = np.random.rand(n_rows, n_cols)
print(a)

a = np.random.randint(0, 5, [n_rows, n_cols])  # random integers
print(a)


## Array Indexing

Sometimes we are interested in extracting certain elements from the arrays. We use indexing operations for this.

With python lists, we can index individual elements or also ranges of elements. We can also do it with numpy arrays, but we can also bring it to the next level.

Specifically, to index elements in multi-dimensional arrays, we can now use a more compact and intuitive notation.

To extract one element, it is sufficient to list the indices conecutively. For example, in a 2D array, we can use the compact notation:

```python
a[0, 1]
```

More in general, we can select subsets of the elements by using the notation

```
<start>:<stop>:<step>
```

Where any, if omitted, defaults to:
* start $\rightarrow$ the beginning of the array (first index)
* stop $\rightarrow$ the end of the arrya (the last index)
* step $\rightarrow$ one, i.e., take all elements without skipping any

Then, we can extract submatrices from the arrays by specifying the indices for the slices. For example, if we start from a 2D array, we can extract submatrices by using:

```python
a[0:2, 0:2]  # extracts submatrix of rows 0 to 2 and columns 0 to 2
```

where each index is used for a dimension of the array (thus this instruction will return the element in the row 0, column 1).

We can also select entire rows or columns (or more in general dimensions) with the colon operator (`:`), that omits all three parameters.

For example, in the previous array, we can select the row 0 by using the colon operator in the first dimension:

```python
a[0, :]
```

and column 0 by using the colon operation in the first dimension:

```python
a[:, 0]
```


In [None]:
a = np.eye(3)
print(a)

element = ...  # picks the first element (returns float)
print(element)

submatrix = ...  #selects submatrix with slicing operators
print(submatrix)

row = ...  # picks the first row (returns flat array)
print(row)

column = ...  # picks the second column (returns flat array)
print(column)

We can also index arrays with other arrays of boolean type.
Boolean arrays can be the result of a boolean comparison in numpy.

In [None]:
a = np.eye(3)
b = ...
indexed_a = ...
print(indexed_a)

# boolean_operator
indexed_a = ...
print(indexed_a)


The complete reference for knowing about indexing arrays can be found in the docs:
- https://numpy.org/doc/stable/reference/arrays.indexing.html

## Other operations on arrays

There are other operations that can be used to transform arrays.

For example, we can transpose arrays, stack them vertically or horizontally, and perform standard operations (e.g., sums, multiplications, etc).

In [None]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
c = np.array([[1, 2, 3], [4, 5, 6]])

transpose = ...  # swaps rows and columns
print(transpose)

vertical_stack = ...  # stack rows (vertical stacking)
print(vertical_stack)

horizontal_stack = ... # stack columns (horizontal stacking)
print(horizontal_stack)

# element-wise operations
array_sum = ...
print(array_sum)

array_product = ...
print(array_product)

scalar_product = ...  # scalar product
print(scalar_product)

# inner dimensions must match for matrix operations
# 1x3 and 3x2 --> result is 1x2
scalar_product_with_matrix = ...
print(scalar_product_with_matrix)

## Exercise

Define a function `extract_subset(x, y, y0)` that takes as input:

- a 2D matrix `x`, and an array `y`
-  a target `y0` (e.g., `y0=0`)

and returns the matrix containing only rows where y is equal to the value of y0

In [None]:
x = np.array([
        [ 0.33990211,  0.94182274,  0.66611658,  0.72773846],
        [ 0.20281557,  0.24280422,  0.3627702,   0.80495032],
        [ 0.5016927,   0.29465024,  0.61690932,  0.25302243],
        [ 0.01744464,  0.82521145,  0.82226041,  0.89858553],
        [ 0.33772606,  0.17433791,  0.7705529,   0.11211808]
    ])

y = np.array([0, 0, 1, 1, 1])  # one value for each row of x

def extract_subset(x, y, y0):
    return ...

result = extract_subset(x, y, y0=0)
print(result)


## Handling data with Pandas

Pandas provides two types of classes for handling data:

* **Series**: a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.
* **DataFrame**: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

Creating a Series by passing a list of values makes pandas create a default RangeIndex to index the series. This is an array of indexes from 0 to the length of the series.


In [None]:
import pandas as pd

s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

Otherwise, we can create a DataFrame by passing a NumPy array. Optionally, we can specify explicitly the index to use. For example, we can create a datetime index using `date_range()`. Additionally, we can create named columns by passing the labels during creation:

In [None]:
dates = ...
data = ...
df = ...

print(df.head())  # prints only the first values

print(df.tail(3))  # prints only the last (three) values

Then, we can retrieve the index and the columns:

In [None]:
print(df.index)
print(df.columns)

To select elements from the dataframe, we can use the getitem operator (`[]`) like we do with numpy, but we can now also index columns.

In [None]:
column_a = ...  # extract the Series "a" from the df
print(column_a)

some_rows = ...  # extracts a sub-frame by the row index
print(some_rows)

sliced_by_index = ...  # extracts a sub-frame by the index values
print(sliced_by_index)

We can also represent the selection more explicitly by using the specific operators to access the elements by matching them from the label (`df.loc`) or by position (`df.iloc`).

In [None]:
by_label = ...
print(by_label)

by_position = ...
print(by_position)

Finally, we can slice through boolean arrays/series (like we did with numpy).

In [None]:
boolean_indexed = ...
print(boolean_indexed)

For the rest, the operations performed with numpy are in general applicable to the dataframes. There are more specific operations that we will not cover in this course, but you can find good examples in the documentation of the library.

Reference for Pandas tutorial:
* https://pandas.pydata.org/docs/user_guide/10min.html