# Introduction

Welcome!

This notebook is for your first practical session of the AI for Health course. Over the course of multiple practical sessions, you will learn how to understand the code that underlies applications of Artificial Intelligence (AI) in healthcare. You will also learn how to implement AI models by yourself. Programming is a skill that is acquired by hands-on experience and typically requires a lot of training. During this course we will not be able to make you an expert programmer, but we aim to introduce you to the different ingredients that come into play when creating an AI model. 

In any AI project, your data is your most crucial ingredient. In this session, we will start by introducing some ways to get a better understanding of your data. Along the way we will introduce several Python packages that are very often used when running an AI project. Some basic programming knowledge is assumed (things like defining variables, for loops, if statements, etc.). If you haven't done any Python introduction yet, please make use of the RealPython accounts offered to you.

## Learning objectives
- get familiar with the look and feel of Python
- get familiar with Google Colab
- get familiar with the Python packages Numpy 

Now, let's get started!

#Getting started with Numpy

There are many ways to represent your data and perform calculations on your data in Python. We will focus on the two most used packages for this, Numpy, which we will focus on first, and Pandas, which will be introduced in the next practical session. 

Numpy is a fundamental package for scientific computing with Python. Its most important feature is storing data in large arrays and enabling you to perform all kinds of computations on these arrays. 

Let's start by importing Numpy.

In [None]:
import numpy as np

The most important feature of Numpy are its arrays, on object that can store data and where you can easily perform all kinds of functions on. To create an array you can use the function np.array(). 

You can get information about a function with a double ??. A separate help window opens where you can see what a function expects as input, and what it returns as output.



In [None]:
??np.array

Now let's create our first array.

In [None]:
my_array = np.array([1,2,3,4,5])

To see the contents of an array, you can use a print statement.

In [None]:
print(my_array)

The array you just created has several properties called 'attributes' automatically assigned to it. One if these is how many elements there are in the array, called the 'size'. Something that is very useful when your arrays are becoming larger.

In [None]:
print(my_array.size)

You can also print the shape of the array. This gives the number of elements per dimension. Our array is just 1 dimensional, so there is only 1 number.

In [None]:
print(my_array.shape)

Let's make a 2 dimensional array and print its size and shape.

In [None]:
my_array = np.array([[0,1,2],[3,4,5]])
print(my_array)
print(my_array.size)
print(my_array.shape)

You can see that our array contains 6 elements divided into 2 rows (1st dimension) and 3 columns (2nd dimension).

Arrays can have any number of dimensions. Higher-dimensional arrays can be hard to print or visualize, which often requires you to select a subsection of your elements that you can easily print or visualize. This brings us to our next subject: indexing arrays.

# Indexing arrays
Indexing is used to select elements from your array. There are different ways to index arrays, which we will demonstrate here.

In Python indices start from 0 (in contrast to for example Matlab where indices start from 1).

Let's select the first element of an array.

In [None]:
my_array = np.array([1,2,3,4,5])
print(my_array[0])

To select elements from a 2D array you indicate the place of the element in every dimension separated by comma's.

In [None]:
my_array = np.array([[1,2,3],[4,5,6]])
print(my_array[1,2])

We selected the element on row 1, column 2, which is 6 (remember indices start from 0).

Try to select the number 4 from the array below.

In [None]:
my_array = np.array([[2,5,1,4],[7,0,6,3]])

print(my_array[...])

To select elements from the end of the array, you can use negative indices.

In [None]:
my_array = np.array([[2,5,1,4],[7,0,6,3]])

print(my_array[-1,-1])

To select a range of elements you can use the ':' symbol. This is called slicing.

In [None]:
my_array = np.array([5,0,4,1,3])
print(my_array[0:3])

Note that the elements that are selected include the start index, but exclude the end index (so index 0, 1, and 2, but not 3).

If you want to select all elements from a certain point onwards you can leave the end index.

In [None]:
my_array = np.array([5,0,4,1,3])
print(my_array[2:])

This can also be done with negative indices.

In [None]:
my_array = np.array([5,0,4,1,3])
print(my_array[2:-1])

Slicing in more dimensions is also possible.

In [None]:
my_array = np.array([[2,5,1,4],[7,0,6,3]])

print(my_array[1,1:4])

Can you select from both rows the first 3 elements? 

In [None]:
my_array = np.array([[2,5,1,4],[7,0,6,3]])

print(my_array[...])

# Data types

Another important property of Numpy arrays is the data type. This is an important concept as it determines how much storage space our data requires. There are many different data types, the standard data types of Python are

- string - used to represent text data, the text is given under quote marks. e.g. "Hello"
- integer - used to represent integer numbers. e.g. 1, 2, 3
- float - used to represent real numbers. e.g. 2.1, 11.53
- boolean - used to represent True or False.
- complex - used to represent complex numbers. e.g. 2.2 + 3.0j, 2.6 + 2.1j

There are also subtypes for these data types. For example, int8, can store integers between -128 to 127 and int16 can store integers between -32768 to 32767. Storing a number as int16 takes twice as much space on your disk as storing a number as int8. It is thus good practice to store your data in a type that uses as little space as possible, especially when going to larger data sets.

Let's print the data type of our array.

In [None]:
int_array = np.array([1,2,3,4,5])

print(int_array.dtype)

Numpy has stored the data automatically as an int64, the largest data type for integers. If we know we will only use integers between -128 and 127 we can store this much more efficiently.

In [None]:
int_array = int_array.astype('int8')

print(int_array.dtype)

We now stored the data in a way that takes 8 times less memory.

Next, let's have a look at decimal numbers. 

In [None]:
float_array = np.array([1.3,2.1,3.5,4.2,5.7])

print(float_array.dtype)

This array is automatically assigned a float64 data type, because the int data type cannot store decimals. Floats can also be stored more efficiently by reducing the number decimals that are stored. This can lead to small errors that can in some cases be a problem, but for machine learning applications are generally negligible.

Try to convert the array to a float32 yourself.

In [None]:
float_array = ...

print(float_array.dtype)

Another often-used data type is the boolean. These are used to store binary, or True / False responses. If you create an array of 0's and 1's Numpy does not automatically turn it into a boolean array, but thinks it is an array of integers.

In [None]:
bool_array = np.array([0, 1, 0, 0, 1])

print(bool_array.dtype)
print(bool_array)

Try to convert it to an array of data type 'bool' yourself.

In [None]:
bool_array = ...

print(bool_array.dtype)
print(bool_array)

By making it a boolean array the values of 0 and 1 are converted to False and True. 

# Calculating with arrays
The whole point of Numpy is not to just store the data in arrays, but to do calculations on your data. The Numpy package is ideal for performing these calculations and much faster than doing it without Numpy.

Let's explore some of the basic calculations you can perform on Numpy arrays. You can perform basic arithmetic on Numpy arrays.

In [None]:
my_array =  np.array([1,2,3,4,5])

a = my_array + 1
b = my_array - 3
c = my_array * 2
d = my_array / 4


print(a, '\n',
      b, '\n',
      c, '\n',
      d)

Interestingly, in one of these 4 cases the array changed its data type. Can you find which one, and where it changed towards? Why is that?

In [None]:
print(... .dtype)

If you found the correct answer you have noticed that division changed the data type, this is because we now suddenly need to store decimal numbers. It is also possible to do an 'integer division' using '//'. The results are then rounded to their lowest nearest integer (also called flooring).

In [None]:
my_array =  np.array([1,2,3,4,5])

print(my_array // 4)

You can also do arithmetic between 2 arrays. These work element-wise by default.

In [None]:
array_1 = np.array([[1,2,3],
              [4,5,6]])

array_2 = np.array([[10,11,12],
              [13,14,15]])

array_3 = array_1 + array_2

print(array_3)

Besides this basic arithmetic, there are many built-in functions in Numpy to do calculations on arrays. For example, to take the sum, product or difference of all elements in an array.

In [None]:
my_array =  np.array([1,2,3,4,5])

print(np.sum(my_array))
print(np.prod(my_array))
print(np.diff(my_array))
print(np.mean(my_array))

Lastly, you can perform multiplications between different arrays in Numpy. Something that is widely used in Machine Learning and Deep Learning. This operation often uses most of the computation time for an algorithm, because we use very large arrays (especially in Deep Learning), where millions or billions of numbers are multiplied.

A dot product (an operation that multiplies the elements of 2 vectors and sums the result) is performed as follows.

In [None]:
array_1 = np.array([1,2,3,4,5])
array_2 = np.array([6,7,8,9,10])

print(np.dot(array_1,array_2))



Matrix multiplication on 2D matrices (an operation where every row of the first matrix is multiplied with every column of the second matrix and the result is summed, similar to performing several dot products at the same time) is performed in the following way.

In [None]:
print(np.matmul(array_1.T,array_2))

# Combining and reshaping arrays
You can combine two arrays if they match in at least 1 dimension. You can do this either horizontally or vertically.


In [None]:
array_1 = np.array([[1,2,3],
               [4,5,6]])

array_2 = np.array([[7,8,9],
               [10,11,12]])

array_3 = np.hstack((array_1, array_2))

array_4 = np.vstack((array_1, array_2))

print(array_3)

print(array_4)

Reshaping arrays also comes in handy often. You can change how the elements are distributed over the dimensions in this way.

In [None]:
array_1 = np.array([[1,2,3],
               [4,5,6]])
array_2 = np.reshape(array_1, (3,2))

print(array_1)
print(array_2)

# Conditions on arrays
Often we would like to select data from an array that meets a certain condition. You can do this in two ways in Numpy. Either by creating a boolean array, or with the 'where' function, which returns the indexes of the elements that meet the given condition.

In [None]:
my_array = np.array([2,5,1,4,7,0,6,3])

# Create boolean array where data meets condition
boolean = my_array>2

# Find indices where data meets condition
indices = np.where(my_array>2)

print(boolean)
print(my_array[boolean])
print(indices)
print(my_array[indices])

# Exercise
To get used to the concepts you learned above, a small exercise. 

Below is an array representing the climate of different cities around the world. The rows represent different cities and the columns represent months. The values of the elements represent the average monthly temperature of a city.

Can you answer the following questions using Numpy?
- Which city has the highest average temperature and which months is on average the warmest?
- Which city has the most months with a temperature above 20 degrees?
- Which city has the biggest difference between the coldest month and the warmest month?

In [None]:
temperatures = np.array([[1,3,8,14,17,20,22,23,23,20,15,8],
                         [2,5,9,15,19,23,26,26,23,20,14,7],
                         [-1,-2,5,10,12,15,20,20,17,13,8,2],
                         [-5,-3,3,11,17,24,25,23,20,15,8,-2],
                         [8,12,15,17,19,20,22,21,20,15,12,8],
                         [5,12,21,25,28,28,30,29,25,17,12,5],
                         [-10,-5,-3,5,10,12,15,12,10,3,-4,-9],
                         [16,17,17,19,23,27,27,27,20,19,18,16],
                         [1,5,10,17,20,25,27,28,24,20,15,8],
                         [2,5,8,12,14,15,17,19,18,15,9,4]])

# Discussion
Now that you have finished this notebook you should have familiarized yourself with the way Python looks and feels within the Google Colab environment and have a basic understanding of the Numpy package.

In our next notebook we will look at our first data set and use Numpy to understand our data.