# Introduction to Data Science PC Lab 02: Basics of Numpy Arrays

Author: Jan Verwaeren - Arne Deloose

Course: Introduction to Data Science

Welcome back, dear reader. In this notebook, we will cover the basics of Numpy arrays.  As usual, each section (three sections once again) will start with illustrations followed by exercises. The illustrations in this notebook are similar to those of the theory slides for lecture2 and can be used to follow along.

The code below will detect which program you are using to run the notebook. This will be important later when we load in arrays

In [None]:
try:
    import google.colab
    in_colab = True #set in_colab to True if you're running the code from google colab
except:
    in_colab = False #otherwise, it's false

**What is Numpy?**

Before we start, let us briefly discuss what Numpy is. Numpy is a Python library that adds support for arrays and matrices. Libraries are collections of functions that extend the capabilities of Python. We will see many other libraries during this course. You may have noticed that we already used the *requests* library in PC lab 01 to load in data from a url. 

If a library is already installed, it can be imported using the *import* function as illustrated below. With *as* we can give it a name. The convention is to shorten numpy to *np*

In [None]:
#import the numpy module and call it np
import numpy as np

If a library is not yet installed, it needs to be downloaded first using the package manager *pip*. If we want the *jellyfish* library for example (this library is used to phonetically match strings), we need to install it first using:

    !pip install jellyfish

Before we can import it with:

    import jellyfish

Commonly used libraries like *Numpy* are installed by default on Colab, so we do not need to worry about this for now.

To use a function from a library, we use *library.function()* for example, *np.array()* or *jellyfish.levenshtein_distance()*

## 1. Creating Numpy arrays

With Numpy, we can work with arrays, so the logical first step is to create these arrays. This can be done in three different ways: using np.array, with another function or by loading from a file 

**ILLUSTRATION**

**1.a. With *np.array***

The most basic way to create an array is with the function *np.array()*. The first input of this function is the object we want as an array. To get a single row, we give the elements seperated by commas enclosed in square brackets, as illustrated below. Multiple rows are given as seperate enclosed rows. However, these must be enclosed again in square brackets, because the object must be given as a single input. Without the extra brackets, the second row would be considered a second input. 

In [None]:
#using np.array
A = np.array([1, 2, 3]) #one row
B = np.array([[1, 2, 3],[4, 5, 6]]) #two rows
A.dtype #show type

In [None]:
B #show B

Once again, Python automatically detects the type of our variables. The most general type is always used. So if one number is given as a float, the entire array becomes a float type.

In [None]:
B = np.array([[1, 2.0, 3],[4, 5, 6]]) #two rows
print(B)
print(B.dtype)

Note that dtype is an *attribute* of our array, not a function (no brackets needed). *dtype* is not the same as *type*, because the type of B is:

In [None]:
type(B)

This type is the same for every numpy array. Even if the internal type is different. For example:

In [None]:
#different array type
#implicit type, analogue to a single var
A = np.array([True, False, False])
print(A.dtype)
print(type(A))

With dtype, we can explicitely define the type as well

In [None]:
A = np.array([1, 2, 3], dtype='float') #one row
A

**1.b. Other functions**

There are several other functions to create arrays. *zeros* can be used to create an array filled with zeros, *arange* and *linspace* can fill in numbers between two points and *random* fills in random numbers. To get floats, we need *random.random* (there are other random generators such as *random.randint* for random integers)

In [None]:
A = np.zeros((5, 3), dtype = 'float') # 5 row, 3 columns of zeros

B = np.arange(0, 10, 2)               # from 0 to (just not) 10 in steps of 2

C = np.linspace(0, 1, 5)              # 5 equally spaced between 0 and 1

D = np.random.random((3, 3))          # 3 by 3 matrix of random numbers in [0, 1

In [None]:
B

**1.c. from a file**

Finally, arrays can be loaded in from a file using *np.loadtxt*. Despite its name, other file types can be loaded in as well, not just *.txt* . The first argument of this function is the location, which can be local, or it can be a network location (in Colab, network locations are recommended). *delimiter* specifies the seperator (usually a comma). The *dtype* can be provided as well and *skiprows* will skip a number of rows (useful if you have a header).

In [None]:
#notebook running locally:
if not in_colab:
    X = np.loadtxt("files_IDS/iris_features.csv",
                    delimiter = ",",
                    skiprows=1)

    Y = np.loadtxt("files_IDS/iris_labels.csv", delimiter = ",",
                    skiprows=1, 
                    dtype = str)
#google colab
else:
    #loadtxt can work with network locations
    X = np.loadtxt("https://raw.githubusercontent.com/jverwaer/IntroDataScience/main/PCLabs/files_IDS/iris_features.csv",
                    delimiter = ",",
                    skiprows=1)

    Y = np.loadtxt("https://raw.githubusercontent.com/jverwaer/IntroDataScience/main/PCLabs/files_IDS/iris_labels.csv", 
                    delimiter = ",",
                    skiprows=1, 
                    dtype = str)

In [None]:
print(X)
print(Y)

**EXERCISES**

**Exer 1**

Create an array with the numbers 0 to 21 in steps of 3 (including the endpoints 0 and 21)

In [None]:
#...

**Exer 2**

Read in the file *heart_reduced* from Github. This is a small version of the Cleveland heart disease dataset, a popular public dataset that is commonly used to illustrate machine learning and data science. The 'goal' of the dataset is to predict heart diseases based on several variables (such as age and blood pressure). As will be seen later, these variables have a different type.

The original dataset is available here: https://archive.ics.uci.edu/ml/datasets/heart+disease

In [1]:
#...

## 2. Data Manipulation

Now that we have our arrays, we need to be able to actually do something with them. In this section, we will see 5 ways of interacting with an array: attributes, indexing, slicing, reshaping and joining + splitting

**ILLUSTRATION**

**2.a. attributes**

Every array has certain *attributes* that can be accessed with the dot notation. We have already seen one before: *array.dtype*
Other useful attributes are *ndim* for the number of dimensions and *shape* for the size of an array. 

Be careful here, because while this looks a lot like methods, the syntax is different. *X.mean()* will give us the mean of the numbers in X (brackets because mean is a *method*), *X.shape* will give us the size (no brackets because this is an *attribute*) 

In [None]:
# get nr of dimensions
print(Y.ndim) 
print(X.ndim)
# get shape -> result is always tuple
print(Y.shape)
print(X.shape)

**2.b. Indexing**

If we want to access a specific element of an array, we can do this with indexing. For a one dimensional array, this is a single number enclosed in square brackets such as [3] for the fourth element (indexing starts at 0). Counting backwards can be done with a negative number. [-2] is the second element counting backwards from the end. Which means the first element is 0 and the last element is -1.

For 2D arrays, we can specify a row and a column seperated by a comma: [1, 2] for the second row, third column. (3D arrays can be accessed with a third number [1, 2, 3])

In [None]:
#accessing values
print(Y[1])          # square brackets with positive integer index
print(X[1, 2])       # square brackets with positive integer indices
print(Y[-1])         # negative index -> index backward from the end of axis
print(X[1, -1])      # negative index -> index backward from the end of axis

By assigning a new value to a specific element, we can change an element in an array

In [None]:
#changing values
Y[1] = 'hello'
print(Y)
X[1, 2] = 9.9
print(X)

**Note:**  values can only be changed to the correct type. Inserting a string inside a float type array like X is not possible

**2.c. Slicing**

Indexing can be extended to multiple values, which we call slicing. For this, a range of values must be provided, such as [0:2] for elements 1 and 2 (the endpoint is not included). If nothing is provided, Python will select the largest range possible. Meaning that [:2] is equivalent to [0:2] and [2:] will select until the end of the array. To select everything, we can use a single : such as [2, :] for the entire second row and [:, 2] for the entire second column. [2, :] can be shortened to [2,] or [2], but we recommend to always write out [2, :] to avoid confusion.

In [None]:
print(Y[0:2]) 
print(X[0:2, 1:4])

In [None]:
print(Y[:2]) #from start to 2
print(Y[2:]) #from 2 until the end

In [None]:
print(X[1, :])     # extract one row
print(X[:, 2])     # extract one column

Using triple values, we can select ranges with step sizes

In [None]:
print(X[0:150:2, 0:4:2]) #index 0 to 150 in steps of 2 + 0 to 4 in steps of 2

We can change multiple values similar to what we did with indexing. Numpy will *broadcast* the input we provide. This means that if we want to replace three numbers with a single number, Numpy will automatically repeat the single number three times to make the sizes compatible. So 10 becomes [10, 10, 10]. The same principle can be extended to rows and columns. If you ask something impossible such as: 

    Y[0:3] = ['hello', 'world']
    
You will get the following error:

    ValueError: could not broadcast input array from shape (2,) into shape (3,)

This indicates the shapes are not compatible with each other.

In [None]:
#change multiple values
X[0, 1:3] = 10 #change multiple values to the same one
X[1, 1:4] = [1, 2, 3] #change each value to another value
print(X)

**2.d. Reshaping**

We can also reshape an array. For example, a 6 by 2 array can be reshaped into a 4 by 3. We could also reshape a 12 element vector into a 3 by 4 array. Note that we can also reshape it into a 12 by 1 array. This may look similar to the original object, but a 12 by 1 array is a 2D object whereas a vector is a 1D object.

By default, arrays are filled row by row, but the order argument can be used to change this.

In [None]:
A = np.arange(1, 13) # 1D array
#reshape is not an in-place method, so you need to assign it to a variable
A = A.reshape((3, 4)) #reshape
print(A)

In [None]:
#change order
A = np.arange(1, 13) # 1D array
A = A.reshape((3, 4), order='F') #reshape, fill in per column
print(A)

**2.e. Joining and splitting**

Arrays can be joined (concatenated) together using *np.concatenate*

Splitting can be done by assigning a slice to a new object

In [None]:
#concatenate
X_extra = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
X_new = np.concatenate([X_extra, X]) 
print(X_new)

In [None]:
Y_extra = np.array(['new', 'new'])
Y_new = np.concatenate([Y_extra, Y]) 

In [None]:
#using axis
X_new2 = np.concatenate((X, X[:,0].reshape(-1, 1)), axis = 1)

In [None]:
#split
X_partial1 = X[:9, :]
X_partial2 = X[9:, :]

**EXERCISES**

**Exer 1**

As mentioned earlier, there are different types of variables in the heart disease dataset. Split the original dataset into the following parts:

- X_num: this should contain the numerical variables: age (first column), trestbps (resting blood pressure, column four) and col (cholesterol level, column five). 
- X_cat: an array with the categorical variables: sex (second column), cp (chest pain type, third column), fbs (fasting blood sugar high or low, sixth column) and restecg (resting electrocardiagraphic anomalies, seventh column). While these look numerical, they are in fact categorical variables encoded as numbers (each category can only take a set number of values).
- Y_heart: the target variable that we are trying to predict (whether the patient has a heart disease). This is the last column in the original dataframe.

Hint: it can be easier if you first make a zeros array with the correct shape (pre-allocation) and then overwrite the right columns.

In [None]:
#...

## 3. Computation

Finally, we can perform computations with arrays. We will discuss universal functions, aggregations, comparisons + masks + boolean logic, fancy indexing and sorting

**ILLUSTRATION**

**3.a. Universal functions**

Universal functions are operations that are performed element by element on a vector. Most of these functions have a shorcut, just like + is a shortcut for the *add()* function. There are three ways of performing these functions, which are illustrated below

In [None]:
#square 2nd column

#option 1
#using a for loop to fill in a pre-allocated array
A = np.zeros((150,))
for i in range(150):
    A[i] = X[i,1]**2

#option 2
#using vectorization
B = X[:,1]**2

#option 3
#using the universal function np.power
B = np.power(X[:,1], 2)    

We will usually prefer the second option, because this is a logical extension of regular arithmetic. Some more examples are given below using this notation

In [None]:
#other arithmetic functions
print(X[:,1] + 5) # short for np.add(X[:,1], 5)
print(X[:,1] + X[:,2]) # short for np.add(X[:,1], X[:,2])
print(X[:,1] - 2) # subtract
print(X[:,1] * 2) # multiply
print(X[:,1] / 2) # divide
print(X[:,1] ** 2) # square

However, for some functions such as trigonometric ones, we will need the third option.

In [None]:
#absolute value, trigonometric, exponents and log functions
a = np.array([-1, 0.5, 1.25, 2.0])
print(np.abs(a)) # absolute value of each element
print(np.sin(a)) # sine of each element (angle assumed in radians)
print(np.cos(a)) # cosine of each element (angle assumed in radians)
print(np.exp(a)) # exponential function: e^a
print(np.log(a)) # natural logarithm (ln(a))

Notice here that Python gives a warning because we cannot take a log of a negative number. The result here is nan (not a number)

**3.b. Aggregations**

Aggregations are a specific type of function that compile statistical information about a variable such as the mean or the variance. Unlike universal functions, aggregations can be called as functions *or* as methods. Meaning that:

    np.mean(a)
    a.mean()
    
Are both valid ways to calculate the mean.

In [None]:
print(X.mean(axis = 0)) #column means
print(X.mean(axis = 1)) #row means

Below, an overview of the aggregations is given

![alt text](files_IDS/aggregations.png "Aggregations")

By combining mean and std, we can standardise a variable. Here, we use axis to specify we want the column mean

In [None]:
#standardise (broadcasting)
X_standardized = (X - X.mean(axis = 0)) / X.std(axis=0)

**3.c. Comparisons, masks and boolean logic**

If we perform a comparison between a scalar and an array, the result will be logical array with the same size as the original array. If both sides of the comparison are arrays of the same size, an element wise comparison will be performed.

Since a True value is counted as 1, the sum function can be used to count the number of True elements of a comparison. So if we want to count how many values in the first column are below 5, we can do this with:

In [None]:
#comparisons
A = X[:,0] < 5
print(A)
print(np.sum(A)) #number of True values (True: 1, False: 0)

An overview of these operators is shown below

![alt text](files_IDS/comparison.png "Comparison")

In [None]:
A = np.arange(1, 10).reshape((3,3))
print(A >= 5) # elements that are greater than or equal to 5
print(A % 3 == 0) # remainder after division by 3 equals zero

We can also use boolean logic, such as AND, OR and NOT

In [None]:
#boolean logic
print((A > 5)  &  (A % 2 == 0)) #AND
print((A > 5)  |  (A % 2 == 0)) #OR
print(~(A > 5)) #NOT

This can be combined with other methods such as slicing

In [None]:
print((X[:,0] < 5) & (X[:,2] < 1.5)) #slicing + comparison + boolean logic
print(np.sum((X[:,0] < 5) & (X[:,2] < 1.5)))

An array with True False values can also be used as an index for slicing (mask). All True values will be returned, so [False, False, True] will return the third element

In [None]:
#masks
print(A[A % 2 == 0])
print(A[(A > 5) | (A % 2 == 0)])

In [None]:
#masks combined with replacing values
A[A % 2 == 0] = 0

In [None]:
#masks as partial index
print(A[A[:,0] % 2 == 1, :])

mask = A[:,0] % 2 == 1 # using an intermediate variable for readability
print(A[mask, :])

Putting all this together, we can perform specific tasks, such as:

In [None]:
#examples

#Compute average sepal length of all ‘short’ sepals ( < 5 )
mask = X[:,0] < 5
print(np.mean(X[mask, 0]))

#Compute average sepal length of all ‘setosa’ flowers
mask = Y == 'setosa'
print(np.mean(X[mask, 0]))

#Compute standard deviation of all variables for ‘setosa’ flowers
mask = Y == 'setosa'
print(np.std(X[mask, :], axis = 0))

**3.d. fancy indexing and sorting**

Finally, we have fancy indexing, where an array is used to slice a different array. Just like True and False, we can use [1, 2, 4] to select the second, third and fifth element of an array. Below, we see some examples of this

In [None]:
#passing an array of indices
idx = [1, 2, 4]
print(B[idx])

#row + column
rowidx = [1, 2, 2]
colidx = [0, 0, 1]
print(A[rowidx, colidx])

#combined with slicing
rowidx = [0, 2]
print(A[rowidx, :])

In [None]:
#random
rand = np.random.RandomState(1) #generate a random state
#this makes the code reproducable
rowidx = rand.choice(150, 60, replace=False) #select 60 values between 0 and 150 without replacement
#no replacement = value can only be selected once
print(X[rowidx, :])

Finally, *sort* can be used to sort an array (ascending by default) and *argsort* can be used to return the indexes of a sorted array

In [None]:
#sorting
B = np.array([5, 8, 7, 3, 1, 2])
print(np.sort(B)) #sort ascending
print(np.argsort(B)) #indices sorted elements

In [None]:
#Sorting rows of a 2D array on a column
idx = np.argsort(X[:,1])
print(X[idx, :])

**EXERCISES**

**Exer 1**

X_cat shows numerical variables, but these actually encode specific categories. Adapt this array to show the full category name instead. Use the following conversions:

* sex: 0 --> 'female', 1 --> 'male' 
* cp (chest pain type): 0 --> 'typical angina', 1 --> 'atypical angina', 2 --> 'non-anginal pain', 3 --> 'asymptomatic'
* fbs (fasting blood sugar): 0 --> 'low', 1 --> 'high'
* restecg (ST-T wave abnormality): 0 --> 'normal', 1 --> 'abnormal'

In order to use strings in an array, you need to specify the space that will be reserved. Using dtype=str will only reserve a single character. If you want more characters, you can do this the following way:

- np.zeros((3, 2), dtype='<U5')

This creates a 3 by 2 array with 5 spaces reserved. Anything longer will be cut off after the 5th letter.

Hint: one way to solve this exercise is to first create a new array with the correct type and then fill this one up using the slicing mask to select the correct values.

In [None]:
#...

**Exer 2**

What percentage of patients have typical angina? Is this the same for each gender?

In [None]:
#...