# Introduction to Data Science PC Lab 02: Basics of Numpy Arrays

Author: Jan Verwaeren - Arne Deloose

Course: Introduction to Data Science

Welcome back, dear reader. In this notebook, we will cover the basics of Numpy arrays.  As usual, each section (three sections once again) will start with illustrations followed by exercises. The illustrations in this notebook match the theory slides for lecture2 (with slight deviations) and can be used to follow along.

The code below will detect which program you are using to run the notebook. This will be important later when we load in arrays

In [2]:
try:
    import google.colab
    in_colab = True #set in_colab to True if you're running the code from google colab
except:
    in_colab = False #otherwise, it's false

**What is Numpy?**

Before we start, let us briefly discuss what Numpy is. Numpy is a Python library that adds support for arrays and matrices. Libraries are collections of functions that extend the capabilities of Python. We will see many other libraries during this course. You may have noticed that we already used the *requests* library in PC lab 01 to load in data from a url. 

If a library is already installed, it can be important very easily using the *import* function illustrated below. With *as* we can give it a name. The convention is to shorten numpy to *np*

In [3]:
#import the numpy module and call it np
import numpy as np

If a library is not yet installed, it needs to be downloaded first using the package manager *pip*. If we want the *jellyfish* library for example (this library is used to phonetically match strings), we need to install it first using:

    !pip install jellyfish

Before we can import it with:

    import jellyfish

Commonly used libraries like *Numpy* are installed by default on Colab, so we do not need to worry about this for now.

To use a function from a library, we use *library.function()* for example, *np.array()* or *jellyfish.levenshtein_distance()*

## 1. Creating Numpy arrays

With Numpy, we can work with arrays, so the logical first step is to create these arrays. This can be done in three different ways: using np.array, with another function or by loading from a file 

**ILLUSTRATION**

**1.a. With *np.array***

The most basic way to create an array is with the function *np.array()*, the first input of this function is the object we want. To get a single row, we give the elements seperated by commas enclosed in square brackets, as illustrated below. Multiple rows are given as seperate enclosed rows. However, these must be enclosed again in square brackets, because the object must be given as a single input. Without the extra brackets, the second row would be considered a second input. 

In [7]:
#using np.array
A = np.array([1, 2, 3]) #one row
B = np.array([[1, 2, 3],[4, 5, 6]]) #two rows
A.dtype #show type

dtype('int32')

In [13]:
B #show B

array([[1, 2, 3],
       [4, 5, 6]])

Once again, Python automatically detects the type of our variables. The most general type is always used. So if one number is given as a float, the entire array becomes a float type.

In [23]:
B = np.array([[1, 2.0, 3],[4, 5, 6]]) #two rows
print(B)
print(B.dtype)

[[1. 2. 3.]
 [4. 5. 6.]]
float64


Note that dtype is a property of our array, not a function (no brackets needed). *dtype* is not the same as *type*, because the type of B is:

In [24]:
type(B)

numpy.ndarray

This type is the same for every numpy array. Even if the internal type is different. For example:

In [28]:
#different array type
#implicit type, analogue to a single var
A = np.array([True, False, False])
print(A.dtype)
print(type(A))

bool
<class 'numpy.ndarray'>


With dtype, we can explicitely define the type as well

In [31]:
A = np.array([1, 2, 3], dtype='float') #one row
A

array([1., 2., 3.])

**1.b. Other functions**

There are several other functions to create arrays. *zeros* can be used to create an array filled with zeros, *arange* and *linspace* can fill in numbers between two points and *random* fills in random numbers. To get floats, we need *random.random* (there are other random generators such as *random.randint* for random integers)

In [6]:
A = np.zeros((5, 3), dtype = 'float') # 5 row, 3 columns of zeros

B = np.arange(0, 10, 2)               # from 0 to (just not) 10 in steps of 2

C = np.linspace(0, 1, 5)              # 5 equally spaced between 0 and 1

D = np.random.random((3, 3))          # 3 by 3 matrix of random numbers in [0, 1

In [7]:
B

array([0, 2, 4, 6, 8])

**1.c. from a file**

Finally, arrays can be loaded in from a file using *np.loadtxt* Despite its name, other file types can be loaded in as well, not just *.txt* . The first argument of this function is the location, which can be local, or it can be a network location (in Colab, network locations are recommended). *delimiter* specifies the seperator (usually a comma). The *dtype* can be provided as well and *skiprows* will skip a number of rows (useful if you have a header).

In [8]:
#notebook running locally:
if not in_colab:
    X = np.loadtxt("files_IDS/iris_features.csv",
                    delimiter = ",",
                    skiprows=1)

    Y = np.loadtxt("files_IDS/iris_labels.csv", delimiter = ",",
                    skiprows=1, 
                    dtype = str)
#google colab
else:
    #loadtxt can work with network locations
    X = np.loadtxt("https://raw.githubusercontent.com/jverwaer/IntroDataScience/main/PCLabs/files_IDS/iris_features.csv",
                    delimiter = ",",
                    skiprows=1)

    Y = np.loadtxt("https://raw.githubusercontent.com/jverwaer/IntroDataScience/main/PCLabs/files_IDS/iris_labels.csv", 
                    delimiter = ",",
                    skiprows=1, 
                    dtype = str)

In [9]:
print(X)
print(Y)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

**EXERCISES**

**Exer 1**

Create an array with the numbers 0 to 21 in steps of 3 (including the endpoints 0 and 21)

In [56]:
M = np.arange(0, 22, 3)
M

array([ 0,  3,  6,  9, 12, 15, 18, 21])

**Exer 2**

Read in the file heart_reduced from Github. This dataset contains a number of variables (such as age and blood pressure) that can be used to try and predict heart disease.

In [11]:
#local method
if not in_colab:
    heart = np.loadtxt("files_IDS/heart_reduced.csv", 
                       delimiter = ",",
                       skiprows=1)
else:
    heart = np.loadtxt("https://raw.githubusercontent.com/jverwaer/IntroDataScience/main/PCLabs/files_IDS/heart_reduced.csv",
                       delimiter = ",",
                       skiprows=1)

## 2. Data Manipulation

**ILLUSTRATION**

**2.a. attributes**

In [12]:
# get nr of dimensions
print(Y.ndim) 
print(X.ndim)
# get shape -> result is allways tuple
print(Y.shape)
print(X.shape)

1
2
(150,)
(150, 4)


**2.b. Indexing**

In [13]:
#accessing values
print(Y[1])          # square brackets with positive integer index
print(X[1, 2])       # square brackets with positive integer indices
print(Y[-1])         # negative index -> index backward from the end of axis
print(X[1, -1])      # negative index -> index backward from the end of axis

setosa
1.4
virginica
0.2


In [14]:
#changing values
Y[1] = 'hello'
print(Y)
X[1, 2] = 9.9
print(X)

['setosa' 'hello' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 've

**Note that values can only be changed to the correct type. Inserting a string inside X is not possible**

**2.c. Slicing**

Indexing: single value

Slicing: multiple values

In [15]:
print(Y[0:2]) 
print(X[0:2, 1:4])

['setosa' 'hello']
[[3.5 1.4 0.2]
 [3.  9.9 0.2]]


In [16]:
print(Y[:2]) #from start to 2
print(Y[2:]) #from 2 until the end

['setosa' 'hello']
['setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa' 'setosa'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'versicolor' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' '

In [17]:
print(X[1, :])     # extract one row
print(X[:, 2])     # extract one column

[4.9 3.  9.9 0.2]
[1.4 9.9 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 1.7 1.5 1.7 1.5 1.  1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.
 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.  4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.
 4.9 4.7 4.3 4.4 4.8 5.  4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.
 4.4 4.6 4.  3.3 4.2 4.2 4.2 4.3 3.  4.1 6.  5.1 5.9 5.6 5.8 6.6 4.5 6.3
 5.8 6.1 5.1 5.3 5.5 5.  5.1 5.3 5.5 6.7 6.9 5.  5.7 4.9 6.7 4.9 5.7 6.
 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
 5.7 5.2 5.  5.2 5.4 5.1]


In [18]:
print(X[0:150:2, 0:4:2]) #index 0 to 150 in steps of 2 + 0 to 4 in steps of 2

[[5.1 1.4]
 [4.7 1.3]
 [5.  1.4]
 [4.6 1.4]
 [4.4 1.4]
 [5.4 1.5]
 [4.8 1.4]
 [5.8 1.2]
 [5.4 1.3]
 [5.7 1.7]
 [5.4 1.7]
 [4.6 1. ]
 [4.8 1.9]
 [5.  1.6]
 [5.2 1.4]
 [4.8 1.6]
 [5.2 1.5]
 [4.9 1.5]
 [5.5 1.3]
 [4.4 1.3]
 [5.  1.3]
 [4.4 1.3]
 [5.1 1.9]
 [5.1 1.6]
 [5.3 1.5]
 [7.  4.7]
 [6.9 4.9]
 [6.5 4.6]
 [6.3 4.7]
 [6.6 4.6]
 [5.  3.5]
 [6.  4. ]
 [5.6 3.6]
 [5.6 4.5]
 [6.2 4.5]
 [5.9 4.8]
 [6.3 4.9]
 [6.4 4.3]
 [6.8 4.8]
 [6.  4.5]
 [5.5 3.8]
 [5.8 3.9]
 [5.4 4.5]
 [6.7 4.7]
 [5.6 4.1]
 [5.5 4.4]
 [5.8 4. ]
 [5.6 4.2]
 [5.7 4.2]
 [5.1 3. ]
 [6.3 6. ]
 [7.1 5.9]
 [6.5 5.8]
 [4.9 4.5]
 [6.7 5.8]
 [6.5 5.1]
 [6.8 5.5]
 [5.8 5.1]
 [6.5 5.5]
 [7.7 6.9]
 [6.9 5.7]
 [7.7 6.7]
 [6.7 5.7]
 [6.2 4.8]
 [6.4 5.6]
 [7.4 6.1]
 [6.4 5.6]
 [6.1 5.6]
 [6.3 5.6]
 [6.  4.8]
 [6.7 5.6]
 [5.8 5.1]
 [6.7 5.7]
 [6.3 5. ]
 [6.2 5.4]]


In [19]:
#change multiple values
X[0, 1:3] = 10 #change multiple values to the same one
X[1, 1:4] = [1, 2, 3] #change each value to another value
print(X)

[[ 5.1 10.  10.   0.2]
 [ 4.9  1.   2.   3. ]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.2]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.6  1.4  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3  1.3  0.3]
 [ 4.4  3.2  1.3  0.2]
 [ 5.   3.5

**2.d. Reshaping**

In [20]:
A = np.arange(1, 13) # 1D array
#reshape is not an in-place method, so you need to assign it to a variable
A = A.reshape((3, 4)) #reshape
print(A)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


In [21]:
#change order
A = np.arange(1, 13) # 1D array
A = A.reshape((3, 4), order='F') #reshape, fill in per column
print(A)

[[ 1  4  7 10]
 [ 2  5  8 11]
 [ 3  6  9 12]]


**Note**: There is a difference between a vector (1D) and an array with only a single row/column (2D)

**2.e. Joining and splitting**

In [22]:
#concatenate
X_extra = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
X_new = np.concatenate([X_extra, X]) 
print(X_new)

[[ 1.   2.   3.   4. ]
 [ 5.   6.   7.   8. ]
 [ 5.1 10.  10.   0.2]
 [ 4.9  1.   2.   3. ]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.2]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.6  1.4  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3

In [23]:
Y_extra = np.array(['new', 'new'])
Y_new = np.concatenate([Y_extra, Y]) 

In [24]:
#using axis
X_new2 = np.concatenate((X, X[:,0].reshape(-1, 1)), axis = 1)

**EXERCISES**

**Exer 1**

Split the heart disease dataset into three parts. X_num should contain the numerical variables: age (first column), trestbps (resting blood pressure, column four) and col (cholesterol level, column five). X_cat should contain the categorical variables: sex (second column), cp (chest pain type, third column), fbs (fasting blood sugar high or low, sixth column) and restecg (resting electrocardiagraphic anomalies, seventh column). Finally, the last variable is the target (whether the patient has a heart disease), name this Y_heart.

In [25]:
#allocate a zeroes array
X_num = np.zeros((heart.shape[0], 3))
#fill in first column
X_num[:, 0] = heart[:, 0]
#fill in other two columns
X_num[:, 1:] = heart[:, 3:5]

#same method for X_cat
#allocate a zeroes array
X_cat = np.zeros((heart.shape[0], 4))
#fill in first column
X_cat[:, 0:2] = heart[:, 1:3] #sex and cp
X_cat[:, 2:] = heart[:, 5:7] #fbs and restecg

Y_heart = heart[:, -1] #last column

## 3. Computation

**ILLUSTRATION**

**3.a. Universal functions**

In [26]:
#square 2nd column

#option 1
A = np.zeros((150,))
for i in range(150):
    A[i] = X[i,1]**2

#option 2
B = X[:,1]**2

#option 3
B = np.power(X[:,1], 2)    

In [27]:
#other arithmetic functions
print(X[:,1] + 5) # short for np.add(X[:,1], 5)
print(X[:,1] + X[:,2]) # short for np.add(X[:,1], X[:,2])
print(X[:,1] - 2) # subtract
print(X[:,1] * 2) # multiply
print(X[:,1] / 2) # divide
print(X[:,1] ** 2) # square

[15.   6.   8.2  8.1  8.6  8.9  8.4  8.4  7.9  8.1  8.7  8.4  8.   8.
  9.   9.4  8.9  8.5  8.8  8.8  8.4  8.7  8.6  8.3  8.4  8.   8.4  8.5
  8.4  8.2  8.1  8.4  9.1  9.2  8.1  8.2  8.5  8.6  8.   8.4  8.5  7.3
  8.2  8.5  8.8  8.   8.8  8.2  8.7  8.3  8.2  8.2  8.1  7.3  7.8  7.8
  8.3  7.4  7.9  7.7  7.   8.   7.2  7.9  7.9  8.1  8.   7.7  7.2  7.5
  8.2  7.8  7.5  7.8  7.9  8.   7.8  8.   7.9  7.6  7.4  7.4  7.7  7.7
  8.   8.4  8.1  7.3  8.   7.5  7.6  8.   7.6  7.3  7.7  8.   7.9  7.9
  7.5  7.8  8.3  7.7  8.   7.9  8.   8.   7.5  7.9  7.5  8.6  8.2  7.7
  8.   7.5  7.8  8.2  8.   8.8  7.6  7.2  8.2  7.8  7.8  7.7  8.3  8.2
  7.8  8.   7.8  8.   7.8  8.8  7.8  7.8  7.6  8.   8.4  8.1  8.   8.1
  8.1  8.1  7.7  8.2  8.3  8.   7.5  8.   8.4  8. ]
[20.   3.   4.5  4.6  5.   5.6  4.8  4.9  4.3  4.6  5.2  5.   4.4  4.1
  5.2  5.9  5.2  4.9  5.5  5.3  5.1  5.2  4.6  5.   5.3  4.6  5.   5.
  4.8  4.8  4.7  4.9  5.6  5.6  4.6  4.4  4.8  5.   4.3  4.9  4.8  3.6
  4.5  5.1  5.7  4.4  5.4  

In [28]:
#absolute value, trigonometric, exponents and log functions
a = np.array([-1, 0.5, 1.25, 2.0])
print(np.abs(a)) # absolute value of each element
print(np.sin(a)) # sine of each element (angle assumed in radians)
print(np.cos(a)) # cosine of each element (angle assumed in radians)
print(np.exp(a)) # exponential function: e^a
print(np.log(a)) # natural logarithm (ln(a))

[1.   0.5  1.25 2.  ]
[-0.84147098  0.47942554  0.94898462  0.90929743]
[ 0.54030231  0.87758256  0.31532236 -0.41614684]
[0.36787944 1.64872127 3.49034296 7.3890561 ]
[        nan -0.69314718  0.22314355  0.69314718]


  import sys


**3.b. Aggregations**

In [29]:
print(X.mean(axis = 0)) #column means
print(X.mean(axis = 1)) #row means

[5.84333333 3.08733333 3.81933333 1.218     ]
[6.325 2.725 2.35  2.35  2.55  2.85  2.425 2.525 2.225 2.4   2.7   2.5
 2.325 2.125 2.8   3.    2.75  2.575 2.875 2.675 2.675 2.675 2.35  2.65
 2.575 2.45  2.6   2.6   2.55  2.425 2.425 2.675 2.725 2.825 2.425 2.4
 2.625 2.5   2.225 2.55  2.525 2.1   2.275 2.675 2.8   2.375 2.675 2.35
 2.675 2.475 4.075 3.9   4.1   3.275 3.85  3.575 3.975 2.9   3.85  3.3
 2.875 3.65  3.3   3.775 3.35  3.9   3.65  3.4   3.6   3.275 3.925 3.55
 3.8   3.7   3.725 3.85  3.95  4.1   3.725 3.2   3.2   3.15  3.4   3.85
 3.6   3.875 4.    3.575 3.5   3.325 3.425 3.775 3.4   2.9   3.45  3.525
 3.525 3.675 2.925 3.475 4.525 3.875 4.525 4.15  4.375 4.825 3.4   4.575
 4.2   4.85  4.2   4.075 4.35  3.8   4.025 4.3   4.2   5.1   4.875 3.675
 4.525 3.825 4.8   3.925 4.45  4.55  3.9   3.95  4.225 4.4   4.55  5.025
 4.25  3.925 3.925 4.775 4.425 4.2   3.9   4.375 4.45  4.35  3.875 4.55
 4.55  4.3   3.925 4.175 4.325 3.95 ]


![alt text](files_IDS/aggregations.png "Aggregations")

In [30]:
#standardise (broadcasting)
X_standardized = (X - X.mean(axis = 0)) / X.std(axis=0)

**3.c. Comparisons, masks and boolean logic**

In [31]:
#comparisons
A = X[:,0] < 5
print(A)
print(np.sum(A)) #number of True values (True: 1, False: 0)

[False  True  True  True False False  True False  True  True False  True
  True  True False False False False False False False False  True False
  True False False False False  True  True False False False  True False
 False  True  True False False  True  True False False  True False  True
 False False False False False False False False False  True False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False  True False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False]
22


![alt text](files_IDS/comparison.png "Comparison")

In [32]:
A = np.arange(1, 10).reshape((3,3))
print(A >= 5) # elements that are greater than or equal to 5
print(A % 3 == 0) # remainder after division by 3 equals zero

[[False False False]
 [False  True  True]
 [ True  True  True]]
[[False False  True]
 [False False  True]
 [False False  True]]


In [33]:
#boolean logic
print((A > 5)  &  (A % 2 == 0)) #AND
print((A > 5)  |  (A % 2 == 0)) #OR
print(~(A > 5)) #NOT

[[False False False]
 [False False  True]
 [False  True False]]
[[False  True False]
 [ True False  True]
 [ True  True  True]]
[[ True  True  True]
 [ True  True False]
 [False False False]]


In [34]:
print((X[:,0] < 5) & (X[:,2] < 1.5)) #slicing + comparison + boolean logic
print(np.sum((X[:,0] < 5) & (X[:,2] < 1.5)))

[False False  True False False False  True False  True False False False
  True  True False False False False False False False False  True False
 False False False False False False False False False False False False
 False  True  True False False  True  True False False  True False  True
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False]
12


In [35]:
#masks
print(A[A % 2 == 0])
print(A[(A > 5) | (A % 2 == 0)])

[2 4 6 8]
[2 4 6 7 8 9]


In [36]:
#masks combined with replacing values
A[A % 2 == 0] = 0

In [37]:
#masks as partial index
print(A[A[:,0] % 2 == 1, :])

mask = A[:,0] % 2 == 1 # using an intermediate variable for readability
print(A[mask, :])

[[1 0 3]
 [7 0 9]]
[[1 0 3]
 [7 0 9]]


In [38]:
#examples

#Compute average sepal length of all ‘short’ sepals ( < 5 )
mask = X[:,0] < 5
print(np.mean(X[mask, 0]))

#Compute average sepal length of all ‘setosa’ flowers
mask = Y == 'setosa'
print(np.mean(X[mask, 0]))

#Compute standard deviation of all variables for ‘setosa’ flowers
mask = Y == 'setosa'
print(np.std(X[mask, :], axis = 0))

4.690909090909091
5.0081632653061225
[0.35215763 1.00065368 1.21920287 0.10517632]


**3.d. fancy indexing and sorting**

In [39]:
#passing an array of indices
idx = [1, 2, 4]
print(B[idx])

#row + column
rowidx = [1, 2, 2]
colidx = [0, 0, 1]
print(A[rowidx, colidx])

#combined with slicing
rowidx = [0, 2]
print(A[rowidx, :])

[ 1.   10.24 12.96]
[0 7 0]
[[1 0 3]
 [7 0 9]]


In [40]:
#random
rand = np.random.RandomState(1) #generate a random state
#this makes the code reproducable
rowidx = rand.choice(150, 60, replace=False) #select 60 values between 0 and 150 without replacement
#no replacement = value can only be selected once
print(X[rowidx, :])

[[5.8 4.  1.2 0.2]
 [5.1 2.5 3.  1.1]
 [6.6 3.  4.4 1.4]
 [5.4 3.9 1.3 0.4]
 [7.9 3.8 6.4 2. ]
 [6.3 3.3 4.7 1.6]
 [6.9 3.1 5.1 2.3]
 [5.1 3.8 1.9 0.4]
 [4.7 3.2 1.6 0.2]
 [6.9 3.2 5.7 2.3]
 [5.6 2.7 4.2 1.3]
 [5.4 3.9 1.7 0.4]
 [7.1 3.  5.9 2.1]
 [6.4 3.2 4.5 1.5]
 [6.  2.9 4.5 1.5]
 [4.4 3.2 1.3 0.2]
 [5.8 2.6 4.  1.2]
 [5.6 3.  4.5 1.5]
 [5.4 3.4 1.5 0.4]
 [5.  3.2 1.2 0.2]
 [5.5 2.6 4.4 1.2]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.  3.5 1.3 0.3]
 [7.2 3.2 6.  1.8]
 [5.7 2.8 4.1 1.3]
 [5.5 4.2 1.4 0.2]
 [5.1 3.8 1.5 0.3]
 [6.1 2.8 4.7 1.2]
 [6.3 2.5 5.  1.9]
 [6.1 3.  4.6 1.4]
 [7.7 3.  6.1 2.3]
 [5.6 2.5 3.9 1.1]
 [6.4 2.8 5.6 2.1]
 [5.8 2.8 5.1 2.4]
 [5.3 3.7 1.5 0.2]
 [5.5 2.3 4.  1.3]
 [5.2 3.4 1.4 0.2]
 [6.5 2.8 4.6 1.5]
 [6.7 2.5 5.8 1.8]
 [6.8 3.  5.5 2.1]
 [5.1 3.5 1.4 0.3]
 [6.  2.2 5.  1.5]
 [6.3 2.9 5.6 1.8]
 [6.6 2.9 4.6 1.3]
 [7.7 2.6 6.9 2.3]
 [5.7 3.8 1.7 0.3]
 [5.  3.6 1.4 0.2]
 [4.8 3.  1.4 0.3]
 [5.2 2.7 3.9 1.4]
 [5.1 3.4 1.5 0.2]
 [5.5 3.5 1.3 0.2]
 [7.7 3.8 6.

In [41]:
#sorting
B = np.array([5, 8, 7, 3, 1, 2])
print(np.sort(B)) #sort ascending
print(np.argsort(B)) #indices sorted elements

[1 2 3 5 7 8]
[4 5 3 0 2 1]


In [42]:
#Sorting rows of a 2D array on a column
idx = np.argsort(X[:,1])
print(X[idx, :])

[[ 4.9  1.   2.   3. ]
 [ 5.   2.   3.5  1. ]
 [ 6.   2.2  4.   1. ]
 [ 6.   2.2  5.   1.5]
 [ 6.2  2.2  4.5  1.5]
 [ 5.   2.3  3.3  1. ]
 [ 4.5  2.3  1.3  0.3]
 [ 5.5  2.3  4.   1.3]
 [ 6.3  2.3  4.4  1.3]
 [ 4.9  2.4  3.3  1. ]
 [ 5.5  2.4  3.8  1.1]
 [ 5.5  2.4  3.7  1. ]
 [ 6.3  2.5  5.   1.9]
 [ 4.9  2.5  4.5  1.7]
 [ 5.7  2.5  5.   2. ]
 [ 6.3  2.5  4.9  1.5]
 [ 5.1  2.5  3.   1.1]
 [ 5.6  2.5  3.9  1.1]
 [ 5.5  2.5  4.   1.3]
 [ 6.7  2.5  5.8  1.8]
 [ 5.5  2.6  4.4  1.2]
 [ 7.7  2.6  6.9  2.3]
 [ 5.7  2.6  3.5  1. ]
 [ 6.1  2.6  5.6  1.4]
 [ 5.8  2.6  4.   1.2]
 [ 5.8  2.7  5.1  1.9]
 [ 5.2  2.7  3.9  1.4]
 [ 5.6  2.7  4.2  1.3]
 [ 6.   2.7  5.1  1.6]
 [ 6.4  2.7  5.3  1.9]
 [ 6.3  2.7  4.9  1.8]
 [ 5.8  2.7  5.1  1.9]
 [ 5.8  2.7  4.1  1. ]
 [ 5.8  2.7  3.9  1.2]
 [ 6.4  2.8  5.6  2.2]
 [ 5.7  2.8  4.5  1.3]
 [ 7.4  2.8  6.1  1.9]
 [ 6.8  2.8  4.8  1.4]
 [ 6.2  2.8  4.8  1.8]
 [ 6.4  2.8  5.6  2.1]
 [ 6.1  2.8  4.7  1.2]
 [ 7.7  2.8  6.7  2. ]
 [ 5.7  2.8  4.1  1.3]
 [ 5.6  2.8

**EXERCISES**

**Exer 1**

Adapt X_cat to show strings instead. Use the following conversions:

* sex: 0 --> 'female', 1 --> 'male' 
* cp (chest pain type): 0 --> 'typical angina', 1 --> 'atypical angina', 2 --> 'non-anginal pain', 3 --> 'asymptomatic'
* fbs (fasting blood sugar): 0 --> 'low', 1 --> 'high'
* restecg (ST-T wave abnormality): 0 --> 'normal', 1 --> 'abnormal'

In order to use strings in an array, you need to specify the space that will be reserved. Using dtype=str will only reserve a single character. If you want more characters, you can do this the following way:

- np.zeros((3, 2), dtype='<U5')

This creates a 3 by 2 array with 5 spaces reserved. Anything longer will be cut off after the 5th letter.

In [59]:
X_cat_new = np.zeros((X_cat.shape[0], X_cat.shape[1]), dtype='<U30')

#sex
mask = X_cat[:, 0]==0 #get the females
X_cat_new[mask, 0] = 'female' #replace the values with Female
X_cat_new[~mask, 0] = 'male' #replace the reverse values with Male

#fbs
mask = X_cat[:, 2]==0
X_cat_new[mask, 2] = 'low'
X_cat_new[~mask, 2] = 'high'

#fbs
mask = X_cat[:, 3]==0
X_cat_new[mask, 3] = 'normal'
X_cat_new[~mask, 3] = 'abnormal'

#cp
mask = X_cat[:, 1]==0
X_cat_new[mask, 1] = 'typical angina'
mask = X_cat[:, 1]==1
X_cat_new[mask, 1] = 'atypical angina'
mask = X_cat[:, 1]==2
X_cat_new[mask, 1] = 'non-anginal pain'
mask = X_cat[:, 1]==3
X_cat_new[mask, 1] = 'asymptomatic'

In [60]:
X_cat_new

array([['male', 'typical angina', 'low', 'abnormal'],
       ['male', 'typical angina', 'high', 'normal'],
       ['male', 'typical angina', 'low', 'abnormal'],
       ...,
       ['male', 'typical angina', 'low', 'normal'],
       ['female', 'typical angina', 'low', 'normal'],
       ['male', 'typical angina', 'low', 'abnormal']], dtype='<U30')

**Exer 2**

What percentage of patients have typical angina? Is this the same for each gender?

In [44]:
#patients with typical angina
result = np.sum(X_cat_new[:, 1] == 'typical angina')
#divide by total and multiply by 100
result = 100 * result/X_cat_new.shape[0]
print(result)

#per gender
result_female = np.sum((X_cat_new[:, 1] == 'typical angina') & (X_cat_new[:, 0] == 'female'))
#divide by total female and multiply by 100
result_female = 100 * result_female/np.sum(X_cat_new[:, 0]=='female')
print('female: ' + str(result_female))
#repeat for male
result_male = np.sum((X_cat_new[:, 1] == 'typical angina') & (X_cat_new[:, 0] == 'male'))
#divide by total female and multiply by 100
result_male = 100 * result_male/np.sum(X_cat_new[:, 0]=='male')
print('male: ' + str(result_male))

48.48780487804878
female: 42.62820512820513
male: 51.051893408134646
