### APPF3 | Spring Semester 2020

# Storing and Operating on Data with NumPy
## Autosave Your Notebook
* Activate autosave for your current notebook by using `%autosave`:

In [None]:
%autosave 30

## NumPy: Numerical Python
* NumPy: Python library that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
* NumPy documentation: https://docs.scipy.org/doc/ 
  * Use your NumPy version number to access the corresponding documentation

In [None]:
import numpy as np
np.__version__

* _Note_: We are going to use the `np` alias for the `numpy` module in all the code samples on the following slides

## NumPy Arrays
* Python's vanilla lists are heterogeneous: Each item in the list can be of a different data type
 * Comes at a cost: Each item in the list must contain its own type info and other information 
 * It is much more efficient to store data in a fixed-type array (all elements are of the same type)
* NumPy arrays are homogeneous: Each item in the list is of the same type
 * They are much more efficient for storing and manipulating data

## Creating NumPy Arrays
* Use the `np.array()` method to create a NumPy array:

In [None]:
example = np.array([0,1,2,3])
example

## Multidimensional NumPy Arrays
* _One-dimensional_ array: we only need one coordinate to address a single item, namely an integer index
* _Multidimensional_ array: we now need multiple indices to address a single item
 * For an $n$-dimensional array we need up to $n$ indices to address a single item
 * We're going to mainly work with two-dimensional arrays in this course, i.e. $n=2$ 

In [None]:
twodim = np.array([[1,2,3],
                   [4,5,6],
                   [7,8,9]])
twodim

## Array Indexing
* Array indexing for one-dimensional arrays works as usual: `onedim[0]`
* Accessing items in a two-dimensional array requires you to specify two indices: `twodim[0,1]`
* First index is the row number (here `0`), second index is the column number (here `1`)

## Objects in Python
* Almost everything in Python is an object, with its properties and methods
 * For example, a dictionary is an object that provides an `items()` method, which can only be called on a dictionary object (which is the same as a value of the dictionary type, or a dictionary value)
* An object can also provide attributes next to methods, which may describe properties of the specific object
 * For example, for an array object it might be interesting to see how many elements it contains at the moment, so we might want to provide a size attribute storing information about this specific property
 
### NumPy Array Attributes
* The type of a NumPy array is `numpy.ndarray` ($n$-dimensional array):

In [None]:
example = np.array([0,1,2,3])
type(example)

* Useful array attributes
 * `ndim`: The number of dimensions, e.g. for a two-dimensional array its just 2 
 * `shape`: Tuple containing the size of each dimension
 * `size`: The total size of the array (total number of elements)

In [None]:
rng = np.random.RandomState(41) # Ensure that the same random numbers are generated each time we run this code
x1 = rng.randint(10, size=6) # One-dimensional array
x2 = rng.randint(10, size=(3, 4)) # Two-dimensional array
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print("x2 dtype: ", x2.dtype)

## Creating Arrays from Scratch
* NumPy provides a wide range of functions for the creation of arrays:<br>
  https://docs.scipy.org/doc/numpy-1.15.4/reference/routines.array-creation.html#routines-array-creation 
 * For example: `np.arange`, `np.zeros`, `np.ones`, `np.linspace`, etc.
* NumPy also provides functions to create arrays filled with random data:<br>
  https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.random.html
 * For example: `np.random.random`, `np.random.randint`, etc.

In [None]:
np.zeros(10, dtype=int)

In [None]:
np.ones((3, 5), dtype=float)

In [None]:
np.full((3, 5), 3.14)

In [None]:
np.arange(0, 20, 2)

In [None]:
np.linspace(0, 1, 5)

In [None]:
np.random.random((3, 3))

In [None]:
np.random.randint(0, 10, (3, 3))

## NumPy Data Types
* Use the keyword `dtype` to specify the data type of the array elements:

In [None]:
floats = np.array([0,1,2,3], dtype="float32")
floats

 * Overview of available data types: https://docs.scipy.org/doc/numpy-1.15.4/user/basics.types.html 

## Array Slicing: One-Dimensional Subarrays
* The NumPy slicing syntax follows that of the standard Python list: `x[start:stop:step]`

In [None]:
x = np.arange(10)
x

In [None]:
x[:5]

In [None]:
x[5:]

In [None]:
x[::-1]

## Array Slicing: Multidimensional Subarrays
* Let `x2` be a two-dimensional NumPy array. Multiple slices are now separated by commas: `x2[start:stop:step, start:stop:step]`

In [None]:
x2

In [None]:
x2[:2, :3]

In [None]:
x2[:3, ::2] # All rows, every other column

In [None]:
x2[:, 0] # Select the first column of x2

In [None]:
x2[1, :] # Select the second row of x2

In [None]:
x2[1] # Select the second row of x2

## Array Views and Copies
* With Python lists, the slices will be _copies_: If we modify the subarray, only the copy gets changed
* With NumPy arrays, the slices will be _direct views_: If we modify the subarray, the original array gets changed, too
 * Very useful: When working with large datasets, we don't need to copy any data (costly operation)
* Creating copies: We can use the `copy()` method of a slice to create a copy of the specific subarray
 * Note: The type of a slice is again `numpy.ndarray`

In [None]:
x2_sub_copy = x2[:2, :2].copy()
x2_sub_copy

In [None]:
x2_sub_copy[0, 0] = 42

In [None]:
x2

In [None]:
x2_sub_copy

## Reshaping
* We can use the `reshape()` method on an NumPy array to actually change its shape:

In [None]:
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

* For this to work, the size of the initial array must match the size of the reshaped array
* _Important_: `reshape()` will return a new view if possible; otherwise, it will be a copy
 * In case of a view, if you change an entry of the reshaped array, it will also change the initial array

## Array Concatenation and Splitting
* Concatenation, or joining of two or multiple arrays in NumPy can be accomplished through the functions `np.concatenate, np.vstack, and np.hstack`
 * Join multiple two-dimensional arrays: `np.concatenate([twodim1, twodim2,…], axis=0)`
   * A two-dimensional array has two axes: The first running vertically downwards across rows (axis `0`), and the second running horizontally across columns (axis `1`)
* The opposite of concatenation is splitting, which is provided by the functions `np.split, np.hsplit` (split horizontally), and `np.vsplit` (split vertically)
 * For each of these we can pass a list of indices giving the split points

In [None]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

In [None]:
grid = np.array([[1, 2, 3], [4, 5, 6]])
np.concatenate([grid, grid])

In [None]:
np.concatenate([grid, grid], axis=1)

In [None]:
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])

np.vstack([x, grid])

In [None]:
y = np.array([[99],
              [99]])

np.hstack([grid, y])

In [None]:
grid = np.arange(16).reshape((4, 4))
grid

In [None]:
upper, lower = np.vsplit(grid, [2])

In [None]:
upper

In [None]:
lower

## Faster Operations Instead of Slow `for` Loops
* Looping over arrays to operate on each element can be a quite slow operation in Python
* One of the reasons why the for loop approach is so slow is because of the type-checking and function dispatches that must be done at each iteration of the cycle
 * Python needs to examine the object's type and do a dynamic lookup of the correct function to use for that type

In [None]:
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output

values = np.random.randint(1, 10, size=5)

compute_reciprocals(values)

In [None]:
big_array = np.random.randint(1, 100, size=10000)
%timeit compute_reciprocals(big_array)

## NumPy's Universal Functions
* NumPy provides very fast, vectorized operations which are implemented via _universal functions_ (ufuncs), whose main purpose is to quickly execute repeated operations on values in NumPy arrays
 * A _vectorized operation_ is performed on the array, which will then be applied to each element
Instead of computing the reciprocal using a for loop, lets do it by using a universal function:

In [None]:
%timeit (1.0 / big_array)

 * We can use ufuncs to apply an operation between a scalar and an array, but we can also operate between two arrays

In [None]:
np.array([4,5,6]) / np.array([1,2,3])

## Advanced Ufunc Features: Specifying Output and Aggregates
* ufuncs provide a few specialized features
* We can specify where to store a result (useful for large calculations)
* If no `out` argument is provided, a newly-allocated array is returned (can be costly memory-wise)

In [None]:
x = np.random.random(10)
y = np.zeros(10)
np.multiply(x,3,y)

* _Reduce_: Repeatedly apply a given operation to the elements of an array until only one single result remains
 * For example, `np.add.reduce(x)` applies addition to the elements until the one result remains, namely the sum of all elements
* _Accumulate_: Almost same as reduce, but also stores the intermediate results of the computation

In [None]:
x = np.array([1,2,3,4,5])
np.add.reduce(x)

In [None]:
x = np.array([1,2,3,4,5])
np.add.accumulate(x)

## Aggregates
* If we want to compute summary statistics for the data in question, aggregates are very useful
  * Common summary statistics: mean, standard deviation, median, minimum, maximum, quantiles, etc.
* NumPy provides fast built-in aggregation functions for working with arrays:

In [None]:
x = np.random.random(10000)
%timeit np.max(x) # NumPy ufunc
%timeit max(x)    # Python function

* Summing values in an array:

In [None]:
%timeit np.sum(x) # NumPy ufunc
%timeit sum(x)    # Python function

## Multidimensional Aggregates
* By default, each NumPy aggregation function will return the aggregate over the entire array
* Aggregation functions take an additional argument specifying the axis along which the aggregate is computed
 * For example, we can find the minimum value within each column by specifying `axis=0`:

In [None]:
twodim = np.array([[1,2,3],[0.12, -1, 0.41],[10,9,8]])
twodim.min(axis=0)

## Comparison Operators as ufuncs
* NumPy also implements comparison operators as element-wise ufuncs
* The result of these comparison operators is always an array with a Boolean data type:

In [None]:
np.array([1,2,3]) < 2

* It is also possible to do an element-by-element comparison of two arrays:

In [None]:
np.array([1,2,3]) < np.array([0,4,2])

## Working with Boolean Arrays: Counting Entries
* The `np.count_nonzero()` function will count the number of `True` entries in a Boolean array

In [None]:
nums = np.array([1,2,3,4,5])
np.count_nonzero(nums < 4)

* We can also use the `np.sum()` function to accomplish the same. In this case, `True` is interpreted as `1` and `False` as `0`:

In [None]:
np.sum(nums < 4)

* NumPy also implements bitwise logic operators as element-wise ufuncs
* We can use these bitwise logic operators to construct compound conditions (consisting of multiple conditions)

In [None]:
(nums < 2) | (nums > 3)

## Boolean Arrays as Masks
* In the previous slides we looked at aggregates computed directly on Boolean arrays
* Once we have a Boolean array from lets say a comparison, we can select the entries that meet the condition by using the Boolean array as a _mask_

In [None]:
x = np.array([[3,1,5],[10,32,100],[-1,3,4]])
x[x<5]

## Reading and Writing Data with NumPy

* We can use the `np.savetxt()` function to save NumPy data to a file
* We can use the `np.loadtxt()` function to load data from a file
  * *Remember*: We can only store elements of a single type in a NumPy array
* Use the shell commands `!ls`, `!pwd`, and `!cd` within our notebook to navigate the file system if necessary

### Split-Up Data Example
1. We are now first going to generate some data which we will store into multiple files
2. In a next step, we are going to read the same split-up data into a NumPy array again. 
3. **Note**: Please create a `smarthome` folder within the `datasets` folder; we are going to store the files there


In [None]:
def generate_split_data():
    seconds_in_a_day = 24 * 60 * 60 - 1
    data_size = (seconds_in_a_day,1)

    days = np.arange(1,31)

    rng = np.random.RandomState(42)

    for day in np.nditer(days):
        fridge_temperature = rng.normal(loc=5, scale=2.0, size=data_size)
        room_temperature = rng.normal(loc=20, scale=3.0, size=data_size)
        outside_temperature = rng.normal(loc=10, scale=2.0, size=data_size)
        data = np.concatenate((outside_temperature, room_temperature, fridge_temperature), axis=1) # Concatenate column-wise
        # Important: use :02d since it allows us to sort the filenames
        np.savetxt("./datasets/smarthome/day_{:02d}.txt".format(day), data, fmt="%.4f", delimiter=",", header="outside_temperature_celsius, room_temperature_celsius, fridge_temperature_celsius")
        print("day #{:02d} done".format(day))
    
    print("All files have been successfully created!")
        
generate_split_data()

In [None]:
from os import listdir
from os.path import isfile, join

def read_split_data():
    files_dir = "./datasets/smarthome"
    files = listdir(files_dir)
    files.sort()
    
    all_data = np.empty((0,3), dtype=np.float32)
    
    for f in files:
        new_data = np.loadtxt("{}/{}".format(files_dir, f), skiprows=1, delimiter=",")
        all_data = np.vstack((all_data, new_data))
        print("Done with loading {}".format(f))
        
    print("Data shape: {}".format(all_data.shape))

read_split_data()
    

## Reading CSV Data with NumPy
* Some CSV data contains a mix between numbers and strings, or might have missing values
* We can use the `np.genfromtxt()` function to load mixed data from such a file into a NumPy array

In [None]:
# Lets play around with the FIFA 2019 player statistics data set and see how we can work with mixed data
data = np.genfromtxt(open("./datasets/fifa2019_player_statistics.csv", "r"), delimiter=",", skip_header=1, missing_values=-1, usecols=(1,2,3,7,8), dtype=[("ID",int),("Name","U50"),("Age",int),("Overall rating",int),("Overall potential",int)])
data