# Multi-dimensional data video notes
So far, 

* We've seen how to work with single columns of data. 
* But data are often in tables.

Consider

In [1]:
import numpy as np
x = np.array([[1,2,3], [4,5,6], [7,8,9]])
x

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

This is a *multi-dimensional array*. 
To access it, note that: 

In [2]:
print(x[0])
print(x[1])
print(x[2])
print(x.shape)

[1 2 3]
[4 5 6]
[7 8 9]
(3, 3)


* `x.shape` is the shape of the array: 3x3. 
* `x[0]` - `x[2]` are the "rows". 
* Obviously `x[i][j]` is the object at row `i`, column `j`. 

In [3]:
x[2][2]

9

You'll be happy to know that the things you are used to for single-dimensional arrays still work, e.g., 

In [4]:
x + 4

array([[ 5,  6,  7],
       [ 8,  9, 10],
       [11, 12, 13]])

In [5]:
x > 5


array([[False, False, False],
       [False, False,  True],
       [ True,  True,  True]])

In [6]:
x[x>5]

array([6, 7, 8, 9])

Oops. That doesn't do quite what we might want. It *flattened* the array and produced the elements that match. 

# The concept of an axis
Most meaningful operations on multi-dimensional arrays act on rows or columns. We might want to remove some rows or columns, or we might want to filter out all rows matching some criteria.

An *axis* is a number or designation that describes the dimension to which to apply an operation. 

Consider, e.g., 


In [7]:
print("input = ")
print(x)
print("x.sum(axis=0)={}".format(x.sum(axis=0)))
print("x.sum(axis=1)={}".format(x.sum(axis=1)))

input = 
[[1 2 3]
 [4 5 6]
 [7 8 9]]
x.sum(axis=0)=[12 15 18]
x.sum(axis=1)=[ 6 15 24]


* The first parameter of `sum` is `axis` (0,1). 
* `axis` is 0 --> sum *rows.* 
* `axis` is 1 --> sum *columns.*

# Broadcasting

* When Numpy is faced with arrays of the exact same shape, things proceed normally. 
* when shapes differ, numpy invented -- and many other libraries copied -- the idea of *broadcasting*. 
* We've already seen this in the single dimensional case, but here it is in the multi-dimensional case. 
* See [A gentle introduction to broadcasting](https://machinelearningmastery.com/broadcasting-with-numpy-arrays)
* This is a bit counter-intuitive: 

In [8]:
a = np.array([[1,2,3],[4,5,6]])
b = np.array([10,11,12])
a + b

array([[11, 13, 15],
       [14, 16, 18]])

Simply stated, computing `a + b` actually replicates `b` to `b'` that is a two-dimensional array with the rows copied. This is identical to: 

In [9]:
a = np.array([[1,2,3], [4,5,6]])
c = [10,11,12]
d = np.array([c, c])
print("a=")
print(a)
print("d=")
print(d)
a+d

a=
[[1 2 3]
 [4 5 6]]
d=
[[10 11 12]
 [10 11 12]]


array([[11, 13, 15],
       [14, 16, 18]])

# What is the point of "broadcasting"?
* Very often we want to repeat a comparison among all rows. 
* Consider the following really counter-intuitive but very useful pattern.

In [10]:
# convention: -1 means missing
data = np.array([[1,2,1],
                 [-1,5,2],
                 [3,3,-1],
                 [1,1,4],
                 [2,1,2]])
print("data is:")
print(data)
print("data != -1 is:")
print(data != -1)
# True means corresponding line has no missing data
choices = (data != -1).all(axis=1)
print("(data != -1).all(axis=1) =")
print(choices)
# select lines without missing data. 
print("data[choices] is")
data[choices]

data is:
[[ 1  2  1]
 [-1  5  2]
 [ 3  3 -1]
 [ 1  1  4]
 [ 2  1  2]]
data != -1 is:
[[ True  True  True]
 [False  True  True]
 [ True  True False]
 [ True  True  True]
 [ True  True  True]]
(data != -1).all(axis=1) =
[ True False False  True  True]
data[choices] is


array([[1, 2, 1],
       [1, 1, 4],
       [2, 1, 2]])

In other words, I just selected all rows that don't have missing columns. 

Let's take this apart carefully. 
* by convention, data that is missing is represented by `-1`. 
* Thus there are two rows in which data is missing. 
* We want to exclude any row with a -1 in it. 
* We compare every element to -1. 
* Then we compute the logical and of every *row* (axis=1). This means 
  to generate summaries for rows, by doing logical and of columns (1). 
* Then we select all rows (axis=0) for which every test is True. 

# A compelling example

Suppose we want to examine all rows whose elements are more than 1 standard deviation from the mean for the respective columns. Consider this code: 

In [11]:
data = np.array([[12, 42, 12],
                 [13,  2, 13],
                 [11, 40, 14],
                 [14, 44, 11], 
                 [10, 39, 15],
                 [13, 43, 14]])
stdev = data.std(axis=0)
print("stdev = {}".format(stdev))
means = data.mean(axis=0)
print("means = {}".format(means))
mins = means - stdev
print("mins = {}".format(mins))
maxs = means + stdev
print("maxs = {}".format(maxs))
gt_lower = (data > mins).all(axis=1)
lt_higher = (data < maxs).all(axis=1)
in_bounds = gt_lower & lt_higher
outliers = np.invert(in_bounds)
print("outlier choices are:")
print(outliers)
print("outlier rows are:")
print(data[outliers])

stdev = [ 1.34370962 14.8548533   1.34370962]
means = [12.16666667 35.         13.16666667]
mins = [10.82295704 20.1451467  11.82295704]
maxs = [13.51037629 49.8548533  14.51037629]
outlier choices are:
[False  True False  True  True False]
outlier rows are:
[[13  2 13]
 [14 44 11]
 [10 39 15]]


# A really curious logic

* What we want is to compute a flag table of rows to choose
( `outliers = [False, True, False, True, True, False]`)

How we do that: 
* compute mean and standard deviation. 
* compute upper and lower bounds for each *column*. 
* broadcast those tables in comparisons with each *row*. 
* compute from that whether each row matches. 
* invert that to get rows that don't match. 
* select these via `data[selection]` pattern

We can do most of this in one line: 

In [12]:
data[np.invert(((data > mins) & (data < maxs)).all(axis=1))]

array([[13,  2, 13],
       [14, 44, 11],
       [10, 39, 15]])

# Aside: functional programming

This seemingly curious logic is part of a movement in Computer Science toward what is called *functional programming*. 
* Express things in terms of functions that transform data. 
* Avoid the "for" loop at all costs. 

This way of thinking is theoretically desirable: functional programs are: 
* much easier to debug and correct. 
* much easier to speed up through parallel computing. 

In fact, *the use of "for" loops makes these things more difficult!* 

# Patterns for functional data programming
* *operations on rows:* broadcasting and parallel selection. 
* *transposition:* switch rows and columns: operate on columns as rows in order to use row patterns! 
Consider the following: 

In [13]:
columns = data.transpose()
print(columns)
# Let's remove column 1, which is now row 1!
c2 = columns[[True, False, True]]
print(c2)
d2 = c2.transpose()
d2

[[12 13 11 14 10 13]
 [42  2 40 44 39 43]
 [12 13 14 11 15 14]]
[[12 13 11 14 10 13]
 [12 13 14 11 15 14]]


array([[12, 12],
       [13, 13],
       [11, 14],
       [14, 11],
       [10, 15],
       [13, 14]])

# What happened? 
* The pattern data[selections] works on rows. 
* I wanted to remove a column. 

so

* transpose rows and columns. 
* remove the new "row" (old "column")
* transpose rows and columns back. 

Result is removing a column! 

# Afterword: I hate the term "broadcasting". 

A better term might be "distribution". But the term "broadcasting" is in common use and used to describe the same process in multiple languages. So we're stuck with it. 

When you are done with this, please [proceed to complete the related exercise](03-02-multi-dimensional-data.ipynb).