# Status

Let's take stock of where we're at in the course.

* Week 1: GitHub and getting Python to run
* Weeks 2-5: Built-in Python functionality
* Weeks 6-8: Important Python packages for data analysis
  * NumPy, SciPy (this week)
  * Pandas (steals all the good stuff from R)
  * Visualization (MatPlotLib, Seaborn)
* Week 9: Select open source GIS and spatial analysis packages

#### This week
* Arrays. A powerful data structure. Learning how to use these will be the majority of this week's Notebook.
* NumPy functionality. The most popular math package for python; most other mathematical packages depend on it.
* SciPy functionality. Specialized math tools. Brief introduction at the end.

Go __slowly__ through this notebook. Before you run each cell, look at the code and try to predict what will happen.

# NumPy and SciPy

#### Pronunciation
 * num-pie not num-pee
 * sci-pie not sci-pee

    *For years I used the wrong pronunciation for numpy; and as you know, [old habits are hard to break](https://youtu.be/HcYdLQ8-XZE). So you will hear me using both terms as I walk the long road to recovery.*

#### Background
"[NumPy is the fundamental package for scientific computing with Python" (source: numpy.org)](http://promobiledj.com/wp-content/uploads/2011/10/self-promotion2.jpg). This is pretty strong statement, but in my experience it's true. NumPy has been around in various forms since 1998, and in its current form since 2006. It is not part of the default set of Python packages. It is, however, a default package in Anaconda Python... but as you recall, `Anaconda != Python`.

* Provides common mathematical and numerical routines in pre-compiled, fast functions
* Provides basic routines for manipulating large arrays of data
* SciPy (Scientific Python) extends NumPy with many additional algorithms
* NumPy and SciPy combined makes Python a powerful competitor to R and MatLab
* Open-source

# Arrays

Arrays are the central component of the numpy package. The main difference between standard python lists and numpy arrays is the fact that the elements of a numpy array have to be of the same type. Numpy arrays are far more efficient than python lists. Essentially, an array can be seen like a list with the following differences:

 * All elements have to be of the same type, i.e. integer, float (real), complex numbers, strings
 * The number of elements has to be known a priori, i.e., when the array is created. 

### Creating arrays

In [2]:
import numpy as np

**Note**: The convention is to use the `np` alias for `numpy`.

Lists or tuples can be used to build arrays. We pass the list or tuple to `np.array()` to create the array.

In [3]:
my_list = [3,4,5]
a1 = np.array(my_list)
a1

array([3, 4, 5])

In [None]:
a1 = np.array([3,4,5])
a1

**Note**: In the two cells above, the first one creates a variable `my_list` and then passes it to `np.array`, the second one just plugs the list in directly.

In [None]:
print(a1)

**Note**: You get a different _representation_ when use `print(a1)` vs. just `a1`. Printing an array makes it look like a list; but as you see in the next cell, it is still an array.

In [4]:
type(a1)

numpy.ndarray

In [None]:
a2 = np.array([[3,4,5],[6,7,8]])
a2

The `shape` attribute tells you how many elements are along each dimension of your array.

In [None]:
a1.shape

In [None]:
a2.shape

**Note**: `a1` has one dimension (like a vector) that has `3` elements in it. `a2` has two dimensions, `2` "rows" and `3` "columns." Arrays can have any number of dimensions.

An array can contain only one type of data. Recall that a list or tuple can contain multiple types of data.

In [None]:
a1

The `dtype` attribute indicates the type of every element within the array.

In [None]:
a1.dtype

Numpy will try to guess the data type when you initially build the array. However, you can force the array to take a particular data type, as seen in the cell below.

In [None]:
a3 = np.array([3,4,5], dtype=float)
a3

In [None]:
a3.dtype

You can try to trick numpy (as in the cells below), but it is pretty robust to your [shenanigans](https://youtu.be/Ok85BmPyl_I). 

In [None]:
a4 = np.array([34, 67.8])
a4

In [None]:
a4.dtype

**Note**: In the above case numpy saw an integer and a float, and made both elements floats. Numpy has a bunch more data types than standard python; the example above uses `float64` (a 64 bit float). You don't need to worry about the `64` part, just that numpy converted both objects to the same float type.

In [None]:
a4 = np.array(['dog', 34, 67.8])
a4

In [None]:
a4.dtype

**Note**: `<U4` is a type of string. In this case it's a string with up to 4 characters. The number in the dtype for a string is more important than for the float: it indicates the maximum number of characters for elements in the array. The longest element in the array above was `67.8`, with its 4 characters. Notice what happens when we try to put a longer string into the `<U4` array.

In [None]:
a4[1] = 'elephant'
a4

**Note**: Recall, unlike lists, all the elements of numpy arrays must be the same type.

#### Slicing arrays

For one dimensional arrays, slicing is identical to slicing on a list.

In [None]:
a1

In [None]:
a1[1]

Things get a little different for multidimensional arrays. 

In [None]:
a2

In [None]:
a2[0,2]

**Note**: In the cell above, we asked for the element in the `0` row and the `2` column. As always, counting starts at 0 in python.

**Action**: In the cell below, grab some different elements from the array `a2`.

**Action**: In the cell below you'll see a list of lists containing the same data as `a2`. Slice out the same values from the list as you did from the array.  Notice the different syntax.

In [None]:
list2 = [[3,4,5],[6,7,8]]
list2

The colon (`:`) by itself means to get all values along that dimension. In the first example below, we're asking for all columns from the `0` row.

In [None]:
a2[0,:]

The second example asks for all rows for the `2` column.

In [None]:
a2[:,2]

**Note**: You'll notice that numpy slicing is more powerful (and more intuitive in my opinion) compared to slicing on a nested list of lists. Executing the slice in the cell above is not easy to do on lists.

Some of the tricks we learned for lists are still applicable for arrays.

In [None]:
a2

In [None]:
a2[1,1:]

In [None]:
a2[:,-1]

#### The array elements are mutable

You cannot append an element to an array like we have been doing to lists. This fixed structure makes the array faster than a list when working with large datasets. However, the elements within the array are mutable. See example below.

In [None]:
a2 = np.array([[3,4,5],[6,7,8]])
a2

In [None]:
a2[1,2]

In [None]:
a2[1,2] = 99
a2

#### Selecting array elements

Array elements can be selected by passing an array or list of booleans (`True` or `False` values) to it.

In [None]:
b1 = np.array([8,2,6,3])
b1

Notice what we get when applying a comparison operator to an array.

In [None]:
s = b1 > 4
s

**Note**: In the output from the cell above it is like we did four individual comparisons: `8>4`, `2>4`, `6>4` and `3>4`, and then put the results into an array. The symmetry between `b1` and `s` is important: the `True` and `False` values in `s` match the positions in `b1`.

We can now pass the array of booleans (`s`) to `b1` using the square brackets to extract the values from `b1` that are greater than `4`, i.e., the `True` values from `s`.

In [None]:
b1[s]

Oftentimes we just combine it all into one step

In [None]:
b1[b1 > 4]

#### Reshaping arrays

numpy provides some nice tools to easily change the "shape" of an array.

In [None]:
a2 = np.array([[3,4,5],[6,7,99]])
a2

`flatten` takes any multidimensional array and returns a new array that is flattened into a one dimensional array.

In [None]:
a2.flatten()

This didn't actually change `a2`.

In [None]:
a2

`reshape` gives the array the shape passed in by the user.

In [None]:
a2.reshape((3,2))

In [None]:
a2.reshape((6,1))

Again, `a2` was not changed.

In [None]:
a2

If you want to change the shape of an array, you can assign the `shape` parameter to a new value.

In [None]:
a2.shape = (6,1)

Finally, we actually changed `a2`!!

In [None]:
a2

That was too weird. Let's get `a2` back to how it belongs.

In [None]:
a2.shape = (2,3)
a2

In general, numpy is used for mathematical operations. A common transformation in math is the transpose. This makes the rows into columns.

In [None]:
a2.transpose()

Since the transpose is such a common action, it can also be called with just `.T`

In [None]:
a2.T

Much of the functionality in numpy can be called in different ways. Below is an alternate way of calling the transpose.

In [None]:
np.transpose(a2)

**Note**: Using methods of an array tends to be faster than calling a numpy function.

#### Copying an array

You saw a similar example in the last Notebook about how lists can be linked together. We'll show it again for arrays.

In [None]:
c1 = np.array([8,4,2])

In [None]:
c1

In [None]:
c2 = c1

In [None]:
c2

In [None]:
c2[1] = 99

In [None]:
c1

In [None]:
c2

**Note**: The elements in `c1` and `c2` are pointing to the same locations in memory.

Numpy arrays have a `copy` method that can be used to break this link.

In [None]:
c1

In [None]:
c3 = c1.copy()

In [None]:
c3

In [None]:
c3[1] = 9999

In [None]:
c3

In [None]:
c1

**Note**: How numpy "copies" data is complicated. Whenever you're working with a subset of an array, be sure to test the arrays to be sure they are functioning as you expect.

#### Initializing an array

Numpy offers a number of tools for getting an array started. We will start with `np.arange`, which is similar to `range`, which you learned earlier.

In [None]:
a3 = np.arange(3, 20, 4)
a3

**Note**: In the above example we start at `3`, step through at increments of `4` until reaching one less than the end point (`20`).

`np.linspace` takes a start and end and fills in the values between.

In [None]:
a4 = np.linspace(3, 20, 4)
a4

**Note**: In the above example, we create `4` equally spaced values between `3` and `20`.

`np.zeros` and `np.ones` create arrays filled with zeros and ones respectively. In both cases you pass in the shape you want for the final array.

In [None]:
a3 = np.zeros((2,5))
a3

In [None]:
a4 = np.ones((2,5))
a4

**Note**: It is often a good strategy to initialize an array with the correct dimensions first, and then go back to fill in the values.  
- But you cannot really make an "empty" array, it needs to have a value in every slot upon instantiation. 
- Be aware that the data type you use to start the array needs to match the type of data you'll put in later.
- Recall the strategy for lists was to create an _empty_ list and then `append` the values one-by-one. 

### Operators and Arrays

To this point, arrays have been presented as something very similar to a list. While arrays have benefits as a generic data structure, they are optimized to work with numbers. In this section you'll see how doing math on arrays is much easier than on lists.

#### Elementwise computations

The default way that numpy performs math on two arrays is to line them up and do the math element by element.

In [None]:
a1 = np.array([2,5,9])
a2 = np.array([4,5,8])

In [None]:
a1

In [None]:
a2

In [None]:
a1 + a2

**Note**: The above cell did three additions: `2+4`, `5+5` and `9+8`, and then stored them in a new array.

Subtraction works the same way.

In [None]:
a1 - a2

The same for multiplication.

In [None]:
a1 * a2

Increasing the dimensions doesn't change anything. The computations are still done element by element.

In [None]:
a3 = np.array([[2,5],[8,3],[9,4]])
a4 = np.array([[7,2],[3,6],[5,8]])

In [None]:
a3

In [None]:
a4

In [None]:
a4.shape

In [None]:
a5 = a3 + a4
a5

In [None]:
a5.shape

**Note**: When the dimensions of the arrays match, the mathematics are done elementwise. Meaning that the element in slot 0 of the first array is matched up with the element in slot 0 of the second array, and so on. The result will also be of the same dimensions.

#### Broadcasting

When the arrays are different shapes, numpy will try to "broadcast" one array onto the other.

In [None]:
a1

In [None]:
a6 = np.array([3])
a6

In the next cell we are summing a one dimensional array with three elements, with a one dimensional array with one element.

In [None]:
a1 + a6

**Note**: In the above case, numpy adds `3` to each value in `a1`.  In the case below it multiplies `3` by each value in `a1`.

In [None]:
a1 * a6

If you want to broadcast a single value, it doesn't need to be inside an array.

In [None]:
a1 * 3

Numpy has rules to intelligently decide how to relate the two objects, but sometimes a match is not possible. See the example below.

In [None]:
a1

In [None]:
a7 = np.array([2,7])
a7

In [None]:
a1 + a7

Now see what happens when we add a 3x2 array to a one dimensional array with two values.

In [None]:
a3

In [None]:
a7

In [None]:
a3 + a7

**Action**: Look closely at the `a3 + a7` output to see how the data was matched up.

### Methods of Arrays

Earlier in this Notebook you were introduced to some of the array methods related to the shape of the array. Now we'll look at some mathematical methods.

In [None]:
type(a1)

In [None]:
print(dir(a1))

As always when reading the output of `dir`, ignore the items with leading underscores. 

In [None]:
a1

Some mathematical methods.

In [None]:
a1.sum()

In [None]:
a1.mean()

In [None]:
a1.min()

In [None]:
a1.cumsum()

**Action**: If you don't understand what happened in any of the above cells, you can use the built-in help tool to see what the method does. The next cell has an example.

In [None]:
help(a1.cumsum)

Things get a little more interesting for multidimensional arrays.

In [None]:
a3

In [None]:
a3.sum()

**Note**: Simply calling `sum` on an array adds up all the numbers and returns a single value. The axis argument (see below), will give results along the particular dimension.

In [None]:
a3.sum(axis=0)

In [None]:
a3.sum(axis=1)

The following method will convert an array to a list.

In [None]:
a1.tolist()

#### Speed

Let's see how numpy arrays can speed up your code.

Goal: create 1 million numbers, and multiply each by 100. First we'll solve the problem using a `for` loop. Second we'll solve it using numpy arrays.

The `%%timeit` "cell magic" will tell us how fast the code in that cell runs.

In [None]:
%%timeit
# pure Python solution
x = range(1000000)
y = []
for i in x:
    y.append(i*100)

In [None]:
%%timeit
# numpy solution
x = np.arange(1000000)
y = x*100

**Note**: On my computer, the numpy array solution was over 100 times faster than the `for` loop... your mileage may vary. The point is, `for` loops can be very expensive in python, and numpy arrays can often help you avoid them.

While we're exercising the computer, the cell below shows that a _list comprehension_ is faster than the "classic" approach of appending to an empty list, but numpy is still way faster than either.

In [None]:
%%timeit
# pure Python solution
y = [i*100 for i in range(1000000)]

# NumPy Functions

Besides the numpy array, numpy has a ton of mathematical functions.

In [None]:
print(dir(np))

That's a long list.  "How long?" you ask. Run the next cell to find out.

In [None]:
len(dir(np))

#### Universal functions (a few examples)

Some math functions you'd probably expect.

In [None]:
np.sin(76)

In [None]:
np.sqrt(16)

In [None]:
np.sum([6,4,2])

To run a correlation coefficient, you build an array and then pass it to `np.corrcoef`.

In [None]:
corr_data = np.array([[3,4,1,9,5,1],[8,4,2,7,6,1],[3,2,8,6,1,6]])
corr_data

In [None]:
np.corrcoef(corr_data)

**Note**: Although the `math` package is useful, typically you'll be working with numpy arrays so code will run faster if you keep it all within numpy.

#### Linear algebra

You can also do linear algebra in numpy.

In [None]:
a3

We saw transpose earlier.

In [None]:
a3.T

In [None]:
a4

Earlier you saw elementwise multiplication of two arrays `array1 * array2`. However, this is not the way you learned to multiply matrices in statistics class. Matrix multiplication is more complicated. [This might help jog your memory](https://www.mathsisfun.com/algebra/matrix-multiplying.html). To do this in numpy you use `np.dot`.

In [None]:
np.dot(a3.T, a4)

The following will take the inverse of an array.

In [None]:
sq_mat = np.array([[2,5,7],[1,3,9],[8,4,6]])
np.linalg.inv(sq_mat)

More linear algebra tools.

In [None]:
print(dir(np.linalg))

#### Random

Python has a built-in `random` module (you used it for the random vacation destination selector), but so does numpy.

In [None]:
np.random.uniform(2,10)

The cell above gives a draw from a uniform distribution. Rerun it a few times to see that the number changes each time.

The cell below draws from a normal distribution. 

In [None]:
np.random.seed(3456)
np.random.uniform(2,10)

Let me guess... is your number... hmmmmm... hold on a second... is it... `7.145364096`?

"How", you might ask, "how could he know the value of a random number!" Could it be his [divine and borderline mystical ways](https://youtu.be/9m_dT0wsrGI)? Regrettably no. If I could guess random numbers, I would have long ago won the lottery and be lying on a beach right now.

As we discussed a few weeks ago, these are technically pseudo-random numbers since they are generated by an algorithm. Numpy needs a "seed" to start its algorithm, which it finds by looking at the computer's current time or other operating system tools. The "problem" with this is that it is nearly impossible for you and me to get the same "random" number. We solve this problem by explicitly giving numpy the seed. 

In the cell below, change the seed to a different integer. Then rerun the cell a few times to see that the result is different from the result above, yet stays the same each time you run the cell.

In [None]:
np.random.seed(3456)
np.random.uniform(2,10)

A few more distributions available ([there are lots](https://docs.scipy.org/doc/numpy/reference/routines.random.html)).

In [None]:
np.random.normal(5,2)

In [None]:
np.random.randint(2,10)

For most of the numpy random functions, the default is to return a single number. As can be seen below, most of them take a `size` argument, which returns multiple random values in one shot.

In [None]:
np.random.randint(2,10,size=(5,3))

You can also randomly rearrange elements in an array.

In [None]:
perm_data = np.arange(10)
perm_data

In [None]:
np.random.permutation(perm_data)

In [None]:
print(dir(np.random))

**Note**: In numpy, the `low` value is inclusive but the `high` value is exclusive. Recall for the built-in `random` module, `high` was inclusive.

# SciPy

>### What is the difference between NumPy and SciPy?

>In an ideal world, NumPy would contain nothing but the array data type and the most basic operations: indexing, sorting, reshaping, basic elementwise functions, et cetera. All numerical code would reside in SciPy. However, one of NumPy’s important goals is compatibility, so NumPy tries to retain all features supported by either of its predecessors. Thus NumPy contains some linear algebra functions, even though these more properly belong in SciPy. In any case, SciPy contains more fully-featured versions of the linear algebra modules, as well as many other numerical algorithms. If you are doing scientific computing with python, you should probably install both NumPy and SciPy. Most new features belong in SciPy rather than NumPy.

http://www.scipy.org/scipylib/faq.html

In [None]:
import scipy as sp

In [None]:
print(dir(sp))

Is that longer than the numpy list?!?!?!

Let's check.

In [None]:
len(dir(sp))

You get most of the stuff from NumPy plus extra SciPy only stuff.

#### SciPy Organization

Subpackage | Description
---------- | -----------
cluster |	Clustering algorithms
constants | 	Physical and mathematical constants
fftpack | 	Fast Fourier Transform routines
integrate | 	Integration and ordinary differential equation solvers
interpolate | 	Interpolation and smoothing splines
io | 	Input and Output
linalg | 	Linear algebra
ndimage | 	N-dimensional image processing
odr | 	Orthogonal distance regression
optimize | 	Optimization and root-finding routines
signal | 	Signal processing
sparse | 	Sparse matrices and associated routines
spatial | 	Spatial data structures and algorithms
special | 	Special functions
stats | 	Statistical distributions and functions
weave | 	C/C++ integration

You need to import these subpackages explicitly. Notice that you will not find `sparse` in the `dir(sp)` output above. So the following line will fail.

In [None]:
sp.sparse

But the following lines are fine. This is a decision made by the scipy developers, which is kind of annoying.

In [None]:
import scipy.sparse

In [None]:
from scipy import sparse

In [None]:
print(dir(sparse))

I will not go into these subpackages in detail. They each provide a lot of functionality, and are targeted to specific uses. 

# Test Yourself

1) Using python syntax, extract the `6` from `my_array`.

In [None]:
my_array = np.array([[9,3],[7,4],[6,8]])
my_array

---

2) Create a 4x2 numpy array (4 rows and 2 columns) containing values from one of the random distribution functions available in numpy (some were introduced above, but [there are a lot more](https://docs.scipy.org/doc/numpy/reference/routines.random.html)).

---

3) In the following cell mark each statement as true or false.

    a. You can add a new row to an existing array.  [True/False]
    b. You can change an element within an existing array.  [True/False]
    c. All elements in an array must be of a single type.  [True/False]

---

4) Using numpy syntax, sum the columns of the array in the following cell.

In [None]:
array3 = np.array([[4,2,9],[8,3,5]])
array3

---

5) The following cell has a variable called `radii` which has 10 values. Let's assume they are the radii of some circles. Create an array of circle _areas_ from these radii. Recall that you can work with the entire array at once, no need to loop over the various elements one by one. Use numpy's pi instead of the math package.

$area = \pi * radius^2$

In [None]:
radii = np.random.randint(1,21,10) # 10 random integers, drawn from the range 1 to 20
radii

---

6) Write some one-liners.

a) Select the values less than 10 from the following array.

In [None]:
np.random.seed(999)
vals = np.random.randint(1,20,10)
vals

b) Multiply all the values in the following array by 100.

In [None]:
np.random.seed(999)
vals = np.random.randint(1,20,10)
vals

c) In one line, select the values less than, and multiply them by 100.

In [None]:
np.random.seed(999)
vals = np.random.randint(1,20,10)
vals

---

7) For the following examples, would a list or numpy array be preferred? Why?

a) You want to store a student's exam responses; where some responses are numeric and some are text.

b) You want to systematically walk around your house and store the name of each of your cats, but you don't know how many cats you have.

c) You have the height and weight of 100 subjects in your health study, and you need to compute each person's body mass index (BMI).