# Handling data (i.e. numpy)

The reason python is such a useful language is not because it's the fastest, the most feauture rich or even very well written.

It's strength comes from the fact it's: relatively readable, high-functioning (you can do complex things in just a few lines) but most of all because it's very easy to import new tools to suit almost any task.

We're going to quickly introduce a couple of these that you'll use heavily over the next couple of days: numpy and matplotlib (in the next notebook)

## Numpy - big data in a simple "list" (technically an array)

Numpy may be the most useful tool for scientific coding. It allows you to write short code that runs very quickly operating on huge "lists" of data. N.b. I'm calling them, unofficially, "lists" here, because that's a good description of them in plain english. However a list is a different python object and numpy technically deals in ARRAYS, just a name for their particular kind of list that we'll use from now on.

### Importing modules

I'm going to assume you have numpy installed (because we asked you to!) hopefully through a simple python package manager like Anaconda (or maybe pip).

But we still have to tell python we want to use numpy, or in python jargon, we have to "import the numpy module"

The way we do this is lovely and simple. We're going to complicate it a little by using a useful but extra shorthand, we could just import numpy and when we want to call a function in it say numpy.someFunction(), but that will quickly get tiresome. So instead we tell python that we're going to call this new module "np", we could choose anything here, like "reallyUsefulPythonModule", but that would kind of defeat the point...

In [1]:
import numpy as np

Now it's importted and we could use it any and everywhere.

### What does numpy do?

Let's start working with these arrays. There's a whole bunch of ways to make an array, here's just a few:

In [4]:
a=np.array([1,1,2,3,5,8,13,21,34,55]) #simple 1D array of user inputted values (here the fibonacci sequence)
print('a= ',a)
b=np.arange(5) #1D array of the number 0,1,2,3,4 
#(note that numpy inteprets this as "give me the number up to but not including 5")
print('b= ',b)

#in fact, arange can do a lot of things
#you can always google these commands for more info but here's an example:
c=np.arange(10,-10,-2) #this gives me the numbers (10,8,6,4,2,0,-2,-4,-6,-8)
print('c= ',c)

#arrays can have more than one dimension, for example:
d=np.zeros((3,2)) #this is a 3 by 2 array of zeros (more useful than you might think!)
print('d= \n',d) #(\n just means new line, makes it easier to read)

#you can even make arrays of the same dimensions as existing arrays, e.g.
e=np.ones_like(d)
print('e= \n',e)

a=  [ 1  1  2  3  5  8 13 21 34 55]
b=  [0 1 2 3 4]
c=  [10  8  6  4  2  0 -2 -4 -6 -8]
d= 
 [[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]]
e= 
 [[ 1.  1.]
 [ 1.  1.]
 [ 1.  1.]]


So we can make numpy arrays, but why do we use them? Because they allow us to do stuff to EVERYTHING in the array in one action (rather than looping through entry by entry)

E.g.

In [5]:
print('We can add: ',e+2)
print('or: ',a+c) #note that these both have the same dimension
#you can't add tall and skinny to short and fat!
print('We can multiply: ',0.5*c)
print('And even: ',a*c)

We can add:  [[ 3.  3.]
 [ 3.  3.]
 [ 3.  3.]]
or:  [11  9  8  7  7  8 11 17 28 47]
We can multiply:  [ 5.  4.  3.  2.  1.  0. -1. -2. -3. -4.]
And even:  [  10    8   12   12   10    0  -26  -84 -204 -440]


## What else does numpy do?

We can make a bunch of arrays, and then start throwing them about, but that's only part of what numpy does. There are (probably) thousands of commands, some of them incredibly useful, some of them you'll probably never see. As always google is your friend if you want to do something specific, but here's one that will likely be useful to everyone immediately:

### The stack commands
We can all admit that we'd rather be playing with Lego at basically any moment. Luckily with numpy we almost can. The "stack" commands stick arrays together, either head to tail or side to side.

Imagine we have 3 anchovies, tall and thin. We could stick them together head to tail (i.e. vertically), and make a really bad rope. Or we could stick them together side to side (i.e. horizontally) and fit the in a sardine tin.

In this case, as in most, the more useful form is obvious (unless anyone can think of a good use for a fish-rope?). Honestly, I tend to work out the right one by trying one, and if it doesn't work trying the other, but feel free to show me up!

In [7]:
aHorizontal=np.hstack((a,a)) #note the second set of brackets!
print('aHorizontal: \n',aHorizontal) #fish-rope
aVertical=np.vstack((a,a))
print('aVertical: \n',aVertical) #anchovies in a tin

acVertical=np.vstack((a,c)) #as long as two arrays have compatible dimensions we can stack em
print('acVertical: \n',acVertical)

deHorizontal=np.hstack((d,e)) #and we can do just the same for 2(+) dimensional arrays
print('deHorizontal: \n',deHorizontal)

aHorizontal: 
 [ 1  1  2  3  5  8 13 21 34 55  1  1  2  3  5  8 13 21 34 55]
aVertical: 
 [[ 1  1  2  3  5  8 13 21 34 55]
 [ 1  1  2  3  5  8 13 21 34 55]]
acVertical: 
 [[ 1  1  2  3  5  8 13 21 34 55]
 [10  8  6  4  2  0 -2 -4 -6 -8]]
deHorizontal: 
 [[ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]
 [ 0.  0.  1.  1.]]


### Pulling out bits of arrays

There are quite a few ways to pull out some part of an array, let's start with the most basic: grabbing everything in one dimension.

In [8]:
#We can pull out individual elements, e.g.
print('The third element in a: ',a[3])
#Remember that (almost) every computing language starts counting at 0, hence
print('The zeroth element in c: ',c[0])
#Numpy also has a useful for shorthand for pulling out stuff at the back to the array
print('The second element from the end of b: ',b[-2])

#For multidimensional arrays (such as acVertical above) we can pull out individual elements
#by telling numpy how for DOWN and then ACROSS the element we want is
print('Zeroth row, first column of acVertical: ',acVertical[0,1])
print('First row, zeroth column of acVertical: ',acVertical[1,0])

#We can also pull out how rows using ":" (which basically means everything)
print('Second row of deHorizontal: ',deHorizontal[2,:])
print('Thrid column of acVertical: ',acVertical[:,3])

#Finally, we can use the ":" command to define a range of values to pull out
print('Second, Third and Fourth elements of a: ',a[2:5])
print('Fifth to penultimate elements from first row of acVertical: \n',acVertical[1,5:-1])

The third element in a:  3
The zeroth element in c:  10
The second element from the end of b:  3
Zeroth row, first column of acVertical:  1
First row, zeroth column of acVertical:  10
Second row of deHorizontal:  [ 0.  0.  1.  1.]
Thrid column of acVertical:  [3 4]
Second, Third and Fourth elements of a:  [2 3 5]
Fifth to penultimate elements from first row of acVertical: 
 [ 0 -2 -4 -6]


### The where command

Sometimes we want more specific parts of an array, that's where the WHERE command comes in.

The where command tells you where something is true.
It's really useful for looking for relevant data in a long array.

In [16]:
aIndex=np.argwhere((a > 5) & (a < 20)) #the locations of elements in a that satisfy this condition
print('aIndex: \n',aIndex)
print('and the corresponding entries: \n',a[aIndex])

#You can also pick data out of multidimensional arrays, for example the entries of 
#acVertical for which the first row is positive
acIndex=np.argwhere(acVertical[1,:] > 0)
print('the indices: \n',acIndex)
print('and the corresponding entries: \n',acVertical[1,acIndex])

aIndex: 
 [[5]
 [6]]
and the corresponding entries: 
 [[ 8]
 [13]]
the indices: 
 [[0]
 [1]
 [2]
 [3]
 [4]]
and the corresponding entries: 
 [[10]
 [ 8]
 [ 6]
 [ 4]
 [ 2]]


### Random, mean and standard deviation

When we have a whole bunch of data we often want to reduce it to just 2 numbers, a mean and a standard deviation.

In [104]:
# First let's make some random data to work with
randomData=np.random.random(10) # yes I know this is a function stupid name...
print('10 random numbers between 0 and 1: \n',randomData)
print('Mean: ',np.mean(randomData))
print('Error in the mean: ',np.std(randomData)/np.sqrt(10))

# Notice what happens with more data points
newData=np.random.random(1000000) # WAAAAY more data
print('New mean: ',np.mean(newData))
print('and new error: ',np.std(randomData)/np.sqrt(1000000))

10 random numbers between 0 and 1: 
 [ 0.7792396   0.18407444  0.45118353  0.39815263  0.25962708  0.02948794
  0.11975714  0.0878027   0.23987961  0.61515646]
Mean:  0.316436111836
Error in the mean:  0.0726744968058
New mean:  0.499951321469
and new error:  0.000229816937713


### The sort command

The final tool we'll introduce here is the SORT command. This does exactly what it says, takes your data and sorts it. Sometimes more useful than a command that does the sorting for us is ARGSORT, which gives us the indices we'd need to resort the array ourselves.

In [75]:
g=np.array([6,3,7,3,2,1,9,10])
print('g: ',g)
gSorted=np.sort(g)
print('gSorted: ',gSorted) #g in ascending order

#or we could do it in two steps, with argsort 
#(one of those commands that doesn't seem useful until it's really useful...)
sortedIndices=np.argsort(g)
print('sortedIndices: ',sortedIndices)
print('and the sorted array: ',g[sortedIndices])

#We don't need to sort the array into ascending order,
#we can ask for the sorted indeces of anther array and use those instead. E.g.
cMagSorted=np.argsort(np.abs(c)) #the ABS command gives the absolute value
print('cMagSorted: ',cMagSorted)
print('c in absolute value order: ',c[cMagSorted])
print('and finally, absolute values of c sorted in absolute value order: ',np.abs(c[cMagSorted]))

g:  [ 6  3  7  3  2  1  9 10]
gSorted:  [ 1  2  3  3  6  7  9 10]
sortedIndices:  [5 4 1 3 0 2 6 7]
and the sorted array:  [ 1  2  3  3  6  7  9 10]
cMagSorted:  [5 4 6 3 7 2 8 1 9 0]
c in absolute value order:  [ 0  2 -2  4 -4  6 -6  8 -8 10]
and finally, absolute values of c sorted in absolute value order:  [ 0  2  2  4  4  6  6  8  8 10]


## Challenge - how random is random
Make a big random array (100 elements should do) of random numbers between -1 and 1, sort it and find the mean difference between consecutive values? How does this relate to the size?

(Challenge challenge - try doing the same with a for-loop and use the %timeit command to compare how long it takes)

In [7]:
challenge=2*(np.array(np.random.random(100))-0.5)
print challenge
challengeSort=np.sort(challenge)
print '\n'
print challengeSort

diff=challengeSort[1:]-challengeSort[:-1]
diffMean=np.sort(diff)
print '\n',diffMean

#n=0
#for (challenge(n) < 100):
    

[ -3.41059806e-01  -8.86292379e-01  -5.24010754e-01   1.35982733e-01
   6.76548003e-01  -6.60935410e-01  -7.20577817e-01  -7.49170753e-01
   4.22335763e-01  -9.72529574e-01  -5.85026113e-01   6.08525454e-01
  -7.12849704e-01  -4.82737844e-01  -2.74758807e-01  -8.36264264e-01
  -6.10567443e-01  -9.96270733e-01  -3.01472118e-01  -5.99347418e-01
  -9.41817035e-01  -8.27956675e-01   8.05518010e-02   4.18812392e-01
  -6.60543235e-01   8.83494934e-01  -9.13757153e-01   7.25647813e-01
   2.99429673e-01   3.16177603e-03  -6.51422380e-01   2.51418849e-01
  -3.80673679e-01   1.58825541e-01  -1.49005576e-01  -2.04602455e-03
   3.04029095e-01  -4.31323400e-01  -4.56691214e-01  -2.93378551e-01
  -2.06095756e-01  -8.09173454e-01  -5.27060880e-01  -5.59544039e-01
  -3.17862314e-01  -8.58855493e-01   5.68509458e-01   2.58406237e-01
   2.31062716e-01   8.64913978e-01   1.71714494e-01  -4.52444750e-01
  -2.26836124e-01  -4.47696220e-01   5.83667179e-01  -9.55491352e-01
  -4.69268167e-01   5.32723351e-01