# LASP REU Python Tutorial Day 2

## Imports

Remember the huge stack of science libraries (or the list of built-in libraries)?

So, how to get those?

With imports:

In [None]:
from math import pi
tau = 2*pi
print("The real circular constant is", tau)
# read tauday.com if you don't believe me. ;)

How to import other libraries, their functions or their values is very flexible.

One can distinguish mainly 2 kinds:
* One imports a function directly into the current `namespace`:
```python
from math import sin
```
* import the top module and access via the dot `.`:
```python
import math
print(math.sin(0.5))
```

In both cases one can rename the object being imported.

This is a very powerful feature of Python: You can do what you want in terms of how to call things up.

In [None]:
from math import sin as stupid_sine
print(stupid_sine(0.5))

In [None]:
import math as i_hate_math
print(i_hate_math.cos(-1))

## Function returns, packing and unpacking

Functions can return values.

(you saw this in the last exercise already):

In [None]:
from math import log10

def mylog(value):
    if value < 0:
        return "Logarithm not defined."
    else:
        return log10(value)

In [None]:
result = mylog(-1)
result

In [None]:
import math
math.log10(-1)

Maybe you want to return more than one value?

Python will automatically pack things up:

In [None]:
from math import log10

def mylog(value):
    if value < 0:
        return "Logarithm not defined."
    else:
        return value, log10(value)

In [None]:
result = mylog(3)
result

In [None]:
print(type(result))
len(result)

Tuples are the immutable version of lists:

In [None]:
result[0] = 4

In [None]:
mylist = list(result)
mylist

In [None]:
mylist[0] = 5
mylist

But tuples are "iterable" just as lists (meaning I can loop over them):

In [None]:
for item in result:
    print(item)

In [None]:
waves = [5000, 6000]

In [None]:
for wave in waves:
    print(wave)

In [None]:
d = {'a':5, 'b':6}
d

In [None]:
for key, val in d.items():
    print(key, val)

# POLL time!

Python will always automatically pack for you (as it did above with result), but will never
automatically unpack for you:

In [None]:
val, res = mylog(3)
print(val)
res

Because of the automatic packing, you can always have less than required variables, but never more:

In [None]:
a, b, c = mylog(17)

In [None]:
out = mylog(17)

In [None]:
type(out)

In [None]:
out[0]

In [None]:
out[1]

### Interlude: strings

They are also iterables:

In [None]:
for char in "Han":  # Remember that I decide on the name of my temporary variable
    print(char)

This result is often causing confusion, because most likely you wanted to loop over ['Han'], but forgot to put the string into a list:

In [None]:
for item in ['Han']:
    print(item)

In [None]:
for item in 'Han':
    print(item)

Strings have a lot of useful support functions (object lingo: `methods`) inside them:

In [None]:
s = "Han shot first!"

In [None]:
s.center(100)

In [None]:
s.split()   # by default, spaces are assumed to be the separator

In [None]:
s.split('s')  #  note that the separator can be anything, but it will be removed!

You can use the methods of an object already even before the object exists! (kinda...)

In [None]:
'{1}, I am your {0}.'.format('father', 'Luke')

A newer, more compact and readable way of doing the same is:

In [None]:
parent = 'father'
name = 'Luke'

In [None]:
f"{name}, I am your {parent}"

Now, even an empty string or a string with only a space has those methods, and one very useful is the 'join' method:

In [None]:
' '.join(["It's", 'a','trap!'])

a more useful example:

In [None]:
mylist = ['ls', '-la']
" ".join(mylist)

### Back to functions: Default values
Functions can have optional arguments that hold a default value:

In [None]:
def sub_reverser(alist, index=0):
    reversed_list = list(reversed(alist))
    return ''.join(reversed_list[index:])

In [None]:
''.join(list(reversed('astring')))

In [None]:
sub_reverser('astring')

In [None]:
sub_reverser('astring', 3)

This is a very powerful design feature of Python as well: I only need to write one function, but can use it in very different ways, depending on my default arguments (also known as `keyword argument`).

### Namespaces of functions

This topic deals with the sometimes confusing but powerful way of how functions find and use variables

In [None]:
a = 3
def print_a(a):  # an `a` parameter is given!
    print(a)  # this a is available and that's all the function needs.

In [None]:
print_a(5)

In [None]:
print(a)

In [None]:
def print_a_no_param():  # no a provided
    print(a)  # the function is asked to print a, there's no `a` in ITS namespace, so it goes and finds one `outside`

In [None]:
print_a_no_param()

In [None]:
print(a)   # this call doesn't know about any function content, it looks in its OWN namespace (i.e. `root`)

In [None]:
def change_a():  
    a = 4  # what will happen??
    print(a)

In [None]:
change_a()

This is GOOD! Because that means a function cannot accidentally change the `outside` world of itself!

In [None]:
def increase_a():
    a = a + 1
    print(a)

In [None]:
increase_a()

Some people call it inconsistent that this fails, and I tend to agree.

But it's kinda lazy (and dangerous) to simply make use of variables outside the function anyway.

So, Guide van Rossum just thought, this is as much as I support the use of undeclared variables.

And who's to argue with the inventor of a thing as nice as Python!?

In [None]:
def use_a():
    b = a + 1
    print(b)

In [None]:
use_a()

In [None]:
print(a)

Ok, let's go to some more meaty science libraries.

## matplotlib gallery

* Can't go into depth of matplotlib library here, very rich and powerful
* Best way to learn: Go to their gallery page http://matplotlib.org/gallery.html

In [None]:
from IPython.display import IFrame
IFrame("https://matplotlib.org/gallery.html", width=800, height=500)

## numpy

For any serious array/vector/matrix based math, you should use the numpy library.

It is faster, because in contrast to lists, it insists on keeping every item the same data-type. This enables under the hood to create C-objects and pass them to the C or Fortran math libraries.

These libraries are decade old standards with very well researched behaviour!

Contrary to the above mentioned freedom for importing, there are some standards that you should just adapt if you don't want to confuse yourself, when searching for tutorials, blogs etc. to help you out.

This standard imports are:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Okay, let's do some math:

In [None]:
mylist = list(range(10))
mylist

In [None]:
arr = np.array(mylist)
arr

Most import feature of numpy arrays is to support vector math, which lists can't do:

In [None]:
mylist / 3

In [None]:
arr / 3

Lingo: `Lists` are Python lists, `arrays` are numpy arrays.

Basic indexing works the same:

In [None]:
mylist[2:4]

In [None]:
arr[2:4]

numpy has it's own range function: `arange()` (standing for array-range):

In [None]:
np.arange(10)

In [None]:
np.arange?

In [None]:
arr = np.arange(1, 11, 2, dtype='float')  # works the same as range, but also type-able
arr

In [None]:
arr.dtype

# Poll time

In [None]:
np.arange(11, 3, -2)

Handy array creators:

In [None]:
np.ones(3)

In [None]:
np.zeros(5)

Multi-dimensional:

In [None]:
np.ones((2,3))  # rows first, then columns

Note that the `ones` function wants the dimensions as a tuple if more than 1, to not confuse things with other arguments.

Here's 3D:

In [None]:
n3d = np.zeros((2,4,3))  # the depth dimension (how many 2D arrays) first!
n3d

In [None]:
n3d.shape

Lot's of methods inside the numpy array object:

In [None]:
arr.

Most useful:

In [None]:
arr

In [None]:
arr.mean()

In [None]:
np.median(arr)

In [None]:
arr.std()

In [None]:
arr.dot(arr)

In [None]:
np.linalg.norm(arr)**2

In [None]:
np.ones((3,3))

In [None]:
np.ones((3,3)).diagonal()

In [None]:
arr.max()

In [None]:
arr.argmax()  # WHERE is the max?

Fancy indexing one of most important functionality:

In [None]:
arr

In [None]:
arr < 3

In [None]:
arr[arr<3]  # lingo: give me array WHERE condition is TRUE

2D numpy (= matrices) have a lot of matrix features:

In [None]:
arr2d = np.random.randn(2,3)  # randn provides Gaussian-distributed random values.
arr2d

In [None]:
%matplotlib inline

plt.hist(np.random.randn(1000), bins=100, color='green');

In [None]:
arr2d

In [None]:
arr2d.transpose()

Absolutely impossible to cover `numpy` even half here, but it's one of the most important tools in Python.

## pandas

Most important high-level data analysis library (in my POV).

It uses `numpy` under the hood.

So, my recommendation: 
1. Learn the numpy basics, don't try to get it all in, hardly possible.
 * For example the scipy quickstart tutorial
2. Learn Pandas first, and whenever they refer to an unknown numpy feature, look that up.

Pandas has a nice 10 minutes intro here: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

Most important objects in `pandas` are Series and DataFrames

In [None]:
import pandas as pd

s = pd.Series(np.arange(10)*24.1)
s

In [None]:
s[4]  # integer indexing works just like for numpy arrays and lists:

A pandas.Series is just like a 1-dim numpy array, but with more bells and whistles, importantly, with an index.

Important concept in `pandas` are the indexes. `pandas` always keeps the relationship between indices and data intact!

series filtering works the same as for numpy arrays, as it's build on it:

In [None]:
s[s<100]

DataFrames are the 2D version of Series (akin to 2D numpy arrays), complete with column names.

Another very enticing feature of `pandas` are its datetime abilities:

In [None]:
dates = pd.date_range('20130101', periods=10, freq='m')

print(dates)
df = pd.DataFrame(np.random.randn(10,4), index=dates, columns=list('ABCD'))
df

In [None]:
df.loc['2013-01','A':'C']

Coolest part is the automatic datetime and multi-graph plotting of DataFrames:

In [None]:
df.plot()

It is using `matplotlib` under the hood, so you could do this yourself, but with much more code.

A newer plotting interface for pandas is called `hvplot`:

In [None]:
import hvplot.pandas

In [None]:
df.hvplot()

In [None]:
df.head()

In [None]:
df.tail(3)

In [None]:
df.loc['2013-01':'2013-06', 'B':'C']

Above, using the `.loc` attribute of dataframes, this is the fully official way to select data in a dataframe in all possible ways. The first part will always filter the rows you want, the second part always on the columns.

But if you only want to filter/select on your current index, one can put these conditions directly like an indexing choice into brackets `[]`:

In [None]:
df['2013-01':'2013-06']  # note, confusingly, but usefully, pandas decided to make the last index INCLUSIVE!!

In [None]:
df[df > 0]  # fancy indexing works with pandas as well! (as it's built upon numpy)

In [None]:
df.describe()

In real-life interactive data analysis, one often calculates new sets of values based on measured stuff.

It's a breeze with pd.DataFrames:

In [None]:
df['E'] = df.A + df.B

In [None]:
df['my new column'] = df.C + df.D

In [None]:
df['my new column']

In [None]:
df.head()

### Combine fancy indexes

Also works in numpy:

In [None]:
df.A < 0

In [None]:
df.B < 0

In [None]:
(df.A<0) & (df.B<0) 

In [None]:
indexer = (df.A<0) & (df.B<0) 

In [None]:
df[indexer]

### differences to numpy indexing for dataframes

In [None]:
df.A[0]

In [None]:
df[0]

In [None]:
df['A']  # direct brackets uses column IDs first!

In [None]:
df.iloc[0]  # using iloc or loc uses rows first again!

In [None]:
df.values[0]  # values always goes `down` to the numpy level

# Poll time

In [None]:
df

In [None]:
df.iloc[2, 1]

In [None]:
df.get(2, 1)

In [None]:
df.idxmax(axis=1)

### apply any function to a row and create new one

In [None]:
def calc_mean(row):  # call this row, because the apply method gives a row to this
    return row.mean()

In [None]:
df['mean'] = df.apply(calc_mean)

**NOTE**

Important powerful concept in Python: Provide links to functions!

To **EXECUTE** a function/method, you always need to use () at the end.

To tell Python about a function defined elsewhere, use **WITHOUT** the parenthese

This is called "a pointer to the function".

In [None]:
df

What happened??

Read the docs:

In [None]:
df.apply?

<details>
<summary>Solution</summary>
df.apply(calc_mean, axis=1)
</details>

In [None]:
df['mean'] = df.apply(calc_mean, axis=0)

In [None]:
calc_mean(df)

In [None]:
df['test'] = calc_mean(df)

Another quick mention of a useful method of dataframes:

Find the location of extrema!

In [None]:
df.idxmax(axis=1)

### groupby

Let's have a look at one powerful feature: Grouping and per-group stats.

First, we need to create something to group by:

In [None]:
import random

In [None]:
group = [random.choice('abc') for _ in range(len(df))]
group

What did I do here?

Each time you call `random.choice` it provides one of the given elements.
I need to do that `n` times, with n being the length of the dataframe.


In [None]:
len(df) == df.shape[0]

In [None]:
df.shape   # just like numpy!

In [None]:
df['group'] = group
df

In [None]:
g = df.groupby('group')
g.size()

In [None]:
g.mean()

Can you imagine how much code you would have to write to do this from scratch?

Ok, exercise time!