In [1]:
#data we need later on

import pandas as pd

knmi = pd.read_csv('data/knmi.tsv', low_memory=False, sep=',')
knmi = knmi.loc[knmi.STN == 270]
knmi.set_index(['STN', 'YYYYMMDD'], inplace = True)
knmi[['TG', 'TX', 'RH']] = knmi[['TG', 'TX', 'RH']].astype(float)

knmi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,TG,TN,TX,SQ,DR,RH
STN,YYYYMMDD,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
270,20000101,42.0,-4,79.0,49,15,11.0
270,20000102,55.0,33,74.0,12,0,-1.0
270,20000103,74.0,49,89.0,0,124,172.0
270,20000104,46.0,22,75.0,4,13,11.0
270,20000105,41.0,14,56.0,56,0,0.0


## Python Builtin functions

Builtins are powerful out-of-the-box Python functions. The advantage of these functions is that you don't need to know the implementation details, but knowing how to call them is enough. First, let's take a look at some of the most interesting built-in features. A nice overview can be found at https://docs.python.org/3/library/functions.html

We already know a lot of those builtin functions, such as `set()`, and `enumerate()`. 

In this notebook we look at some useful builtin functions that you can use in combination with iterable objects, such as a list or the pandas dataframe. These are:

- higher order functions: `map()`, `filter()` and `reduce()`
- the `zip()` function
- sort function `sorted()` compared to `sort()`

Furthermore we will look at the lambda functions because they are often used in combination with iterable operations.

## `lambda` functions

In the Python programming language, the anonymous function is a function also known as a lambda function. Lamda functions are 1-line functions with **no name** and **without the keyword `def`**. For that reason they are called **anonymous functions** (no name), also called lambda function. the syntax is as follows: 

    lambda arguments : expression

An example is below. The argument is x, the expression is x*x


In [2]:
lambda x: x*x

<function __main__.<lambda>(x)>

By assinging the lambda function to a variable I can call it.

In [3]:
square = lambda x: x*x
print(square(4))

16


So this does exactly the same as the non-anonymous function with name square and keyword def

In [4]:
def square(x):
    return x*x
print(square(4))

16


Like regular functions, lambda functions can also take multiple arguments as input. In the example above, the lambda function takes the argument x as input argument and the lambda function returns the result of the expression x*x. In case of multiple input arguments it looks like this:

In [5]:
ml = lambda x,y: x*y
print(ml(3,4))

12



So `x` and `y` are the **input arguments** (these are before the `:`) and `x*y` is the **expression** (this is after the `:`) 

The expression can contain everything we can normally do with python code

In [6]:
f = lambda a,b: a if (a > b) else b
print(f(47,11))
print(f(11,47))

47
47


Why is this useful? I could just use the regular function, right? That is indeed possible. But for small operations it is often useful to use an anonymous lambda function. Especially if I want to use a repeating operation on an iterable object, such as a pandas column. Or when I want to use a repeating function to make a listcomprehension. Lambda functions are also very often used in combination with higher order functions `map()`, `filter()` and `reduce()`

## Higher order functions: `map()`, `filter()` and `reduce()`

`map()`, `filter()` and `reduce()` are builtin functions of Python. They are kind of higher order functions. These functions **take another function as a parameter along with an itterable object** (a sequence, a list, a dict, or the like) and return an output after applying the function to each item in the iterable object. These functions enable the functional programming aspect of Python. The syntax is as follows:

    map(function, iterable_object)

The function parameter is a function that defines an expression (operation) that is applied to each item in the iterable_object. Functions can again be built-in functions, self-defined functions or lambda functions. Below you can see two examples: 

### `map()` examples

In [7]:
def newfunc(a):
    return a*a
x = map(newfunc, (1,2,3,4))  
print(x)         #x is the map object
print(tuple(x))  #the tuple contains the converted by the map function values 

<map object at 0x11d9b1c70>
(1, 4, 9, 16)


In [8]:
def complement_map(seq):
    complements = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
    return ''.join(list(map(lambda x: complements[x], seq))) 

complement_map('TCCAAAGGT')

'AGGTTTCCA'

In [28]:
def deci(t):
    return t/10

#examples with new columns
print(knmi.head(2), '\n')
knmi['TG_mapped'] = knmi.TG.map(deci)
print(knmi.head(2), '\n')
knmi['TG_lambda'] = knmi.TG.map(lambda x:x/10)
print(knmi.head(2), '\n')
#examples with adjusted column
knmi.TG = knmi.TG.map(lambda x:x/10)
print(knmi.head(2), '\n')

                TG     TN    TX     SQ     DR   RH  TG_mapped  TG_lambda
STN YYYYMMDD                                                            
270 20120204 -10.5   -168 -58.0     71      0  0.0      -1.05      -1.05
    20100126  -9.2   -127 -43.0     75      0  0.0      -0.92      -0.92 

                TG     TN    TX     SQ     DR   RH  TG_mapped  TG_lambda
STN YYYYMMDD                                                            
270 20120204 -10.5   -168 -58.0     71      0  0.0      -1.05      -1.05
    20100126  -9.2   -127 -43.0     75      0  0.0      -0.92      -0.92 

                TG     TN    TX     SQ     DR   RH  TG_mapped  TG_lambda
STN YYYYMMDD                                                            
270 20120204 -10.5   -168 -58.0     71      0  0.0      -1.05      -1.05
    20100126  -9.2   -127 -43.0     75      0  0.0      -0.92      -0.92 

                TG     TN    TX     SQ     DR   RH  TG_mapped  TG_lambda
STN YYYYMMDD                                 

`map()` can only be used for an array (pandas.series) 
if we want to apply the expression on the entire dataframe we should use `apply()` or `aaplymap()`

In [33]:
knmi = knmi.applymap(lambda x:x*10)
print(knmi.head(2), '\n')

                   TG                                                 TN  \
STN YYYYMMDD                                                               
270 20120204 -10500.0   -168 -168 -168 -168 -168 -168 -168 -168 -168 ...   
    20100126  -9200.0   -127 -127 -127 -127 -127 -127 -127 -127 -127 ...   

                   TX                                                 SQ  \
STN YYYYMMDD                                                               
270 20120204 -58000.0     71   71   71   71   71   71   71   71   71 ...   
    20100126 -43000.0     75   75   75   75   75   75   75   75   75 ...   

                                                             DR   RH  \
STN YYYYMMDD                                                           
270 20120204      0    0    0    0    0    0    0    0    0 ...  0.0   
    20100126      0    0    0    0    0    0    0    0    0 ...  0.0   

              TG_mapped  TG_lambda  
STN YYYYMMDD                        
270 20120204    -1050.0    

`applymap()` is only available in DataFrame and used for element-wise operation across the whole DataFrame. It has been optimized and some cases work much faster than `apply()`

## `filter()`

The filter() function is used to create an output list consisting of values for which the function returns true. Like the `map()` function, the `filter()` function also takes as arguments a function with an operation expression and an iterable object. Its syntax is as follows:

    filter(function, iterable_object)

In [10]:
def func(x):
    if x>=3:
        return x
y = filter(func, (1,2,3,4))  
print(y)
print(tuple(y))

<filter object at 0x11d9bd040>
(3, 4)


We can do the same with `lambda`

In [11]:
y = filter(lambda x: (x>=3), (1,2,3,4))
print(tuple(y))

(3, 4)


## filtering in pandas

Attention! Pandas dataframe.`filter()` function is used to Subset rows or columns of dataframe according to labels in the specified index. Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index

In [12]:
knmi_TG = knmi.filter(['TG']) #filter columns
knmi_TG.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,TG
STN,YYYYMMDD,Unnamed: 2_level_1
270,20000101,4.2
270,20000102,5.5


In [13]:
knmi_re = knmi.filter(regex='T') #filter columns with a T in its name
knmi_re.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,TG,TN,TX,TG_mapped,TG_lambda
STN,YYYYMMDD,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
270,20000101,4.2,-4,79.0,4.2,4.2
270,20000102,5.5,33,74.0,5.5,5.5


In [14]:
knmi_dd = knmi.filter(regex='20000101', axis = 0) #filter on rows containing date '20000101'

In [15]:
knmi_dd

Unnamed: 0_level_0,Unnamed: 1_level_0,TG,TN,TX,SQ,DR,RH,TG_mapped,TG_lambda
STN,YYYYMMDD,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
270,20000101,4.2,-4,79.0,49,15,11.0,4.2,4.2


If we want to subset a dataframe on content we should use just the expression in combination with a slice. The expression will return True and False for each value and selects all True returns for the subset

    df = df[df.column expression]
 

In [16]:
print(len(knmi))
knmi_freeze = knmi[knmi.TG < 0 ]
print(len(knmi_freeze))
knmi_freeze.head(2)

6940
360


Unnamed: 0_level_0,Unnamed: 1_level_0,TG,TN,TX,SQ,DR,RH,TG_mapped,TG_lambda
STN,YYYYMMDD,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
270,20000123,-0.4,-52,32.0,49,20,7.0,-0.4,-0.4
270,20000124,-1.6,-53,7.0,14,0,0.0,-1.6,-1.6


## Using higher order functions in conjunction: `map(func, filter(func, iter))`

When you do this, the inner functions are solved first and then the outer functions. The output of the inner function is then the input of the outer function

Let's first try passing the filter() function as a parameter to the map() function.

The code below first checks if the condition (x > = 3) is True for the iterables. Then that output (3, 4) is used as an iterable object as a parameter for the map() function. Because it uses the lambda function `x:x+x` input 3 -> 6 and 4 

In [17]:
# map(function, iterable_object)
# functio = lambda x:x+x
# iterable_object = outcome of expression filter(lambda x: (x>=3), (1,2,3,4))
c = map(lambda x:x+x,filter(lambda x: (x>=3), (1,2,3,4)))
print(tuple(c))

(6, 8)


## The `zip()` function
The zip() function returns a zip object, an iterator of tuples where the first item in an iterator is linked to the first item of a second operator, then the second item in each iterator is linked together, then the third and so on. If the iterators have different lengths, the iterator with the fewest items decides the length of the new iterator. The syntax is as follows:

    zip(iterator1, iterator2, iterator3 ...) 

In [18]:
oneLetters = ['a','c','d','e','f']
threeLetters = ['Ala','Cys','Asp','Glu','Phe']
combined = zip(oneLetters, threeLetters)
print(dict(combined))

{'a': 'Ala', 'c': 'Cys', 'd': 'Asp', 'e': 'Glu', 'f': 'Phe'}


This might be handy to create structures for plotting

    x = [ (m, y) for m in months for y in years ] # this creates [ ("jan", "2019"), ("jan", "2020"),...]
    counts = sum(zip(df[2019].tolist(), df[2020].tolist()), ()) 
    source = ColumnDataSource(data=dict(x=x, counts=counts))


## python `sorted()` and `sort()`

Sorting a sequence is very easy in Python using the built-in method `sorted()`. `sorted()` sorts any sequence and always returns a list of the elements in a sorted manner, without modifying the original sequence. Syntax:

      sorted(iterable_object, key, reverse)

where key and reverse are optional. A key indicates which sort basis should be chosen, reverse indicates whether descending should be sorted. (default ascending)

In [19]:
x = [2, 8, 1, 4, 6, 3, 7] 
  
print("Sorted List returned :")
print(sorted(x))
  
print("\nReverse sort :")
print(sorted(x, reverse = True))
  
print("\nOriginal list not modified :") 
print(x) 

Sorted List returned :
[1, 2, 3, 4, 6, 7, 8]

Reverse sort :
[8, 7, 6, 4, 3, 2, 1]

Original list not modified :
[2, 8, 1, 4, 6, 3, 7]


one can parse a key argument to indicate the kind of sort

In [20]:
L = ["cccc", "b", "dd", "aaa"] 
  
print("Normal sort :", sorted(L)) 
print("Sort with len :", sorted(L, key = len))

Normal sort : ['aaa', 'b', 'cccc', 'dd']
Sort with len : ['b', 'dd', 'aaa', 'cccc']


In the above case the len function is used as a key, but this can also be a self defined function. The function below sorts the amino acid by weight 

In [21]:
aaWeights = {'gly':75, 'ala':89, 'glu':147, 'his':155, 'pro':115, 'tyr':181}
def aaSorter(aa):
      return aaWeights[aa.lower()]
l = ['Gly','Ala','pro','His','his','glu','tyr']
print(sorted(l, key=aaSorter))

['Gly', 'Ala', 'pro', 'glu', 'His', 'his', 'tyr']


The `sort()` function sorts a list, but then changes this list with the sorted list as well. The function sorted() returns a list and you need to assign it to a new variable if you want to use it as a variable. The difference is demonstrated below:

In [22]:
vegetables = ['squash', 'pea', 'carrot', 'potato']
print(vegetables)
new_list = sorted(vegetables)
print(new_list)
print(vegetables)
vegetables.sort()
print(vegetables)

['squash', 'pea', 'carrot', 'potato']
['carrot', 'pea', 'potato', 'squash']
['squash', 'pea', 'carrot', 'potato']
['carrot', 'pea', 'potato', 'squash']


## numpy `sort()` and `argsort()`


There are many kind of sorting algoritms. `np.sort()` and `np.argsort()` supports the fastest of these sorting algorithms, like ‘mergesort’, ‘quicksort’, ‘heapsort’ and ‘stable’. The `sort()` function is inplace, the `argsort()` function is an indirect sort. 

See also https://en.wikipedia.org/wiki/Sorting_algorithm for an overview of sorting algorithms. a visual representation is https://www.youtube.com/watch?v=kPRA0W1kECg and a nice explanation is https://www.youtube.com/watch?v=kgBjXUE_Nwc



In [23]:
import numpy as np
arr = np.random.randn(2,3)
print('\nunsorted')
print(arr)
print('\nsorted on row')
arr.sort(axis=1) #sort row
print(arr)
print('\nsorted on collumn')
arr.sort(axis=0)#sort column
print(arr)


unsorted
[[-0.80899401 -0.48515855 -0.07336271]
 [ 0.27898235 -0.08632479  1.24483513]]

sorted on row
[[-0.80899401 -0.48515855 -0.07336271]
 [-0.08632479  0.27898235  1.24483513]]

sorted on collumn
[[-0.80899401 -0.48515855 -0.07336271]
 [-0.08632479  0.27898235  1.24483513]]


`np.argsort()` returns the indexes of the sortation

In [24]:
arr = np.random.randn(5)
print('\nunsorted')
print(arr)
print('\nusing np.argsort()')
print(arr.argsort())
print('\n arr with indexes')
print(arr[arr.argsort()])


unsorted
[ 0.61571673  0.3601718  -1.09654169  0.5332182   0.36805026]

using np.argsort()
[2 1 4 3 0]

 arr with indexes
[-1.09654169  0.3601718   0.36805026  0.5332182   0.61571673]


Other sort functions are `np.searchsorted()` and `np.lexsort()`
all these functions work in pd.Series as well. 

## pandas `sort_values()` and `sort_index()`
Pandas does not have a `sort()` method. It has the `sort_values()` method and the `sort_index()` method

In [25]:
print(knmi.head(2))
knmi = knmi.sort_values(by=['TG', 'TN'], ascending = True)
print(knmi.head(2))

               TG     TN    TX     SQ     DR    RH  TG_mapped  TG_lambda
STN YYYYMMDD                                                            
270 20000101  4.2     -4  79.0     49     15  11.0        4.2        4.2
    20000102  5.5     33  74.0     12      0  -1.0        5.5        5.5
                TG     TN    TX     SQ     DR   RH  TG_mapped  TG_lambda
STN YYYYMMDD                                                            
270 20120204 -10.5   -168 -58.0     71      0  0.0      -10.5      -10.5
    20100126  -9.2   -127 -43.0     75      0  0.0       -9.2       -9.2
