# Advanced Data Analysis - week 2 - functions

In the advanced data analysis course, we assume basic knowledge of Python, as could be acquired by attending the *Introduction to Programming* bridging course.

This notebook presents more examples on functions and their use in Pandas.



## Preliminaries

Let's start by import Pandas, matplotlib and os libraries.

In [1]:
# imports pandas
import pandas as pd


import os

## Examples

We will start by creating a simple 

In [2]:
data = pd.DataFrame( { "country": ["PT", "ES", "DE", "BR", "MX", "UY"] , \
                            "population": [10276617, 46937060, 83019213, 211049519, 127575529, 3461731], \
                            "language": ["Portuguese", "Spanish", "German", "Portuguese", "Spanish", "Spanish"]})

print(data)

  country  population    language
0      PT    10276617  Portuguese
1      ES    46937060     Spanish
2      DE    83019213      German
3      BR   211049519  Portuguese
4      MX   127575529     Spanish
5      UY     3461731     Spanish


Consider we want to compute the number of persons that speak each language.

We would need to group the information by the language, and for each group sum the population.

In [3]:
stats = data[["population","language"]].groupby("language").sum()

print( stats)

            population
language              
German        83019213
Portuguese   221326136
Spanish      177974320


What if we wanted to print the information about the country with the largest population for each language?

As the function ```nlargest``` is not defined for the result of the groupby, we need to define a function that computes the ```nlargest```. The function apply allows to define a function that will be applied to each group individually - the data of each group is passed to the function as a DataFrame, allowing us to call any function defined on Dataframes, including the ```nlargest```. Given this, the code of our function only needs to return the largest row, given the population. Pandas will automatically call the function for each group.

The syntax for defining a function is the following:

```
def nameOfFunction( arguments):
    statement
    return expression
```

where arguments is a list of variables, possibly empty, and statement and expression is any Python statement or expression.


In [6]:
def mylargest( p):
    return p.nlargest(1,["population"])

stats2 = data.groupby("language").apply(mylargest)

print( stats2)

             country  population    language
language                                    
German     2      DE    83019213      German
Portuguese 3      BR   211049519  Portuguese
Spanish    4      MX   127575529     Spanish


To better show up what is going on, let's do some printing in the code of function.

We will be printing the type of the variable received - to confirm that it is a DataFrame - and the contents of the Dataframe.

In [7]:
def mylargest2( p):
    print( "The type of variable p is " + str(type(p)))
    print( "The contents of p :")
    print( p)
    print( "-------------------------")
    return p.nlargest(1,["population"])

stats2 = data.groupby("language").apply(mylargest2)

print( "Final result")
print( stats2)

The type of variable p is <class 'pandas.core.frame.DataFrame'>
The contents of p :
  country  population language
2      DE    83019213   German
-------------------------
The type of variable p is <class 'pandas.core.frame.DataFrame'>
The contents of p :
  country  population    language
0      PT    10276617  Portuguese
3      BR   211049519  Portuguese
-------------------------
The type of variable p is <class 'pandas.core.frame.DataFrame'>
The contents of p :
  country  population language
1      ES    46937060  Spanish
4      MX   127575529  Spanish
5      UY     3461731  Spanish
-------------------------
Final result
             country  population    language
language                                    
German     2      DE    83019213      German
Portuguese 3      BR   211049519  Portuguese
Spanish    4      MX   127575529     Spanish


As you can see, the function was called three time, one for each group.

The functions defined in one cell are available in the following cells. For example, we can use function ```mylargest2```without defining it again.

In [8]:
stats2 = data.groupby("language").apply(mylargest2)

print( "Final result")
print( stats2)

The type of variable p is <class 'pandas.core.frame.DataFrame'>
The contents of p :
  country  population language
2      DE    83019213   German
-------------------------
The type of variable p is <class 'pandas.core.frame.DataFrame'>
The contents of p :
  country  population    language
0      PT    10276617  Portuguese
3      BR   211049519  Portuguese
-------------------------
The type of variable p is <class 'pandas.core.frame.DataFrame'>
The contents of p :
  country  population language
1      ES    46937060  Spanish
4      MX   127575529  Spanish
5      UY     3461731  Spanish
-------------------------
Final result
             country  population    language
language                                    
German     2      DE    83019213      German
Portuguese 3      BR   211049519  Portuguese
Spanish    4      MX   127575529     Spanish


When performing data analysis, the functions use are often simple, containing a single line. 

For simple functions, instead of defining a function using the normal syntax, it is possible to use a lambda function. The lambda function is definded with the following syntax:

```lambda arguments : expression```

This is equivalent to define the function

```
def myfun( arguments):
    return expression
```

So, let's rewrite the previous code using a lambda function.

In [7]:
stats2 = data.groupby("language").apply(lambda p: p.nlargest(1,["population"]) )

print( stats2)

             country  population    language
language                                    
German     2      DE    83019213      German
Portuguese 3      BR   211049519  Portuguese
Spanish    4      MX   127575529     Spanish


In data analysis code it is common to have functions defined as lambda functions.

I hope this notebooks has helps consolidating your knowledge about functions. As a final exercise, compute the number of persons that speak each language (our first example in this notebook) using a function defined by you.