This document is a Python exploration of this R-based document: https://m-clark.github.io/data-processing-and-visualization/functions.html.  Code is *not* optimized for anything but learning.  In addition, all the content is located with the main document, not here, so some sections may not be included.  I only focus on reproducing the code chunks.

# Writing Functions

You can’t do anything in data science without using functions, but have you ever written your own? Why would you?

- Efficiency
- Customized functionality
- Reproducibility
- Extend the work that’s already been done

There are many benefits to writing your own functions, and it’s actually easy to do. Once you get the basic concept down, you’ll likely find yourself using your own functions more and more.

There is less convincing needed for Python users to use functions.  I believe this partly stems from Python being a general programming language rather than a data-science specific language, and many courses teach the basic programming part before data science applications, even when the latter is the focus.  In addition, while R and other statistical programming languages assume interactive/line-by-line use, Python as a programming language does not, and many use it in a much different fashion than what would be more useful for data science.

In general, if something out there is available that is tested and already does the job, I suggest using it before reinventing the wheel, which goes with the typical DRY approach of programming.

### A Starting Point

In [1]:
import pandas as pd
import numpy as np

A custom function to calculate some values of interest and return a `DataFrame` object.

In [2]:
def my_summary(x):
    out = pd.DataFrame(
        {
        'mean': np.mean(x),
        'sd': np.std(x),
        'N_missing': np.sum(np.isnan(x))
        },
        index = ['row1']   # index is required for 1 row result
    )
    return(out)


In [3]:
my_summary([1,2,3])

Unnamed: 0,mean,sd,N_missing
row1,2.0,0.816497,0


Works fine. However, data typically isn’t that pretty. It often has missing values.

In [4]:
def my_summary(x):
    out = pd.DataFrame(
        {
        'mean': np.nanmean(x),
        'sd': np.nanstd(x),
        'N_missing': np.sum(np.isnan(x))
        },
        index = ['row1']   # index is required for 1 row result
    )
    return(out)

In [5]:
gapminder = pd.read_csv('../data/gapminder_small.csv')
my_summary(gapminder.lifeExp)

Unnamed: 0,mean,sd,N_missing
row1,72.658152,7.233072,3


In [6]:
def my_summary(x):
    out = pd.DataFrame(
        {
        'mean': np.nanmean(x),
        'sd': np.nanstd(x),
        'N_observed': np.sum(np.logical_not(np.isnan(x))), 
        'N_missing': np.sum(np.isnan(x)),
        'N_total': len(x)            
        },
        index = ['row1']   # index is required for 1 row result
    )
    return(out)

In [7]:
my_summary(gapminder.lifeExp)

Unnamed: 0,mean,sd,N_observed,N_missing,N_total
row1,72.658152,7.233072,184,3,187


Now let's do it for every column!

In [8]:
gapminder.dtypes

country        object
year            int64
lifeExp       float64
pop             int64
gdpPercap       int64
giniPercap    float64
continent      object
dtype: object

In [9]:
# this was a good example of where the tidy approach is more straightforward due to purrr and other functionality; 
# this was about as good as I could come up with.
init = gapminder.select_dtypes(exclude='object')

pd.concat([my_summary(init[i]) for i in init.columns])

Unnamed: 0,mean,sd,N_observed,N_missing,N_total
row1,2018.0,0.0,187,0,187
row1,72.65815,7.233072,184,3,187
row1,40617140.0,147328700.0,187,0,187
row1,17983.37,19574.69,187,0,187
row1,38.80749,7.51135,187,0,187


Playing with functions.  Create a function that returns another function.

In [10]:
def center(type):
    if (type == 'mean'): 
        return np.mean
    else:
        return np.median

center(type = 'mean')

myfun = center(type = 'mean')

myfun([1,2,3])

2.0

Set default values for the inputs.

In [11]:

def hi(name = 'Beyoncé'):
    return 'Hi ' + name + '!'


hi()

'Hi Beyoncé!'

In [12]:
hi(name = 'Jay-Z')

'Hi Jay-Z!'

In [13]:
mpg = pd.read_csv('../data/mpg.csv')
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
0,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
1,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
2,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
3,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
4,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
229,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
230,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
231,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
232,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


### **D**on't **R**epeat **Y**ourself

An oft-quoted mantra in programming is Don’t Repeat Yourself. One context regards iterative programming, where we would rather write one line of code than one-hundred. More generally though, we would like to gain efficiency where possible. A good rule of thumb is, if you are writing the same set of code more than twice, you should write a function to do it instead.

In [14]:
def good_mileage(
    cylinder = 4,
    mpg_cutoff = 30,
    displ_fun = np.mean,
    displ_low = True,
    cls = "compact"
):
    if (displ_low == True):
        result = mpg[
            (mpg['cyl'].eq(cylinder)) &
            (mpg['hwy'].ge(mpg_cutoff)) &
            (mpg['class'].eq(cls)) &
            (mpg['displ'].le(displ_fun(mpg['displ'])))
        ]
    else:
         result = mpg[
            (mpg['cyl'].eq(cylinder)) &
            (mpg['hwy'].ge(mpg_cutoff)) &
            (mpg['class'].eq(cls)) &
            (mpg['displ'].ge(displ_fun(mpg['displ'])))
        ]
    
    return result

### Conditionals

The core of the above function uses a conditional statement using standard if…else structure. The if part determines whether some condition holds. If it does, then proceed to the next step in the brackets. If not, skip to the else part. We can also add conditional else statements (else if), drop the else part entirely, nest conditionals within other conditionals, etc. Like loops, conditional statements look very similar across all programming languages.

In any case, with our function at the ready, we can now do the things we want to as needed:

In [15]:
good_mileage(mpg_cutoff = 40)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
212,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact


In [16]:
good_mileage(
    cylinder = 8,
    mpg_cutoff = 15,
    displ_low = False,
    displ_fun = np.median,
    cls = 'suv'
)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
18,chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
19,chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,11,15,e,suv
20,chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
21,chevrolet,c1500 suburban 2wd,5.7,1999,8,auto(l4),r,13,17,r,suv
22,chevrolet,c1500 suburban 2wd,6.0,2008,8,auto(l4),r,12,17,r,suv
28,chevrolet,k1500 tahoe 4wd,5.3,2008,8,auto(l4),4,14,19,r,suv
30,chevrolet,k1500 tahoe 4wd,5.7,1999,8,auto(l4),4,11,15,r,suv
31,chevrolet,k1500 tahoe 4wd,6.5,1999,8,auto(l4),4,14,17,d,suv
58,dodge,durango 4wd,4.7,2008,8,auto(l5),4,13,17,r,suv
60,dodge,durango 4wd,4.7,2008,8,auto(l5),4,13,17,r,suv


Let’s extend the functionality by adding a year argument (the only values available are 2008 and 1999).

In [17]:
def good_mileage(
    cylinder = 4,
    mpg_cutoff = 30,
    displ_fun = np.mean,
    displ_low = True,
    cls = "compact",
    yr = 2008
):
    if (displ_low == True):
        result = mpg[
            (mpg['cyl'].eq(cylinder)) &
            (mpg['hwy'].ge(mpg_cutoff)) &
            (mpg['class'].eq(cls)) &
            (mpg['displ'].le(displ_fun(mpg['displ'])) &
            (mpg['year'].eq(yr))
            )
        ]
    else:
         result = mpg[
             (mpg['cyl'].eq(cylinder)) &
             (mpg['hwy'].ge(mpg_cutoff)) &
             (mpg['class'].eq(cls)) &
             (mpg['displ'].ge(displ_fun(mpg['displ']))) &
             (mpg['year'].eq(yr))
        ]
    
    return result

In [18]:
good_mileage(
  cylinder = 8,
  mpg_cutoff = 19,
  displ_low = False,
  cls = 'suv',
  yr = 2008
)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
18,chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
20,chevrolet,c1500 suburban 2wd,5.3,2008,8,auto(l4),r,14,20,r,suv
28,chevrolet,k1500 tahoe 4wd,5.3,2008,8,auto(l4),4,14,19,r,suv
81,ford,explorer 4wd,4.6,2008,8,auto(l6),4,13,19,r,suv
127,jeep,grand cherokee 4wd,4.7,2008,8,auto(l5),4,14,19,r,suv
139,mercury,mountaineer 4wd,4.6,2008,8,auto(l6),4,13,19,r,suv


### Anonymous functions

Oftentimes we just need a quick and easy function for a one-off application.  For example, both the following would calculate standard deviations of columns.


In [19]:
mtcars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv', index_col=0)

In [20]:
mtcars.apply(np.std, axis=0)

mpg       5.932030
cyl       1.757795
disp    121.986781
hp       67.483071
drat      0.526258
wt        0.963048
qsec      1.758801
vs        0.496078
am        0.491132
gear      0.726184
carb      1.589762
dtype: float64

In [21]:
mtcars.apply(lambda x: x/2, axis=0).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,10.5,3.0,80.0,55.0,1.95,1.31,8.23,0.0,0.5,2.0,2.0
Mazda RX4 Wag,10.5,3.0,80.0,55.0,1.95,1.4375,8.51,0.0,0.5,2.0,2.0
Datsun 710,11.4,2.0,54.0,46.5,1.925,1.16,9.305,0.5,0.5,2.0,0.5
Hornet 4 Drive,10.7,3.0,129.0,55.0,1.54,1.6075,9.72,0.5,0.0,1.5,0.5
Hornet Sportabout,9.35,4.0,180.0,87.5,1.575,1.72,8.51,0.0,0.0,1.5,1.0


The difference between the two is that for the latter, our function didn’t have to be a named object already available. We created a function on the fly just to serve a specific purpose. A function doesn’t exist that just does nothing but divide by two, but since it is simple, we just created it as needed.

To further illustrate this, we’ll create a robust standardization function that uses the median and median absolute deviation rather than the mean and standard deviation.

In [22]:
from statsmodels import robust

# some variables have a mad = 0, and so return Inf (x/0) or NaN (0/0)
mtcars.apply(lambda x: (x - np.median(x))/robust.mad(x)).head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,0.332625,0.0,-0.258406,-0.168622,0.291096,-0.91887,-0.88284,,inf,0.0,1.34898
Mazda RX4 Wag,0.332625,0.0,-0.258406,-0.168622,0.291096,-0.586513,-0.487328,,inf,0.0,1.34898
Datsun 710,0.66525,-0.67449,-0.628575,-0.389129,0.220097,-1.309879,0.635645,inf,inf,0.0,-0.67449
Hornet 4 Drive,0.406542,0.0,0.439219,-0.168622,-0.873287,-0.14337,1.221851,inf,,-0.67449,-0.67449
Hornet Sportabout,-0.092396,0.67449,1.165319,0.67449,-0.773888,0.149887,-0.487328,,,-0.67449,0.0


Even if you don’t use anonymous functions (sometimes called lambda functions), it’s important to understand them, because you’ll often see other people’s code using them.

## Writing Functions Exercises

### Excercise 1

Write a function that takes the log of the sum of two values (i.e. just two single numbers) using the log function. Just remember that within a function, you can write R code just like you normally would.

In [23]:
def log_sum(a, b):
    ?

### Excercise 1b

What happens if the sum of the two numbers is negative? You can’t take a log of a negative value, so it’s an error. How might we deal with this? Try using a conditional to provide an error message. The first part is basically identical to the function you just did. But given that result, you will need to check for whether it is negative or not. The message can be whatever you want (i.e. just return a character string of some kind).

In [24]:
# def log_sum(a, b):
#     if (?):
#         ?
#     else:
#         ?

## Exercise 2

Let’s write a function that will take a numeric variable and convert it to a *character string* of ‘positive’ vs. ‘negative’. We can use if {}... else {} structure, or other means. In this case, the input is a single vector of numbers, and the output will recode any negative value to ‘negative’ and positive values to ‘positive’ (or whatever you want). Here is an example of how we would just do it as a one-off.

In [25]:
np.random.seed(123)  # so you get the exact same 'random' result

x = np.random.normal(size = 10)

np.array2string(x < 0)

'[ True False False  True  True False  True  True False  True]'

Now try your hand at writing a function for that using an conditional statement.  If that's too easy, try converting it to be `True` if the input is less than some number, and `False` if greater.

In [26]:
# def pos_neg(?):
#   ?