These notes are the Python version of the R code [here](https://m-clark.github.io/data-processing-and-visualization/).  Much more detail and demonstration is found there.

# Writing Functions

You can’t do anything in data science without using functions, but have you ever written your own? Why would you?

- Efficiency
- Customized functionality
- Reproducibility
- Extend the work that’s already been done

There are many benefits to writing your own functions, and it’s actually easy to do. Once you get the basic concept down, you’ll likely find yourself using your own functions more and more.

There is less convincing needed for Python users to use functions.  I believe this partly stems from Python being a general programming language rather than a data-science specific language, and many courses teach the basic programming part before data science applications, even when the latter is the focus.  In addition, while R and other statistical programming languages assume interactive/line-by-line use, Python as a programming language does not, and many use it in a much different fashion than what would be more useful for data science.

In general, if something out there is available that is tested and already does the job, I suggest using it before reinventing the wheel, which goes with the typical DRY approach of programming.

### A Starting Point

In [None]:
import pandas as pd
import numpy as np

A custom function to calculate some values of interest and return a `DataFrame` object.

In [None]:
def my_summary(x):
    out = pd.DataFrame(
        {
        'mean': np.mean(x),
        'sd': np.std(x),
        'N_missing': np.sum(np.isnan(x))
        },
        index = ['row1']   # index is required for 1 row result
    )
    return(out)


In [None]:
my_summary([1,2,3])

Works fine. However, data typically isn’t that pretty. It often has missing values.

In [None]:
def my_summary(x):
    out = pd.DataFrame(
        {
        'mean': np.nanmean(x),
        'sd': np.nanstd(x),
        'N_missing': np.sum(np.isnan(x))
        },
        index = ['row1']   # index is required for 1 row result
    )
    return(out)

In [None]:
gapminder = pd.read_csv('../data/gapminder_small.csv')
my_summary(gapminder.lifeExp)

In [None]:
def my_summary(x):
    out = pd.DataFrame(
        {
        'mean': np.nanmean(x),
        'sd': np.nanstd(x),
        'N_observed': np.sum(np.logical_not(np.isnan(x))), 
        'N_missing': np.sum(np.isnan(x)),
        'N_total': len(x)            
        },
        index = ['row1']   # index is required for 1 row result
    )
    return(out)

In [None]:
my_summary(gapminder.lifeExp)

Now let's do it for every column!

In [None]:
gapminder.dtypes

In [None]:
# this was a good example of where the tidy approach is more straightforward due to purrr and other functionality; 
# this was about as good as I could come up with.
init = gapminder.select_dtypes(exclude='object')

pd.concat([my_summary(init[i]) for i in init.columns])

Playing with functions.  Create a function that returns another function.

In [None]:
def center(type):
    if (type == 'mean'): 
        return np.mean
    else:
        return np.median

center(type = 'mean')

myfun = center(type = 'mean')

myfun([1,2,3])

Set default values for the inputs.

In [None]:

def hi(name = 'Beyoncé'):
    return 'Hi ' + name + '!'


hi()

In [None]:
hi(name = 'Jay-Z')

In [None]:
mpg = pd.read_csv('../data/mpg.csv')
mpg

### **D**on't **R**epeat **Y**ourself

An oft-quoted mantra in programming is Don’t Repeat Yourself. One context regards iterative programming, where we would rather write one line of code than one-hundred. More generally though, we would like to gain efficiency where possible. A good rule of thumb is, if you are writing the same set of code more than twice, you should write a function to do it instead.

In [None]:
def good_mileage(
    cylinder = 4,
    mpg_cutoff = 30,
    displ_fun = np.mean,
    displ_low = True,
    cls = "compact"
):
    if (displ_low == True):
        result = mpg[
            (mpg['cyl'].eq(cylinder)) &
            (mpg['hwy'].ge(mpg_cutoff)) &
            (mpg['class'].eq(cls)) &
            (mpg['displ'].le(displ_fun(mpg['displ'])))
        ]
    else:
         result = mpg[
            (mpg['cyl'].eq(cylinder)) &
            (mpg['hwy'].ge(mpg_cutoff)) &
            (mpg['class'].eq(cls)) &
            (mpg['displ'].ge(displ_fun(mpg['displ'])))
        ]
    
    return result

### Conditionals

The core of the above function uses a conditional statement using standard if…else structure. The if part determines whether some condition holds. If it does, then proceed to the next step in the brackets. If not, skip to the else part. We can also add conditional else statements (else if), drop the else part entirely, nest conditionals within other conditionals, etc. Like loops, conditional statements look very similar across all programming languages.

In any case, with our function at the ready, we can now do the things we want to as needed:

In [None]:
good_mileage(mpg_cutoff = 40)

In [None]:
good_mileage(
    cylinder = 8,
    mpg_cutoff = 15,
    displ_low = False,
    displ_fun = np.median,
    cls = 'suv'
)

Let’s extend the functionality by adding a year argument (the only values available are 2008 and 1999).

In [None]:
def good_mileage(
    cylinder = 4,
    mpg_cutoff = 30,
    displ_fun = np.mean,
    displ_low = True,
    cls = "compact",
    yr = 2008
):
    if (displ_low == True):
        result = mpg[
            (mpg['cyl'].eq(cylinder)) &
            (mpg['hwy'].ge(mpg_cutoff)) &
            (mpg['class'].eq(cls)) &
            (mpg['displ'].le(displ_fun(mpg['displ'])) &
            (mpg['year'].eq(yr))
            )
        ]
    else:
         result = mpg[
             (mpg['cyl'].eq(cylinder)) &
             (mpg['hwy'].ge(mpg_cutoff)) &
             (mpg['class'].eq(cls)) &
             (mpg['displ'].ge(displ_fun(mpg['displ']))) &
             (mpg['year'].eq(yr))
        ]
    
    return result

In [None]:
good_mileage(
  cylinder = 8,
  mpg_cutoff = 19,
  displ_low = False,
  cls = 'suv',
  yr = 2008
)

### Anonymous functions

Oftentimes we just need a quick and easy function for a one-off application.  For example, both the following would calculate standard deviations of columns.


In [None]:
mtcars = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv', index_col=0)

In [None]:
mtcars.apply(np.std, axis=0)

In [None]:
mtcars.apply(lambda x: x/2, axis=0).head()

The difference between the two is that for the latter, our function didn’t have to be a named object already available. We created a function on the fly just to serve a specific purpose. A function doesn’t exist that just does nothing but divide by two, but since it is simple, we just created it as needed.

To further illustrate this, we’ll create a robust standardization function that uses the median and median absolute deviation rather than the mean and standard deviation.

In [None]:
from statsmodels import robust

# some variables have a mad = 0, and so return Inf (x/0) or NaN (0/0)
mtcars.apply(lambda x: (x - np.median(x))/robust.mad(x)).head()

Even if you don’t use anonymous functions (sometimes called lambda functions), it’s important to understand them, because you’ll often see other people’s code using them.