# Functions
## Data 765 tutoring

Functions are reusable blocks of codes. You've been using built-in functions throughout the course, such as `print()`. I also wrote examples of functions in other Notebooks.

Like iteration, functions are integral to your scripts and programs.

Functions vastly increase the readability and maintainability of code. A well tested function reduces the amount of code that you have to write. Functions also provide a single point to test and repair. For example, fixing or improving a function causes those changes to reverberate wherever it is called.

Another benefit is that functions separate logic into discrete blocks that reduce a programmer's cognitive load. Reading well defined functions is easier than parsing hundreds of lines all at once.

**D.R.Y.** or Don't Repeat Yourself is a primary impetus for writing functions. If you find yourself copying and pasting blocks of code or writing similar lines of code, then you may likely refactor those into a function.

# Basic functions
Let's write a function to calculate the mean of a `list`!

As a sidenote, I don't recommend rolling your own mean function over using the built in [statistics](https://docs.python.org/3/library/statistics.html) module or [NumPy](https://numpy.org/).

In [1]:
def mean(a):
    """Calculate the arithmetic mean of an iterable.

    Parameters
    ----------
    a: Iterable
        An Iterable of numbers.

    Returns
    -------
    Mean of a as an integer or float.
    """
    sum_a = 0

    for x in a:
        sum_a += x
    
    return sum_a/len(a)

N.b., again: I also don't recommend writing a mean function using a `for` loop. But you shouldn't forget `for` loops, so I'll ensure you can't escape them!

This is more succinct (and still worse than using NumPy or the statistics module):

In [2]:
def better_mean(a):
    """Like mean(a) but BETTER!"""
    return sum(a)/len(a)

Functions may take zero or more parameters. `mean()` takes one parameter, `a`, which is the array for which to calculate the mean. Something that is passed into a function is known as an argument.

Parameters are scoped within a function. The parameters declared in a function signatures are valid within the function and can be used in the code block that defines the function.

Thus:

In [3]:
import random

random_numbers = [random.gauss(5., 10.) for _ in range(1000)]
x_bar = mean(random_numbers)
print(x_bar)

5.035592718898639


`random_numbers` is the argument passed into `mean()`. That argument is passed into the parameter `a`. Argument and parameter are often interchangable colloquially, so you don't need to memorize the exact distinction.

Python functions that don't take parameters or have defaults for every argument (more on that later) are called as you would expect: `f()`. 

In [4]:
from datetime import datetime, timedelta

now = datetime.now().strftime("%B %d, %Y")

print(f"It is {now} (at the time of running my code at least).")

It is October 20, 2021 (at the time of running my code at least).


`datetime.now()` doesn't have any parameters. Declaring a function without parameters also looks as you would expect.

In [5]:
def yesterday():
    return datetime.now() - timedelta(days=1)

yesterday_f = yesterday().strftime("%A, %B %d")

print(f"And yesterday was {yesterday_f}.")

And yesterday was Tuesday, October 19.


**Questions:**
1. What happens if we remove `name` from the function declaration below?

In [None]:
def say_hello(name):
    print(f"Hi {name}!!")

# Why doesn't this work?
def say_hello():
    print(f"Hi {name}!!")

# Default arguments

Default arguments allow programmers to provide reasonable defaults for their functions. A function may have many parameters to custom execution. [matplotlib](https://matplotlib.org/) is a plotting and graphics library. Its classes and functions tend to have _tons_ of parameters to customize calls. Take a look at [some of the functions here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot) for examples.

We'll look at the [scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.scatter.html#matplotlib.axes.Axes.scatter) function as a demonstration.

A scatter plot in _matplotlib_ requires two arrays of the `x` and `y` pairs to plot. Beyond that, the size, point color, edge color, color map, alpha, and other parameters may be set.

So, think about it: would you really want to pass an argument for each parameter every time you need a simple scatter plot? **NO. AHH!** üò±

In order to prevent programmers from going totally nuts, the _matplotlib_ sages wisely provided default arguments for every parameter except `x` and `y`.

Default arguments make certain parameters optional. Positional arguments are mandatory. Thus, in the `scatter()` function linked above, you _always_ have to pass `x` and `y` but can pass the other arguments as necessary.

Now let's take a look at a few examples.

In [6]:
from faker import Faker

def generate_names(amount=2):
    """Generate random names using Faker.

    Parameters
    ----------
    amount: int
        Number of fake names to generate.
    
    Returns
    -------
    A list[str] of random names.
    """
    fake = Faker()
    return [fake.name() for _ in range(amount)]

def random_person(people, n=1):
    """Return n random people.

    Parameters
    ----------
    people: list[str]
        A list or iterable of people.
    n: int
        Amount of people to return.
    
    Returns
    -------
    A list[str] of people.
    """
    return [random.choice(people) for _ in range(n)]

def random_groups(people, groups=2):
    """Return people separated into groups.

    Parameters
    ----------
    people: list[str]
        A list or iterable of people.
    groups: int
        Amount of groups.

    Returns
    -------
    List of list of groups of people. list[list[str]]
    """
    length = len(people)
    # Shadowing, not mutation
    # Shuffle would mutate
    people = random.sample(people, length)
    return [people[i:i + groups] for i in range(0, length, groups)]

The functions above all have reasonable defaults. `generate_names()` returns realistic, localized names using [Faker](https://faker.readthedocs.io/en/master/); the function defaults to two names. We can call it by passing in a number or without arguments to use the default.

In [7]:
# Generates two names
names = generate_names()
print(f"Two names via default argument: {names}")

# Generates five names by specifically using the keyword argument.
names = generate_names(amount=5)
print(f"Five names via specifically passing 5 to 'amount': {names}")

# Generates 11 names by passing 11 by position.
names = generate_names(11)

Two names via default argument: ['Alyssa Griffin', 'Ricky Terry']
Five names via specifically passing 5 to 'amount': ['Sandra Randall', 'Peter Compton', 'Katrina Grant', 'Alexandra Walter', 'Seth Hansen']


The other functions I wrote have similar default arguments. `random_person()` returns a random person from a `list` of people (you've seen something like this in class). The default is to return a single person. `random_groups()` returns randomized groups with a default of two groups.

In [8]:
two_people = random_person(names, 2)
print(f"Two randomly selected people: {two_people}")

# Split everyone into three groups.
three_groups = random_groups(names, 3)
print(f"Three randomly shuffled groups: {three_groups}")

Two randomly selected people: ['Jordan Martinez', 'Jordan Martinez']
Three randomly shuffled groups: [['Patrick Lyons', 'Katie Branch', 'Teresa Wiggins'], ['Mike Patrick', 'Travis Jackson', 'James Torres'], ['Donald Martin', 'William Hanson', 'Dawn Andrews'], ['Jordan Martinez', 'Anna Schwartz']]


The `people` parameter for `random_person()` and `random_groups()` is a mandatory positional argument.

**Questions**
1. What's the difference between `random_groups(names, groups=4)` and `random_groups(names, 4)`?
2. Which of the following are correct?

* `random_person(n=2, people=names)`

* `random_person(names, 3)`

* `random_person(people=names, n=4)`

* `random_person(4, names)`

# Variadic arguments (\*args and **kwargs)

Variadic arguments are arbitrarily sized. You can pass in as many arguments to `*args` or `**kwargs` as you wish. Like default arguments, variadic arguments are designed to ease calling functions by providing flexibility.

[seaborn](https://seaborn.pydata.org/) is a plotting library built on _matplotlib_. The plotting functions generally take `**kwargs` that are passed down to the _matplotlib_ functions.

We can take a look at the [kdeplot()](https://seaborn.pydata.org/generated/seaborn.kdeplot.html#seaborn.kdeplot) function for a great example. The documentation mentions that the `**kwargs**` are all passed down to specific _matplotlib_ functions depending on other parameters.

We'll take a look at the basic, canonical examples before looking at better uses next week.

In [9]:
def concatenate(*args, sep=' '):
    """Combine strings separated by sep."""
    # You should use string's "join" method instead of this.
    temp = ""
    last_i = len(args) - 1
    for i, arg in enumerate(args):
        temp += str(arg)
        if i != last_i:
            temp += sep
    return temp

concatenate("I", "like", "cats", "meow", ["look", "it's", "a", "list"])

'I like cats meow [\'look\', "it\'s", \'a\', \'list\']'

`concatenate()` is something of the canonical example for `*args`. `print()` works somewhat similarly. Rather than taking a `list` of elements, `print()` takes `*args` that are printed by taking the string representation of each object.

# Composition

Functions are often composed of other functions. In other words, functions build on each other. They're not megaliths that encompass a varied range of actions. Individual functions should be disparate in the sense that do one action well. Other functions can use those functions in service of their goals.

Since we're statisticians and data scientists, I'll demonstrate composition via simple equations as functions.

I'll use loops instead of comprehensions so they're easier to understand as comprehensions aren't mandatory for this class.

In [10]:
import math

def variance(a):
    """Calculate the variance of an array."""
    mean_a = mean(a)
    diff_squares = 0
    
    for xi in a:
        diff_squares += (xi - mean_a)**2
    
    return diff_squares/len(a)

def stddev(a):
    """Calculate the standard deviation of an array."""
    return math.sqrt(variance(a))

stddev(random_numbers)

10.261722924872414

`stddev()` calls `variance()` which in turn calls `mean()`. The functions build on each other rather than reimplementing functionality. We could technically define `stddev()` like so:

In [11]:
def bad_stddev(a):
    # Mean of a
    a_mean = sum(a)/len(a)
    # Variance
    a_var = [(xi - a_mean)**2 for xi in a]
    a_var = sum(a_var)/len(a_var)
    
    # And finally standard deviation
    return a_var**0.5

However, that clearly duplicates works that should be separated into multiple functions. The benefit of multiple functions would be that updates to one function (i.e. a faster algorithm) would extend to every function calling that function. You know, composition!

_Not_ writing separate functions means that if we were to define a `variance()` function later we'd end up duplicating the code to calculate a mean again for no gain.

# Lambdas and closures

Lambdas are anonymous functions. Closures are anonymous functions that enclose an environment by keeping the references to certain variables.

Lambdas are useful for executing a small action without having to write an entire function. You can think of them as throwaway or single use functions.

Python's lambdas are limited to one expression. Python also hinders mutations and assignments. While you can technically write multiline lambdas using [glorious hacky logic](https://stackoverflow.com/questions/1233448/no-multiline-lambda-in-python-why-not) and assign to local variables using the walrus `:=` operator, doing so is not idiomatic Python and looks hilariously ugly. Side effects with lambdas are not idiomatic in Python (and most other languages, probably).

Lambdas are written like so:
`lambda var1, var2, varN: expression`.

They're usually passed directly to functions that take lambdas rather than stored in variables. For example, Python's [map()](https://docs.python.org/3/library/functions.html#map) calls a lambda over each element of an iterable. `map()` is lazily evaluated by yielding the results. We have to consume the generator if we want all of the outputs in a data structure, like a `list`.

In [12]:
# 20 random lists of numbers of 1000 numbers each.
nested_random = [[random.gauss(10, random.randint(1, 10)) for _ in range(1000)]
                 for _ in range(20)]

# Notice that map()'s output is consumed by list.
nested_random_means = list(map(lambda x: sum(x)/len(x), nested_random))

print(nested_random_means)

[10.359360139651542, 10.192140300436932, 9.781905257336646, 10.435219304944404, 10.068903564469576, 10.13980122450079, 9.904777265195513, 10.119542004578289, 9.864959970025332, 9.864811602658659, 10.204753693382155, 9.800811585394055, 10.25152672952076, 9.552711278618277, 9.732936889182817, 10.161206584584026, 9.949628668779145, 10.0521472901358, 10.079544905986616, 10.011901840232888]


Lambdas as well as functions can also be stored in variables. You can pass standalone lambdas, functions, stored lambdas, and stored functions anywhere function objects are expected.

In [13]:
# Lambs aren't mean.
mean_lamb = lambda x: sum(x)/len(x)

# Storing the mean function from earlier into another variable.
meanest = mean

_ = map(mean_lamb, nested_random)
_ = map(meanest, nested_random)
_ = map(mean, nested_random)

Functions, like basically everything else in Python, are objects. Assigning functions to variables is similar to assigning the same string to two different variables. The variables simply hold the reference to the functions rather than cloning the definition.

You can call a lambda like a function after assigning it to a variable. This should make sense given the logic above. Defining a function grants it a name, such as `mean()`, which allows you to call the function. If you copy the reference into another variable as shown above, you can still call the function using the new name like you would normally.

(I'm going somewhere with this).

In [14]:
_ = mean_lamb(random_numbers)
_ = meanest(random_numbers)

This means that you can pass functions into _other_ functions. You saw this above with `map()` which is a function that takes another function, such as a lambda, and calls the function on each element of an iterable.

Writing a function that takes a callable (functions, lambdas, callable classes) as a parameter isn't different from what you've seen so far.

The function below calls a function (`fn`) on `iterable` in overlapping windows of length `size`.

In [15]:
def rolling(iterable, size, fn):
    """Map over overlapping windows.
    
    iterable: Iterable
        A sequence to call a function over.
    size: integer
        Window size. Must be positive.
    fn: Callable
        A callable, such as a function, to map over iterable.
    """
    iter_len = len(iterable)
    
    # Yielding None (or NaN) is reasonable here instead of raising an error.
    # Raising an error would force messy try blocks everywhere whereas NaN
    # would be more idiomatic in a data context.
    if not iter_len:
        # Returning an iterable to keep the return types the same.
        # "yield None" doesn't work because the function stops and
        # breaks the return statement at the bottom.
        # So, this is a bit hacky until I find a better solution.
        return iter([None])
    
    if not size or size < 0:
        raise ValueError(f"Window size must be positive. Got: {size}")
    
    return (fn(iterable[i:i + size]) for i in range(iter_len))

If you'd prefer a loop and _not_ a generator: 

In [16]:
def rolling_loop(iterable, size, fn):
    outputs = []
    for i in range(len(iterable)):
        outputs.append(fn(iterable[i:i + size]))
    return outputs

And here's `rolling()` in action by calculating a rolling mean:

In [17]:
roll_mean = list(rolling(random_numbers, 3, mean))
roll_mean_loop = rolling_loop(random_numbers, 3, mean)

print(f"Example output: {roll_mean[:10]}")

Example output: [9.578188870789948, 14.495991378074308, 20.04375293359957, 22.318810090728864, 15.345636389336233, 8.46145917008217, -1.8756206721897992, -1.1099176354194518, -4.267736015345626, 5.600331781848614]


## Closures

Closures "close" over an environment by keeping references to variables from scopes outside of the function. Decorators, which aren't covered in this course, are syntactic sugar for closures.

Closures in Python are most easily defined by returning a function from within another function.

Let's say we want to simplify our calls of `rolling()` by covering a few common use cases.

In [18]:
import functools
import statistics
from typing import Union
import numpy as np

def rolling_with(func):
    """Wrap func to be called in moving windows.
    
    Parameters
    ----------
    func: Callable[[Iterable, Union[int, float, np.number]],
                    Union[int, float, np.number]]
        Callable to wrap into rolling windows.
    
    Returns
    -------
    Callable[[Iterable, Union[int, float, np.number]],
             Union[int, float, np.number]]
        Wrapped function.
    """
    # Fixes the __name__ attribute for the returned function
    # so that it has the right name.
    @functools.wraps(func)
    def rolling_with_inner(iterable, size):
        return rolling(iterable, size, func)
    
    return rolling_with_inner

# Example calls
rolling_mean = rolling_with(mean)
rolling_med = rolling_with(statistics.median)
rolling_mode = rolling_with(statistics.mode)
rolling_max = rolling_with(max)

Python provides syntactic sugar to elide calling `rolling_with()`; these are called decorators. Decorators are placed at the top of a function definition with `@name_of_decorator(args)`. You don't have to pass the name of the function to the decorator.

In [19]:
@rolling_with
def rolling_mean(x):
    """Calculate a rolling mean over an iterable.
    
    Parameters
    ----------
    x: Iterable[Union[int, float, np.number]]
    size: int
        Size of rolling window.
    
    Returns
    -------
    Iterable[Union[int, float, np.number]]
    """
    return sum(x)/len(x)

roll_test = list(rolling_mean(random_numbers, 3))

print(f"Testing rolling_mean: {roll_test[:10]}")

Testing rolling_mean: [9.578188870789948, 14.495991378074308, 20.04375293359957, 22.318810090728864, 15.345636389336233, 8.46145917008217, -1.8756206721897992, -1.1099176354194518, -4.267736015345626, 5.600331781848614]


# Basic error handling

Python is dynamically typed which means that types (such as integers or strings) are determined at run time. This means that you can call `mean()` with nonsense values such as strings. However, Python is also strongly typed which in turn means that incorrect types would raise an exception.

In [20]:
# Dynamic typing
calculate_this = "i like vegan pie lolol."

# But strongly checked at runtime
mean(calculate_this)

TypeError: unsupported operand type(s) for +=: 'int' and 'str'

Python isn't pell-mell about typing! Notice the error as well. The error explains that the function is trying to do some operation that the value doesn't support. The error may seem unclear at first, but you can get a sense of what's wrong by the final line which states `unsupported operand type(s) for +=: 'int' and 'str'`.

We can add explicit error checks to functions to provide more information to the caller. 

In [21]:
from array import array

def mean(a):
    """Calculate the arithmetic mean of an iterable.

    Parameters
    ----------
    a: Iterable
        An Iterable of numbers.

    Returns
    -------
    Mean of a as an integer or float.
    """
    # Check for an iterable of numbers.
    if not (isinstance(a, (list, array))
            and
            isinstance(a[0], (int, float))):
        raise TypeError("You need to pass in an iterable of numbers.")

    # Empty arrays cause division by zero.
    if not len(a):
        raise ValueError("You can't calculate the mean of an empty array.")

    sum_a = 0

    for x in a:
        sum_a += x
    
    return sum_a/len(a)

_ = mean(["Dark Souls", "is a", "great series."])

TypeError: You need to pass in an iterable of numbers.

Some Pythonistas would argue that checking the type of `a` is superfluous. Python would raise an error anyway, and explicitly checking if an argument is a `list` could preclude a lot of other types due to duck typing. In other words, we don't really need a `list` explicitly when other iterables would work fine as well.

Other Python code can raise exceptions. Uncaught exceptions bubble up the stack until they're either caught or crashes the program. Unhandled exceptions are fine if you can justify them. For example, crashing if the user running a script doesn't have permission to load a required file is probably fine.

Catching exceptions, as you learned last class, requires wrapping code in `try-except-finally` blocks.

In [None]:
import json

poke_starters = {
    "grass": ["Bulbasaur",
              "Chikorita",
              "Treecko"],
    "fire": ["Charmander",
             "Cyndaquil",
             "Torchic"],
    "water": ["Squirtle",
              "Totodile",
              "Mudkip"],
    "electric": ["Pikachu"],
    "psychic": ["Espeon"]
}

try:
    with open("poke_starters.json", 'w') as poke_json:
        json.dump(poke_starters, poke_json)
except OSError as e:
    print(f"Failed to write {e.filename}.\nOS: {e.strerror}")

# Longer Pok√©API example

Let's revisit the [Pok√©mon API](https://pokeapi.co/) example from last week.

I designed this more as a script, but you shouldn't take it as the absolute best way to tackle this problem. The site lists several API implementations in different languages, including Python, that you should use instead.

I dislike global variables, but I included them as a demonstration. Ideally, you'd refactor this code into a class instead.

In [None]:
import io
import matplotlib.pyplot as plt
import json
import time

from requests import Session, HTTPError
from IPython.display import display
from PIL import Image

POKEAPI = "https://pokeapi.co/api/v2/pokemon/{}"
CACHE_PATH = "pokeapi_cache.json"
THROTTLE = 30

# Globals don't have to explicitly laid out like this but it looks cleaner.
session = None
poke_cache = None
last_time = None

def load_cache(path):
    """Create or load Pok√©mon API data cache.

    Parameters
    ----------
    path: str
        File path. Loads cached Pok√©mon API data as JSON if exists or else
        creates a new file.
    """
    global poke_cache
    try:
        with open(path, 'r') as cache:
            poke_cache = json.load(cache)
    except FileNotFoundError:
        # Create an empty cache if the file doesn't exist.
        # Also...just bubble up the rest of the errors.
        poke_cache = {}

def check_cache(pokenum):
    """Retrieve Pok√©mon data from cache if exists.

    Parameters
    ----------
    pokenum: int
        Pok√©mon number as integer.
    """
    return poke_cache.get(pokenum)

def update_cache(new_pokemon):
    """Update cache with new data.

    Parameters
    ----------
    new_pokemon: Dict[int, Dict]
        Dictionary of new Pok√©mon to add 
    """
    poke_cache.update(new_pokemon)

def check_time():
    """Throttle requests.
    """
    global last_time

    # Check if thirty seconds has elapsed.
    sleep_time = THROTTLE - (time.monotonic() - last_time)
    if sleep_time > 0:
        print(f"Pausing for {sleep_time} seconds.")
        time.sleep(sleep_time)
    # Update last_name to the current time.
    last_time = time.monotonic()

def create_url(pokenum):
    """Create a Pok√©API URL from a Pok√©mon number.

    Parameters
    ----------
    pokenum: int
        Pok√©mon number, such as 25 for Pikachu.
    
    Returns
    -------
    str
        Formatted URL.
    """
    return POKEAPI.format(pokenum)

def init_scraper(path=CACHE_PATH):
    """Initialize all required scraper variables.

    Parameters
    ----------
    path: str
        Path to Pok√©API cache.
    """
    global poke_cache
    global session
    global last_time

    if not poke_cache:
        load_cache(path)
    if not session:
        session = Session()
    if not last_time:
        last_time = time.monotonic()

def get_pokemon(pokemon_nums, **kwargs):
    """Retrieve data for an Iterable of Pok√©dex numbers.

    pokemon_nums: Iterable[int]
        Pok√©dex numbers.
    **kwargs:
        Keyword arguments to pass down to request.
    
    Returns
    -------
    Dict[int, Dict[str, Any]]
        Pok√©mon data.
    """
    global session
    pokedata = {}

    for pokenum in pokemon_nums:
        # Check if the Pok√©mon data exists in the cache
        # instead of scraping again.
        if data := check_cache(pokenum):
            pokedata[pokenum] = data
            continue

        # Throttle to avoid spamming the API
        check_time()
        try:
            url = create_url(pokenum)
            resp = session.get(url, **kwargs)
            # Raise an error for a non-200 HTTP status.
            resp.raise_for_status()
            resp = resp.json()
            pokedata[pokenum] = resp
            # Always update the cache if the request succeeded
            update_cache({pokenum: resp})
        except HTTPError as e:
            print(f"Got {e} for Pok√©mon number {pokenum}.")
    
    return pokedata

# Basic usage
init_scraper()
scrape_results = get_pokemon([54, 25, 300])

[Next: Midterm review](https://github.com/joshuamegnauth54/data765-intro-python-tutoring/blob/main/notebooks/05-midterm_miscellanea.ipynb)