# Python for R users
# Part 3: Functions

In this notebook we will explore how R and Python differ in the creation and usage of functions.

First, a general note: Whenever you find yourself doing something more than once, you should put the relevant code inside a function and call that function instead.  Then you can reuse that code whenever you need it in the future. This in keeping with the general software engineering philsophy known as "Don't Repeat Yourself" (DRY).  It's also good practice to keep these organized in a central place so that you can always find them --- we will come back to that later.  

First we need to tell Jupyter to let us use R within this Python notebook.

In [6]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


## Creating a function

Let's say that we want to create our own function that divides two numbers but returns NaN ("not a number") if one tries to divide by zero. Here is how we would define this function in R.  

In [7]:
%%R

my_divide <- function(i, j){
    if (j == 0){
        return(NaN)
    } else {
        return(i/j)
    }
}

my_divide(1,0)

[1] NaN


Now let's write the analogous function in Python. First, we need to import the math library where nan is defined, then we can define the function. Note that it's generally good practice to put all of the imports at the top of the file, but we will leave it here for clarity.

In [4]:
import math

In [13]:
def my_divide(i, j):
    if j == 0:
        return(math.nan)
    else:
        return(i / j)
    
my_divide(1, 0)

nan

#### Catching exceptions using try/catch

One of the coding philosophies of Python is that it's easier to ask for forgiveness than permission (summarized as *EAFP*). This means that rather than checking for various conditions, we should assume that everything will work as intended (e.g. files will exist, operations will work) but be prepared to deal with the errors that can occur.  In Python, when an error occurs it results in an *exception* being raised.  For example, let's see what happens if we try to divide by zero:

In [1]:
1/0

ZeroDivisionError: division by zero

This results in a "ZeroDvisionError" exception being raised.  Note that the exception specifies the kind of error that is occuring.  Similarly, if we try to open a file that doesn't exist, we will see a "FileNotFoundError" exception:

In [2]:
my_file = open('nonexistent_file','r')

FileNotFoundError: [Errno 2] No such file or directory: 'nonexistent_file'

There is a programming construct called *try/catch* that allows us to deal gracefully with errors.  Let's rewrite our myDivide function to use this, rather than checking for the problematic condition up front:

In [5]:
def my_divide(i, j):
    try:
        return(i / j)
    except ZeroDivisionError:
        return(math.nan)
    finally:
        pass
        
    
my_divide(1, 0)

nan

Here we see that it deals appropriately with the division by zero error.  The *finally* section tells the function what to do if it catches any other errors, which in this case is to pass the error up the chain.  For example, if we use a string instead of a number, it will raise a "TypeError" exception that gets passed back to us:

In [7]:
my_divide(1,'0')

TypeError: unsupported operand type(s) for /: 'int' and 'str'

0.0

#### Variable scope

Similar to R, a variable defined outside of a function in the global workspace is visible within the function, and variables defined within a function are not visible outside of the functionl

*NOTE*: It is generally bad practice to use global variables within a function, unless they are meant to serve as constants that will never be modified.  Otherwise it can be very hard to debug problems that arise. It is customary to write the names of constants in all caps so that it's clear that they are different.

In [61]:
# a constant
MY_PLANET = 'Earth'

def testfunc():
    print(MY_PLANET)
    defined_inside = 'local'
    
testfunc()

# try printing the variable defined inside the function - will raise an error
print(defined_inside)

Earth


NameError: name 'defined_inside' is not defined

#### Optional arguments

Often there are arguments that we want to be optional.  For example, let's say that we want to create standardized scores for a set of numbers.  Let's start by creating a version that creates standardized scores with a mean of zero and a standard deviation of 1 --- i.e. Z-scores.

Here we will also add some documentation to the function, so that users can know what it's for.  We do this by placing what is called a *docstring* on the first indented line of the function.  We use triple quotation marks, which allows us to continue the string over several lines, though for a docstring we use them even if it's just a single line.

In [9]:
import numpy

def my_std_score(v):
    """returns a standard score for a list or array
    with mean zero and standard deviation 1
    
    Parameters
    ----------
    v: list or numpy array 
        numbers to be scaled
    
    Returns
    -------
    std_score: numpy array
        the standardized values
    """
    
    # make sure it's an appropriate kind of variable (list or numpy array)
    assert type(v) in [list, numpy.ndarray]
    
    # convert list to numpy array
    if type(v) == list:
        v = numpy.array(v)

    std_score = (v  - numpy.mean(v))/numpy.std(v)
    return(std_score)


In [10]:
v = numpy.array([1,2,3,4])
my_std_score(v)

array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])

Now that we added the docstring, the help() function will give us useful information.

In [11]:
help(my_std_score)

Help on function my_std_score in module __main__:

my_std_score(v)
    returns a standard score for a list or array
    with mean zero and standard deviation 1
    
    Parameters
    ----------
    v: list or numpy array 
        numbers to be scaled
    
    Returns
    -------
    std_score: numpy array
        the standardized values



Sometimes we might want to create scores with a different mean or standard deviation. To enable this, we can add optional arguments specifying the mean and standard deviation, setting them to zero and one by default.

In [12]:
def my_std_score2(v, mean=0, sd=1):
    """returns a standard score for a list or array
    with arbitrary mean and standard deviation
    
    Parameters
    ----------
    v: list or numpy array 
        numbers to be scaled
    mean: float, optional
        mean of the standard scores (default = 0)
    sd: float, optional
        standard deviation of the standard scores (default = 1)
    
    Returns
    -------
    std_score: numpy array
        the standardized values
    
    """
    
    # make sure it's an appropriate kind of variable (list or numpy array)
    assert type(v) in [list, numpy.ndarray]
    
    # convert list to numpy array
    if type(v) == list:
        v = numpy.array(v)

    # first create Z-score, then convert to units of interest
    z_score = (v  - numpy.mean(v))/numpy.std(v)
    std_score = z_score*sd + mean
    return(std_score)


Now if we run it with no arguments, we will still get Z-scores:

In [13]:
my_std_score2(v)

array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])

But we can also modify those arguments to obtain differently scaled scores.  Note that including the names of the arguments is not required, but is good practice in general to be clear about which optional arguments are being set.

In [14]:
my_std_score2(v, mean=100, sd=10)

array([ 86.58359214,  95.52786405, 104.47213595, 113.41640786])

#### Return values

So far we have seen functions that return a single variable, but a function can also return multiple variables.  The easiest way to do this is to return a *tuple* of items, which is similar to a list but is immutable, meaning that its values can't be modified.

Let's create an example of a function that will take in a vector and return the mean and standard deviation of the values. It will also by default print a report.

In [15]:
def my_summary_stats(v, print_report=True):
    """compute the mean and sd of a numpy array
    
    Parameters
    ----------
    v: numpy array 
        values to be processed
    print_report: bool, optional
        a flag for whether to print a report (default = True)
    
    Returns
    -------
    mean: int
        mean of vector
    sd: int
        sd of vector
    """
    mean = numpy.mean(v)
    sd = numpy.std(v)
    
    if print_report:
        print(f'Mean = {mean:.2f}')
        print(f'SD = {sd:.2f}')

    return((mean, sd))


In [16]:
vec = numpy.array([1,2,3,4])
output = my_summary_stats(vec)
print(output)

# if you try to change a value of the output tuple, it will raise an error:

output[1] = 1

Mean = 2.50
SD = 1.12
(2.5, 1.118033988749895)


TypeError: 'tuple' object does not support item assignment

## Classes

The way that you generally write programs in R is known as *procedural programming*: The program is basically a script that outlines a procedure, and R goes through it sequentially to perform the procedure.  Python allows this kind of programming, but there is another way of programming that is used by most modern programming languages, known as *object-oriented programming*.  Understanding the basics of object oriented programming is essential for working with R code as well as for building code that you can easily reuse, following the DRY principle that we outlined above.

Tutorials:
- https://realpython.com/python3-object-oriented-programming/


In object-oriented programming, we define the objects that we are interested in, and then characterize them in terms of their properties or attributes and the functions that they perform or that are performed on them.  The fundamental concept that defines an object in Python is known as a *class*.  

Let's say that we want to build a program to keep track of information about our cats.  To do this, we would define a class to describe a cat (generically).  First let's think about the different attributes that a cat might have. Keeping it simple, we could say that we want to keep track of their weight and whether they are hungry (which we will assume is true by default - these are cats after all).  Let's start by creating a class that stores these.

First we will create the class definition, dissecting it in the comments.

In [67]:
#The first line says that we want to create a new class called "Cat", which is a kind of object.  
#This is an example of *inheritance*, in which an object can inherit the features of another 
#kind of object.  In this case, Cat inherits the features of Python's most generic kind of object, 
#called simply *object*.

class Cat:
    # the next section is the docstring, providing info about the object
    # the parameters refer to the values that are passed into the __init__ function
    """an object describing a cat

    Parameters
    ----------
    weight_lbs: float
        cat's weight, in pounds
    is_hungry: bool, optional
        Is the cat hungry? (default = True)
    """
    
    # the next line is a *class attribute* - this is a value that is shared by 
    # all instances of the class
    species = "Felis catus"
    
    # The next line creates a special function called __init__(), which is used to 
    # create a new instance of the class - in this case, a description of a 
    # specific cat.  The fact that the function's name starts with two underscores 
    # means that this is a *private* function that is not meant to be called except 
    # by the class itself.  The first item in the list, which by convention is called *self*, 
    # refers to the object itself; whenever we create a function inside a class, we usually  
    # need to put *self* as the first argument.  The next two arguments refer to the specific 
    # features of the cat that we want to define: weight in pounds, and whether the cat is hungry.  
    # We use descriptive variable names to make clear what each one refers to.
    def __init__(self, weight_lbs, is_hungry=True):
        # the next two lines make sure that the values passed for the arguments
        # are of the correct type: weight should be a number (either integer or 
        # floating point) and is_hungry should be a Boolean value. 
        # It is generally good practice to make sure that values passed to
        # a class or function are of the correct type
        assert isinstance(weight_lbs, int) or isinstance(weight_lbs, float)
        assert isinstance(is_hungry, bool)
        
        # These two lines take the two settings and use them to create variables that are part of 
        # the object, which we call *attributes*.  These are called *self.is_hungry* and 
        # *self.weight_lbs* respectively.  The period is used to denote that this is an attribute 
        # of the current class (*self*). 
        self.weight_lbs = weight_lbs
        self.is_hungry = is_hungry


Now let's create a new instance of the cat object, for my cat Coco.

In [68]:
Coco = Cat(8,True)

Because we added a docstring, we can also obtain help if we forget how to use the function:

In [71]:
help(Cat)

Help on class Cat in module __main__:

class Cat(builtins.object)
 |  an object describing a cat
 |  
 |  Parameters
 |  ----------
 |  weight_lbs: float
 |      cat's weight, in pounds
 |  is_hungry: bool, optional
 |      Is the cat hungry? (default = True)
 |  
 |  Methods defined here:
 |  
 |  __init__(self, weight_lbs, is_hungry=True)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  species = 'Felis catus'



We can refer to the attributes from outside of the class using the dot operator:

In [69]:
print(Coco.weight_lbs)
print(Coco.is_hungry)
print(Coco.species)

8
True
Felis catus


We can also assign the value of an attribute directly.  Let's say that Coco gains some weight, up to 9 pounds.  We can assign this new value to the attribute.

In [70]:
Coco.weight_lbs = 9
print(Coco.weight_lbs)

9


#### Class methods

In addition to having attributes, a class also has the ability to do things, which we refer to as *methods*.  For example, let's say that we want to feed the cat.  We can create a new function that is part of the method, which we will call *feed*. When this method is invoked, it moves the cat from a hungry state to a not hungry state. 

In [72]:
class Cat:
    """an object describing a cat

    Parameters
    ----------
    weight_lbs: float
        cat's weight, in pounds
    is_hungry: bool, optional
        Is the cat hungry? (default = True)
        
    """

    def __init__(self, weight_lbs, is_hungry=True):
        assert isinstance(weight_lbs, int) or isinstance(weight_lbs, float)
        assert isinstance(is_hungry, bool)
        
        self.weight_lbs = weight_lbs
        self.is_hungry = is_hungry
        
    def feed(self):
        """
        A function that feeds the cat, moving is_hungry from true to false
        """
        self.is_hungry = False

In [73]:
Coco = Cat(8,True)
print(Coco.is_hungry)
Coco.feed()
print(Coco.is_hungry)

True
False


### A more realistic example: designing an analysis method using the scikit-learn pattern

Now we will look at an example of how to build a class that does something interesting.  We will build on a design pattern that we will use extensively in Psych 253, which is the pattern used by the scikit-learn package.

In scikit-learn, each analysis method is represented by a class. Each of these classes has a set of methods that are standard, which is nice because we don't have to learn a new interface for each function. The interface includes several commong methods:

* *fit()*: This fits the model.
* *predict()*: This returns the predicted values from the model
* *score()*: This returns the goodness of fit (R^2) of the model

Let's build a simple version of this to see how it would work.



In [17]:
import numpy

def check_variable_sizes(X, y=None):
    """
    makes sure that vectors are properly shaped for linear modeling
    
    Parameters
    ----------
    X: numpy array (N x k)
        design matrix for a linear model
    y: numpy array (N or N x 1), optional
        vector of dependent variables for model
    
    Returns
    -------
    y: numpy array (N x 1)
        vector of dependent variables, reshaped
        if necessary (or None if y is None)
    """

    # make sure X is two-dimensional
    assert len(X.shape) == 2

    if y is not None:
        # if y is one-dimensional then add a second dimension
        # so that it matches X
        if len(y.shape) == 1:
            y = y[:, numpy.newaxis]

        assert y.shape[0] == X.shape[0]

        return(y)
    else:
        return(None)


class MyLinearRegression:
    """Class to perform linear regression using scikit-learn API
            
    Attributes
    ----------
    coef_: numpy array, k X 1
        estimated regression coefficients (None until model is fitted)
    residuals_: numpy array, N X 1
        residuals from model fit (None until model is fitted)
        
    Notes
    -----
    This function assumes that the model has an intercept added already
    
    Example
    -------
    >>> lr = myLinearRegression()
    >>> X = numpy.random.randn(100,3)
    >>> X[:,2] = 1 # intercept column
    >>> y = X.dot(numpy.array([3,-2, 5]))
    >>> lr.fit(X, y)
    >>> lr.coef_
    array([[ 3.],
       [-2.],
       [ 5.]])
    """
    
    
    def __init__(self, fit_intercept=True):
        # placeholder for the regression coefficients
        # that we will estimate
        self.coef_ = None
        self.residuals_ = None
       

        
    def fit(self, X, y):
        """function to fit the model
        
        Parameters
        ----------
        X: numpy array (N x k)
            design matrix for the model
        y: numpy array (N or N x 1)
            vector of dependent variables for model
        """
        
        # check variable sizes
        y = check_variable_sizes(X, y)
        
        # compute the regression coefficients using ordinary least squares
        self.coef_ = numpy.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
        
        self.residuals_ = y[:, 0] - self.predict(X)
        
        
    def predict(self, X):
        """Predicted values using fitted coefficients
        
        Parameters
        ----------
        X: numpy array (N x k)
            design matrix for the model
            
        Returns
        -------
        pred: numpy array (N)
            predicted values
        """
        
        # make sure the model has been fit
        if self.coef_ is None:
            print('Model must be fitted first!')
            return(None)
        
        # check variable sizes
        check_variable_sizes(X)
       
        # make sure X has same dimension as coefficients
        # raise an exception if it doesn't
        assert self.coef_.shape[0] == X.shape[1]
        
        return(X.dot(self.coef_)[:, 0])

    
    def score(self, X, y):
        """returns coefficient of determination (R**2) for fitted model
        
        Parameters
        ----------
        X: numpy array (N x k)
            design matrix for the model
        y: numpy array (N or N x 1)
            vector of dependent variables for model

        Returns
        -------
        r2: float
            coefficient of determination
       
        Notes
        -----
        Computed as in sklearn: The coefficient R^2 is defined as (1 - u/v), 
        where u is the residual sum of squares ((y_true - y_pred) ** 2).sum() 
        and v is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). 
        
        """
        
        if self.coef_ is None:
            self.fit(X, y)

        y = check_variable_sizes(X, y)

        y_pred = self.predict(X)
        
        # return to a 1d vector to match y_pred
        y_true = y[:, 0]
        
        # compute r-squared using sum of squares formulation
        
        RSS = (self.residuals_ ** 2).sum()
        SSE = ((y_true - y_true.mean()) ** 2).sum()
        r2 = (1 - RSS/SSE)
        
        return(r2)
        
        

In [18]:
reg = MyLinearRegression()

num_points = 10
X = numpy.random.randn(num_points, 2)
# add intercept
X = numpy.append(X, numpy.ones(num_points)[:, numpy.newaxis], 1)
y = X.dot([3, 4, -2]) + numpy.random.randn(10)

# running predict before running model returns a warning
reg.predict(X)

reg.fit(X, y)
print(reg.coef_)

predicted = reg.predict(X)

reg.score(X, y)

Model must be fitted first!
[[ 2.2906098 ]
 [ 4.113673  ]
 [-2.53111071]]


0.9686900814459842