# Chapter 3: Building Functions

Often we are performing statistical operations over a large dataset. It can be difficult to understand the meaning conveyed by these measures. Learning to program presents an opportunity to better understand how functions work. In this chapter we will create some basic statistical functions and compare their output to the functions built into python. By creating the function, you will understand the meaning of summation signs. Computing these statistics by hand would be a laborious process and expensive in terms of time. Once a function is constructed, it can be employed to calculate statistics in a fraction of the time.

## Building a Function

| New Concepts | Description |
| --- | --- |
| _return obj_ from function | Functions may return an object to be saved if a variable is defined by the function i.e., var1 = function(obj1, obj2, . . .)|

So far, we have built programs on the fly. For purposes of pedagogy, this is fine. As you develop your skills, you want to form good practices. This includes the building of functions for repeated use as well as the building of classes. This chapter we will concentrate on functions. Build all of your functions in the same file, _statsFunctions.py_.

In python, functions take the form:

In [None]:
def function_name(object1, object2, objectn):
    <operations>

If the function allows, you will pass an object by calling it in the parentheses that follow the function name. The first function that we build will be the total() function. We define the function algebraically as the sum of all values in a list of length j:

$\sum_{i=0}^{n-1} x_{i}$

Since lists indices start with the integer 0, we will write our functions as starting with _i = 0_ and process elements to the index of value _n - 1_. Since the range function in Python automatically counts to one less than the value identified, the for-loop used will take the form:

In [2]:
n=0
for values in range(n):
    n += 1

We will use it to return the sum of values in a list. After building this, we will pass a list to the function:

In [3]:
#statsFunctions.py
def total(list_obj):
    total = 0
    n = len(list_obj)
    for i in range(n):
        total += list_obj[i]
    return total

list1 = [3, 6, 9, 12, 15]
list2 = [i ** 2 for i in range(3,9)]
total_list1 = total(list1)
total_list2 = total(list2)
print(total_list1)
print(total_list2)

45
199


The total() function is a simple function that will be used in many of the other functions that we write. You can find this and other functions from this chapter in [statsFunctions.py](https://github.com/jlcatonjr/Learn-Python-for-Stats-and-Econ/tree/master/Chapter%203).

## Statistical Functions
| New Concepts | Description |
| --- | --- |
| Operators e.g., _!=_, _%&nbsp;_, _+=_, _\*\*_ | The operator != tests whether the values on either side of the operator are equal; _a % b_ returns the remainder of $a / b$; _a += b_ sets a equal to $a + b$; _a ** b_ raises a to the b power ($a^b$). |
| Dictionary | A dictionary is a datastructure that uses keys instead of index values. Each unique key references an object linked to that key. |
| Dictionary Methods e.g., _dct.values()_ | dct.values() returns a list of the objects that are referenced by the dictionaries keys.|
| Default Function Values | Function may assume a default value for values passed to it. e.g., _def function(val1 = 0, val2 = 2, …)_ | 

### Average Statistics

We define the mean of a set of numbers:
$\frac{\sum_{i=0}^{n-1} x_{i}} {n}$

The top part of the function is the same as the notation that represents the sum of a list of numbers. Thus, in mean(), we call total() and divide the result by the length of the list.  Then, we use the function to calculate value and save that value as an object:

In [4]:
#statsFunctions.py
#. . . 
def mean(list_obj):
    n = len(list_obj)
    mean_ = total(list_obj) / n
    return mean_

# . . . 
mean_list1 = mean(list1)
mean_list2 = mean(list2)
print("mean_list1:", mean_list1)
print("mean_list2:", mean_list2)

mean_list1: 9.0
mean_list2: 33.166666666666664


Now that we have set up total and mean functions, we are ready to calculate 
other core statistical values: 

1. median
2. mode
3. variance
4. standard deviation
5. covariance
6. correlation

Statistical values provide information about the shape and structure of data. These values are aggregates as they sum some characteristic from the dataset, and transform it to a value representative of the whole dataset. Above, we have  already calculated the mean, now we shall calculate the other average values, median and mode. 

The median is defined is the middle most number in a list. In a list of odd length, this is straightforward to find. We divide the length of the list plus one by two. To identify if a list is odd or even, we divide the list by 2 using  the _%&nbsp;_ sign. This will call the remainder. If the remainder does not equal (_!=_) zero, then the list of odd length. If the remainder is 0, then the list is of even length. If the list is of even length, we take the average of the two middle terms:

In [5]:
#statsFunctions.py 
#. . .
def median(list_obj):
    n = len(list_obj)
    list_obj = sorted(list_obj)
    #lists of even length divided by 2 have reminder 0
    if n % 2 != 0:
        #list length is odd
        middle_index = int((n - 1) / 2)
        median_ = list_obj[middle_index]
    else:
        upper_middle_index = int(n / 2)
        lower_middle_index = upper_middle_index - 1
        # pass slice with two middle values to mean()
        median_ = mean(list_obj[lower_middle_index : upper_middle_index + 1])
        
    return median_
# . . . 
median_list1 = median(list1)
median_list2 = median(list2)
print("median_list1:", median_list1)
print("median_list2:", median_list2)

median_list1: 9
median_list2: 30.5


The mode of a list is defined as the number that appears the most in the list. In order to quickly and cleanly identify the mode, we are going to use a new data structure: the dictionary. The dictionary is like a list, but elements are called by a key, not by elements from an ordered set of index numbers. We are going to use the values from the list passed to the function as keys. Every time a value is passed, the dictionary will indicate that it has appeared an additional time by adding one to the value pointed to by the key. We will pass the lists that we used in the previous exercises:

In [6]:
#statsFunctions.py
# . . .
def mode(list_obj):
    # use to record value(s) that appear most times
    max_count = 0
    # use to count occurrences of each value in list
    counter_dict = {}
    for value in list_obj:
        # count for each value should start at 0
        counter_dict[value] = 0
    for value in list_obj:
        # add on to the count of the value for each occurrence in list_obj
        counter_dict[value] += 1
    # make a list of the value (not keys) from the dictionary
    count_list = list(counter_dict.values())
    # and find the max value
    max_count = max(count_list)
    # use a generator to make a list of the values (keys) whose number of 
    # occurences in the list match max_count
    mode_ = [key for key in counter_dict if counter_dict[key] == max_count]
    
    return mode_

# . . .
mode_list1 = mode(list1)
mode_list2 = mode(list2)
print("mode_list1:", mode_list1)
print("mode_list2:", mode_list2)

mode_list1: [3, 6, 9, 12, 15]
mode_list2: [9, 16, 25, 36, 49, 64]


Note that instead of using the command for _i in range(n)_, the command for value in list_obj is used. The first command counts from 0 to _j&nbsp;_ using _i&nbsp;_ and can be used to call elements in the list of interest by passing _i&nbsp;_ in the form *list_obj[i]*. In the cases above, we called the values directly, passing them to *counter_dict* to count the number of times each value appears in a list, first initializing the dictionary by setting to 0 the value linked to each key. Then we add 1 for each time that a value appears. We identify that maximum number of times a value appears by taking the maximum value in *count_list*, which is simply a list of the values held by *counter_dict*. Once the *count_list* is created, identify the maximum value of the list and collect keys that point to that value in *counter_dict* by comparing each the value linked to each key to the *max_count*.

### Statistics Describing Distributions

Average values do not provide a robust description of the data. An average does not tell us the shape of a distribution. In this section, we will build functions to calculate statistics describing distribution of variables and their relationships. The first of these is the variance of a list of numbers.

We define population variance as:

$var_{pop} = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})^2} {n}$

When we are dealing with a sample, which is a subset of a population of observations, then we divide by (n - 1) to unbias the calculation.

$var_{samp} = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})^2} {n - 1}$

We will first build functions that calculate a population's variance and standard deviation. We will include an option for calculating sample variance and sample standard deviation.

In [7]:
def variance(list_obj, sample = False):
    # popvar(list) = sum((xi - list_mean)**2) / n for all xi in list
    # save mean value of list
    list_mean = mean(list_obj)
    # use n to calculate average of sum squared diffs
    n = len(list_obj)
    # create value we can add squared diffs to
    sum_sq_diff = 0
    for val in list_obj:
        # adds each squared diff to sum_sq_diff
        sum_sq_diff += (val - list_mean) ** 2
    if sample == False:
        # normalize result by dividing by n
        variance_ = sum_sq_diff / n
    else:
        # for samples, normalize by dividing by (n-1)
        variance_ = sum_sq_diff / (n - 1)
    
    return variance_

# . . . 
variance_list1 = variance(list1)
variance_list2 = variance(list2)
print("variance_list1:", variance_list1)
print("variance_list2:", variance_list2)

variance_list1: 18.0
variance_list2: 359.1388888888889


We include an option to identify sample variance in lines 72 through 74. We can call the sample variance by adding the following in the script:

In [8]:
sample_variance_list1 = variance(list1, sample = True)
sample_variance_list2 = variance(list2, sample = True)
print("sample_variance_list1:", sample_variance_list1)
print("sample_variance_list2:", sample_variance_list2)

sample_variance_list1: 22.5
sample_variance_list2: 430.9666666666667


From a list’s variance, we also calculate its standard deviation as the square root of the variance.

$SD = \sqrt{var}$

This is true for both the population and sample standard deviations. The function and its employment are listed below:

In [9]:
#statsFunctions.py
# . . .
def SD(list_obj, sample = False):
    # Standard deviation is the square root of variance
    SD_ = variance(list_obj, sample) ** (1/2)
    
    return SD_

# . . . 
SD_list1 = SD(list1)
SD_list2 = SD(list2)
print("SD_list1:", SD_list1)
print("SD_list2:", SD_list2)
sample_SD_list1 = SD(list1, sample = True)
sample_SD_list2 = SD(list2, sample = True)
print("sample_SD_list1:", sample_SD_list1)
print("sample_SD_list2:", sample_SD_list2)

SD_list1: 4.242640687119285
SD_list2: 18.950960104672504
sample_SD_list1: 4.743416490252569
sample_SD_list2: 20.75973667141919


We have left to build function for covariance and, correlation, skewness and kurtosis. Covariance measures the average relationship between two variables. Correlation normalizes the covariance statistic a fraction between 0 and 1.

To calculate covariance, we multiply the sum of the product of the difference between the observed value and the mean of each list for value _i = 1_ through _n = number of observations_:

$cov_{pop}(x,y) = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})(y_{i} - y_{mean})} {n}$

We pass two lists through the covariance() function. As with the _variance()_ and _stdev()_ functions, we can take the sample-covariance.

$cov_{sample}(x,y) = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})(y_{i} - y_{mean})} {n - 1}$

In order for covariance to be calculated, it is required that the lists passed to the function are of equal length. So we check this condition with an if statment:

In [10]:
#statsFunctions.py

def covariance(list_obj1, list_obj2, sample = False):
    # determine the mean of each list
    mean1 = mean(list_obj1)
    mean2 = mean(list_obj2)
    # instantiate a variable holding the value of 0; this will be used to 
    # sum the values generated in the for loop below
    cov = 0
    n1 = len(list_obj1)
    n2 = len(list_obj2)
    # check list lengths are equal
    if n1 == n2:
        n = n1
        # sum the product of the differences
        for i in range(n1):
            cov += (list_obj1[i] - mean1) * (list_obj2[i] - mean2)
        if sample == False:
            cov = cov / n
        # account for sample by dividing by one less than number of elements in list
        else:
            cov = cov / (n - 1)
        # return covariance
        return cov
    else:
        print("List lengths are not equal")
        print("List1:", n1)
        print("List2:", n2)
    
# . . . 
population_cov = covariance(list1, list2)
print("population_cov:", population_cov)
sample_cov = covariance(list1, list2, sample = True)
print("sample_cov:", sample_cov)            

List lengths are not equal
List1: 5
List2: 6
population_cov: None
List lengths are not equal
List1: 5
List2: 6
sample_cov: None


The list lengths are not equal. Let's shorten list2 so that it has 5 elements:

In [11]:
list2 = [i ** 2 for i in range(3,8)]
population_cov = covariance(list1, list2)
print("population_cov:", population_cov)
sample_cov = covariance(list1, list2, sample = True)
print("sample_cov:", sample_cov) 

population_cov: 60.0
sample_cov: 75.0


We can transform the covariance into a correlation value by dividing by the product of the standard deviations. 

$corr_{pop}(x,y) = \frac{cov_{pop}(x, y)} {\sigma_x \sigma_y}$

We thus divide the average sum of the product of the errors for each variable by the product standard deviations. This normalizes the covariance, providing an easily interpretable value between 0 and 1. The correlation() function that we build will make use of the covariance() function that we have already constructed as well as the stdev() function.

In [12]:
#statsFunctions.py
# . . .
def correlation(list_obj1, list_obj2):
    # corr(x,y) = cov(x, y) / (SD(x) * SD(y))
    cov = covariance(list_obj1, list_obj2)
    SD1 = SD(list_obj1)
    SD2 = SD(list_obj2)
    corr = cov / (SD1 * SD2)
    return corr

# . . . 
corr_1_2 = correlation(list1, list2)
print("corr_1_2:", corr_1_2)

corr_1_2: 0.9930726528736967


Not all distributions are normal, so we need statistics that reflect differences in shapes between distributions.

Skewness is a measure of asymmetry of a population of data about the mean. It is the expected value of the cube of the standard deviation.

$skew_{pop}(x,y) = \frac{\sum_{i=0}^{n-1}{(x_{i} - x_{mean})^3}} {n\sigma^3}$


$skew_{sample}(x,y) = \frac{\sum_{i=0}^{n-1}{(x_{i} - x_{mean})^3}} {(n-1)(n-2)\sigma^3}$

Asymmetry in distribution exists due either the existence of long or fat tails. If a tail is long, this means that it contains values that are relatively far from the mean value of the data. If a tail is fat, there exists a greater number of observations whose values are relatively far from the mean than is predicted by a normal distribution. Skewness may sometimes be thought of as the direction which a distribution leans. This can be due to the existence of asymmetric fat tails, long tails, or both. For example, if a distribution includes a long tail on the right side, but is normal otherwise, it is said to have a positive skew. The same can be said of a distribution with a fat right tail. Skewness can be ambiguous concerning the shape of the distribution. If a distribution has a fat right tail and a long left tail that is not fat, it is possible that its skewness will be zero, even though the shape of the distribution is asymmetric.

In [13]:
#statsFunctions.py
# . . . 
def skewness(list_obj, sample = False):
    mean_ = mean(list_obj)
    SD_ = SD(list_obj, sample)
    skew = 0
    n = len(list_obj)
    for val in list_obj:
        skew += (val - mean_) ** 3
        skew = skew / n if not sample else n * skew / ((n - 1)*(n - 1) * SD_ ** 3)
        
    return skew

# . . . 
population_skew_list1 = skewness(list1, sample = False)
population_skew_list2 = skewness(list2, sample = False)
print("population_skew_list1:", population_skew_list1)
print("population_skew_list2:", population_skew_list2)
sample_skew_list1 = skewness(list1, sample = True)
sample_skew_list2 = skewness(list2, sample = True)
print("sample_skew_list1:", sample_skew_list1)
print("sample_skew_list2:", sample_skew_list2)

population_skew_list1: 44.167680000000004
population_skew_list2: 2154.70016
sample_skew_list1: 0.6326870114840893
sample_skew_list2: 0.824429305041913


Kurtosis is an absolute measure of the weight of outliers. While skewness describes the ‘lean’ of a distribution, kurtosis describes the weight of a distribution that is held in the tails. Kurtosis is the sum of the standard deviation of each observation raised to the fourth power. As with the other statistical values, kurtosis can be taken for a population and for a sample.

$kurt_{pop} = \frac{\sum_{i=0}^{n-1} (x_{i} - x_{mean})^4} {n\sigma^4}$

$kurt_{sample} = \frac{n(n+1)\sum_{i=0}^{n-1} (x_{i} - x_{mean})^4} {(n - 1)(n - 2)( n - 3)\sigma^4} - \frac{3(n - 1)^2}{(n - 2)(n - 3)}$

If an observation is less than one standard deviation from the mean, its value will be relatively insignificant compared to in observation that is relatively farther from the mean.

In [14]:
#statsFunctions.py
# . . .
def kurtosis(list_obj, sample = False):
    mean_ = mean(list_obj)
    kurt = 0
    SD_ = SD(list_obj, sample)
    n = len(list_obj)
    for x in list_obj:
        kurt += (x - mean_) ** 4
    kurt = kurt / (n * SD_ ** 4) if not sample else  n * (n + 1) * kurt / \
    ((n - 1) * (n - 2) * (SD_ ** 4)) - (3 *(n - 1) ** 2) / ((n - 2) * (n - 3))
    
    return kurt

# . . .
population_kurt_list1 = kurtosis(list1)
population_kurt_list2 = kurtosis(list2)
print("population_kurt_list1:", population_kurt_list1)
print("population_kurt_list2:", population_kurt_list2)
sample_kurt_list1 = kurtosis(list1, sample = True)  
sample_kurt_list2 = kurtosis(list2, sample = True)  
print("sample_kurt_list1:", sample_kurt_list1)  
print("sample_kurt_list2:", sample_kurt_list2)


population_kurt_list1: 1.7000000000000004
population_kurt_list2: 1.7528272819579145
sample_kurt_list1: 5.6
sample_kurt_list2: 6.022618255663312


## Using a Nested Dictionary to Organize Statistics

| New Concepts | Descripton |
| --- | --- |
| Filling dictionary with for loop | When a dictionary is called in a for loop in the form, for key in dct, the for loop will iterate through dct.keys().|

Using a dictionary, we can cleanly organizes the statistics that we have generated. Create a new script that includes all of the functions that we created in the previous lesson. We will use the same two lists that we previously created. 

Next create a dictionary named stats_dict that will hold not only these lists, but also statistics associated with the lists. At the top level, the dictionary will have two keys: 1 and 2, referring to list1 and list2, respectively. In the lext layer, we will first save the appropriate list identified by each top layer key (i.e., 1 or 2) under the second layer key of “list”.

In [15]:
#statsFunctions.py
# . . .  
### Build a nested dictionary with lists ###  
stats_dict = {}  
# 1 refers to list1 and attributes associated with it  
stats_dict[1] = {}  
stats_dict[1]["list"] = list1  
# 2 refers to list2 and attributes associated with it  
stats_dict[2] = {}  
stats_dict[2]["list"] = list2  

Now that the dictionary has been created, the keys of the stats_dict by typing the following command in the console:

In [16]:
stats_dict.keys()

dict_keys([1, 2])

We see that stats_dict has two keys: 1 and 2. These each have been linked to their own dictionaries, thus creating a dictionary of dictionaries. Next, we will use a for loop to call these keys and create entries of each appropriate statistic (population) for the lists saved in the dictionary.

In [17]:
# for loop will call keys from stats_dict (i.e., first 1, and then 2)  
for key in stats_dict:  
    # save the list associated with key as lst; this will be easier to access  
    lst = stats_dict[key]["list"]  
    # use the functions to calculate each statistic and save in stats_dict[key]  
    stats_dict[key]["total"] = total(lst)  
    stats_dict[key]["mean"] = mean(lst)  
    stats_dict[key]["median"] = median(lst)  
    stats_dict[key]["mode"] = mode(lst)  
    stats_dict[key]["variance"] = variance(lst)  
    stats_dict[key]["standard deviation"] = SD(lst)    
    stats_dict[key]["skewness"] = skewness(lst)  
    stats_dict[key]["kurtosis"] = kurtosis(lst)  
  
print(stats_dict)  


{1: {'list': [3, 6, 9, 12, 15], 'total': 45, 'mean': 9.0, 'median': 9, 'mode': [3, 6, 9, 12, 15], 'variance': 18.0, 'standard deviation': 4.242640687119285, 'skewness': 44.167680000000004, 'kurtosis': 1.7000000000000004}, 2: {'list': [9, 16, 25, 36, 49], 'total': 135, 'mean': 27.0, 'median': 25, 'mode': [9, 16, 25, 36, 49], 'variance': 202.8, 'standard deviation': 14.24078649513432, 'skewness': 2154.70016, 'kurtosis': 1.7528272819579145}}


### Exercises
1. Create a list of random numbers between 0 and 100 whose length is 1000. (Hint: import random; search "python random" to learn more about the library.)

2. Use the variance function from the textbook to find the variance of this list. Assume that the list represent a population in whole. 

3. Create a 9 more lists of the same length whose elements are random numbers between 0 and 100. Use a nested dictionary to house and identify these lists. Keys for the first layer should be the numbers 1 through 10. Lists should be stored using a second key as follows: dict_name[index]["list"]. Index represents the particular integer key between 1 and 10 as noted above.

4. Find the variance of each list and store it as follows: dict_name[index]["variance"].

5. At the end of chapter 2, we used for loops to find min and max values. Create a min() function and max() function and pass the values from the list in question 1 to each of these to determine the min and max values in that list.

6. Explain why it might be advantageous to create a function instead of building all commands from scratch as you create a script.

### Exploration:
1. Visit the Python Essentials lesson from Sargent and Stachurski. Complete exercise 3. Pass 3 other sentences to the function that you create. Include a paragraph that explains in detail how the function operates

2. Visit the Python Essentials lesson from Sargent and Stachurski. Complete exercise 4. Pass 3 pairs of unique lists to the function. Include a paragraph explaining in detail how the function operates, including explanation for a solution that uses set().

