# Algorithms For Calculating Summary Statistics

## Lesson Overview

Some of the most important and fundamental algorithms in computer science involve calculating a summary statistic from a dataset. Summary statistics are simple aggregated statistics of a dataset that help to summarize a key attribute of the dataset.

Examples of summary statistics include:

- minimum value
- maximum value
- mean/average value
- mode (most common value)
- median (middle value)
- standard deviation

This lesson doesn't go into the technical definition of a general summary statistic, but you can read more about them [here](https://en.wikipedia.org/wiki/Summary_statistics).

### Function to calculate the mean

In a previous lesson on complexity, we already saw the implementation for calculating some of these summary statistics for an array.

For example, the function `mean` calculates the mean value of an array. This function has a time complexity of $O(n)$ and a space complexity of $O(1)$.

In [None]:
def mean(arr):
  """Finds the mean of a list of integers."""
  sum = 0
  len = 0
  for i in arr:
    sum += i
    len += 1
  
  # Coerce sum to float here so that the division will be float, not int.
  return float(sum) / len

In [None]:
mean([2, 1, 4, 5, 2, 3, 7, 6])

## Question 1

Write a function to calculate the maximum value in a numeric array, without using the built-in `max` function

In [None]:
def maximum(arr):
  """Returns the maximum value in a list of numerical values."""
  # TODO(you): Implement
  print('This function has not been implemented.')

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
print(maximum([1.1, 2.2, 3.3, 4.4, 5.5]))
# Should print: 5.5

print(maximum([-10, 10, 20, -20, 20]))
# Should print: 20

print(maximum([1.994, 1994, 199.4, -19.94]))
# Should print: 1994

### Solution

In [None]:
def maximum(arr):
  """Returns the maximum value in a list of numerical values."""
  # Initialize the maximum value at negative infinity.
  max_value = float("-Inf")

  # Iterate through the list, and reset max_value if a higher value is found.
  for i in arr:
    if i > max_value:
      max_value = i
  
  return max_value

## Question 2

What is the big-O time and space complexity of the function below which calculates the maximum of an array?

In [None]:
def maximum(arr):
  """Returns the maximum value in a list of numerical values."""
  # Initialize the maximum value at negative infinity.
  max_value = float("-Inf")

  # Iterate through the list, and reset max_value if a higher value is found.
  for i in arr:
    if i > max_value:
      max_value = i
  
  return max_value

In [None]:
#freetext

### Solution

Suppose the input array `arr` has $n$ elements.

**Time complexity**

All of the operations outside the `for` loop are $O(1)$. Within the `for` loop, each operation is $O(1)$, so the time complexity is the number of iterations of the `for` loop. The `for` loop iterates through each element in `arr`, so it has $n$ iterations. Therefore, the time complexity is $O(n)$.

**Space complexity**

Only two variables, `max_value` and `i`, are allocated to memory and no new memory allocations are made within the `for` loop, so this function is $O(1)$ for space complexity.

## Question 3

Write a function to calculate the [mode(s)](https://en.wikipedia.org/wiki/Mode_(statistics)) or most common value(s) in a list. Since there can be multiple modes (if different elements have the same frequency), your output should be a list. Do not use any external package functionality.

In [None]:
def mode(arr):
  """Returns the most common value in a list of numerical values."""
  modes = []

  # TODO(you): Implement
  print('This function has not been implemented.')

  return modes

### Hint

Use a dictionary.

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
print(mode([1, 2, 3, 4, 5]))
# Should print: [1, 2, 3, 4, 5]

print(mode([1, 2, 2, 3, 3]))
# Should print: [2, 3]

print(mode([4, 1, 4, 4, 1]))
# Should print: [4]

### Solution

This algorithm is more complicated than finding the minimum, maximum, and mean of an array.

In [None]:
def mode(arr):
  """Returns the most common value in a list of numerical values."""
  modes = []

  counts = {} # dictionary mapping each unique element to its count
  highest_count = 0 # the count of the highest occuring element

  for i in arr:
    if i not in counts:
      counts[i] = 1
    else:
      counts[i] += 1

    # If i has the equal highest count, add it to the list of modes.
    if counts[i] == highest_count:
      modes.append(i)
    # If i has the highest count, it becomes the new mode.
    elif counts[i] > highest_count:
      highest_count = counts[i]
      modes = [i]

  return modes

## Question 4

What is the big-O time and space complexity of the function below which calculates the mode of an array?

In [None]:
def mode(arr):
  """Returns the most common value in a list of numerical values."""
  modes = []

  counts = {} # dictionary mapping each unique element to its count
  highest_count = 0 # the count of the highest occuring element

  for i in arr:
    if i not in counts:
      counts[i] = 1
    else:
      counts[i] += 1

    # If i has the equal highest count, add it to the list of modes.
    if counts[i] == highest_count:
      modes.append(i)
    # If i has the highest count, it becomes the new mode.
    elif counts[i] > highest_count:
      highest_count = counts[i]
      modes = [i]

  return modes

In [None]:
#freetext

### Solution

Suppose the array has $n$ elements.

**Time complexity**

Despite containing more logic than the functions for calculating the minimum, maximum, and mean, the time complexity of `mode` is still $O(n)$. Each operation within the `for` loop is $O(1)$ and the number of iterations is `len(arr)`. Therefore, the big-O time complexity is $O(n)$.

**Space complexity**

While calcluating the minimum, maximum, and mean only requires storing a few numerical values, calculating the mode requires storing a dictionary. The number of elements in the dictionary depends on the number of unique elements in `arr`.

In the best case, `arr` contains only one unique element. If this element is called `x`, then the `counts` dictionary is just `{x: len(arr)}` and `modes` is just `[x]`. Therefore the best case space complexity is $O(1)$.

In the worst case, `arr` contains $n$ unique elements. In this case, `counts` maps each of the $n$ elements to 1 (requiring $2n$ memory allocations), and `modes` is identical to `arr` (requiring $n$ memory allocations). Therefore the worst case space complexity is $O(n)$.

The average case space complexity is also $O(n)$, since there is a very high number of possible integers, so it is most likely that all (or most) of the integers in `arr` are unique.

## Question 5

Write an algorithm to calculate the minimum element of a linked list of integers. Assume that you have `LinkedListElement` and `LinkedList` classes as defined here.

In [None]:
class LinkedListElement:

  def __init__(self, value):
    self.value = value
    self.next = None

In [None]:
class LinkedList:

  def __init__(self):
    self.first = None

In [None]:
# This creates the linked list: 1 -> 2 -> 3

linked_list = LinkedList()

linked_list.first = LinkedListElement(1)
linked_list.first.next = LinkedListElement(2)
linked_list.first.next.next = LinkedListElement(3)

In [None]:
class LinkedList:

  def __init__(self):
    self.first = None
  
  def minimum(self):
    """Returns the minimum value in a linked list of integers."""
    # TODO(you): Implement
    print('This method has not been implemented.')

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
# This creates the linked list: 2 -> 4 -> -4 -> -2

linked_list = LinkedList()

linked_list.first = LinkedListElement(2)
linked_list.first.next = LinkedListElement(4)
linked_list.first.next.next = LinkedListElement(-4)
linked_list.first.next.next.next = LinkedListElement(-2)

print(linked_list.minimum())
# Should print: -4

# This creates the linked list: 1 -> 2 -> 3 -> 1

linked_list = LinkedList()

linked_list.first = LinkedListElement(1)
linked_list.first.next = LinkedListElement(2)
linked_list.first.next.next = LinkedListElement(3)
linked_list.first.next.next.next = LinkedListElement(1)

print(linked_list.minimum())
# Should print: 1

### Solution

In [None]:
class LinkedList:

  def __init__(self):
    self.first = None
  
  def minimum(self):
    """Returns the minimum value in a linked list of integers."""
    if linked_list.first is None:
      raise ValueError('Input linked list has no first element.')

    min_value = float("Inf")

    el = linked_list.first
    while el is not None:
      # If the value is less than the current minimum, make it the new minimum.
      if el.value < min_value:
        min_value = el.value
      # Iterate to the next element.
      el = el.next

    return min_value

## Question 6

How do the time and space complexity of the algorithm to find the minimum of a linked list compare to the time and space complexity of finding the minimum of an array? Remember that finding the maximum of an array (or linked list) is an equivalent algorithm to finding the minimum of an array (or linked list), in terms of complexity.

In [None]:
#freetext

### Solution

The big-O time and space complexity of finding the maximum of an array is $O(n)$ and $O(1)$ respectively.

For the linked list, it is still the case that each operation is $O(1)$, so the time complexity is the number of iterations. While the linked list uses a `while` loop and the array uses a `for` loop, both are iterating through the number of elements in the data structure, which is $n$. Therefore the time complexity is $O(n)$. And similarly to the array algorithm, only `min_value` and `el` are allocated to memory, so the space complexity is $O(1)$.

## Question 7

The [weighted mean](https://en.wikipedia.org/wiki/Weighted_arithmetic_mean) of an array is the mean of an array where each element is weighted by some positive number.

Weighted means are often used to calculate a final grade in a course. For example, your Data Structures and Algorithms final grade could be composed of 20% homeworks, 30% midterm project, and 50% final project. In that case, your final grade is calculated as $0.2 \times grade_{hw} + 0.3 \times grade_{midterm} + 0.5 \times grade_{final}$.

Write an algorithm to calculate the weighted mean of an array. Your function should raise an error if any element of the `weights` list is negative.

In [None]:
def weighted_mean(arr, weights):
  """Returns the weighted mean of the list arr weighted by the list weights."""
  # Raise an error if arr and weights don't have the same number of elements.
  if len(arr) != len(weights):
    raise ValueError('len(arr) = %d should be the same as len(weights) = %d' %
                     (len(arr), len(weights)))
  if sum(weights) != 1:
    raise ValueError('sum(weights) should be 1, but is %.2f' % sum(weights))

  # TODO(you): Implement
  print('This function has not been implemented.')

### Unit Tests

Run the following cell to check your answer against some unit tests.

In [None]:
weights1 = [0.1, 0.5, 0.05, 0.2, 0.15]
weights2 = [0.45, 0.3, 0.1, 0.05, 0.1]
arr1 = [68, 14, -5, -108, 25]
arr2 = [0, -4, 3, 1, 8]


print(round(weighted_mean(arr1, weights1), 2))
# Should print: -0.86

print(round(weighted_mean(arr1, weights2), 2))
# Should print: 6.28

print(round(weighted_mean(arr2, weights1), 2))
# Should print: -0.09

print(round(weighted_mean(arr2, weights2), 2))
# Should print: -0.01

### Solution

In [None]:
def weighted_mean(arr, weights):
  """Returns the weighted mean of the list arr weighted by the list weights."""
  # Raise an error if arr and weights don't have the same number of elements.
  if len(arr) != len(weights):
    raise ValueError('len(arr) = %d should be the same as len(weights) = %d' %
                     (len(arr), len(weights)))
  if sum(weights) != 1:
    raise ValueError('sum(weights) should be 1, but is %.2f' % sum(weights))

  weighted_sum = 0
  for i in range(len(arr)):
    if weights[i] < 0:
      raise ValueError('All elements of the weights list must be non-negative.')
    weighted_sum += arr[i] * weights[i]

  # Coerce sum to float here so that the division will be float, not int.
  return float(weighted_sum) / len(arr)

## Question 8

How does the time complexity of `weighted_mean` compare to that of `mean`? Both are shown here.

In [None]:
def weighted_mean(arr, weights):
  """Returns the weighted mean of the list arr weighted by the list weights."""
  # Raise an error if arr and weights don't have the same number of elements.
  if len(arr) != len(weights):
    raise ValueError('len(arr) = %d should be the same as len(weights) = %d' %
                     (len(arr), len(weights)))
  if sum(weights) != 1:
    raise ValueError('sum(weights) should be 1, but is %.2f' % sum(weights))

  weighted_sum = 0
  for i in range(len(arr)):
    if weights[i] < 0:
      raise ValueError('All elements of the weights list must be non-negative.')
    weighted_sum += arr[i] * weights[i]

  # Coerce sum to float here so that the division will be float, not int.
  return float(weighted_sum) / len(arr)

In [None]:
def mean(arr):
  """Finds the mean of a list of integers."""
  sum = 0
  len = 0
  for i in arr:
    sum += i
    len += 1
  
  # Coerce sum to float here so that the division will be float, not int.
  return float(sum) / len

In [None]:
#freetext

### Solution

`weighted_mean` involves more computations than `mean`. Firstly, we need to calculate `len(arr)` and `len(weights)` and compare them. Secondly, we need to check if any elements of `weights` is negative. Thirdly, we need to multiply each element of `arr` with its corresponding element in `weights`.

Nevertheless, the big-O time complexity is $O(n)$. Despite there being more computations both within and outside the `for` loop, all of those computations are individually $O(1)$. (See [this discussion](https://stackoverflow.com/questions/1115313/cost-of-len-function) of the complexity of `len`.) Therefore, as with `mean`, the big-O time complexity is just the number of iterations of the `for` loop, which is still `len(arr)`, therefore `weighted_mean` is $O(n)$.

## Question 9

Your friend is a school teacher, and is calculating the final grade for the students in their class. The final grade is a weighted mean of several assessments during the class.

The data is in a dictionary, where:

- The keys represent the weight of the assessments
- The values represent the grade the student received

Here is your friend's code. Your friend tells you that while the function works, it seems inefficient. The time and space complexity are both $O(n)$. Can you optimize this function to reduce the time and/or space complexity?

In [None]:
def weighted_average_score(map):
  """Calculate the weighted average score for a student.

  Args:
    map: The keys are the weights of the assessment. The values are the grades.
  
  Returns:
    The weighted average score for the student.
  """

  # TODO(you): Can you optimize the time and/or space complexity?

  sum = 0
  len = 0
  weights = []
  grades = []

  for key in map.keys():
    weights.append(key)
  
  for value in map.values():
    grades.append(value)

  for i in range(len(grades)):
    sum += grades[i] * weights[i]
    len += 1

  # Coerce sum to float here so that the division will be float, not int.
  return float(sum) / len

### Solution

The current implementation is inefficient in two ways:

- The weights and grades can be used within the `for` loop and do not need to be stored in arrays.
- The three `for` loops can be made into one.

Instead of iterating through `map.keys()` and `map.values()` separately, it is more efficient to iterate through `map.items()`, which contains keys *and* values. This `for` loop is still $O(n)$. This allows us to not store each value of `weights` and `grades` in an array, but use the values directly. Not creating the arrays reduces the space complexity from $O(n)$ to $O(1)$. This also allows us to use a single `for` loop, instead of three.

In [None]:
def weighted_average_score(map):
  """Calculate the weighted average score for a student.

  Args:
    map: The keys are the weights of the assessment. The values are the grades.
  
  Returns:
    The weighted average score for the student.
  """

  sum = 0
  len = 0

  for key, val in map.items():
    sum += key * val
    len += 1

  # Coerce sum to float here so that the division will be float, not int.
  return float(sum) / len

## Question 10

Calculating the length of a list in Python is constant time. This is because `__len__` is a built-in method of the `list` class, so calling `len(list)` is equivalent to calling `list.__len__()`.

That is, certain attributes of a list in Python, such as length, are stored in a look-up table, and can be retrieved in constant time. Thus, `list.__len__()` has the same complexity as `list[0]`, which is $O(1)$. See [here](https://stackoverflow.com/questions/1115313/cost-of-len-function) for more information.

In [None]:
my_list = [1, 2, 3, 4]
print(len(my_list))
print(my_list.__len__())

Python also offers a `sum` function for a list of integers. Assume that this function is $O(n)$. Given these functions, can you simplify the standard `mean` function? Does this change the time complexity?

In [None]:
def mean(arr):
  """Finds the mean of a list of integers."""
  # TODO(you): Modify this function to use the built-in sum and len functions.

  sum = 0
  len = 0
  for i in arr:
    sum += i
    len += 1
  
  # Coerce sum to float here so that the division will be float, not int.
  return float(sum) / len

### Solution

Instead of calculating the variables `sum` and `len`, we can use the in-built `sum` `len` functions.

In [None]:
def mean(arr):
  """Finds the mean of a list of integers."""
  # Coerce sum to float here so that the division will be float, not int.
  return float(sum(arr)) / len(arr)

Since `sum` is $O(n)$ and `len` is $O(1)$, this function is $O(n)$. Therefore the big-O time complexity is the same as the original function. However, this function has been reduced from several lines to one. Moreover, the time complexity has approximately halved (since we no longer need to add 1 to `len` at every iteration), and the space complexity has been reduced by 2 (since we no longer need to store `sum` and `len` in memory).