# Analyzing Complexity

## Lesson Overview

As you become familiar with more algorithms, it becomes increasingly important to analyze and calculate complexities. In any real-world application, it is fundamental to understand the time and space complexities of the algorithms employed in a program.

An algorithm's complexity is usually referenced based on the **average case**. In the previous lesson, we introduced the concept of **best and worst case complexity**: the best case complexity is the lowest possible complexity of the algorithm, and the worst case complexity is the highest possible complexity of the algorithm.

### Average case complexity

> The **average case time complexity** of a program is the mean time complexity of the program averaged over all possible inputs.

Formally calculating the average case time complexity of an algorithm is often cumbersome, as it involves calculating the time complexity for *every* possible input, then taking the mean.

### Analyzing the time complexity of an algorithm

The following function returns the index of an array of integers at which the cumulative sum first exceeds a threshold. This function has many possible applications, including finding the first day a company's gross revenue exceeds a target, and finding whether a country's total carbon emissions exceed a quota and if so when.

In [None]:
def threshold_exceeded(arr, threshold):
  """Returns the index at which the cumulative sum of array exceeds threshold.
  
  Returns -1 if the total sum of the array does not exceed the threshold."""
  cumulative_sum = 0

  for i in range(len(arr)):
    cumulative_sum += arr[i]

    if cumulative_sum > threshold:
      return i
  
  return -1

In [None]:
print(threshold_exceeded([2, 4, 5, 1, 3], 7))
print(threshold_exceeded([2, 4, 5, 1, 3], 70))

Suppose the input array has $n$ elements. All of the single-line operations in this function have an $O(1)$ time complexity, so the time complexity is the number of iterations in the `for` loop before it can `return`.

**Worst case**

The maximum number of iterations that this function can have is $n$, since the `for` loop is through `range(len(arr))`. This occurs either if the total sum of the array does not exceed threshold, or if it only exceeds it at the final index. The worst case number of iterations is $n$, so the worst case time complexity is $O(n)$.

**Best case**

The minimum number of iterations that this function can have is 1, in the case that `array[0] > threshold`. The best case number of iterations is 1, so the best case time complexity is $O(1)$.

### Average case complexity of an algorithm

To formally calculate the average case time complexity, we would need to find the time complexity for all possible combinations of `arr` and `threshold`, and calculate the mean complexity. This is often tricky, since the number of possible input values for `arr` and `threshold` is infinite! Instead, we can use a few simple assumptions to calculate the average case time complexity.

In [None]:
def threshold_exceeded(arr, threshold):
  """Returns the index at which the cumulative sum of array exceeds threshold.
  
  Returns -1 if the total sum of the array does not exceed the threshold."""
  cumulative_sum = 0

  for i in range(len(arr)):
    cumulative_sum += arr[i]

    if cumulative_sum > threshold:
      return i
  
  return -1

Let's consider two scenarios separately:

1. The total sum of `array` does not exceed `threshold`. 

   In this scenario, the function must iterate through the entire array to calculate the total sum of `array`, find that it does not exceed `threshold`, and `return -1`. Therefore the number of iterations in this scenario is always $n$.

1. The total sum of `array` exceeds `threshold`.

   In this scenario, the number of iterations can vary from 1 (if `arr[0] > threshold`, which is the best case) to $n$ (if the cumulative sum exceeds `threshold` after all $n$ iterations but not after $n-1$ iterations). To take an average over all possible inputs, we can assume that all inputs are equally likely to occur. Under this assumption, the mean number of iterations lies halfway between the minimum 1 and the maximum $n$. Therefore the mean number of iterations in this scenario is $\frac{n+1}{2}$.

Now, we need to calculate the mean across *both* these scenarios. Assume that the first scenario occurs with some unknown but constant probability $p$. Therefore the second scenario occurs with probability $1-p$. Using the [law of total probability](https://en.wikipedia.org/wiki/Law_of_total_probability), the average number of iterations is

$$pn + (1-p)\frac{n+1}{2} = n \left( p + \frac{1-p}{2} \right) + \frac{1-p}{2}$$

This looks complicated, but the multiplicative and additive constants can be ignored in big-O analysis. Therefore the average case time complexity is $O(n)$.

In practice, you shouldn't need to calculate average case complexities too often. However it is important to understand *how* average case complexities are derived, and to know the average case complexities of common algorithms.

## Question 1

Which *one* of the following best defines the average case time complexity of an algorithm?

**a)** The average of the best and worst case time complexity

**b)** The time complexity when run on average inputs

**c)** The median time complexity over all possible inputs

**d)** The mean time complexity over all possible inputs

### Solution

The correct answer is **d)**.

**a)** Average case complexity averages more than just the best and worst case complexity.

**b)** The average case complexity is the average complexity, not the complexity on average inputs.

**c)** Close, but not quite. "Average" generally refers to the mean, not the median.

## Question 2

Your friend is taking a course in data structures and algorithms. As part of their homework assignment, they had to write a function that checks if a given value is in an array. Below is their function.

In [None]:
def find(arr, val):
  """Returns True if val is in the list arr."""
  result = False

  for i in arr:
    if i == val:
      result = True

  return result

When your friend received the homework assignment back, they only received partial credit for this function. The teacher said that while the function is correct, it is inefficient in terms of the best case time complexity.

What is the best case time complexity of your friend's implementation?

In [None]:
#freetext

### Solution

All of the computations within and outside the `for` loop are $O(1)$. In all cases (best, worst, average), this implementation inspects all the $n$ elements of `arr` and the `for` loop has $n$ iterations. Therefore, this implementation has a best (and worst and average) case complexity of $O(n)$.

## Question 3

How can you modify this function to reduce its best case time complexity? What is the new best case time complexity?

In [None]:
def find(arr, val):
  """Returns True if val is in the list arr."""
  result = False

  for i in arr:
    if i == val:
      result = True

  return result

In [None]:
#freetext

### Solution

The algorithm can be simplified by returning `True` as soon as we find `val` in `arr`, instead of checking every element. This means that we do not check the rest of the array if we have already found `val`.

In [None]:
def find(arr, val):
  """Returns True if val is in the list arr."""
  for i in arr:
    if i == val:
      return True

  return False

Now, in the best case, `arr[0] == val`, so the function exits after one iteration. This implementation therefore has a best case time complexity of $O(1)$.

## Question 4

As per the previous questions, the big-O time complexity of the following function that finds if a value exists in an array is constant in the best case and linear in the worst case.

In [None]:
def find(arr, val):
  """Returns True if val is in the list arr."""
  for i in arr:
    if i == val:
      return True
  return False

What is the average case time complexity?

In [None]:
#freetext

### Hint

`val` is in `arr` with some unknown probability $p$. Consider the two cases separately. Calculate the mean time complexity if `val` is in `arr`, then calculate the mean time complexity if `val` is not in `arr`. Then, use $p$ to calculate the overall mean. Note that $p$ should not appear in the final answer, as it is a constant and can be ignored for big-O.

### Solution

The function finishes executing as soon as a match is found, if at all. The time complexity is the number of iterations of the `for` loop. If a match is found in the first element, this is the best case, and there is 1 iteration, which is $O(1)$. If a match is not found or it is found on the last element, this is the worst case, there are $n$ iterations, which is $O(n)$.

To calculate the average case time complexity, we should calculate the average number of iterations over all possible cases. Let's break this down into two categories:

1. `val` is in `arr`, with some probability $p$
2. `val` is not in `arr`, with some probability $1-p$

If `val` is in `arr`, it is equally likely to be in any position, from 1 through $n$. If it is in position $i$, then the number of iterations is $i$. Therefore the average number of iterations, given `val` in `arr` is

\begin{align*}
\frac{1}{n} \sum_{i=1}^n i &= \frac{1}{n} \frac{n(n+1)}{2} \\
&= \frac{n+1}{2}, \\
\end{align*}

where the first equality comes from the formula for an [arithmetic series](https://en.wikipedia.org/wiki/1_%2B_2_%2B_3_%2B_4_%2B_%E2%8B%AF). Therefore if `val` is in `arr`, the average number of iterations is $\frac{n+1}{2}$.

If `val` is not in `arr`, the algorithm still needs to check each of the $n$ elements in the array, therefore the number of iterations is always $n$.

Using the [Law of Total Probability](https://en.wikipedia.org/wiki/Law_of_total_probability), the average number of iterations averaged over both categories is

\begin{align*}
p \frac{n+1}{2} + (1-p) n &= n\left(1-\frac{p}{2} \right) + \frac{p}{2}. \\
\end{align*}

Since $p$ is a constant, these coefficients that are expressions of $p$ can be ignored for the purposes of big-O. Therefore, the average case time complexity of this algorithm is $O(n)$.

## Question 5

A colleague on your team at work has developed a function to check if an array of integers is sorted. The function is working fine, but your manager wants to make sure that the function is not too computationally expensive. What is the best and worst case **time and space** complexity of `is_sorted`, below?

In [None]:
def is_sorted(arr):
  """Returns True if arr is sorted lowest to highest."""
  for i in range(len(arr) - 1):
    if arr[i] > arr[i+1]:
      return False
  
  return True

In [None]:
#freetext

### Solution

- Time complexity

  - In the **best case** (which minimizes time complexity), `arr[0] > arr[1]`, since in this case, the function exits after only one iteration. Therefore the best case time complexity is $O(1)$.

  - In the **worst case** (which maximizes time complexity), the `for` loop completes all $n-1$ iterations. Therefore the worst case time complexity is $O(n)$. (It turns out that the **average case** time complexity is also $O(n)$.)

- Space complexity

  In all cases, the only variable created is `i`, so the space complexity is always $O(1)$.

## Question 6

One of the most rudimentary algorithms for sorting an array of integers is "bogosort" (also known as "permutation sort" and "stupid sort"). The way this algorithm works is to repeatedly shuffle the elements of the array and check if the result is sorted, stopping if the sorted array is generated.

In [None]:
def is_sorted(arr):
  """Returns True if arr is sorted lowest to highest."""
  for i in range(len(arr) - 1):
    if arr[i] > arr[i+1]:
      return False
  
  return True

To see all the permutations of `arr` as `bogosort` looks for a sorted one, uncomment the `print` command.

In [None]:
import random # needed to generate random shuffles of the array

def bogosort(arr):
  """Sorts an array of integers from lowest to highest."""
  while not is_sorted(arr):
    # print(arr)
    random.shuffle(arr)

  return arr

In [None]:
bogosort([1, 4, 2, 3])

What is the worst case time and space complexity of bogosort?

In [None]:
#freetext

### Solution

- Time complexity

  Due to the probabilistic nature of this implementation of bogosort, it is entirely possible that the algorithm *never* produces a sorted array. Consider a case where `random.shuffle` by chance produces the same unsorted array at every iteration. In this case, which is as possible as any other case, the algorithm never finishes and the time complexity is infinite.

- Space complexity

  All shuffling is done in-place (this means that the array is not copied before shuffling, so all elements are moved around in the input array) and no new variables are created. Therefore the worst (and best and average) case space complexity is $O(1)$.

## Question 7

[Advanced] In the previous question, we saw that the worst case time complexity of bogosort (code below) is infinite. What is the average time complexity?

You don't need to *prove* what the average case time complexity is. Consider what the probability of randomly arriving at the sorted array is, in this implementation. And remember that `is_sorted` is $O(n)$.

In [None]:
def is_sorted(arr):
  """Returns True if arr is sorted lowest to highest."""
  for i in range(len(arr) - 1):
    if arr[i] > arr[i+1]:
      return False
  
  return True

In [None]:
import random # needed to generate random shuffles of the array

def bogosort(arr):
  """Sorts an array of integers from lowest to highest."""
  while not is_sorted(arr):
    # print(arr)
    random.shuffle(arr)

  return arr

In [None]:
#freetext

### Hint

What are the total possible number of permutations of `arr` (assuming all elements of `arr` are distinct)?

If this number is $M$, then the probability of the sorted array being hit is $\frac{1}{M}$ since each permutation is equally likely at each shuffle. In this case, the expected number of iterations before a success is just $M$.

### Solution

If all of the $n$ elements of `arr` are distinct, then there are $n!$ possible permutations of the elements. Therefore the probability of any given permutation being chosen at each shuffle is $\frac{1}{n!}$ and the expected number of iterations before the sorted array is chosen is $n!$.

(If you are familiar with statistics, the number of iterations before a success follows a [geometric distribution](https://en.wikipedia.org/wiki/Geometric_distribution).)

Therefore, the time complexity of arriving at the sorted array is $O(n!)$. But remember that the function to check that a given array is sorted, `is_sorted`, is $O(n)$. This is called at every iteration of the `while` loop. So, this function has on average $n!$ iterations of a loop that is $O(n)$. Therefore, the total time complexity is $O(n \cdot n!)$.

Bogosort is one of the least efficient sorting algorithms. There are several sorting algorithms that are average case $O(n^2)$ and $O(n\log(n))$.

## Question 8

[Advanced] The main problem with the bogosort implementation in the previous questions is that the worst case time complexity is infinite. That means the program could literally run forever!

A different implementation of bogosort addresses this, by creating all possible permutations of the input array and *then* checking one by one which is sorted.

In [None]:
def is_sorted(arr):
  """Returns True if arr is sorted lowest to highest."""
  for i in range(len(arr) - 1):
    if arr[i] > arr[i+1]:
      return False
  
  return True

To see all the permutations of `arr`, pass `print_all=True`.

In [None]:
import itertools # needed to generate all possible permutations of an array

def bogosort2(arr, print_all=False):
  """Sorts an array of integers from lowest to highest."""
  # Generate all possible permutations of arr.
  all_permutations = itertools.permutations(arr)
  if print_all:
    print([p for p in all_permutations])
  for p in all_permutations:
    if is_sorted(p):
      return list(p)

In [None]:
bogosort2([1, 4, 2, 3], print_all=True)

What is the worst case time *and* space complexity of this implementation of bogosort?

Assume that the time and space complexity of `itertools.permutations(arr)` is equal to the number of permutations generated.

In [None]:
#freetext

### Solution

Most of the computational complexity comes from the `itertools.permutations(arr)` call, which generates all possible permutations of `arr`. In the worst case, all $n$ elements of `arr` are distinct so there are $n!$ permutations, and the time and space complexity of this line is $O(n!)$.

- Time complexity

  Since `all_permutations` has $n!$ elements, the `for` loop is $O(n!)$ in the worst case. Each array $p$ within `all_permutations` is an $n$-array, so `is_sorted(p)` is $O(n)$. Therefore the `for` loop has time complexity $O(n \cdot n!)$. Remembering that creating `all_permutations` is $O(n!)$, the total time complexity is

  $$O(n!) + O(n \cdot n!) = O(n \cdot n!).$$

- Space complexity

  There are only two variables created in `bogosort2`: `all_permutations` (which is $O(n!)$ as discussed) and `p` (which is an $n$-array so is $O(n)$). Therefore, the space complexity is

  $$O(n!) + O(n) = O(n!).$$