# Running Time Functions and Complexity


Do this first:
+ Download this notebook and place it in your Lectures folder.

Review:
+ Practice implementing sequential designs.

Key ideas for running time analysis:

+ A running time function of a program is a function of the program's input size.

+ Running time complexity is a set of running time functions.

+ Running time complexity is not exact.  It describes magnitudes.

+ Running time complexity describes the magnitudes of growth.

+ An "additive" difference makes no difference in complexity.

+ A "multiplicative" difference makes no difference in complexity.



### Warm-up

**Task: Design a Python function that computes the average horse power of cars.**

First, get the data. Second, articulate the sequential steps you need to do to accomplish the task.

Reference: https://github.com/mwaskom/seaborn-data/blob/master/mpg.csv

Goal:
+ See if you can design a relatively simple sequential program.
+ See if you can clean the data. Preliminary processing.

In [16]:
import requests
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv'
result = requests.get(url)

In [None]:
result.text

NameError: name 'result' is not defined

* First, get the data. 
* Second, clean it up.
    + Split the text into a list of lines.
        + Split each line into a list of things.
* Given the cleaned data to the function
    + 
    + 

In [13]:
d = '28.0,4,120.0,79.0,2625,18.6,82,usa,ford ranger\n31.0,4,119.0,82.0,2720,19.4,82,usa,chevy s-10\n'
line1 = d.strip().split('\n')[0]
line1.split(',')

['28.0', '4', '120.0', '79.0', '2625', '18.6', '82', 'usa', 'ford ranger']

GOAL: given a set of instructions written in English, you have to be able to translate to Python, i.e. to implement a design.

In [34]:
def clean_data(text):
    # strip and split text into a list of lines
    lines = text.strip().split('\n')
    output = []
    
    # go through each line, 
    #    split it into a list of things
    #    append that list to output
    for line in lines:
        things = line.split(',')
        output.append(things)
    
    return output

def compute_average_hp( data ):
    # go through each item in the data, skipping the first item (header)
    #   pick out the horse power from the current item, 
    #   and add the horse power to a current sum.
    # return the average
    cur_sum = 0
    count = 0
    for i in range(1, len(data)):
        if data[i][3] != '':
            hp = float(data[i][3])
            cur_sum += hp
            count += 1
    return cur_sum / count


In [35]:
import requests
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv'
result = requests.get(url)
data = clean_data(result.text)
average_hp = compute_average_hp(data)

In [37]:
average_hp

104.46938775510205

---
**Task: Compute the average horse power of cars for each origin.**

First, get the data. Second, articulate the sequential steps you need to do to accomplish the task.

Steps:
+ Go through each line in the data.
    + Get the hp and origin (usa, europe, japan).
    + Accumulate the sum of hp for that origin.
+ Calculate the average hp for each origin.
+ Return the average hp's.

Design decision: how to keep track of total hp for each origin.
+ Solution A: a dedicated variable (e.g. usa_hp) for each origin.  Problem: if there's a new origin, you need to create a new variable. Also, you have to analyze the data before hand.
+ Solution B: have a list of sum hp's. Problem: lists are indexed by numbers.
+ Solution C: have a dictionary that stores total hp for each origin.  We can index a dictionary using strings (e.g. 'usa', 'japan', etc.)
    * Start with an empty dict.
    * When an origin arrives, we check if it's already in the dict.


GOAL: implement the design.  We want to be able to translate English instructions to Python.

### Analyze running times of sequential programs.

In [38]:
def clean_data(text):
    lines = text.strip().split('\n')
    output = []
    for line in lines:
        things = line.split(',')
        output.append(things)
    
    return output

What is the running time of this program?

Possible answers:
* n

What is n?  number of items (lines).

What's the unit of measurement?
+ iterations? problem: an iteration can take a lot of time.
+ seconds? minutes? problem: this depends on the hardware.
+ **steps**. This is abstract.  How fast is a step?
    + This depends on the hardware.
    + BUT... it remains **more or less** constant once we choose a hardware frame of reference.
    
We have n iterations.

Each iteration can have many steps.  But in each iteration, they are the same number of steps.
    + We split into 9 columns, and append them into a list.


So, is "n steps" the correct answer? Yes.

Although it is more correct to say that 10*n steps, it is also correct to say it's n steps (if we ignore/absorb the constant 10).

A constant is always implicit when we say "n steps".


Summary:
+ There's an implicit constant in front of the variable n.

+ It's more correct to say that, e.g., this program takes c*n steps.

+ It's actually likely inaccurate to say that this program takes 10*n steps.  Because it's nearly impossible to count the exact number of steps.

But because we know it's constant, it's nicer (more accurate) to say that the program takes c*n steps.



### Running time analysis

+ Key concepts: running time function, running time complexity
+ Big-O, upperbound

The running time function of a program specifies the number of steps the program takes.

Definition: T(n) is the running time equation of prob1, where n is the number of items on the input list, L.

n is what we call "input size".

The running time equation is a function of input size.

T(10) is the running time of this program when the input has 10 items.

T(21) is the running time of this program when the input has 25 items.

T(n) is the running time of this program when the input has n items.

How do we determine T(n)?  Answer: we sum up all the steps in the program.

In [2]:
def prob1(L):     # T(n)
    s = 0         # 1 step (a1 steps) 
    for x in L:   # n iterations (L has n items)
        s += x    #   each iteration takes: b steps
    s = s * 5     # 1 step (a2 steps)
    return s      # a3 steps

$T(n) = a + b*n$

a = a1+a2+a3


If n = 10, T(10) = a + 10b.

This running time function linearly dependent on n.

What is the running time complexity of prob1? Is it O(n)?

What's the difference between T(n) and the complexity?

The complexity is a shorter description of T(n).

The complexity includes T(n).

What is O(n)?
+ Order of n
+ Input grows the program runs linearly.

Big-O or O is a measurement of complexity.

Is is correct to say the complexity of prob1 is O(n)?  Yes.

Is is correct to say the complexity of prob1 is O(n^2)? Yes.


### O (or Big-O)

O is a measurement of complexity.

O is a way to specify upper bounds of (resource) functions.

#### A running time complexity of O(n) means that the program takes **at most** a constant times n steps.

O(n) means: $T(n) \le c * n$

Upper bound estimation is very useful for complex programs. 

$T(n) = a + b*n \le a*n + b*n$ for $n>1$

I'm upper bounding T(n), specifically the term "a".

$T(n) \le (a+b)*n$

In other words, $T(n) \le c * n$.

Therefore, $T(n)$ is in O(n).

$O(n)$ is a set of functions.

$10n + 5 \in O(n)$

$20n + 1 \in O(n)$

$5 \in O(n)$

Revising the complexity of prob1.

We've established that $T(n) \le c * n \le c * n^2$ for all $n>1$.

Therefore, by definition, $T(n) \in O(n^2)$.

### Intuitive understanding

$T(n) \in O(n)$ means the upper bound of this running time function is a constant times n.

$T(n) \in O(n^2)$ means the upper bound of this running time function is a constant times $n^2$.


### True or False?

+ $5n^3 + 10 \in O(n^2)$  FALSE

+ $5n^3 + 10 \in O(n^3)$  TRUE

+ $5n^3 + 10 \in O(n^4)$  TRUE

We can show this using the definition of O (this is what you'll need to do in upcoming assignments)

$5n^3 + 10 \le 5n^3 + 10n^3 = 15n^3$, for all $n>1$.

I have chosen the constant c to be 15.  With this choice, we've shown that $5n^3 + 10 \in O(n^3)$.


$T(n) \in O(n^3)$ means $T(n) \le c * n^3$ for some constant c and when n is greater than some value.

If you want to prove/show that $T(n) \in O(n^3)$, you have to identify a specific constant c.

#### An example

Show that $n^4 + 5 \in O(n^7)$

Answer:

$n^4 + 5 \le n^7 + 5n^7 = 6n^7$, for all $n>1$

With the choice of c=6, $n^4 + 5 \le c*n^7$.  In other words, $n^4 + 5 \in O(n^7)$

Why are we doing this?  This is because the definition of O states that: $T(n) \in O(n^7)$ if $T(n) \le c * n^7$ for some constant c and for all n greater than a number.


$n^7$ is not a tight upper bound, but it's still an upper bound.

### Quick summary

We should understand:
+ What running time functions are.
+ What Big-O complexity means.
+ How to upper bound a running time function to show its upper bound complexity. 

Definition of O:

$T(n) \in O( f(n) )$ if $T(n) \le c * f(n)$ for some specific value of c and for n larger than a number.

This means f(n) is an upper bound of T(n).

Let's do some estimate of upper bounds for a function.

In [13]:
#
# L is a list of numbers.
# We don't know in advance how many numbers, so we'll just say "n"
#
def prob2(L):                         # T(n)
    s = 0                             # a1 steps
    for i in range(len(L)):           # n iterations
        for j in range(0, len(L)):    #    n iterations
            s += L[i]*L[j]            #       b steps
    return s                          # a2 steps

In [19]:
prob2([1,80,3,5,3,30])

2768

What is the upper-bound complexity of prob2?

What do we need to answer this question? T(n)

$T(n) = a + b*n^2 \in O(n^2)$

a=a1+a2


In [24]:
for i in range(3):           # always 3 iterations
    for c in list('hello'):  # always take 5 iterations
        print(i,c)
# there're 3 * 5 steps

0 h
0 e
0 l
0 l
0 o
1 h
1 e
1 l
1 l
1 o
2 h
2 e
2 l
2 l
2 o


In [25]:
#
# L is a list of numbers.
# We don't know in advance how many numbers, so we'll just say "n"
#
def prob3(L):                         # T(n)
    s = 0                             # a1 steps
    for i in range(len(L)):           # n iterations
        for j in range(i+3, len(L)):  #    at most n iterations
            s += L[i]*L[j]            #       b steps
    return s                          # a2 steps

$T(n) = a + bn^2$ is no longer correct because line 8 does not have take n iterations.

The number of iterations on line 8 depends on i.

$T(n) = a + ???$

We don't it exactly (because it's not exact).

But we can upper bound it.

n * (at most n) * b = at most b*n^2

$T(n) \le a + b*n^2$

$T(n) \le an^2 + bn^2 = (a+b)n^2$  all $n>1$

By definition, $T(n) \in O(n^2)$

The upper-bound complexity of both prob2 and prob3 is $O(n^2)$, even though prob3 is a little faster (takes fewer steps) than prob2.

Using upper bounds appropriately can be very effective in making design decisions.

### Summary:

+ Understand what running time functions (T(n)) are.
+ Understand what Big-O is
+ Understand how to determine upper bound complexity for running time functions. To do this, you have to know how to do upper-bound estimates.
+ Understand how to figure out T(n) for a simple iterative program.