# 2. Analysis
## 2.1. Objectives
* To understand why algorithm analysis is important.
* To be able to use “Big-O” to describe execution time.
* To understand the “Big-O” execution time of common operations on Python lists and dictionaries.
* To understand how the implementation of Python data impacts algorithm analysis.
* To understand how to benchmark simple Python programs.
## 2.2. What Is Algorithm Analysis?
It is very common for beginning computer science students to compare their programs with one another. You may also have noticed that it is common for computer programs to look very similar, especially the simple ones. An interesting question often arises. When two programs solve the same problem but look different, is one program better than the other?

In order to answer this question, we need to remember that there is an important difference between a program and the underlying algorithm that the program is representing. As we stated in Chapter 1, an algorithm is a generic, step-by-step list of instructions for solving a problem. It is a method for solving any instance of the problem such that given a particular input, the algorithm produces the desired result. A program, on the other hand, is an algorithm that has been encoded into some programming language. There may be many programs for the same algorithm, depending on the programmer and the programming language being used.

To explore this difference further, consider the function shown in ActiveCode 1. This function solves a familiar problem, computing the sum of the first n integers. The algorithm uses the idea of an accumulator variable that is initialized to 0. The solution then iterates through the n integers, adding each to the accumulator.

In [1]:
def sumOfN(n):
   theSum = 0
   for i in range(1,n+1):
       theSum = theSum + i

   return theSum

print(sumOfN(10))

55


Now look at the function in ActiveCode 2. At first glance it may look strange, but upon further inspection you can see that this function is essentially doing the same thing as the previous one. The reason this is not obvious is poor coding. We did not use good identifier names to assist with readability, and we used an extra assignment statement during the accumulation step that was not really necessary.

In [2]:
def foo(tom):
    fred = 0
    for bill in range(1,tom+1):
       barney = bill
       fred = fred + barney

    return fred

print(foo(10))


55


The question we raised earlier asked whether one function is better than another. The answer depends on your criteria. The function `sumOfN` is certainly better than the function `foo` if you are concerned with readability. In fact, you have probably seen many examples of this in your introductory programming course since one of the goals there is to help you write programs that are easy to read and easy to understand.

**Algorithm analysis** is concerned with comparing algorithms based upon the amount of **computing resources** that each algorithm uses. We want to be able to consider two algorithms and say that one is better than the other because it is more efficient in its use of those resources or perhaps because it simply uses fewer.

At this point, it is important to think more about what we really mean by **computing resources**. There are two different ways to look at this. One way is to consider the amount of **space or memory** an algorithm requires to solve the problem.

As an alternative to space requirements, we can analyze and compare algorithms based on the **amount of time** they require to execute. This measure is sometimes referred to as the “execution time” or “running time” of the algorithm. One way we can measure the execution time for the function `sumOfN` is to do a benchmark analysis. This means that we will track the actual time required for the program to compute its result. In Python, we can benchmark a function by noting the starting time and ending time with respect to the system we are using. In the `time` module there is a function called `time` that will return the current system clock time in seconds since some arbitrary starting point. By calling this function twice, at the beginning and at the end, and then computing the difference, we can get an exact number of seconds (fractions in most cases) for execution.

In [3]:
import time

def sumOfN2(n):
   start = time.time()

   theSum = 0
   for i in range(1,n+1):
      theSum = theSum + i

   end = time.time()

   return theSum,end-start

The function returns a tuple consisting of the result and the amount of time (in seconds) required for the calculation. If we perform 5 invocations of the function, each computing the sum of the first 10,000 integers, we get the following:

In [9]:
for i in range(5):
    print("Sum is %d required %10.7f seconds" %sumOfN2(10000))

Sum is 50005000 required  0.0010009 seconds
Sum is 50005000 required  0.0009995 seconds
Sum is 50005000 required  0.0009995 seconds
Sum is 50005000 required  0.0010002 seconds
Sum is 50005000 required  0.0009999 seconds


We discover that the time is fairly consistent and it takes on average about 0.0019 seconds to execute that code. What if we run the function adding the first 100,000 integers?

Again, the time required for each run, although longer, is very consistent, averaging about 10 times more seconds. For `n` equal to 1,000,000 we get:

In [10]:
for i in range(5):
    print("Sum is %d required %10.7f seconds"%sumOfN2(1000000))

Sum is 500000500000 required  0.0730007 seconds
Sum is 500000500000 required  0.0620048 seconds
Sum is 500000500000 required  0.0610042 seconds
Sum is 500000500000 required  0.0620167 seconds
Sum is 500000500000 required  0.0749931 seconds


In this case, the average again turns out to be about 10 times the previous.

Now consider ActiveCode 3, which shows a different means of solving the summation problem. This function, `sumOfN3`, takes advantage of a closed equation $\sum_{i=1}^{n}i=\frac{(n)(n+1)}{2}$ to compute the sum of the first `n` integers without iterating.

In [6]:
import time

def sumOfN3(n):
    start = time.time()
    result = (n*(n+1))/2
    end = time.time()
    return result, end-start

for i in [10000, 100000, 1000000, 10000000, 100000000]:
    print("Sum is %d required %10.7f seconds"%sumOfN3(i))

Sum is 50005000 required  0.0000000 seconds
Sum is 5000050000 required  0.0000000 seconds
Sum is 500000500000 required  0.0000000 seconds
Sum is 50000005000000 required  0.0000000 seconds
Sum is 5000000050000000 required  0.0000000 seconds


If we do the same benchmark measurement for `sumOfN3`, using five different values for `n` (10,000, 100,000, 1,000,000, 10,000,000, and 100,000,000).

First, the times recorded above are **shorter** than any of the previous examples. Second, they are very consistent no matter what the value of `n`. It appears that `sumOfN3` is **hardly impacted** by the number of integers being added.

But what does this benchmark really tell us? Intuitively, we can see that the iterative solutions seem to be doing more work since some program steps are being repeated. This is likely the reason it is taking longer. Also, the time required for the iterative solution seems to increase as we increase the value of `n`.

We need a better way to characterize these algorithms with respect to execution time. The benchmark technique computes the actual time to execute. It does not really provide us with a useful measurement, because it is dependent on a **particular machine**, **program**, **time of day**, **compiler**, and **programming language**. Instead, we would like to have a characterization that is independent of the program or computer being used. This measure would then be useful for judging the algorithm alone and could be used to compare algorithms across implementations.

## 2.3. Big-O Notation
When trying to characterize an algorithm’s efficiency in terms of execution time, independent of any particular program or computer, it is important to quantify the number of operations or steps that the algorithm will require. If each of these steps is considered to be a basic unit of computation, then the execution time for an algorithm can be expressed as the number of steps required to solve the problem. Deciding on an appropriate basic unit of computation can be a complicated problem and will depend on how the algorithm is implemented.

Computer scientists prefer to take this analysis technique one step further. It turns out that the exact number of operations is not as important as determining the most dominant part of the T(n) function. In other words, as the problem gets larger, some portion of the T(n) function tends to overpower the rest. This dominant term is what, in the end, is used for comparison. The **order of magnitude** function describes the part of T(n) that increases the fastest as the value of n increases. Order of magnitude is often called **Big-O** notation (for “order”) and written as O(f(n)). It provides a useful approximation to the actual number of steps in the computation. The function f(n) provides a simple representation of the dominant part of the original T(n).

A number of very common order of magnitude functions will come up over and over as you study algorithms. These are shown in Table 1. In order to decide which of these functions is the dominant part of any T(n) function, we must see how they compare with one another as n gets large.

![big o](images/big-o.png)

Notice that when n is small, the functions are not very well defined with respect to one another. It is hard to tell which is dominant. However, as n grows, there is a definite relationship and it is easy to see how they compare with one another.
![plot](images/newplot.png)

## 2.4. An Anagram Detection Example
A good example problem for showing algorithms with different orders of magnitude is the classic anagram detection problem for strings. One string is an anagram of another if the second is simply a rearrangement of the first. For example, `'heart'` and `'earth'` are anagrams. For the sake of simplicity, we will assume that the two strings in question are of equal length and that they are made up of symbols from the set of 26 lowercase alphabetic characters. Our goal is to write a boolean function that will take two strings and return whether they are anagrams.
### 2.4.1 Solution 1: Checking Off
Our first solution to the anagram problem will check the lengths of the strings and then to see that each character in the first string actually occurs in the second. If it is possible to “checkoff” each character, then the two strings must be anagrams. Checking off a character will be accomplished by replacing it with the special Python value None. However, since strings in Python are immutable, the first step in the process will be to convert the second string to a list. Each character from the first string can be checked against the characters in the list and if found, checked off by replacement

In [1]:
def anagramSolution1(s1,s2):
    if len(s1) != len(s2):
        stillOK = False

    alist = list(s2)

    pos1 = 0
    stillOK = True

    while pos1 < len(s1) and stillOK:
        pos2 = 0
        found = False
        while pos2 < len(alist) and not found:
            if s1[pos1] == alist[pos2]:
                found = True
            else:
                pos2 = pos2 + 1

        if found:
            alist[pos2] = None
        else:
            stillOK = False

        pos1 = pos1 + 1
