## Profiling and Timing Code 性能測算和計時

> In the process of developing code and creating data processing pipelines, there are often trade-offs you can make between various implementations.Early in developing your algorithm, it can be counterproductive to worry about such things. As Donald Knuth famously quipped, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

在開發階段以及創建數據處理任務流時，經常都會出現多種可能的實現方案，每種都有各自優缺點，你需要在這之中進行權衡。在開發你的算法的早期階段，過於關注性能很可能會影響你的實現效率。正如 Donald Knuth 的名言：“我們應該忘掉那些小的效率問題，在絕大部分情況下：過早的優化是所有罪惡之源”

> But once you have your code working, it can be useful to dig into its efficiency a bit.
Sometimes it's useful to check the execution time of a given command or set of commands; other times it's useful to dig into a multiline process and determine where the bottleneck lies in some complicated series of operations.
IPython provides access to a wide array of functionality for this kind of timing and profiling of code.
Here we'll discuss the following IPython magic commands:

一旦你的代碼已經開始工作了，那麼你就應該開始深入的考慮一下性能問題了。有時你會需要檢查一行代碼或者一系列代碼的執行時間；有時你又需要對多個線程進行研究，找到一系列複雜操作當中的瓶頸所在。 IPython提供了這類計時或性能測算的豐富功能。本章節中我們會討論下述的IPython魔術指令：
- ``%time``: 測量單條語句的執行時間
- ``%timeit``: 對單條語句進行多次重複執行，並測量平均執行時間，以獲得更加準確的結果
- ``%prun``: 執行代碼，並使用性能測算工具進行測算
- ``%lprun``: 執行代碼，並使用單條語句性能測算工具進行測算
- ``%memit``: 測量單條語句的內存佔用情況
- ``%mprun``: 執行代碼，並使用單條語句內存測算工具進行測算

### Timing Code Snippets: 計時代碼片段 ``%timeit`` and ``%time``

> Note that because this operation is so fast, ``%timeit`` automatically does a large number of repetitions.
For slower commands, ``%timeit`` will automatically adjust and perform fewer repetitions:For more information on ``%time`` and ``%timeit``, as well as their available options, use the IPython help functionality (i.e., type ``%time?`` at the IPython prompt).

`%timeit`自動做了很多次的重複執行。如果換成一個執行慢的操作，`%timeit`會自動調整（減少）重複次數。

In [10]:
%timeit  sum(range(10))

214 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [11]:
%%timeit
total = 0
for i in range(100):
    for j in range(100):
        total += i * (-1) ** j

2.63 ms ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [12]:
L = range(100)
%timeit [i**2 for i in L]

23.3 µs ± 32.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [13]:
import numpy as np
a = np.arange(100)
%timeit a**2

451 ns ± 3.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


### 對於一個執行時間較長的操作 ``%time``
> Sometimes repeating an operation is not the best option.For example, if we have a list that we'd like to sort, we might be misled by a repeated operation.Sorting a pre-sorted list is much faster than sorting an unsorted list, so the repetition will skew the result:

有些情況下，重複多次執行反而會得出一個錯誤的測量數據。例如我們有一個列表，希望對它進行排序，重複執行的結果會明顯的誤導我們。因為對一個已經排好序的列表執行排序是非常快的，因此在第一次執行完成之後，後面重複進行排序的測量數據都是錯誤的：

In [14]:
import random
L = [random.random() for i in range(100000)]
%timeit L.sort()

365 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


> For this, the ``%time`` magic function may be a better choice. It also is a good choice for longer-running commands, when short, system-related delays are unlikely to affect the result.
Let's time the sorting of an unsorted and a presorted list:

在這種情況下，`%time`魔術指令可能會是一個更好的選擇。對於一個執行時間較長的操作來說，它也更加適用，因為與系統相關的那些持續時間很短的延遲將不會對結果產生什麼影響。讓我們對一個未排序和一個已排序的列表進行排序，並觀察執行時間：

In [15]:
import random
L = [random.random() for i in range(100000)]
print("sorting an unsorted list:")
%time L.sort()

sorting an unsorted list:
CPU times: user 14.4 ms, sys: 307 µs, total: 14.7 ms
Wall time: 14.7 ms


In [16]:
print("sorting an already sorted list:")
%time L.sort()

sorting an already sorted list:
CPU times: user 7.06 ms, sys: 360 µs, total: 7.42 ms
Wall time: 8.24 ms


> Notice how much faster the presorted list is to sort, but notice also how much longer the timing takes with ``%time`` versus ``%timeit``, even for the presorted list!
This is a result of the fact that ``%timeit`` does some clever things under the hood to prevent system calls from interfering with the timing.
For example, it prevents cleanup of unused Python objects (known as *garbage collection*) which might otherwise affect the timing.
For this reason, ``%timeit`` results are usually noticeably faster than ``%time`` results.
For ``%time`` as with ``%timeit``, using the double-percent-sign cell magic syntax allows timing of multiline scripts:

你應該首先註意到的是對於未排序的列表和對於已排序的列表進行排序的執行時間差別。而且你還需要了解`%time`和`%timeit`執行的區別，即使都是使用已經排好序的列表的情況下。這是因為`%timeit`會使用一種額外的機制來防止系統調用影響計時的結果。例如，它會阻止Python解析器清理不再使用的對象（*垃圾收集*），否則垃圾收集會影響計時的結果。因此通常情況下`%timeit`的結果都會比`%time`的結果要快。對於`%time`和`%timeit`指令，使用兩個百分號可以對一段代碼進行計時：

In [17]:
%%time
total = 0
for i in range(1000):
    for j in range(1000):
        total += i * (-1) ** j

CPU times: user 359 ms, sys: 3.15 ms, total: 362 ms
Wall time: 360 ms


### Profiling Full Scripts: ``%prun`` 腳本代碼塊性能測算

> A program is made of many single statements, and sometimes timing these statements in context is more important than timing them on their own.
Python contains a built-in code profiler (which you can read about in the Python documentation), but IPython offers a much more convenient way to use this profiler, in the form of the magic function ``%prun``.

一個程序都是有很多條代碼組成的，有的時候對整段代碼塊性能進行測算比對每條代碼進行計時要更加重要。 Python自帶一個內建的代碼性能測算工具（你可以在Python文檔中找到它），而IPython提供了一個更加簡便的方式來使用這個測算工具，使用`%prun`魔術指令。

In [18]:
def sum_of_lists(N):
    total = 0
    for i in range(5):
        L = [j ^ (j >> i) for j in range(N)]
        total += sum(L)
    return total

%prun sum_of_lists(1000000)

 

> The result is a table that indicates, in order of total time on each function call, where the execution is spending the most time. In this case, the bulk of execution time is in the list comprehension inside ``sum_of_lists``.
From here, we could start thinking about what changes we might make to improve the performance in the algorithm.
For more information on ``%prun``, as well as its available options, use the IPython help functionality (i.e., type ``%prun?`` at the IPython prompt).

這個結果的表格，使用的是每個函數調用執行總時間進行排序（從大到小）。從上面的結果可以看出，絕大部分的執行時間都發生在函數`sum_of_lists`中的列表解析之上。然後，我們就可以知道如果需要優化這段代碼的性能，可以從哪個方面開始著手了。
