# Python for Machine Learning



## Built-in Data Structures

- Python arrays are called “list,” and it will expand automatically. 

- Associative arrays (or hash tables) are called “dict.” 

- “tuple” is a read-only list 

- “set” as a container for unique items.

- we can make use of the dict to build a counter

In [10]:
sentence = "Portez ce vieux whisky au juge blond qui fume"
counter = {}
for char in sentence:
    if char not in counter:
        counter[char] = 0
    counter[char] += 1

print(counter)

{'P': 1, 'o': 2, 'r': 1, 't': 1, 'e': 5, 'z': 1, ' ': 8, 'c': 1, 'v': 1, 'i': 3, 'u': 5, 'x': 1, 'w': 1, 'h': 1, 's': 1, 'k': 1, 'y': 1, 'a': 1, 'j': 1, 'g': 1, 'b': 1, 'l': 1, 'n': 1, 'd': 1, 'q': 1, 'f': 1, 'm': 1}


- we can use + to concatenate lists. In the above, we use += to extend the list A.

In [12]:
A = [1, 2, "fizz", 4, "buzz", "fizz", 7]
A += [8, "fizz", "buzz", 11, "fizz", 13, 14, "fizzbuzz"]
print(A)

[1, 2, 'fizz', 4, 'buzz', 'fizz', 7, 8, 'fizz', 'buzz', 11, 'fizz', 13, 14, 'fizzbuzz']


- swap two variables in a very clean syntax using tiples

In [13]:
a = 42
b = "foo"
print("a is %s; b is %s" % (a,b))
a, b = b, a # swap
print("After swap, a is %s; b is %s" % (a,b))

a is 42; b is foo
After swap, a is foo; b is 42


-  Python strings support substitution on the fly. 

In [14]:
template = "Square root of %d is %.3f"
n = 10
answer = template % (n, n**0.5)
print(answer)

Square root of 10 is 3.162


### Special variables

- One notable “special” variable that you may often see in Python code is _, just an underscore character. It is by convention to mean a variable that we do not care about. 

- we use it to hold a return value from a function

In [9]:
import pandas as pd
A = pd.DataFrame([[1,2,3],[2,3,4],[3,4,5],[5,6,7]], columns=["x","y","z"])
print(A)

for _, row in A.iterrows():
      print(row["z"])
        
## A.iterrows() return the index of row and the value of row, don't care about index s we use _        

   x  y  z
0  1  2  3
1  2  3  4
2  3  4  5
3  5  6  7
0
3
1
4
2
5
3
7


## Built-in functions

- List of python built-in functions

abs()
aiter()
all()
any()
anext()
ascii()
bin()
bool()
breakpoint()
bytearray()
bytes()
callable()
chr()
classmethod()
compile()
complex()
delattr()
dict()
dir()
divmod()
enumerate()
eval()
exec()
filter()
float()
format()
frozenset()
getattr()
globals()
hasattr()
hash()
help()
hex()
id()
input()
int()
isinstance()
issubclass()
iter()
len()
list()
locals()
map()
max()
memoryview()
min()
next()
object()
oct()
open()
ord()
pow()
print()
property()
range()
repr()
reversed()
round()
set()
setattr()
slice()
sorted()
staticmethod()
str()
sum()
super()
tuple()
type()
vars()
zip()
__import__()

- zip() allows you to combine multiple lists together

In [5]:
a = ["x", "y", "z"]
b = [3, 5, 7, 9]
c = [2.1, 2.5, 2.9]
for x in zip(a, b, c):
    print(x)

('x', 3, 2.1)
('y', 5, 2.5)
('z', 7, 2.9)


- zip also help us  to “pivot” a list of list

- The zip(*a) function in Python is used to unzip a list of iterables (such as lists or tuples). It takes multiple iterables as arguments and returns an iterator of tuples, where the i-th tuple contains the i-th element from each of the input iterables.

In [8]:
a = [['x', 3, 2.1], ['y', 5, 2.5], ['z', 7, 2.9]]
p,q,r = zip(*a)
print(p)
print(q)
print(r)

('x', 'y', 'z')
(3, 5, 7)
(2.1, 2.5, 2.9)


- enumerate(): to number a list of items, for example:

In [9]:
a = ["quick", "brown", "fox", "jumps", "over"]
for num, item in enumerate(a):
    print("item %d is %s" % (num, item))

item 0 is quick
item 1 is brown
item 2 is fox
item 3 is jumps
item 4 is over


- some functions that manipulate a list (or list-like data structures, which Python calls the “iterables”):

    max(a): To find the maximum value in list a
    
    min(a): To find the minimum value in list a
    
    sum(a): To find the sum of values in list a
    
    reverse(a): To iterate from list a from back
    
    sorted(a): To return a copy of list a with elements in sorted order


## Python Debugging Tools 

- The built-in debugger is pdb

- debugger is to provide you with a slow-motion button to control the flow of a program. 

- It also allows you to freeze the program at a certain time and examine the state.

- The simplest operation under a debugger is to step through the code. That is to run one line of code at a time and wait for your acknowledgment before proceeding to the next. The reason we want to run the program in a stop-and-go fashion is to allow us to check the logic and value or verify the algorithm.

- For large code, debuggers also provide a breakpoint feature that will kick in when a specific line of code is reached. From that point onward, we can step through it line by line.

- to run a programm with the Python debugger, we enter the following in the command line:


- At the prompt, you can type in the debugger commands. 

EOF-    c-          d-        h-         list-      q-        rv-       undisplay-
a-      cl-         debug-    help-      ll-        quit-     s-        unt-
alias-  clear-      disable-  ignore-    longlist-  r-        source-   until-
args-   commands-   display-  interact-  n-         restart-  step-     up-
b-      condition-  down-     j-         next-      return-   tbreak-   w-
break-  cont-       enable-   jump-      p-         retval-   u-        whatis-
bt-     continue-   exit-     l-         pp-        run-      unalias-  where-

- h: for hel

- n: to run line by line

- s: step into the f() function

- until: run to let the debugger run the program until that line is reahced (until 11)

- b: (breakpoint) to stop at a particular line whenever it is being run (b 40)

- b: to place a breakpoint with a condition so that it will stop only if the condition is met. ( b 40, r1 > 0.5)

- c: to continue until a trigger is met

- bt: to show the traceback to check how we reached that point(maybe after a trigger reached)

- p: to print the variables (or an expression) to check what value they are holding (p r1, r2)

- l: to list the code around the current statement 

- manipulate variables while we are debugging.(p r1, r2) --> (r1 = 0.2)

- up: moves our focus to one level up on the call stack (after bt showed call stack)

-  q: to quit or hit Ctrl-D if your terminal supports it.

In [None]:
python -m pdb pso.py

 - pdb from Python is suitable only for programs running from scratch
 
 - Python extension from GDB help with a program already running but is stuck

In [None]:
## final & will make it run in the background 
python simpleqt.py &

##  check for its process
ps a | grep python

## run gdp on running file--> 3997 the result that programm stuck 
gdb python 3997

- The commands supported under GDB are py-list, py-bt, py-up, py-down, and py-print. They are comparable to the same commands in pdb without the py- prefix.

## Profiling Python Code

- Profiling is a technique to figure out how time is spent in a program. 

- helps to find the “hot spot” of a program and think about ways of improvement

- example, to concatenate many short strings, we can use the join() function from strings or the + operator. which one is more efficient?

In [15]:
python -m timeit 'longstr=""' 'for x in range(1000): longstr += str(x)'
python -m timeit '"".join([str(x) for x in range(1000)])'

- The above commands are to load the timeit module and pass on a single line of code for measurement. In the first case, we have two lines of statements, and they are passed on to the timeit module as two separate arguments.

- -s option allows us to provide the “setup” code, which is executed before the profiling and not timed

In [None]:
python -m timeit '[x**0.5 for x in range(1000)]'
python -m timeit -s 'from math import sqrt' '[sqrt(x) for x in range(1000)]'
python -m timeit -s 'from numpy import sqrt' '[sqrt(x) for x in range(1000)]'

- you can also run timeit in Python code.



In [None]:
import timeit
measurements = timeit.repeat('[x**0.5 for x in range(1000)]', number=10000)
print(measurements)

### The Profile Module

- A program running slow can generally be due to two reasons: A part is running slow, or a part is running too many times, adding up and taking too much time. We call these “performance hogs” the hot spot.

- run the profiler for a module as follows:

In [None]:
python -m cProfile hillclimb.py

- to see which function is called the most number of times, we can sort by ncalls:

In [None]:
python -m cProfile -s ncalls hillclimb.py

- The other sort options are as follows:
- calls: Call count
- cumulative: Cumulative time
- cumtime: Cumulative time
- file: File name
- filename:	File name
- module: File name
- ncalls: Call count
- pcalls: Primitive call count
- line:	Line number
- name:	Function name
- nfl: Name/file/line
- stdname: Standard name
- time: Internal time
- tottime: Internal time

- save the profiler’s statistics for further processing as follows:

-  it will run the program. But this will not print the statistics to the screen but save them into a file

In [None]:
python -m cProfile -o hillclimb.stats hillclimb.py

 - Afterward, we can use the pstats module like the following to open up the statistics file and provide us a prompt to manipulate the data:

In [None]:
python -m pstats hillclimb.stats

- Using Profiler Inside Code: to profile a specific part of programm

In [None]:
import cProfile as profile

# Static Analyzers in Python

- Static analyzers are tools that help you check your code without really running your code.

- The most basic form of static analyzers is the syntax highlighters

- Three useful libraries are: 
   
    - Pylint
    - Flake8
    - mypy

### Pylint


In [None]:
pip install pylint

- Pylint can check one script or the entire directory.

- ask Pylint to tell us how good our code is before even running it:

In [None]:
pylint lenet5-notworking.py

- If you provide the root directory of a module to Pylint, all components of the module will be checked by Pylint. In that case, you will see the path of different files at the beginning of each line.

- Pylint may give false positives.--> error when it not an error

- Pylint is to help us make our code align with the PEP8 coding style. 

- Pylint help us identify potential issues.

- you know what Pylint should stop complaining about, you can request to ignore those. 

In [None]:
pylint -d E0611 lenet5-notworking.py

- all errors of code E0611 will be ignored by Pylint.

- You can disable multiple codes by a comma-separated list, e.g.,

- to disable some issues on only a specific line or a specific part of the code

In [None]:
from tensorflow.keras.datasets import mnist  # pylint: disable=no-name-in-module
from tensorflow.keras.models import Sequential # pylint: disable=E0611
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten
from tensorflow.keras.utils import to_categorical

- The magic keyword pylint: will introduce Pylint-specific instructions. The code E0611 and the name no-name-in-module are the same. In the example above, Pylint will complain about the last two import statements but not the first two because of those special comments.

## Flake8

- The tool Flake8 is indeed a wrapper over PyFlakes, McCabe, and pycodestyle. When you install flake8 with:

In [None]:
pip install flake8

- Similar to Pylint, we can pass in a script or a directory for analysis.

- But the focus of Flake8 is inclined toward coding style.

- The error codes beginning with letter E are from pycodestyle, and those beginning with letter F are from PyFlakes.

- It complains about coding style issues such as the use of (5,5) for not having a space after the comma. 

- Also see it can identify the use of variables before assignment.

- Similar to Pylint, we can also ask Flake8 to ignore some complaints and those lines will not be printed in the output. For example,

In [None]:
flake8 --ignore E501,E231 lenet5-notworking.py

- We can also use magic comments to disable some complaints

In [None]:
import tensorflow as tf  # noqa: F401
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential

- Flake8 will look for the comment # noqa: to skip some complaints on those particular lines.

## Mypy

- Python is not a typed language so, unlike C or Java, you do not need to declare the types of some functions or variables before use.

- But lately, Python has introduced type hint notation, so we can specify what type a function or variable intended to be without enforcing its compliance like a typed language.

- One of the biggest benefits of using type hints in Python is to provide additional information for static analyzers to check. Mypy is the tool that can understand type hints.

- Even without type hints, Mypy can still provide complaints similar to Pylint and Flake8.

In [None]:
pip install mypy

In [None]:
mypy lenet5-notworking.py


- It expects all libraries we used to come with a stub so the type checking can be done. This is because type hints are optional. 


- Some of the libraries have typing stubs available that enables mypy to check them better.

- In conclusion, the three tools we introduced above can be complementary to each other. You may consider to run all of them to look for any possible bugs in your code or improve the coding style. 

# Running a Python Script in Command Line

- Running a Python script in command line is powerful because you can pass in additional parameters to the script.

In [None]:
import sys

n = int(sys.argv[1])
print(n+1)

In [None]:
python commandline.py 15
16

- The list sys.argv contains the name of our script and all the arguments (all strings), which in the above case, is ["commandline.py", "15"].

- Python provided the library argparse to help with multiple and more complicated set of arguments

In [None]:
rsync -a -v --exclude="*.pyc" -B 1024 --ignore-existing 192.168.0.3:/tmp/ ./

- The optional arguments are introduced by “-” or “--“, where a single hyphen will carry a single character “short option” (such as -a, -B, and -v above), and two hyphens are for multiple characters “long options” (such as --exclude and --ignore-existing above). 

- The optional arguments may have additional parameters, such as in -B 1024 or --exclude="*.pyc"; the 1024 and "*.pyc" are parameters to -B and --exclude, respectively. 

- Additionally, we may also have compulsory arguments, which we just put into the command line. The part 192.168.0.3:/tmp/ and ./ above are examples. The order of compulsory arguments is important. For example, the rsync command above will copy files from 192.168.0.3:/tmp/ to ./ instead of the other way round.

- The following replicates the above example in Python using argparse:

In [None]:
import argparse

parser = argparse.ArgumentParser(description="Just an example",
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("-a", "--archive", action="store_true", help="archive mode")
parser.add_argument("-v", "--verbose", action="store_true", help="increase verbosity")
parser.add_argument("-B", "--block-size", help="checksum blocksize")
parser.add_argument("--ignore-existing", action="store_true", help="skip files that exist")
parser.add_argument("--exclude", help="files to exclude")
parser.add_argument("src", help="Source location")
parser.add_argument("dest", help="Destination location")
args = parser.parse_args()
config = vars(args)
print(config)

- If you run the above script, you will see:

In [None]:
python argparse_example.py
usage: argparse_example.py [-h] [-a] [-v] [-B BLOCK_SIZE] [--ignore-existing] [--exclude EXCLUDE] src dest
argparse_example.py: error: the following arguments are required: src, dest

- This means you didn’t provide the compulsory arguments for src and dest. 

## Working on the Command Line

- Empowering your Python script with command line arguments can bring it to a new level of reusability.

- let’s look at a simple example of fitting an ARIMA model to a GDP time series. World Bank collects historical GDP data from many countries. We can make use of the pandas_datareader package to read the data. 

- we use is NY.GDP.MKTP.CN; we can get the data of a country in the form of a pandas DataFrame 

- Fitting an ARIMA model and using the model for predictions is not difficult. In the following, we fit it using the first 40 data points and forecast for the next 3. Then compare the forecast with the actual in terms of relative error

- use argparse so that we can change some parameters from the command line

In [None]:
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
import warnings
warnings.simplefilter("ignore")

from pandas_datareader.wb import WorldBankReader
import statsmodels.api as sm
import pandas as pd

# Parse command line arguments
parser = ArgumentParser(formatter_class=ArgumentDefaultsHelpFormatter)
parser.add_argument("-c", "--country", default="SE", help="Two-letter country code")
parser.add_argument("-l", "--length", default=40, type=int, help="Length of time series to fit the ARIMA model")
parser.add_argument("-s", "--start", default=0, type=int, help="Starting offset to fit the ARIMA model")
args = vars(parser.parse_args())

# Set up parameters
series = "NY.GDP.MKTP.CN"
country = args["country"]
length = args["length"]
start = args["start"]
steps = 3
order = (1,1,1)

# Read the GDP data from WorldBank database
gdp = WorldBankReader(series, country, start=1960, end=2020).read()
# Drop country name from index
gdp = gdp.droplevel(level=0, axis=0)
# Sort data in choronological order and set data point at year-end
gdp.index = pd.to_datetime(gdp.index)
gdp = gdp.sort_index().resample("y").last()
# Convert pandas dataframe into pandas series
gdp = gdp[series]
# Fit arima model
result = sm.tsa.ARIMA(endog=gdp[start:start+length], order=order).fit()
# Forecast, and calculate the relative error
forecast = result.forecast(steps=steps)
df = pd.DataFrame({"Actual":gdp, "Forecast":forecast}).dropna()
df["Rel Error"] = (df["Forecast"] - df["Actual"]) / df["Actual"]
# Print result
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    print(df)

- If we run the code above in a command line, we can see it can now accept arguments:

In [None]:
python gdp_arima.py --help
usage: gdp_arima.py [-h] [-c COUNTRY] [-l LENGTH] [-s START]

optional arguments:
  -h, --help            show this help message and exit
  -c COUNTRY, --country COUNTRY
                        Two-letter country code (default: SE)
  -l LENGTH, --length LENGTH
                        Length of time series to fit the ARIMA model (default: 40)
  -s START, --start START
                        Starting offset to fit the ARIMA model (default: 0)
python gdp_arima.py
                   Actual      Forecast  Rel Error
2000-12-31  2408151000000  2.367152e+12  -0.017025
2001-12-31  2503731000000  2.449716e+12  -0.021574
2002-12-31  2598336000000  2.516118e+12  -0.031643

python gdp_arima.py -c NO
                   Actual      Forecast  Rel Error
2000-12-31  1507283000000  1.337229e+12  -0.112821
2001-12-31  1564306000000  1.408769e+12  -0.099429
2002-12-31  1561026000000  1.480307e+12  -0.051709

 - In the last command above, we pass in -c NO to apply the same model to the GDP data of Norway (NO) instead of Sweden (SE). Hence, without the risk of messing up the code, we reused our code for a different dataset.
 
- The power of introducing a command line argument is that we can easily test our code with varying parameters. 

- For example, we want to see if the ARIMA(1,1,1) model is a good model for predicting GDP, and we want to verify with a different time window of the Nordic countries:

    - Denmark (DK)
    - Finland (FI)
    - Iceland (IS)
    - Norway (NO)
    - Sweden (SE)

- We want to check for the window of 40 years but with different starting points (since 1960, 1965, 1970, 1975). Depending on the OS, you can build a for loop in Linux and mac using the bash shell syntax.

- as the shell syntax permits, we can put everything in one line:
 

In [None]:
for C in DK FI IS NO SE; do for S in 0 5 10 15; do python gdp_arima.py -c $C -s $S ; done ; done

- If you’re using Windows, you can use the following syntax in command prompt:



In [None]:
for %C in (DK FI IS NO SE) do for %S in (0 5 10 15) do python gdp_arima.py -c $C -s $S



- or the following in PowerShell:


In [None]:
foreach ($C in "DK","FI","IS","NO","SE") { foreach ($S in 0,5,10,15) { python gdp_arima.py -c $C -s $S } }


## Alternative to command line arguments

 At least, there are several other ways too:

    using environment variables
    using config files

Environment variables are features from your OS to keep a small amount of data in memory. We can read environment variables in Python using the following syntax:

In [None]:
import os
print(os.environ["MYVALUE"])

- the above two-line script will work with the shell as follows in windows:

In [None]:
C:\MLM> set MYVALUE=hello

C:\MLM> python show_env.py
hello

- In case we have a lot of options to set, it is better to save the options to a file rather than overwhelming the command line. 

- Depending on the format we chose, we can use the configparser or json module from Python to read the Windows INI format or JSON format, respectively. 

- We may also use the third-party library PyYAML to read the YAML format.

- For the above example running the ARIMA model on GDP data, we can modify the code to use a YAML config file:

In [None]:
import warnings
warnings.simplefilter("ignore")

from pandas_datareader.wb import WorldBankReader
import statsmodels.api as sm
import pandas as pd
import yaml

# Load config from YAML file
with open("config.yaml", "r") as fp:
    args = yaml.safe_load(fp)

# Set up parameters
series = "NY.GDP.MKTP.CN"
country = args["country"]
length = args["length"]
start = args["start"]
steps = 3
order = (1,1,1)

# Read the GDP data from WorldBank database
gdp = WorldBankReader(series, country, start=1960, end=2020).read()
# Drop country name from index
gdp = gdp.droplevel(level=0, axis=0)
# Sort data in choronological order and set data point at year-end
gdp.index = pd.to_datetime(gdp.index)
gdp = gdp.sort_index().resample("y").last()
# Convert pandas dataframe into pandas series
gdp = gdp[series]
# Fit arima model
result = sm.tsa.ARIMA(endog=gdp[start:start+length], order=order).fit()
# Forecast, and calculate the relative error
forecast = result.forecast(steps=steps)
df = pd.DataFrame({"Actual":gdp, "Forecast":forecast}).dropna()
df["Rel Error"] = (df["Forecast"] - df["Actual"]) / df["Actual"]
# Print result
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    print(df)

- The YAML config file is named as config.yaml, and its content is as follows:

In [None]:
country: SE
length: 40
start: 0

- The JSON counterpart is very similar, where we use the load() function from the json module:

In [None]:
import json
import warnings
warnings.simplefilter("ignore")

from pandas_datareader.wb import WorldBankReader
import statsmodels.api as sm
import pandas as pd

# Load config from JSON file
with open("config.json", "r") as fp:
    args = json.load(fp)

# Set up parameters
series = "NY.GDP.MKTP.CN"
country = args["country"]
length = args["length"]
start = args["start"]
steps = 3
order = (1,1,1)

# Read the GDP data from WorldBank database
gdp = WorldBankReader(series, country, start=1960, end=2020).read()
# Drop country name from index
gdp = gdp.droplevel(level=0, axis=0)
# Sort data in choronological order and set data point at year-end
gdp.index = pd.to_datetime(gdp.index)
gdp = gdp.sort_index().resample("y").last()
# Convert pandas dataframe into pandas series
gdp = gdp[series]
# Fit arima model
result = sm.tsa.ARIMA(endog=gdp[start:start+length], order=order).fit()
# Forecast, and calculate the relative error
forecast = result.forecast(steps=steps)
df = pd.DataFrame({"Actual":gdp, "Forecast":forecast}).dropna()
df["Rel Error"] = (df["Forecast"] - df["Actual"]) / df["Actual"]
# Print result
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    print(df)

- And the JSON config file, config.json, would be:

In [None]:
{
    "country": "SE",
    "length": 40,
    "start": 0
}

## What is the decorator pattern, and why is it useful?

- The decorator pattern is a software design pattern that allows us to dynamically add functionality to classes without creating subclasses and affecting the behavior of other objects of the same class.

- By using the decorator pattern, we can easily generate different permutations of functionality that we might want without creating an exponentially increasing number of subclasses, making our code increasingly complex and bloated.

- Decorators are usually implemented as sub-interfaces of the main interface that we want to implement and store an object of the main interface’s type.

- It will then modify the methods to which it wants to add certain functionality by overriding the methods in the original interface and calling on methods from the stored object.

### Function Decorators in Python

- A function decorator is an incredibly useful feature in Python. It is built upon the idea that functions and classes are first-class objects in Python.

-  Since a Python function is an object and we can pass a function as an argument to another function, this task can be done as follows:


In [18]:
def repeat(fn):
    fn()
    fn()

def hello_world():
    print("Hello world!")

repeat(hello_world)

Hello world!
Hello world!


- since a Python function is an object, we can make a function to return another function, which is to execute yet another function twice. This is done as follows:

In [19]:
def repeat_decorator(fn):
    def decorated_fn():
        fn()
        fn()
    # returns a function
    return decorated_fn

def hello_world():
    print ("Hello world!")

hello_world_twice = repeat_decorator(hello_world)

# call the function
hello_world_twice()

Hello world!
Hello world!


- In the above, we passed the hello_world function as an argument to the repeat_decorator() function, and it returns the decorated_fn function, which is assigned to hello_world_twice. Afterward, we can invoke hello_world_twice() since it is now a function.

- The idea of decorator pattern applies here. But we do not need to define the interface and subclasses explicitly. In fact, hello_world is a name defined as a function in the above example. There is nothing preventing us from redefining this name to something else.

In [20]:
# function decorator that calls the function twice
def repeat_decorator(fn):
    def decorated_fn():
        fn()
        fn()
    # returns a function
    return decorated_fn

# using the decorator on hello_world function
@repeat_decorator
def hello_world():
    print ("Hello world!")

# call the function
hello_world()

Hello world!
Hello world!


- In the above code, @repeat_decorator before a function definition means to pass the function into repeat_decorator() and reassign its name to the output. That is, to mean hello_world = repeat_decorator(hello_world). The @ line is the decorator syntax in Python.

- We can also implement decorators that take in arguments, but this would be a bit more complicated as we need to have one more layer of nesting. If we extend our example above to define the number of times to repeat the function call:

In [21]:
def repeat_decorator(num_repeats = 2):
    # repeat_decorator should return a function that's a decorator
    def inner_decorator(fn):
        def decorated_fn():
            for i in range(num_repeats):
                fn()
        # return the new function
        return decorated_fn
    # return the decorator that actually takes the function in as the input
    return inner_decorator

# use the decorator with num_repeats argument set as 5 to repeat the function call 5 times
@repeat_decorator(5)
def hello_world():
    print("Hello world!")

# call the function
hello_world()

Hello world!
Hello world!
Hello world!
Hello world!
Hello world!


- The repeat_decorator() takes in an argument and returns a function which is the actual decorator for the hello_world function (i.e., invoking repeat_decorator(5) returns inner_decorator with the local variable num_repeats = 5 set).

## The Use Cases of Decorators

- One of the most common use cases is to convert data implicitly. For example, we may define a function that assumes all operations are based on numpy arrays and then make a decorator to ensure that happens by modifying the input:

In [None]:
# function decorator to ensure numpy input
def ensure_numpy(fn):
    def decorated_function(data):
        # converts input to numpy array
        array = np.asarray(data)
        # calls fn on input numpy array
        return fn(array)
    return decorated_function

- We can further add to our decorator by modifying the output of the function, such as rounding off floating point values:

In [None]:
# function decorator to ensure numpy input
# and round off output to 4 decimal places
def ensure_numpy(fn):
    def decorated_function(data):
        array = np.asarray(data)
        output = fn(array)
        return np.around(output, 4)
    return decorated_function

- Let’s consider the example of finding the sum of an array. A numpy array has sum() built-in, as does pandas DataFrame. 

- But the latter is to sum over columns rather than sum over all elements. 

- Hence a numpy array will sum to one floating point value while a DataFrame will sum to a vector of values. 

- But with the above decorator, we can write a function that gives you consistent output in both cases:

In [26]:
import numpy as np
import pandas as pd

# function decorator to ensure numpy input
# and round off output to 4 decimal places
def ensure_numpy(fn):
    def decorated_function(data):
        array = np.asarray(data)
        output = fn(array)
        return np.around(output, 4)
    return decorated_function

@ensure_numpy
def numpysum(array):
    return array.sum()

x = np.random.randn(10,3)
print(x)
y = pd.DataFrame(x, columns=["A", "B", "C"])
print(y)

# output of numpy .sum() function
print("x.sum():", x.sum())
print()

# output of pandas .sum() funuction
print("y.sum():", y.sum())
print()

# calling decorated numpysum function
print("numpysum(x):", numpysum(x))
print("numpysum(y):", numpysum(y))

[[ 1.10826376 -1.8346967  -0.8656575 ]
 [ 0.83011178 -0.4462485   1.62806834]
 [ 0.57775645 -1.0687193  -0.75038706]
 [-0.20930878 -0.73197545 -0.39379396]
 [-0.29453276 -1.185483   -0.75592191]
 [-1.12707315  0.2844536   0.38370484]
 [-0.28469777  0.5582686  -2.05787598]
 [ 0.76308829 -0.36139811 -1.09883896]
 [ 0.47075571  0.53695196 -0.39533697]
 [-0.90400514  0.09515254  1.00264799]]
          A         B         C
0  1.108264 -1.834697 -0.865658
1  0.830112 -0.446249  1.628068
2  0.577756 -1.068719 -0.750387
3 -0.209309 -0.731975 -0.393794
4 -0.294533 -1.185483 -0.755922
5 -1.127073  0.284454  0.383705
6 -0.284698  0.558269 -2.057876
7  0.763088 -0.361398 -1.098839
8  0.470756  0.536952 -0.395337
9 -0.904005  0.095153  1.002648
x.sum(): -6.526727148650348

y.sum(): A    0.930358
B   -4.153694
C   -3.303391
dtype: float64

numpysum(x): -6.5267
numpysum(y): -6.5267


- In the above code, @ensure_numpy before a function definition means to pass the function into ensure_numpy() and reassign its name to the output. That is, to mean numpysum = ensure_numpy(numpysum).

## Techniques to Write Better Python Code


- Sanitation and assertive programming
- Guard rails and offensive programming
- Good practices to avoid bugs

### Sanitation and Assertive 

- As Python is a duck-typing language, it is easy to see a function accepting numbers to be called with strings. For example:


In [2]:
def add(a, b):
    return a + b

c = add("one", "two")
c

'onetwo'

- One common thing a fairly long code would do is to sanitize the input. For example, we may rewrite our function above as the following:

In [None]:
def add(a, b):
    if not isinstance(a, (int, float)) or not isinstance(b, (int, float)):
        raise ValueError("Input must be numbers")
    return a + b

- Or, better, convert the input into a floating point whenever it is possible:

In [None]:
def add(a, b):
    try:
        a = float(a)
        b = float(b)
    except ValueError:
        raise ValueError("Input must be numbers")
    return a + b

- The key here is to do some “sanitization” at the beginning of a function, so subsequently, we can assume the input is in a certain format. 

- Another reason to sanitize the input is for canonicalization. This means we should make the input in a standardized format.

-  For example, a URL should start with “http://,” and a file path should always be a full absolute path like /etc/passwd instead of something like /tmp/../etc/././passwd.

- The correct way of using assert is to help us debug while developing our code. For example,

In [None]:
def add(a, b):
    assert isinstance(a, (int, float)), "`a` must be a number"
    assert isinstance(b, (int, float)), "`b` must be a number"
    return a + b

In [None]:
def evenitems(arr):
    newarr = []
    for i in range(len(arr)):
        if i % 2 == 0:
            newarr.append(arr[i])
    assert len(newarr) * 2 >= len(arr)
    return newarr

- While we develop this function, we are not sure our algorithm is correct. There are many things to check, but here we want to be sure that if we extracted every even-indexed item from the input, it should be at least half the length of the input array.

- When we try to optimize the algorithm or polish the code, this condition must not be invalidated. We keep the assert statement at strategic locations to make sure we didn’t break our code after modifications. 

- Using assert this way is to check the steps inside a function.

- If we write a complex algorithm, it is helpful to add assert to check for loop invariants, namely, the conditions that a loop should uphold. Consider the following code of binary search in a sorted array:

In [None]:
def binary_search(array, target):
    """Binary search on array for target

    Args:
        array: sorted array
        target: the element to search for
    Returns:
        index n on the array such that array[n]==target
        if the target not found, return -1
    """
    s,e = 0, len(array)
    while s < e:
        m = (s+e)//2
        if array[m] == target:
            return m
        elif array[m] > target:
            e = m
        elif array[m] < target:
            s = m+1
        assert m != (s+e)//2, "we didn't move our midpoint"
    return -1

- The last assert statement is to uphold our loop invariants. This is to make sure we didn’t make a mistake on the logic to update the start cursor s and end cursor e such that the midpoint m wouldn’t update in the next iteration.

- If we replaced s = m+1 with s = m in the last elif branch and used the function on certain targets that do not exist in the array, the assert statement will warn us about this bug.

## Guard Rails and Offensive Programming

- It is amazing to see Python comes with a NotImplementedError exception built-in. This is useful for what we call  offensive programming.

- While the input sanitation is to help align the input to a format that our code expects, sometimes it is not easy to sanitize everything or is inconvenient for our future development.

- One example is the following, in which we define a registering decorator and some functions:

In [None]:
import math

REGISTRY = {}

def register(name):
    def _decorator(fn):
        REGISTRY[name] = fn
        return fn
    return _decorator

@register("relu")
def rectified(x):
    return x if x > 0 else 0

@register("sigmoid")
def sigmoid(x):
    return 1/(1 + math.exp(-x))

def activate(x, funcname):
    if funcname not in REGISTRY:
        raise NotImplementedError(f"Function {funcname} is not implemented")
    else:
        func = REGISTRY[funcname]
        return func(x)

print(activate(1.23, "relu"))
print(activate(1.23, "sigmoid"))
print(activate(1.23, "tanh"))

- We raised NotImplementedError with a custom error message in our function activate().

-  we can raise NotImplementedError in places where the condition is not entirely invalid, but it’s just that we are not ready to handle those cases yet.

- This is useful when we gradually develop our program, which we implement one case at a time and address some corner cases later. 

- The principle here is that you should never let the anomaly proceed silently as your algorithm will not behave correctly and sometimes have dangerous effects

## Good Practices to Avoid Bugs

- First is the use of the functional paradigm. While we know Python has constructs that allow us to write an algorithm in functional syntax, the principle behind functional programming is to make no side effect on function calls.

-  The “no side effect” principle is powerful in avoiding a lot of bugs since we can never mistakenly change something.

-  we should be careful if the argument to our function is a mutable object.

In [None]:
def func(a=[]):
    a.append(1)
    return a

- It is trivial to see what this function does. However, when we call this function without any argument, the default is used and returned us [1]. When we call it again, a different default is used and returned us [1,1]. It is because the list [] we created at the function declaration as the default value for argument a is an initiated object. When we append a value to it, this object is mutated. The next time we call the function will see the mutated object.


- And in case it is appropriate, we should make a copy of it. For example,

In [None]:
LOGS = []

def log(action):
    LOGS.append(action)
    
data = {"name": None}
for n in ["Alice", "Bob", "Charlie"]:
    data["name"] = n
    ...  # do something with `data`
    log(data)  # keep a record of what we did

In [None]:
import copy

def log(action):
    copied_action = copy.deepcopy(action)
    LOGS.append(copied_action)

- The other technique to avoid bugs is not to reinvent the wheel. In Python, we have a lot of nice containers and optimized operations. You should never try to create a stack data structure yourself since a list supports append() and pop(). 

- Similarly, if you need a queue, we have deque in the collections module from the standard library. 

- Python doesn’t come with a balanced search tree or linked list. But the dictionary is highly optimized, and we should consider using the dictionary whenever possible.

- We have a JSON library, and we shouldn’t write our own. 

- If we need some numerical algorithms, check if you can get one from NumPy.

- Another way to avoid bugs is to use better logic. An algorithm with a lot of loops and branches would be hard to follow and may even confuse ourselves. 

- It would be easier to spot errors if we could make our code clearer. For example, making a function that checks if the upper triangular part of a matrix contains any negative would be like this:


In [None]:
def neg_in_upper_tri(matrix):
    n_rows = len(matrix)
    n_cols = len(matrix[0])
    for i in range(n_rows):
        for j in range(n_cols):
            if i > j:
                continue  # we are not in upper triangular
            if matrix[i][j] < 0:
                return True
    return False

- But we also use a Python generator to break this into two functions:

In [None]:
def get_upper_tri(matrix):
    n_rows = len(matrix)
    n_cols = len(matrix[0])
    for i in range(n_rows):
        for j in range(n_cols):
            if i > j:
                continue  # we are not in upper triangular
            yield matrix[i][j]

def neg_in_upper_tri(matrix):
    for element in get_upper_tri(matrix):
        if element[i][j] < 0:
            return True
    return False

-  If the function is more complicated, separating the nested loop into generators may help us make the code more maintainable.

- Finally, consider adopting a coding style for your project. Having a consistent way to write code is the first step in offloading some of your mental burdens later when you read what you have written. 

## Multiprocessing in Python

- Multiprocessing can make a program substantially more efficient by running multiple tasks in parallel instead of sequentially. A similar term is multithreading, but they are different.

- A process is a program loaded into memory to run and does not share its memory with other processes. A thread is an execution unit within a process. Multiple threads run in a process and share the process’s memory space with each other.

- Python’s Global Interpreter Lock (GIL) only allows one thread to be run at a time under the interpreter, which means you can’t enjoy the performance benefit of multithreading if the Python interpreter is required. This is what gives multiprocessing an upper hand over threading in Python. 

- Multiple processes can be run in parallel because each process has its own interpreter that executes the instructions allocated to it. 

- Also, the OS would see your program in multiple processes and schedule them separately, i.e., your program gets a larger share of computer resources in total. So, multiprocessing is faster when the program is CPU-bound. 

### Basic multiprocessing 

- Let’s look at this function, task(), that sleeps for 0.5 seconds and prints before and after the sleep:

In [27]:
import time

def task():
    print('Sleeping for 0.5 seconds')
    time.sleep(0.5)
    print('Finished sleeping')

- To create a process, we simply say so using the multiprocessing module:

In [29]:
import multiprocessing
p1 = multiprocessing.Process(target=task)
p2 = multiprocessing.Process(target=task)

- The target argument to the Process() specifies the target function that the process runs. But these processes do not run immediately until we start them:

In [31]:
p1.join()
p2.join()

- A complete concurrent program would be as follows:

In [None]:
import multiprocessing
import time

def task():
    print('Sleeping for 0.5 seconds')
    time.sleep(0.5)
    print('Finished sleeping')

if __name__ == "__main__": 
    start_time = time.perf_counter()
    processes = []

    # Creates 10 processes then starts them
    for i in range(10):
        p = multiprocessing.Process(target = task)
        p.start()
        processes.append(p)
    
    # Joins all the processes 
    for p in processes:
        p.join()

    finish_time = time.perf_counter()

    print(f"Program finished in {finish_time-start_time} seconds")

- We must fence our main program under if __name__ == "__main__" or otherwise the multiprocessing module will complain. This safety construct guarantees Python finishes analyzing the program before the sub-process is created.

- We need to call the join() function on the two processes to make them run before the time prints.

- This is because three processes are going on: p1, p2, and the main process. The main process is the one that keeps track of the time and prints the time taken to execute. 

- The join() function allows us to make other processes wait until the processes that had join() called on it are complete.

- We should make the line of finish_time run no earlier than the processes p1 and p2 are finished.

### Multiprocessing for Real Use

- Starting a new process and then joining it back to the main process is how multiprocessing works in Python (as in many other languages).

- The reason we want to run multiprocessing is probably to execute many different tasks concurrently for speed

- Let’s consider a function:

In [None]:
def cube(x):
    return x**3

- If we want to run it with arguments 1 to 1,000, we can create 1,000 processes and run them in parallel:

In [None]:
import multiprocessing

def cube(x):
    return x**3

if __name__ == "__main__":
    # this does not work
    processes = [multiprocessing.Process(target=cube, args=(x,)) for x in range(1,1000)]
    [p.start() for p in processes]
    result = [p.join() for p in processes]
    print(result)

- However, this will not work as you probably have only a handful of cores in your computer. Running 1,000 processes is creating too much overhead and overwhelming the capacity of your OS. Also, you may have exhausted your memory. 

- The better way is to run a process pool to limit the number of processes that can be run at a time:

In [None]:
import multiprocessing
import time

def cube(x):
    return x**3

if __name__ == "__main__":
    pool = multiprocessing.Pool(3)
    start_time = time.perf_counter()
    processes = [pool.apply_async(cube, args=(x,)) for x in range(1,1000)]
    result = [p.get() for p in processes]
    finish_time = time.perf_counter()
    print(f"Program finished in {finish_time-start_time} seconds")
    print(result)

- The argument for multiprocessing.Pool() is the number of processes to create in the pool. If omitted, Python will make it equal to the number of cores you have in your computer.

- We use the apply_async() function to pass the arguments to the function cube in a list comprehension. This will create tasks for the pool to run. 

- It is called “async” (asynchronous) because we didn’t wait for the task to finish, and the main process may continue to run. Therefore the apply_async() function does not return the result but an object that we can use, get(), to wait for the task to finish and retrieve the result. 

- Since we get the result in a list comprehension, the order of the result corresponds to the arguments we created in the asynchronous tasks. However, this does not mean the processes are started or finished in this order inside the pool.

- If you think writing lines of code to start processes and join them is too explicit, you can consider using map() instead:

In [None]:
import multiprocessing
import time

def cube(x):
    return x**3

if __name__ == "__main__":
    pool = multiprocessing.Pool(3)
    start_time = time.perf_counter()
    result = pool.map(cube, range(1,1000))
    finish_time = time.perf_counter()
    print(f"Program finished in {finish_time-start_time} seconds")
    print(result)

- We don’t have the start and join here because it is hidden behind the pool.map() function. What it does is split the iterable range(1,1000) into chunks and runs each chunk in the pool. The map function is a parallel version of the list comprehension

- But the modern-day alternative is to use map from concurrent.futures, as follows:

In [None]:
import concurrent.futures
import time

def cube(x):
    return x**3

if __name__ == "__main__":
    with concurrent.futures.ProcessPoolExecutor(3) as executor:
        start_time = time.perf_counter()
        result = list(executor.map(cube, range(1,1000)))
        finish_time = time.perf_counter()
    print(f"Program finished in {finish_time-start_time} seconds")
    print(result)

- This code is running the multiprocessing module under the hood. The beauty of doing so is that we can change the program from multiprocessing to multithreading by simply replacing ProcessPoolExecutor with ThreadPoolExecutor.

- Of course, you have to consider whether the global interpreter lock is an issue for your code.

### Using joblib

- The package joblib is a set of tools to make parallel computing easier. It is a common third-party library for multiprocessing. 

- It also provides caching and serialization functions. 

- We can convert our previous example into the following to use joblib:

In [None]:
import time
from joblib import Parallel, delayed

def cube(x):
    return x**3

start_time = time.perf_counter()
result = Parallel(n_jobs=3)(delayed(cube)(i) for i in range(1,1000))
finish_time = time.perf_counter()
print(f"Program finished in {finish_time-start_time} seconds")
print(result)

- Indeed, it is intuitive to see what it does. The delayed() function is a wrapper to another function to make a “delayed” version of the function call. Which means it will not execute the function immediately when it is called.

- Then we call the delayed function multiple times with different sets of arguments we want to pass to it. For example, when we give integer 1 to the delayed version of the function cube, instead of computing the result, we produce a tuple, (cube, (1,), {}) for the function object, the positional arguments, and keyword arguments, respectively.

- We created the engine instance with Parallel(). When it is invoked like a function with the list of tuples as an argument, it will actually execute the job as specified by each tuple in parallel and collect the result as a list after all jobs are finished. Here we created the Parallel() instance with n_jobs=3, so there will be three processes running in parallel.

- We can also write the tuples directly. Hence the code above can be rewritten as:

In [None]:
result = Parallel(n_jobs=3)((cube, (i,), {}) for i in range(1,1000))

- The benefit of using joblib is that we can run the code in multithread by simply adding an additional argument:

In [None]:
result = Parallel(n_jobs=3, prefer="threads")(delayed(cube)(i) for i in range(1,1000))

- And this hides all the details of running functions in parallel. We simply use a syntax not too much different from a plain list comprehension.