<a href="https://colab.research.google.com/github/noviaayup/novice1/blob/master/Bonus%20Exercise%20-%20Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We will use the Colab Python computational environment for the experiment in this subject. Refer to [this nice introduction](https://colab.research.google.com/notebooks/intro.ipynb#) for a comprehensive overview of the environment as well as further references to doing machine learning using Python in Colab.

This notebook is to get us familiarised with using Python to process data. We will dive into an example of building projectors for a dataset. We will explain nontrivial python features while running into them and have most of the Python codes either self-explanatory or with comments for beginners. If you are confused at a point about the Python language, and looking for a more "linear" introduction, check out (and bookmark) [this tutorial](https://colab.research.google.com/github/cs231n/cs231n.github.io/blob/master/python-colab.ipynb) as a reference to the language. 

HINT: Because now you are viewing a local copy of this notebook in your Google Drive, feel free to edit and play with the notes and codes. This is __your__ notebook. _Double Click Here_ to see how to take notes with this convenient tool, including embedding [links](https://www.markdowntutorial.com/) and insert math symbols like $x$, $y$ or a function $f: \mathbf X \rightarrow Y$.

# A taste of building a predictor from data

We start with our old friend Iris dataset (You should have encountered it in a pre-requisite data analytics subject. If you are not don't panic -- this example will walk you through the data and the task.) Basically we are given a set of iris flower data samples, each has four attributes to describe the shape of the flower. The task is to determine the species of the flour according to the shape attributes.

## A small but practical dataset: Iris flowers

The dataset is a text-book example and widely used in machine learning. Instead of downloading from the Internet and perform pre-processing, we will use a prepared version of the dataset provided by the scientific learning `scikit-learn` library of Python. 

> The Python backend of Colab (called a "kernel") is a comprehensive version with many useful libraries installed, including `scikit-learn`. Be aware that the library is known within Python as `sklearn` (The operating system and Python "know" the same library with different names)



In [None]:
# You can tell Python to use a library by import
# Try
print("Hello, world.")
print("The ratio of a circle's circumference to its diameter is", pi)
# And try to interpret why you got an error.

In [None]:
# Now use the math library, as it knows pi
import math
print("The ratio of a circle's circumference to its diameter is", math.pi)
# Note the "math." before pi, since the identifier pi is defined within 
# "math"'s namespace. You can try to uncomment the following statement
# and re-execute this code block to see the difference.

# print("The ratio of a circle's circumference to its diameter is", pi)
#


In [None]:
# Alternatively, we can import only a part of a library into the main namespace.
# This is often useful when the library is big and complex.
# 
# E.g. if you had tried the previous direct use of `pi` and got an error, try
# the following:
from math import pi
print(pi)
# This is what we will do in the next block to use the Iris data preparation 
# function of the scikit-learn library.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
# The iris is a "dictionary" object, basically it is a complex object with 
# a "key-value" structure. The values in an dictionary d can be assessed by 
# d[KEY]
# where KEY is the value of the corresponding key. 
# -- how on earth do I know what `load_iris` will produce?
# -- by reading the doc
# But fortunately colab provides a handy way to read the doc, just 
# add a new code block in which, type
# load_iris?
# and execute (do not miss the question mark "?")

X = iris['data']   # key names are strings, which can be quoted using either
y = iris['target'] # double or single quotation marks -- so it makes things 
# easier when you want to mean the literal quotation mark symbol, see above
# in the statement when we print out *... circle's ...*.

# HINT: the above is equivalent to (and more often written as)
# X, y = iris['data'], iris['target']
# search for "tuple" and "unpacking" in Python for further ref.

# Now let us check the data -- in Python you can almost `print` anything
# so when encounter an unknown object the first-step investigation 
# is often to print it out.
print("Data Attributes")
print(X)
print("Targets")
print(y)

# HINT: print-play in Python
# Print even works for objects that you might not consider as printable
# Try to uncomment and put in an extra code-block the following 
#
# print(load_iris)
#
# the printable part is not very forthcoming, then try
#
# print(dir(load_iris))
#
# here we have a long list of mysteriously named methods provided by the 
# `load_iris` object. An often informative one is `__doc__`, it means the 
# document string of an object, by convention the developer should put 
# information that she wants the users to know here.
#
# print(load_iris.__doc__)
#
# Is the above print-out looking familiar?
#
# One more thing: there is also a `__call__` in the list. It just means that 
# this object is "callable" and thus being a "function" in common sense -- 
# it allows you "call" it by making a statement "load_iris" followed by "()".

## Working with data arrays

In addition to the `dir` command, we can also also investigate an object in Python by looking at its type. 

In [None]:
print(type(X)) # numpy.ndarray

We found that the type of the data is a `numpy.ndarray` -- an n-dimensional array managed by the library `numpy`. `numpy` has been internally imported by `scikit-learn`. We can also explicitly import it to use the functionalities by
`import numpy`, or with a shortcut:

In [None]:
import numpy as np

Numpy is a wrapper of and provides high-level API to a set of highly efficient numerical computation low-level libraries. So the basic operations provided by `numpy` are very useful in dealing with data. Also the similar API interface is shared among all industry-level high-performance computation facilities, including R, PyTorch, TensorFlow, Matlab etc.  You can find a generic introduction to `numpy` in the Python tutorial privided at the beginning of this notebook.

In [None]:
# Let us check the "shape" of our data array, which provides a piece of primary 
# information given to data
print(np.shape(X))
# The `.shape` is also implemented as a self-object attribute of a numpy array
# (not to be confused with data attribute!)
print(X.shape) # this is more often

Our data space $\mathcal X$ consists of all objects that are described with 4 numbers, -- we call it $\mathbb R^4$. A dataset from this data space will be a 2-D array (table) of $n$ (#.samples) rows and 4 columns. Arrays are ubiquitous objects used in data science. In Python, numerical arrays are usually managed by numerical computation libraries, in this case, `numpy`. You can check the data via array operations provided by `numpy`.



In [None]:
# by futher finding out the shape of the targets, 
# we confirmed that there are 150 samples and 4 attributes per sample.
print(y.shape)

As a running example, we will simplify the problem by 
1. using only the first two attributes of the data, and 
2. taking out the samples belonging to the first two classesand making the problem a binary classification one -- to classify  setosa and versicolor (y==0 or y==1) .


In [None]:
# Indexing-1: Basic

# Elements in an array are assessed as follows:
print(y[0]) # the first element, Python (and numpy) arrays start with element-0
print(y[3]) # the 4-th element
print(X[2, 1]) # by 2: take the 3-rd row, then by 1: take the 2nd element
# check the results by 
# 1. print out the entire X and y, 
# 2. copy-and-paste the text to a separate file
# 3. compare the element values

In [None]:
# Indexing-2: slicing

print(y[1:4]) # 3 elements, the element-1, 2, 3 (non-inclusive of 4)
print(y[0:3]) # 3 elements, the element-0, 1, 2 (non-inclusive of 3)
print(y[:3])  # the same 3 elements, you can omit 0 if starting from there
print(X[1, :2]) # the first 2 elements in row-1 (2nd row) of X
print(X[1, 2:4]) # the element 2, 3 in row-1 of X
print(X[1, 2:]) # the same two elements, you can omit the "end" (as 0 for start)
print(X[1, :]) # if you omit both, you mean the entire row
print(X[1]) # the same as above, the entire 2nd-row (row-1)
print(X.shape) # (150, 4)
print(X[1].shape) # one row of X: dimension - 1, (4, )
# Add intuitive understanding is to think one dimension as one degree of freedom 
# to move about in an array -- in 2D array you can go in the direction of either 
# rows or columns, while in one 1-D array you can only move in one direction -- 
# by specifying one dimension (to be a fixed value), the degrees of freedom 
# reduce by one.
print(X[5:10, 0]) # the first attribute (column) of sample (rows) 5, 6, 7, 8, 9
print(X[5:10, 0].shape) # one-D array, 1-D array has no meaningful col/row


In [None]:
# Indexing-3: "advanced"

print(X[0, [0, 1, 2]]) # the 0, 1, 2 elements in row-0
print(X[0, [0, 2]]) # the 0, 2 elements in row-0 -- now you can cherry-picking
# there is a cost, refer to "advanced indexing" in numpy for more info.

In [None]:
# Indexing-4: Binary
X_tmp = X[:6] # first 6 rows
print(X_tmp)
a0_tmp = X_tmp[:, 0] # first attribute of the 6 samples
print("a0", a0_tmp)
print("a0 shape", a0_tmp.shape)

binary_index = np.array([True, False, False, True, True, True]) 
print(a0_tmp[binary_index]) # bin-index has the same size as the array-dimension 
# to be indexed

print(X_tmp[binary_index]) # X_tmp has 6 rows, also indexable via the 
# binary array.

With this, let us take the first two attributes from the data:

In [None]:
X0 = X[:, 0]
# [Exercise] Insert your code here
# Let X1 to be the 2nd attribute of the flowers.

As we have discussed, one important condition for adopting learning-based modelling is that there must be some relationship between the observed and the targets to be discovered. To have a first impression, we use the visualisation library `matplotlib` to display a scatter plot of the samples (scattered w.r.t. the first two attributes) with colours showing their classes.

In [None]:
import matplotlib.pyplot as plt
plt.scatter(X0, X1, c=y) # use y (target) value as colors, as to why specific 
# color is assigned to class 0/1/2, they are prescribed internally, the colors
# serve as tags to distinguish different classes, without real meaning.
# You can specify the color-correspondence, refer to `matplotlib` documents
# and search for "colormap" in interested. 

In the next, we do the 2nd simplification -- to keep only the first two classes of flowers.

In [None]:
print(y == 0) # numpy array takes "==" as element wise comparison
# and returns a binary array

two_class_index = np.logical_or(y == 0, y == 1) # the logical operation between 
# two binary arrays, it is the same as 
#
# (y == 0) + (y == 1) # which is simplier to write
#
# as numpy re-defines "+" for binary arrays as element-wise logical-OR.

# Now we are ready to prepare our simplified task:
simple_X = X[two_class_index]
simple_X = simple_X[:, 0:2]
simple_y = y[two_class_index]
print(simple_X.shape, simple_y.shape)

## Building learning-based models from scratch

Now we are ready to go. Try to building a predictor $\mathcal X \mapsto \mathcal Y$. First, let us examine our assumption (again) that the link between the observable data attributes and the target does exist.

In [None]:
plt.scatter(simple_X[:, 0], simple_X[:, 1], c=simple_y)

The mechanism to make output from an import in Python is implemented as a function.



In [None]:
# An example function
def predict_flower_class(x0, x1):
    # just by looking at the figure above, I come up with some rules, such as 
    # if x0 > 6.5, the flower is vericolor
    if x0 > 6.5:
        predicted_class = 1
    else:
        predicted_class = 0
    return predicted_class

In [None]:
# A direct check of the performance
for i in range(100):
    x0 = simple_X[i, 0]
    x1 = simple_X[i, 1]
    print(predict_flower_class(x0, x1), y[i])

In [None]:
# [Exercise] Make your own predictor and compute its accuracy
def predict_flower_class_2(x0, x1):
    # Insert your code here
    return predicted_class

# Compute accuracy
corr_num = 0
for i in range(100):
    x0 = simple_X[i, 0]
    x1 = simple_X[i, 1]
    if predict_flower_class_2(x0, x1) == y[i]:
        # Insert your code here
print("Accuracy", corr_num/100)