# INFO411 Lab 1 - Getting Started with Numpy and Jupyter

## Part A. Introduction

This lab is for us to get familiar with NumPy scripting and the Jupyter notebook environment.

With NumPy we'll see it is very handy to manipulate arrays, matrices, and the Matplotlib plotting utilities are very useful too. 

We are using Jupyter Notebook, an interactive environment based on IPython - a high-performance command shell for Python. If you want to install IPython on your own computer, please choose the Python 3.7 version from 
[Anaconda](https://www.anaconda.com/distribution/). 

To proceed, read along and run the code cells along the way. If you want to add comments, either click into a text cell, or use the "+" button under the menu bar and change the cell format to "Raw NBConvert", or "Markdown". Here is a [cheatsheet for Markdown](https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed). 

Complete the scripts (for "tasks") and verify their outcome. Submit your notebook (named as Lab1_YourName.ipynb) by emailing jeremiah.deng@otago.ac.nz, with a subject line "INFO411 Lab1 submission". 

In [None]:
print("Hello World!")

The Python shell, functioning as a command-line interpreter, will respond. The following two lines should be rather straightforward:

In [None]:
# Change to 11, 111, 1111, ... each time click back to this cell, run it and then the next cell; repeat
a = 1

In [None]:
a**2      # or a*a

Or, we can use a "for" loop to iterate through a list and generate the outcome:

In [None]:
# iterate through a list
for a in [1, 11, 111, 1111, 11111, 111111]:
    print(a*a)

In [None]:
# Or, using "range(9)" to cover 0,1,2,...,8
a=1
for i in range(9):
    print(a*a)
    a = 10*a + 1 

There seems an interesting pattern, doesn't it? 

### Arrays

Next, we have a try with number sequences called 'arrays' in NumPy. First we import all the functions defined in the 'numpy' library (as a lazy but inefficient approach):

In [None]:
import numpy as np

We can then use 'np' to refer to various numpy functions. Now we define an array "v":

In [None]:
v=np.array([1, 2, -1, -1])

We can access array elements by their indeces, just like using a Python List:

In [None]:
v[1]         # 2nd element

In [None]:
v[-1]        # last element, same as in standard Python

Some simple operations on the array are straightforward:

In [None]:
sum(v)

In [None]:
v+2

In [None]:
v*2

As you see the manipulation of arrays is quite easy in NumPy. 
How about this multiplication operation between v and a new array x:

In [None]:
x=np.array([1, 1, -1, -1])

In [None]:
x*v      #element-wise product

and run the next line too:

In [None]:
sum(x*v)      

Indeed, the last operation gives us the dot produt of the two vectors $\mathbf{x}$ and $\mathbf{v}$, i.e., $\mathbf{x}^T\mathbf{v}$.  In NumPy, this is implemented by the following line (see if it gives the same outcome as above):

In [None]:
np.dot(x, v)

Next, let us generate a 2D array (2 rows x 3 colums) using a random number generator and try out some simple operations:

In [None]:
a=np.random.random((2,3))

In [None]:
a       # check out the content of "a"

Simple operations such as +,-,*,/ can be easily carrried out across all elements, e.g.

In [None]:
a+5

Now that we have a 2-D array, we can do operation on one axis or the other, e.g., a vertical addition:

In [None]:
np.sum(a, axis = 0)

and a horizontal addition (column-wise):

In [None]:
np.sum(a, axis = 1)

In [None]:
a[:,0]

The following line chunks the 1st and 3rd columns of "a" together to form a new array:

In [None]:
np.vstack((a[:,0], a[:,2])).T

Numpy also uses .c_() and .r_() to do similar operations, e.g.:

In [None]:
np.c_[np.array([1,2,3]), np.array([4,5,6])]

### Matrices

Sometimes it is easier to work with matrices. NumPy has very good support on matrix operations - often it is almost the same as the mathematical notations.

In [None]:
am = np.matrix(a)

Let's create a 3x2 random matrix (note the size is specified by a "(,)" tuple):

In [None]:
bm = np.matrix(np.random.random((3, 2)))

In [None]:
bm

The transpose of the matrix above is 

In [None]:
bm.T

Matrix mulplication, is simply another "*" operation. Note this is different from the array multiplication we just experimented. Sizes should match, i.e. in this case a 2x3 matches a 3x2 and the multiplication gives a 2x2 matrix:

In [None]:
A = am * bm
A

In [None]:
invA = np.linalg.inv(A)

The following plays with A and its inverse, and the outcome of course is an *identity matrix* (diagonal elements are 1; all others 0):

In [None]:
A * invA

Let's see how matrices can be used for geometry. Suppose we have a vector $\mathbf{c}$ with coordinates (1,0): 

In [None]:
c = np.array([1, 0])

Now we rotate the vector by $\theta$ degrees, anti-clockwise. The $2\times 2$ rotation matrix is:
$A=\begin{vmatrix}
\cos\theta & -\sin\theta \\
\sin\theta & \cos\theta
\end{vmatrix}$.

So for $\theta=\pi/4$, we have

In [None]:
A = np.matrix([[np.cos(np.pi/4), -np.sin(np.pi/4)], [np.sin(np.pi/4), np.cos(np.pi/4)]])

**Task A1**. Now use a simple multiplication A\*c to get the new coordinates. 

Transform c into a column matrix (first make it a matrix, then transpose). Multiply A and the column matrix:

In [None]:
# Type in Task A1 code here 


**Task A2**: Write a function for rotating any data point (x, y) by any degrees "theta" and test it out. Complete the following cell:

In [None]:
def rot((x,y), theta):
    v=np.matrix([x,y]).T         # Get the coordinates and make a column vector
    rad=theta*np.pi/180          # Convert degree number into radian: 180 deg = pi
    #...
    #...

## Part B. First attempts on a dataset

Now we are ready to play with a dataset and put our array manipulation skills into good use. First, we load in a dataset adapted from the UCI email spambase data. The text file has 57 attributes, plus a class ID at the end for each line. The delimiter symbol is space. Try this:

In [None]:
d=np.loadtxt('./spambase_s.txt')

Check on the shape of the array 'd' and see if it is right:

In [None]:
d.shape

Question: How many data items are read into 'd'?
The groundtruth or class labels are loaded into column no.57 (note the array index starts from 0, just like a list) and we convert them from float to integer:

In [None]:
label = d[:,-1].astype('int')           

To find out how many data points belong to class '0', this will do (Python use -1 to indicate the last element in a list/array, i.e. it's equivalent to 57 in this case):

In [None]:
sum(label==0)      # or sum(d[:,-1]==0)

**Task B**: Write a line of code to find out how many data points are from class '1'.

We can easily cut out some portions of the large dataset and generates a smaller one. E.g. if we only need entries no.600 to 699, and the first 10 attributes:

In [None]:
din=d[600:700,0:10]

Also cut out the class labels of the selected entries:

In [None]:
dout=label[600:700].reshape((100,1))             # reshape() is to arrange it as a 2-D array

Now we can concatenate the input and output (label) parts, column-wise, into a smaller dataset d2:

In [None]:
d2=np.concatenate((din,dout),axis=1)

In [None]:
d2.shape            # check on the shape of the new dataset

## Part C. Plot them out

Here is our first attempt to visualize the dataset. To do this we need to import the Matplotlib/pylab package:

In [None]:
import pylab as pl

We can then choose two columns (i.e. attributes) and do a scatter plot. In this case we choose attributes 1 and 4, and plot the coordinates using a blue dot:

In [None]:
pl.plot(d[:,1], d[:,4],'b.')

Now you should see a scatterplot above.

We can improve the visualization by incorporating the class labels. For example, plotting all those class-0 items with a blue dot, and all class-1 items with a red cross? We can do this by repeatedly calling the plot() function:

In [None]:
for i in range(d.shape[0]):             # iterate through all rows
    if d[i,-1]==0:
        pl.plot(d[i,1],d[i,4],'b.')    # either plot a blue dot
    else:
        pl.plot(d[i,1],d[i,4],'rx')    # or a red cross

It looks better, isn't it?

In Python, unfortunately, brute-force looping through the array items one after another is very costly. To speed up, we should always try processing arrays collectively. 
In this case, we can construct two subsets, each for a class respectively:

In [None]:
set0 = d[:,-1]==0
set1 = d[:,-1]==1

And then we plot out the two subsets separately:

In [None]:
pl.plot(d[set1,1], d[set1,4], 'rx')
pl.plot(d[set0,1], d[set0,4], 'b.')

You may have been wondering why we have picked attributes numbered 1 and 4 for display. There is no particular reason for this choice, but to make sense out of the scatterplot, we would need to know about what those columns we have chosen stand for! 

**Task C**: Have a look of the dataset description at the UCI repository: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names and explore alternatives choices instead of 1 and 4; find a good pair to regenerate the scatterplot.

In [None]:
# Task C code here....


## Markov Chain

Let's have some more practice on matrices. 

Sociologists often assume a simplified social class transition model expressed by a Markov chain. E.g., a son's occupation depends only on his father's (but not on his grandfather's). The following example transition probability matrix is taken from Talor & Karlin's book *An Introdution to Statistical Modeling*:

In [None]:
P = np.matrix('0.7 0.2 0.1;  0.2 0.6 0.2; 0.1 0.4 0.5')

We print 'P' nicely:

In [None]:
print(P)

The row stands for the father's class (Lower/Middle/Upper), and the column stands for the son's class (Lower/Middle/Upper). Note values on each row add to 1. 

How to interpret $P$? For instance, the probability of a next-generation transition from "Middle" (2nd row), to "Upper" (3rd column), is 0.2. Likewise, the transition probability from "Upper" (3rd row), to "Lower" (1st column), is 0.1.  
What will be the transition probabilities over two generations? We have $P^2=P\times P$:

In [None]:
P*P

**Task D**: Use a "for" loop to calculate $P^2$, $P^4$, $P^8$, $P^{16}$, and $P^{32}$. What does the result suggest?

In [None]:
# your code here

Your Comments: 

######  End of Lab 1.

Remember to submit your work by Thursday 11pm 25/7/2019. 