<a href="https://colab.research.google.com/github/markbriers/data-science-jupyter/blob/main/week2_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python fundamentals 1 (Week 2)

## Module: Learning outcomes

* Describe the six stages of a data processing pipeline (using CRISP-DM)

* Demonstrate an understanding of the python programming language through the production of elementary data analysis programme

* Analyse at least three different data sources by applying at least one python data processing library to extract and explore pertinent features

* Be able to design a set of data requirements for a specified business problem

* Describe and apply (using the python programming language) the main approaches to supervised learning for a given classification problem

* Understand the use cases of Big Data technology (in particular Spark)

* Produce a report including appropriate data visualisations covering the analysis of a business problem using a data science based approach

## Week 2: Learning outcome

* At the end of week 2, you will have a foundational level knowledge of Python. You will be able to write code and text using Markdown.

## An introduction to Markdown

Let's start with the text interface, which allows Markdown to be written. Markdown is a lightweight markup language (similar to HTML for webpages) that allows us to write text and embed images and mathematics. This allows for interactive visualisations, allowing for better reproducibility, sharing, and decision making.

Example:

# Title 1

## Title 2

### Title 3

This is a line of text.

* This is a bullet
* This is a second bullet
* This is a bullet with _italic_ text
* This is a bullet with **bold** text

This is an equation written in LaTex:
\begin{equation}
f(x) = x^2
\end{equation}

This is an equation $x^2$ written inline.

Further details on Markdown and LaTex can be found here:
Mandatory exercises
- [ ] Read this reference: https://colab.research.google.com/notebooks/markdown_guide.ipynb

Optional exercises
- [ ] Use this cheat sheet when writing Markdown for your own work: https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed
- [ ] Read this reference: https://www.math.ubc.ca/~pwalls/math-python/jupyter/latex/

## Mathematical foundations

### Notation

* $4\times4\times4 = 4^3 = 64$
* $x\times x\times x = x^3$ (e.g. when $x=4$, $x^3 = 64$)
* In the expression above, $x$ is a _variable_

In [None]:
4 * 4 * 4

64

This can be equivalently written as:

In [None]:
4 ** 3

64

We can introduce variables, such as $x = 4$

In [None]:
x = 4

We can then manipulate variable x, rather than specific values.

In [None]:
x * x * x

64

In [None]:
y = 2

In [None]:
x ** y

16

We can then change the values of x and y, to maximise code reuse.

* We could generalise further and say $f(x) = x^2$, where "$f$ is a function of the variable $x$" (e.g. $f(4) = 16$)
* A function is a relation between sets that associates to every element of a first set exactly one element of the second set
* A set is a well defined collection of distinct items. e.g. $\{apple,orange,banana\}$

Let's look at how we do this in Python.

In Python, a function is defined using the def keyword:

In [None]:
def print_hello():
  print("Hello world")

# this is a comment
x = 4
print(x)

4


In [None]:
def print_hello_user(user):
  print("Hello "+user)

In [None]:
print_hello_user("Mark")

Hello Mark


In [None]:
print_hello()

Hello world


We can pass variable arguments to functions:

In [None]:
def square(x):
  return x ** 2

We can use this function several times, in order to reduce the amount of code that we need to write and to maximise code sharing.

In [None]:
x = 3
print(x)
print(x == 4)

3
False


In [None]:
square(5)

25

We can also define recursive functions (e.g. factorial(5) = 5! = 5 x 4 x 3 x 2 x 1)

In [None]:
def factorial(n):  
   if n == 1:  
       return n  
   else:  
       return n*factorial(n-1)

In [None]:
factorial(5)

120

In [None]:
factorial(3)

6

We can perform operations with different Python native datatypes (Boolean, Integer, Float, List, String).

Booleans are either assigned to True or False. Python expects an expression to evaluate to a boolean value. These are called Boolean contexts.

In [None]:
x > 10

False

In [None]:
x == 2

False

In [None]:
if x < 4:
  print("x<4")
else:
  print("x>=4")

x>=4


In [None]:
x = 10

A float (floating point real number) is defined with a decimal, or explicitly:

In [None]:
pi = 3.14

In [None]:
print(pi)

3.14


In [None]:
float(2)

2.0

## Libraries

The power of Python (with respect to data science) is its extensibility. We can import libraries and use functions and data types from such libraries. NumPy (https://numpy.org/) is the basis of many data science libraries. We import it as follows:

In [None]:
import numpy as np

We can use numpy to define _vectors_ and _matrices_, and to perform mathematical operations:

In [None]:
# Create a vector as a row
vector_row = np.array([1, 2, 3])

In [None]:
print(vector_row)

[1 2 3]


In [None]:
# Create a vector as a column
vector_column = np.array([[1],
                          [2],
                          [3]])

In [None]:
print(vector_column)

[[1]
 [2]
 [3]]


Let's call the _linspace_ function, in order to produce a list of values that are equally spaced:

In [None]:
# Build array/vector:
x = np.linspace(-np.pi, np.pi, 10)
print(x)

[-3.14159265 -2.44346095 -1.74532925 -1.04719755 -0.34906585  0.34906585
  1.04719755  1.74532925  2.44346095  3.14159265]


In [None]:
print(x[0])  # first element
print(x[2])  # third element
print(x[-1]) # last element
print(x[-2]) # second to last element

-3.141592653589793
-1.7453292519943295
3.141592653589793
2.443460952792061


In [None]:
print(x[1:4])     # second to fourth element. Element 5 is not included
print(x[0:-1:2])  # every other element
print(x[:])       # print the whole vector
print(x[-1:0:-1]) # reverse the vector!

[-2.44346095 -1.74532925 -1.04719755]
[-3.14159265 -1.74532925 -0.34906585  1.04719755  2.44346095]
[-3.14159265 -2.44346095 -1.74532925 -1.04719755 -0.34906585  0.34906585
  1.04719755  1.74532925  2.44346095  3.14159265]
[ 3.14159265  2.44346095  1.74532925  1.04719755  0.34906585 -0.34906585
 -1.04719755 -1.74532925 -2.44346095]


Consider if we want the part of the vector where x > 2:

In [None]:
print(x > 2)
y = x[x > 2]
print(y)
y[0]

[False False False False False False False False  True  True]
[2.44346095 3.14159265]


2.443460952792061

In the above, we have been programmatically manipulating arrays. A mathematical representation of an array is known as a vector.
* (Imprecisely) a *vector* is an ordered set of elements, $\vec{x}=(x_{1}~x_{2}~\ldots~x_{n})\in\mathbb{R}^n$

* Let:
\begin{eqnarray}
   \vec{a} = \left[\begin{matrix} 
   a_{11} \\
   a_{21} \\
   a_{31} \\
   \end{matrix}\right], & ~~~ &
   \vec{b} = \left[\begin{matrix} 
   b_{11} \\
   b_{21} \\
   b_{31} \\
   \end{matrix}\right]
\end{eqnarray}

* Then:
\begin{equation}
\vec{a}+\vec{b} = \left[\begin{matrix} 
   a_{11}+b_{11} \\
   a_{21}+b_{21} \\
   a_{31}+b_{31} \\
   \end{matrix}\right]
  \end{equation}

We will return to this when we delve deeper into modelling, later in the course.

## Matrix

* A matrix is a two-dimensional representation of numbers (a grid)
* In data science, we will use linear representations to map our input to our output space, to store neural network weights, while features are stored as vector inputs
* This allows us to have compact mathematical representations of the models, and to exploit computational resources (such as GPUs) in order to perform extremely fast processing
* Example:
\begin{equation}
A =
   \left[\begin{matrix} 
   11 & 14 & 18 & 9 \\
   32 & 14 & 24 & 17 \\
   10 & 7 & 6 & 28 \\
   \end{matrix}\right]
\end{equation}
* We can define operations between matrices, or between matrices and vectors

In [None]:
# Create a matrix
matrix = np.array([[11, 14, 18, 9],
                   [32, 14, 24, 17],
                   [10, 7, 6, 28]])

In [None]:
matrix[0,2]

18

In [None]:
matrix.size

12

In [None]:
matrix.shape

(3, 4)

In [None]:
matrix.T

array([[11, 32, 10],
       [14, 14,  7],
       [18, 24,  6],
       [ 9, 17, 28]])

We can perform algebraic operations on matrices:

In [None]:
# Create matrix
matrix_a = np.array([[1, 1, 1],
                     [1, 1, 1],
                     [1, 1, 2]])

# Create matrix
matrix_b = np.array([[1, 3, 1],
                     [1, 3, 1],
                     [1, 3, 8]])

# Add two matrices
np.add(matrix_a, matrix_b)

array([[ 2,  4,  2],
       [ 2,  4,  2],
       [ 2,  4, 10]])

Or equivalently:

In [None]:
matrix_a + matrix_b

array([[ 2,  4,  2],
       [ 2,  4,  2],
       [ 2,  4, 10]])

In [None]:
matrix_a - matrix_b

array([[ 0, -2,  0],
       [ 0, -2,  0],
       [ 0, -2, -6]])

## Exercises

Mandatory exercises:
- [ ] Introduce a function that computes $x+2$ for any value of $x$. Call the function addTwo.
- [ ] Replicate all of the code in this notebook.


Advanced (optional) exercises (for students with existing Python knowledge):
- [ ] By reading the NumPy documentation, create a 3-dimensional array (a _tensor_)
- [ ] Create your own simple Python class for a Matrix, that is, an object based representation of a two-dimensional array