# COM761 Machine Learning

### Teaching staff
- Prof Jane Zheng `h.zheng@ulster.ac.uk` (Module Coordinator)
- Dr Glenn Hawe `gi.hawe@ulster.ac.uk`

### Lab Tutors
- Conor Clare `clare-c@ulster.ac.uk` 1-1 hour: Each TUESDAY 3-4 pm
- Isaac Ampomah `ampomah-i@ulster.ac.uk` 1-1 hour: Each FRIDAY 2-3 pm
- Mohammad Saedi `saedi-m@ulster.ac.uk` 1-1 hour: Each TUESDAY 3-4 pm


## Weekly Schedule

**Each Wednesday**

- **2:15 - 3:15** LECTURE (RECORDED)
- **3:15 - 3:30** *BREAK* (NOT RECORDED)
- **3:30 - 4:30** LECTURE (RECORDED)
- **4:30 - 4:40** LAB INTRO (RECORDED)
- **4:40 - 5:00** *BREAK*
- **5:00 - 8:00** LABS (NOT RECORDED)

Feel free to take some time for your dinner, but please try to make the most of the lab sessions, i.e. aim to attend at least 2 hours.

The lectures will be recorded and made available on Blackboard shortly afterwards.


## Teaching Plan (Weeks 1-5)  

**Lecturer: Dr Glenn Hawe**

**1. Introduction to machine learning** 

- 1.1: Introduction to Machine Learning
- 1.2: Matrices and NumPy
- Labs: Exercises involving NumPy and pandas


**2. Linear regression** 

- 2.1: Multiple linear regression; non-linear responses; cross-validation; maximum likelihood
- 2.2: Alternative cost functions; regularisation; diagnostics
- Labs: Exercises in linear regression


**3. Optimization for machine learning**

- 3.1: Zero-order optimization: Grid search; Random Search; Coordinate Search / Descent
- 3.2: First-order optimization: Derivatives; Gradient Descent
- Labs: Optimization exercises; evolutionary algorithms

**4. The Bayesian approach to machine learning** 

- 4.1: Bayesian thinking (priors, likelihood, posterior)
- 4.2: Bayesian linear regression
- Labs: Probabilistic Programming and Pymc3

**5. Advanced Bayesian machine learning** 

- 5.1: Markov Chain Monte Carlo (MCMC)
- 5.2: Gaussian Process regression and Bayesian Optimization 
- Labs: MCMC diagnostics; GP regression models and Bayesian optimization

## Teaching Plan (Weeks 6-10)  

**Lecturer: Prof Jane Zheng**

**6. Classification I**: process and algorithms

**7. Classification II**: algorithms and assessment

**8. Unsupervised Learning I**: concept and algorithms

**9. Unsupervised Learning II**: algorithms and assessment

**10. Ethical AI, Explainable AI (XAI) & ML**

## Assessment

This module is assessed entirely by Coursework. There are two pieces of coursework:

#### Coursework 1 [40%]

- A Jupyter notebook containing four short questions
- Covers core material from Weeks 1-3
- Will be released in Week 3
- Deadline: end of Week 5
- Submission will be via Blackboard
- An individual assignment


#### Coursework 2 [60%]

- Three longer questions covering material from Weeks 4-9
- An individual assignment
- Submission will be via Blackboard.
- Deadline: end of Week 10

#### Late submissions

- If you are unable to submit due to extenuating circumstances, please submit an EC1 form to the school office.
- Otherwise, an assessment submitted late will score a mark of zero.




#### Plagiarism
- Both pieces of coursework are *individual* assignments.
- You should not share your solutions with other students for any reason.
- If you rely heavily on some source to answer a question, then it should be cited in your code as a comment. 


## Recommended Texts

![Books](Images/books3.png)

### Useful blogs / websites

- https://machinelearningmastery.com/
- https://www.kdnuggets.com/
- https://distill.pub/
- https://towardsdatascience.com/ (paid subscription via medium)

### Software

- We will be using Python
- The majority of teaching material will be delivered via Jupyter notebooks
- To open and use these Jupyter notebooks you should use either:
    - JupyterLab (Installed locally on your machine): 
       - https://www.anaconda.com/products/individual
    - Google colab (Cloud-based approach; no need to install anything):
       - https://colab.research.google.com/
- Ask in Labs today if you need help getting set-up


### What is machine learning and how does it fit into AI (and what is deep learning?)

![Venn](Images/AI_Venn.png)

## Definitions of machine learning

**Arthur Samuel** defined machine learning in 1959 as:

Machine Learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed.

**Tom Mitchell** coined the following popular definition in 1998:

A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.



### An Illustrative Example

- This example is taken from Chapter 1 of *Machine Learning Refined*.
- It will allow us to informally introduce the terminology of machine learning.

Suppose we need to teach a computer how to distinguish between pictures of *cats* and *dogs*. 

- How does a child learn the difference between a cat and a dog? They learn by *example*.
- After seeing many cats and dogs, and being told which is which by e.g. their parents (a *supervisor*), the child *learns* how to distinguish between cats and dogs.
- How do we know when a child can successfully distinguish between cats and dogs?  When they encounter new cats and dogs and can correctly identify each new example.
  - i.e. when they can *generalize* what they have learned to new, previously unseen, examples.

Computers can be taught how to perform this task in a similar manner.

Relating this back to Tom Mitchell's definition of machine learning:
- The task $T$ is *classification*: distinguishing between different *classes* of object (in this case, cats and dogs).
- The experience $E$ is a set of images of cats and dogs that a supervisor has labelled and the learner has seen.
- The performance $P$ is the accuracy with which new, previously unseen, examples, are labelled.

We now summarize the main steps involved in this classification task.

##### 1. Data Collection

Collect a set of labelled images of cats and dogs.

![MLR1_01](Images/MLR_Fig1_01.jpg)

##### 2. Feature design

- How do we (humans) tell the difference between cats and dogs?
- We use color, size, shape of ears or nose, $\ldots$ in order to distinguish between the two.
    - In machine learning, these are called *features*.
- In order to train a computer to perform this task (or more generally, any machine learning task), we need to provide it with properly designed *features*.
  - *Representation learning* is a form of machine learning concerned with learning features. 
  - *Deep learning* is a form of representational learning concerned with learning a hierarchy of features (of varying degrees of abstractness).
- For our toy problem of distinguishing cats from dogs, suppose we use two features: 
  - *size of nose* (relative to the size of the head), ranging from small to large.
  - *shape of ears* ranging from round to pointy.

![MLR1_02](Images/MLR_Fig1_02b.png)

#### Representation of the experience (dataset)

One common way of describing a dataset is with a **design matrix**. This is a matrix where:
- each row contains a different example
- each column represents a different feature

Most learning algorithms operate on a design matrix representing the dataset. 

- Therefore to understand machine learning, you should be comfortable using matrices.
- Today we will introduce the Python library NumPy for working with matrices.

##### 3. Model Training

With our feature representation of the training data, the machine learning problem of distinguishing cats and dogs is a geometric one: have the computer find a line or a curve that separates the cats from the dogs in our carefully designed feature space.

- If we are to fit a straight line $y=w_0 + w_1x$, then we must find the *best* values for its two parameters:
  - $w_1$ the gradient
  - $w_0$ the intercept on the vertical axis
- This is an **optimization** problem.  **Optimization is key to machine learning, and we will spend Week 3 on this topic.**

![MLR1_03](Images/MLR_Fig1_03b.png)

##### 4. Model Validation

How can we test that we have successfully learnt how to distinguish between cats and dogs?

We cannot rely on just classifying the examples in the training set: in this case we would get them all correct! And our model is surely not perfect.

Instead we need to test our model using some *unseen* images of cats and dogs. i.e. images that were not used in training the model. Such a set is called a **validation** set.

![MLR1_04](Images/MLR_Fig1_04b.png)

![MLR1_05](Images/MLR_Fig1_05b.png)

![MLR1_06](Images/MLR_Fig1_06.jpg)

##### Summary

- Machine learning involves learning how to perform a **task** from **experience**.
- Experience comes in the form of a collection of **examples**, known as a **dataset**.
- Each example in the dataset is defined by its **features**. 


### Supervised vs unsupervised learning

- In our toy problem, each example that we learnt from had, in addition to its features, a label (in this case cat/dog).
- If every example we learn from has a label, and the task is to learn how the label depends on the values of the features (so that we can then predict the label for new, unseen, unlabelled examples), then the learning is said to be **supervised**. 
  - When the label is a discrete class, then the task is called **classification**; classification will be covered in Weeks 6-7. 
  - When the label is continuous (a real number), then the task is called **regression**; we will focus on it until Week 5. 
- If no example has a label (i.e. we only have feature values), then the task is usually to learn useful properties of the structure of this dataset. This is called **unsupervised learning**. Unsupervised learning will be covered in Weeks 8-9. The most common tasks in unsupervised learning are *dimensionality reduction* and *clustering*.

## Regression

**Regression** is the task of predicting how a continuous valued target depends on the features.

![reg_task](Images/reg_task.jpg)

### Simple linear regression

We will cover linear regression in more depth next week, after we have covered matrices. 

Today, we will restrict ourselves to *simple* linear regression, which is fitting a straight line through the data, when we have one single feature. 

A straight line is defined by the equation $y = w_0 + w_1x$.
- Here, $y$ is the label (or target)
- $x$ is our single feature
- $w_0$ is the intercept on the $y$-axis
- $w_1$ is the gradient of the line


The parameters of our model are $w_0$ and $w_1$. 

#### Defining a good model

Given a set of data $\{(x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\}$, training a simple linear regression model $y = w_0 + w_1x_i$ involves finding the *best* values of $w_0$ and $w_1$.

Intuitively, we would expect the *best* line to pass as closely as possible to *all* of the data points.

A common way of measuring how close a line is through a set of points is:
1. Measure each of the residuals (residual = true value - predicted value)
 - At $x_i$ the predicted value is $w_0 + w_1x_i$, so the residual is $y_i - (w_0 + w_1x_i)$
2. Square each residual (to make them all positive)
3. Find the mean (i.e. average) of the squares of all of the residuals

This gives us the following **loss function** 

$g(w_0,w_1) = \frac{1}{n}\sum_{i=1}^{n}(y_i - (w_0 + w_1x_i))^2$

The values of $w_0$ and $w_1$ that minimize $g(w_0,w_1)$ give rise to the *least squares* model.

![MLR5_02](Images/MLR_Fig5_02.jpg)

##### Finding the optimal parameter values

- The usual approach to finding the optimal parameters of a model is using an *optimization* algorithm to search efficiently through parameter space to minimize the loss function.

- We will look at algorithms for optimizing loss functions in Week 3.


In [1]:
# Video below from https://github.com/jermwatt/machine_learning_refined
# Shared under the Creative Commons Attribution 4.0 International License (CC BY-NC-SA 4.0)

from IPython.display import HTML
HTML("""
<video width="1000" height="400" controls loop>
  <source src="Videos/animation_1.mp4" type="video/mp4">
  </video>
""")

#### Normal equations

- In the case of linear regression, the optimal values can be determined analytically, by differentiating the loss function and solving for when the gradient is equal to zero.

- We refer to p. 8-14 *A First Course in Machine Learning*  by Simon Rogers and Mark Girolami for details of the derivation, but the optimal values $\hat{w_0}$ and $\hat{w_1}$ of $w_0$ and $w_1$ are:

![normeq](Images/simplelinearnormeq2.png)

where a bar over a term means the average value for that term (calculated from the training data).