# Module 1: Introduction to Machine Learning

# Introduction

This module covers the basic definitions of machine learning and introduces its main branches. We will explore machine learning from different perspectives and define the essential terminology. In the following modules, we will examine various machine learning algorithms.

In this course, we assume you are new to machine learning. The goal is to introduce you to the concepts and tools needed to implement machine learning applications. We will cover a large number of techniques, from the simplest and most commonly used (such as linear regression) to some of the deep learning techniques that regularly win machine learning competitions.

In this course, we will use two production-ready Python frameworks:

1. **Scikit-Learn** is an easy to use framework that implements many machine learning algorithms efficiently and is a great tool for starting with machine learning.<br><br>

2. **TensorFlow** is a more complex library for distributed numerical computation using data flow graphs. TensorFlow makes it possible to train and run very large neural networks efficiently by distributing the computations across potentially thousands of multi-GPU servers. TensorFlow was created at Google and supports many of their large-scale machine learning applications. It was open-sourced in November 2015.

This course favours a hands-on approach, developing an intuitive understanding of machine learning through concrete working examples and just a little bit of theory. We highly recommend you experiment with the code examples from the course textbook (the examples are available online in Jupyter notebooks at https://github.com/ageron/handson-ml).

In addition to [Anaconda](https://www.anaconda.com/distribution/), which is used to view the course's Jupyter notebooks, we can make use of [Google Colaboratory](https://colab.research.google.com/notebooks/welcome.ipynb) to run code. Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely on the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources &mdash; all for free from your browser.

# Learning Outcomes

This module introduces the concepts of machine learning. At the end of this module, you should be able to:

* Define machine learning
* Determine when machine learning is applicable
* Identify the types of machine learning
* Recognize the challenges of machine learning
* Begin to apply machine learning tools and techniques

# Reading and Resources

The textbook for this course is *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow* by **Aurélien Géron**:

- Géron, A. (2019). *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow* (2nd ed.). O’Reilly Media. https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

<h1>Table of Contents<span class="tocSkip"></span></h1>
<br>
<div class="toc">
<ul class="toc-item">
<li><span><a href="#Module-1:-Introduction-to-Machine-Learning" data-toc-modified-id="Module-1:-Introduction-to-Machine-Learning">Module 1: Introduction to Machine Learning</a></span>
</li>
<li><span><a href="#Introduction" data-toc-modified-id="Introduction">Introduction</a></span>
</li>
<li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes">Learning Outcomes</a></span>
</li>
<li><span><a href="#Reading-and-Resources" data-toc-modified-id="Reading-and-Resources">Reading and Resources</a></span>
</li>
<li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents">Table of Contents</a></span>
</li>
<li><span><a href="#Prerequisites" data-toc-modified-id="Prerequisites">Prerequisites</a></span>
</li>
<li><span><a href="#Machine-Learning-Overview" data-toc-modified-id="Machine-Learning-Overview">Machine Learning Overview</a></span>
<ul class="toc-item">
<li><span><a href="#What-is-Machine-Learning?" data-toc-modified-id="What-is-Machine-Learning?">What is Machine Learning?</a></span>
<ul class="toc-item">
<li><span><a href="#Example:-Spam-Filter" data-toc-modified-id="Example:-Spam-Filter">Example: Spam Filter</a></span>
</li>
</ul>
</li>
<li><span><a href="#Why-use-Machine-Learning?" data-toc-modified-id="Why-use-Machine-Learning?">Why use Machine Learning?</a></span>
</li>
</ul>
</li>
<li><span><a href="#Types-of-Machine-Learning" data-toc-modified-id="Types-of-Machine-Learning">Types of Machine Learning</a></span>
<ul class="toc-item">
<li><span><a href="#Main-branches-of-Machine-Learning" data-toc-modified-id="Main-branches-of-Machine-Learning">Main branches of Machine Learning</a></span>
</li>
<li><span><a href="#Tabular-Datasets" data-toc-modified-id="Tabular-Datasets">Tabular Datasets</a></span>
</li>
<li><span><a href="#Supervised-Learning-Methods" data-toc-modified-id="Supervised-Learning-Methods">Supervised Learning Methods</a></span>
<ul class="toc-item">
<li><span><a href="#Classification" data-toc-modified-id="Classification">Classification</a></span>
<ul class="toc-item">
<li><span><a href="#Classification-example" data-toc-modified-id="Classification-example">Classification example</a></span>
</li>
</ul>
</li>
<li><span><a href="#Regression" data-toc-modified-id="Regression">Regression</a></span>
<ul class="toc-item">
<li><span><a href="#Regression-example" data-toc-modified-id="Regression-example">Regression example</a></span>
</li>
</ul>
</li>
</ul>
</li>
<li><span><a href="#Unsupervised-Learning" data-toc-modified-id="Unsupervised-Learning">Unsupervised Learning</a></span>
<ul class="toc-item">
<li><span><a href="#Clustering" data-toc-modified-id="Clustering">Clustering</a></span>
<ul class="toc-item">
<li><span><a href="#Clustering-example" data-toc-modified-id="Clustering-example">Clustering example</a></span>
</li>
</ul>
</li>
<li><span><a href="#Dimensionality-Reduction" data-toc-modified-id="Dimensionality-Reduction">Dimensionality Reduction</a></span>
</li>
</ul>
</li>
<li><span><a href="#Reinforcement-Learning" data-toc-modified-id="Reinforcement-Learning">Reinforcement Learning</a></span>
</li>
<li><span><a href="#Instance-Based-vs.-Model-Based-Learning" data-toc-modified-id="Instance-Based-vs.-Model-Based-Learning">Instance-Based vs. Model-Based Learning</a></span>
<ul class="toc-item">
<li><span><a href="#Instance-based-learning" data-toc-modified-id="Instance-based-learning">Instance-based learning</a></span>
</li>
<li><span><a href="#Model-based-learning" data-toc-modified-id="Model-based-learning">Model-based learning</a></span>
</li>
</ul>
</li>
</ul>
</li>
<li><span><a href="#Exercises" data-toc-modified-id="Exercises">Exercises</a></span>
<ul class="toc-item">
<li><span><a href="#Exercise-Solutions" data-toc-modified-id="Exercise-Solutions">Exercise Solutions</a></span>
</li>
</ul>
</li>
<li><span><a href="#References" data-toc-modified-id="References">References</a></span>
</li>
</ul>
</div>

# Prerequisites

This course assumes that you have some Python programming experience and that you are familiar with Python’s main scientific libraries &mdash; in particular `numpy`, `pandas`, and `matplotlib`.

Also, if you are interested in the theoretical framework for machine learning, you should have a reasonable understanding of university-level math (calculus, linear algebra, probability, and statistics). If you don’t know Python yet, http://learnpython.org/ is a great place to start. The official tutorial on http://python.org is also quite good. You can also install [Sololearn](https://www.sololearn.com/Course/Python/) on your smartphone for learning Python 3 in a fun way.

# Machine Learning Overview

## What is Machine Learning?

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a task without using explicit instructions &mdash; relying on patterns and inference instead. Machine learning is considered a subset of artificial intelligence (AI). Machine learning algorithms build a mathematical model of sample data, known as **training data**, in order to make predictions or decisions without being explicitly programmed to perform a task.

Arthur Samuel, in 1959, defined machine learning as a "*field of study that gives computers the ability to learn without being explicitly programmed*." (Géron, 2019)

Another famous and more engineering-oriented definition is presented by Tom Mitchel in 1997: "A computer program is said to learn from experience $E$ with respect to some task $T$ and some performance measure $P$, if its performance on $T$, as measured by $P$, improves with experience $E$."(Géron, 2019)

### Example: Spam Filter 

Advanced spam filters use machine learning programs which learn to flag spam given examples of spam emails (typically flagged by users) and examples of regular (non-spam) emails. These example emails form a training set (shown in the figure below). Each training example is called a **training instance** or sample. In this case, the task T is to flag new spam emails, the experience E is the training data, and the performance measure P needs to be defined (e.g. a percentage of correctly classified emails).

![image1.png](attachment:image1.png)
*A labeled training set for supervised learning* (Géron, 2019)

**NOTE:** Training a machine learning algorithm is not as simple as storing a large amount data on a computer. For example, if you just download a copy of Wikipedia, your computer has a lot of data but it is not suddenly better at a task.

## Why use Machine Learning?

Suppose we want to write a program for filtering spam messages without using machine learning, how would we achieve this? One approach is to use the procedure depicted in the figure below:

1. Analyze some examples of spam emails. You might notice that certain words or phrases (such as “4U,” “credit card,” “free,” and “amazing”) tend to appear frequently in the subject line. Perhaps you notice other patterns in the sender’s name, the email’s body, and so on.<br><br>

2. Write a detection algorithm for each of the identified patterns. The algorithm will flag emails as spam if a certain number of these patterns are detected.<br><br>

3. Test the program and repeat steps 1 and 2, until satisfactory.

![image2.png](attachment:image2.png)
*The traditional approach* (Géron, 2019)

Since the problem is not trivial, your program will likely become a long list of complex rules and will be hard to maintain. In contrast, a spam filter based on machine learning techniques automatically learns which words and phrases are good predictors of spam by detecting unusually frequent patterns of words (figure below). This approach results in a program that is much shorter, easier to maintain, and often more accurate.

![image3.png](attachment:image3.png)
*Machine learning approach* (Géron, 2019)

# Types of Machine Learning

Machine learning can be categorized by several characteristics:

* **Branches of machine learning**: Whether or not they are trained with human or machine supervision (e.g. supervised, unsupervised, semi-supervised, and reinforcement learning).


* **Model creation process**: Whether or not they can learn incrementally on the fly (i.e. online vs. batch learning).


* **Model characteristics**:  Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model (i.e. instance-based vs. model-based learning).

## Main branches of Machine Learning

There are 3 main branches of machine learning. The data available and its properties will often dictate which method is appropriate.

1. **Unsupervised Learning**: This is when we have a dataset (say, a data table or a dataframe) and we're interested in finding patterns in the data, but not in trying to predict the value in a particular cell based on the other values in the same row. In this case we're interested in any inter-relationships that might exist between columns. The relationships could be a linear correlation, or perhaps something more complex. If we can find that some of the columns have even a loose functional relationship between them, it suggests (but certainly doesn't prove) that there might be an underlying cause and effect relationship between them, or as a result of them both being influenced by some common factor. This may lead us to develop additional experiments to better understand whether there truly is a correlation and if so why it exists.<br><br>

2. **Supervised Learning**: Often we are interested in building a *predictive* model where we would like to learn how to predict the value (or label for categorical data) in one of the columns given the other values in the same table row. We call the value we're trying to predict the *target*. Here we want to discover if there is a functional relationship between the values in a row and the target. If so we can hopefully use this function to predict the value of the target when presented with a new observation having values for only the independent variables.<br><br>

3. **Reinforcement Learning**: Whereas unsupervised and supervised learning are about passively learning from data, *reinforcement learning* is about actively learning by trial-and-error. In reinforcement learning an "intelligent agent" interacts with the world and learns what works and what doesn't through experience. Reinforcement learning is primarily used in situations where either a software agent or a robot needs to learn a complex task and its unclear how to program it to do so. For example, a robot that needs to pick up parts of various shapes may discover that it's easier to pick some of them up in pairs squeezed together than individually whereas that might not be obvious to the robot's programmer. The robot can try random pairs discovering what actually works well in practice and what doesn't.

## Tabular Datasets

To understand the differences between machine learning approaches, we need to examine the properties of training datasets. Tabular datasets are useful for both supervised and unsupervised learning. However, for reinforcement learning, often the training data cannot be stored as a tabular dataset and other learning strategies are used.

Data can be represented as a tabular dataset if the data can be re-structured to match the format shown in the table below. Regardless of the data type (e.g. data sheets, images, voice, text, etc.), if we are able to convert each sample into a set of features, then the data can be represented as a tabular dataset. For example, an image is essentially a matrix of numerical colour values. Thus, we can convert this matrix into a table of rows, each representing a single pixel with 3 features: x coordinate, y coordinate, and colour value.

![image5.png](attachment:image5.png)
*General structure of a tabular dataset*

The `Target` column in the dataset is a predetermined label for each sample. This column is required for **supervised machine learning**. In supervised learning, we are trying to learn a function $f$ which maps inputs (features) to outputs given known input-output pairs. In the image above, the `Feature` columns are the inputs and the `Target` is an already known output value for the sample data. The learning algorithm with use both the features and the target values to learn the function $f$. For example, consider a dataset of housing information. In this dataset, features are values such as the age of the house, the number of bedrooms, or its location. The target values are house prices. This dataset can be used for supervised learning since the target values (prices) are included in the dataset. If the `Target` column contains a fixed number of categorical values (such as True / False, or Blue / Green / Red) then the supervised learning algorithm is called **classification**. However, if the target column contains numerical values, the supervised learning algorithm is called **regression**.

Now, consider the situation where we do not have labels. For example, we have collected housing features without house prices and we are looking for some way to group houses which are similar. In this scenario, we can use **unsupervised machine learning**.

## Supervised Learning Methods

### Classification

The figure below depicts the training process for **classification**. Classification is a supervised learning algorithm that is used for categorical data. The following properties must be met to use classification:

1. The predictors (features) in the training dataset have labels.<br><br>
    
2. The outcomes of the learned function $f$ are categorical.

![image6.png](attachment:image6.png)
*Overview of classification*

#### Classification example

Let's examine a classic machine learning dataset known as the Iris Data Set. In the figure below, there are three types of iris plants (setosa, versicolor, and virginica) and four features collected for each sample: sepal length, sepal width, petal length, and petal width. The full dataset contains 50 records for each type of iris plant. In this example, the goal of classification is to find an estimator function that can predict the type of iris based on only the four features.

![image7.png](attachment:image7.png)
*Classification example. Iris Data Set features with target values.*

### Regression

The figure below depicts the training process for **regression**. Regression is a supervised learning algorithm that is used for numerical data. The following properties must be met to use regression:

1. The predictors (features) in the training dataset have labels.<br><br>
    
2. The outcomes of the learned function $f$ are numerical.

![image8.png](attachment:image8.png)
*Overview of regression*

#### Regression example
The dataset below represents housing data. The task is to create an estimator function that predicts house prices (`median_house_value`) for houses that are not included in dataset, given their features. Since a label/target value is included in the dataset and the target values are numerical, this is an example of regression.

![image9.png](attachment:image9.png)
*An example of regression*

## Unsupervised Learning

### Clustering

The figure below depicts **clustering**. A clustering task learns a function $f$ based a dataset which does not contain labels. Since there are no labels, this an unsupervised learning method. The inputs of this function are mapped to  output groups &mdash; hence the name clustering.

![image10.png](attachment:image10.png)
*Overview of clustering*

#### Clustering example
We will once again use the Iris Data Set, this time in a clustering exercise. In the figure below, petal length, petal width, sepal length, and sepal width are displayed. In this task, we assume we are not aware of the number of iris types. Thus, we have the freedom to select the number of output clusters, and the number of groups generated could be different than the actual number of iris types. However, rather than select a number of groups at random, there are metrics we can use to decide how many clusters to generate.

![image11.png](attachment:image11.png)
*Clustering example*

### Dimensionality Reduction

In machine learning, dimensionality reduction is an unsupervised process for reducing the number of features by obtaining a set of principal variables. The process can be divided into feature selection and feature reduction. For example, imagine we obtain 100 features for each record. Thus, we have 100 dimensions in our feature space and probably not all of them are useful for creating a machine learning model. Dimensionality reduction algorithms provide tools for transforming our feature space into a new space with a lower number of dimensions. This will yield benefits such as speeding up the machine learning process.

## Reinforcement Learning

In reinforcement learning, the learning system, called an **agent** in this context, can both observe the environment and perform actions. When the actions turn out to be favourable, the agent gets a reward. However, if the actions are not favourable, a negative reward or penalty is applied. The agent must learn by itself what is the best strategy to get the most reward over time. This strategy, known as a **policy**, is used to determine which action the agent should choose in a given situation.

![image12.png](attachment:image12.png)
*Reinforcement Learning (Géron, 2019)*

For example, robots can implement reinforcement learning algorithms to learn how to walk. DeepMind’s AlphaGo program is also a good example of reinforcement learning. It made headlines in March 2016 when it beat the world champion Lee Sedol at the game of Go (Koch, C., 2016). AlphaGo learned its winning policy by analyzing millions of games, and then by playing many games against itself. Learning was turned off during the games against Lee Sedol &mdash; AlphaGo was simply applying the policy it had already learned.

## Instance-Based vs. Model-Based Learning

Another way to categorize machine learning systems is by how they generalize. Most machine learning tasks are about making predictions. This means that given a number of training examples, the system needs to be able to generalize to examples it has never seen before. A good performance measure on the training data is helpful but insufficient. The true goal is to perform well on new instances. There are two main approaches to generalization: **instance-based learning** and **model-based learning**.

### Instance-based learning

One of the most trivial forms of learning is simply to memorize. If we were to create a spam filter this way, it would just flag all emails that are identical to emails that have already been flagged by users. This is not the worst solution, but certainly not the best.

Now, instead of just flagging emails that are identical to known spam emails, the spam filter could be programmed to flag emails that are very similar to known spam emails. This requires a measure of similarity between two emails. A (very basic) measure could be to count the number of words they have in common. The system would flag an email as spam if it has many words in common with a known spam email. This is an example of instance-based learning: the system learns the examples through memorization, then generalizes to new cases using a similarity measure. In the figure below, **k-Nearest Neighbours (k-NN)** is used. We have two features (Feature 1 and Feature 2) and two classes (triangles and squares). Each of the training instances are labeled as a triangle or square.

X is a new data point which does not exist in the dataset and its label is unknown. Our task here is to find a suitable label to assign to this new data point (triangle or square). This is a classification task, and the algorithm tries to find the closest training examples to this new data point to determine which label is most suitable. Therefore, all the training examples should be stored in the algorithm. The similarity measure here is calculated based on the distance between the new data point and the existing training instances. If we choose $k=3$, it finds the 3 closest neighbours and counts how many of them are triangles vs. squares. Since there are more triangles than squares, X should be assigned to the triangle class.

![image13.png](attachment:image13.png)
*Instance-based learning (Géron, 2019)*

### Model-based learning

Another way to generalize from a set of examples is to build a model, and then use that model to make predictions. This is called model-based learning. The figure below shows the same classification example we used in our discussion of instance-based learning. This time, instead of storing all the instances and using a similarity measure, a model called a **decision boundary** is created to distinguish between classes. This is depicted as the dashed-curve in the figure below. After creating this model, there is no need to store the training instances inside the algorithm. A new instance X will be classified based on where it resides relative to the decision boundary.

![image14.png](attachment:image14.png)
*Model-based learning (Géron, 2019)*

In the figure below, several linear regression models are proposed to model life satisfaction based on GDP. Among these models, it seems that the blue line provides a better generalized model. Instead of storing all the training instances, we just keep the best model (in this case the $\theta_0$ and $\theta_1$ values which represent the blue line).

![image15.png](attachment:image15.png)
*Model-based learning (Géron, 2019)*

# Exercises

**Q1**. Assume we track the number of hours students spend studying for a course along with their pass or fail results. If we use this data for future students to predict if they will pass or fail, this is an example of:

- a) Classification
- b) Regression
- c) Clustering
- d) Reinforcement Learning


**Q2**. Assume we collect the number of hours students study for a course and their final mark in the range 0 to 100.  If we use this data for future students to predict their final mark in the range 0 to 100, this is an example of:

- a) Classification
- b) Regression
- c) Clustering
- d) Reinforcement Learning

**Q3**. Assume we collect the number of hours students study for a course and using only this information we want to categorize them into two groups: Pass or Fail. This is an example of:

- a) Classification
- b) Regression
- c) Clustering
- d) Reinforcement Learning

**Q4**. Assume we develop an algorithm for playing chess, and after each move based on our evaluation of the chess board configuration, we apply a penalty or reward to the algorithm. This is an example of:

- a) Classification
- b) Regression
- c) Clustering
- d) Reinforcement Learning

## Exercise Solutions

**Answer to Q1**:

- (a) is correct. Since we have categorical labels in our dataset (pass/fail) and the task is to predict one of these two categories, this is a classification task.

**Answer to Q2**:

- (b) is correct. Since we have numerical labels in our dataset (final marks) and the task is to predict a numerical target value, this is a regression task.

**Answer to Q3**:

- (c) is correct. Since we do not have predefined labels (numerical or categorical) in our dataset and the task is to group students based on a collected feature (hours spent studying), the task is an example of clustering.

**Answer to Q4**:

- (d) is correct. The space of the problem (different possible games in chess) is so large that it cannot be described in a tabular dataset format. We have to use reinforcement learning to train the algorithm to evaluate the best move for each chess board configuration using a series of penalties and rewards.

# References

Géron, A. (2019). *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow* (2nd ed.). O’Reilly Media. https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
 
Koch, C. (2016). How the Computer Beat the Go Master. https://www.scientificamerican.com/article/how-the-computer-beat-the-go-master/

Sololearn. https://www.sololearn.com/Course/Python/