# A Brief Introduction to Machine Learning for image classification

> Written by Dr Daniel Buscombe, Northern Arizona University

> Part of a series of notebooks for image recognition and classification using deep convolutional neural networks


![](figs/Picture4.jpg)

## What constitutes an 'image'?

We're considering a broad/loose definition of the word image. 

Not just a photograph 

Almost any dataset >= 2D that is naturally a raster or is "rasterizable" 

(e.g. ultrasonic, radar, sonar, seismic, DEM, etc, etc)

## What is Machine Learning?

* a means of building models of data

* involves building mathematical models to help understand data. 

* "Learning" is the process of giving these models tunable parameters that can be adapted to observed data; in this way the program can be considered to be "learning" from the data. 

Learning how the data is structured so that it can predict

* Once these models have been fit to previously seen data, they can be used to predict and understand aspects of newly observed data. 

* The fundamental goal is to *generalize* (a program that memorizes its observations may not perform its task well)

Not to perfectly fit it is overfitted if it doesn't generalize well.

classification is discrete groups, regressioni s continuous fit

## Categories

![](https://www.mathworks.com/content/mathworks/www/en/discovery/machine-learning/jcr:content/mainParsys3/discoverysubsection_1965078453/mainParsys3/image_2128876021_cop.adapt.full.high.svg/1523365053391.svg)

### 1. Supervised learning 
* Involves modeling the relationship between measured features of data and some label associated with the data
* Once this model is determined, it can be used to apply labels to new, unknown data. 
* This is further subdivided into classification tasks and regression tasks
    * Classification: the labels are discrete categories
    * Regression: the labels are continuous quantities

![](https://lakshaysuri.files.wordpress.com/2017/03/sup-vs-unsup.png?w=648)

### 2. Unsupervised learning
* Involves modeling the features of a dataset without reference to any label
* These models include tasks such as clustering and dimensionality reduction
    * Clustering algorithms: identify distinct groups of data
    * Dimensionality reduction algorithms: search for more succinct representations of the data
    
***    

### A 'typical' supervised learning workflow

![](https://morganpolotan.files.wordpress.com/2015/04/supervised_learning_model2.png)

***

![](https://cdn.intellipaat.com/wp-content/uploads/2015/11/machine-learning-algorithms-460x255.png)

***

### Machine Learning for image analysis

classification and localization is where it is too

#### 1. Image recognition

For example, recognizing faces ...

![](figs/Picture1.png)

... classifying organisms ... etc

![](figs/Picture2.png)

***

#### 2. Semantic segmentation

Classifying each pixel in each image

![](figs/Picture3.png)


#### 3. Others

![](https://cdn-images-1.medium.com/max/1600/1*6ugm_qZgwuWMnIrWJUR3rg.png)

***

## Distinction between Machine and Deep Learning

Machine learning ...
* requires extracting features from data to input to the model
* requires fine-tuning of model architecture
* requires fine-tuning of model hyperparameters
* performance tends to plateau with more data
* lots of different models

Deep learning ...
* automatically extract features from data
* automatically fine-tunes hyperparameters
* performance doesn't tend to plateau with more data
* requires fine-tuning of model architecture
* just one model - the artificial neural network

![](https://images.xenonstack.com/blog/machine-learning-vs-deep-learning.png)

***

***

## Hyperparameters and model validation

The basic recipe for applying a supervised machine learning model:

* Choose a class of model
* Choose model hyperparameters
* Fit the model to the training data
* Use the model to predict labels for new data

![](https://media.mljar.com/blog/are-hyper-parameters-really-important-in-machine-learning/head/are-hyper-parameters-really-important-in-machine-learning.jpg)


### What is a Model Parameter?
A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.

* They are required by the model when making predictions.
* They values define the skill of the model on your problem.
* They are estimated or learned from data.
* They are often not set manually by the practitioner.
* They are often saved as part of the learned model.


### What is a Model Hyperparameter?
A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.

* They are often used in processes to help estimate model parameters.
* They are often specified by the practitioner.
* They are often tuned for a given predictive modeling problem.


![](https://image.slidesharecdn.com/icmltalk-160620125922/95/hyperparameter-optimization-with-approximate-gradient-2-638.jpg?cb=1467797347)

After choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the prediction to the known value.

***

## Bias-Variance Trade-off

If our estimator is underperforming, how should we move forward?

* Use a more complicated/more flexible model
* Use a less complicated/less flexible model
* Gather more training samples
* Gather more data to add features to each sample

Fundamentally, the question of "the best model" is about finding a sweet spot in the tradeoff between bias and variance. 

![](https://cdn-images-1.medium.com/max/1600/1*x8CBE7eAbaifwM15KNHuUA.png)

### Bias
If a model doesn't have enough flexibility to suitably account for all the features in the data, it is said to underfit the data. Another way of saying this is that the model has high bias.

### Variance
If a model fit has enough flexibility to nearly perfectly account for the fine features in the data, but its precise form seems to be more reflective of the particular noise properties of the data rather than the intrinsic properties of whatever process generated that data. 

Such a model is said to overfit the data: that is, it has so much model flexibility that the model ends up accounting for random errors as well as the underlying data distribution; another way of saying this is that the model has high variance.


If we imagine that we have some ability to tune the model complexity, we would expect the training and validation error to behave as illustrated in the following figure

![](https://www.learnopencv.com/wp-content/uploads/2017/02/Bias-Variance-Tradeoff-In-Machine-Learning-1.png)


* The training error is lower than the validation error
    * This means that the model will be a better fit to data it has seen than to data it has not seen.

* For very low model complexity (a high-bias model), the training data is under-fit
    * the model is a poor predictor both for the training data and for any previously unseen data.

* For very high model complexity (a high-variance model), the training data is over-fit
    * the model predicts the training data very well, but fails for any previously unseen data.

* For some intermediate value, the validation curve has a maximum. This level of complexity indicates a suitable trade-off between bias and variance.

## Learning curves

A plot of the training/validation error with respect to the size of the training set is known as a learning curve.

The general behavior we would expect from a learning curve is:

* A model of a given complexity will overfit a small dataset
    * the training score will be relatively high, while the validation score will be relatively low.
* A model of a given complexity will underfit a large dataset
    * the training score will decrease, but the validation score will increase.
* A model will never, except by chance, give a better score to the validation set than the training set
    * this means the curves should keep getting closer together but never cross.


![](http://www.yuthon.com/images/typical-learning-curve-for-high-variance.png)

Once you have enough points that a particular model has converged, adding more training data will not help you! The only way to increase model performance in this case is to use another (often more complex) model.

## Choosing Hyperparameters: Grid Search

Models generally have more than one knob to turn, and thus plots of validation and learning curves change from lines to multi-dimensional surfaces. In these cases, such visualizations are difficult and we would rather simply find the particular model that maximizes the validation score.

Grid searching allows us to do this

![](https://i.stack.imgur.com/02p4O.png)

***
***
## Case study

[Hoonhout et al 2015](https://www.sciencedirect.com/science/article/pii/S0378383915001313) "An automated method for semantic classification of regions in coastal images" Coastal Engineering 105, 1-12

Objective: develop a ML classifier for semantic segmentation of images of coasts

![](https://ars.els-cdn.com/content/image/1-s2.0-S0378383915001313-gr1.jpg)

Four steps: 
1. a manually annotated dataset of coastal images is oversegmented into superpixels; 
2. for all images in the dataset an extensive set of features is extracted; 
3. a suitable classification model is trained using the manually annotated data; and 
4. the trained model is used to automatically classify future images. 

***

![](https://ars.els-cdn.com/content/image/1-s2.0-S0378383915001313-gr2.jpg)

### Oversegmentation

Oversegmentation is the process of subdividing the image in smaller segments of similar pixels, which are called superpixels

Used to reduce the number of features (data dimensionality reduction)

![](https://ars.els-cdn.com/content/image/1-s2.0-S0378383915001313-gr3.jpg)

### Feature extraction

The classification algorithm uses 1727 features from the categories 1. position, 2. intensity, 3. shape, and 4. texture.

![](https://ars.els-cdn.com/content/image/1-s2.0-S0378383915001313-gr5.jpg)

### Support vector machine (SVM) classifier

Given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. 

More info [here](https://machinelearningmastery.com/support-vector-machines-for-machine-learning/)

![](https://docs.opencv.org/2.4/_images/optimal-hyperplane.png)

![](http://www.statsoft.com/textbook/graphics/SVMIntro3.gif)

***

![](https://ars.els-cdn.com/content/image/1-s2.0-S0378383915001313-gr7.jpg)

### Results

Aggregated confusion matrix

![](https://ars.els-cdn.com/content/image/1-s2.0-S0378383915001313-gr8.jpg)

Best and worst performance

![](https://ars.els-cdn.com/content/image/1-s2.0-S0378383915001313-gr9.jpg)
