# D2L: 1. Introduction


## 1.2 Key Components

### 1.2.1 Data

Data is required to train an ML model. Typically this is a collection of “Examples” also called data points, data instances or samples. These examples typically consist of a set of features (covariates, inputs). These are used to predict the labels (targets).

- Image data might be a set of images with a number representing the category of the image (dog, cat…) with the image itself being represented by three grids of numerical values representing the brightness of the image in red, green and blue. E.g. a 200px x 200px image would have 200 x 200 x 3 = 12,000 data points representing it. 
- Medical data might be a list of comorbidities, age, blood pressure, current medications etc. These would be labelled as binary values indicating the presence or absence of individual comorbidites. 

In the above examples, the number of features is fixed (fixed-length vectors) and then number of elements in these vectors is termed the dimensionality of the data. Often, in real life the data is variable length (online reviews, images from the internet, etc). One advantage of ML models is the ability to deal elegantly with variable length data.

The more data, the better. Much modern success for ML can be attributed to the move from “small” (read: not that small) to big data. 

Garbage in, garbage out. We need the right data, and it must be high quality. Especially in areas which are more sensitive. E.g. training a model for skin cancer recognition but omitting black patients. 

### 1.2.2 Model

The “model” is the computational machinery for ingesting data of one type and outputting a prediction of another type. Deep learning often consists of multiple transformations of the data chained together. In neural networks, the “model” would be the arrangement of neurones in the model. In Gaussian process regression it would be the Gaussian basis functions and their coefficients (though not necessarily the values associated with those coefficients). 

### 1.2.3 Objective function

The objective function is a mathematical formulation which can be used to measure the performance of a model. Since these are often formulated so that lower numbers are better, they can often be called “loss functions”. The squared error is a common loss function, which is easy to optimise against. Other loss functions such as the “error rate” (I.e. number of misclassifications for a classifier) are nondifferentiable so hard to optimise against, so a surrogate loss function may be used instead. Typically the loss function is measured separately  on both the data used to train the model (training dataset) and a withheld partition of the data known as the test dataset. A model which performs well on the training dataset but poorly on the test dataset is considered overfitted.  

### 1.2.4 Algorithm

i.e. the optimisation algorithm, is the method used to modify the parameters of the model in order to reduce the value of the loss function. Commonly used methods are gradient descent methods, which detect if a change in the model parameter would result in a change in loss function, and then moves the parameter in the direction which would reduce the loss function. 

## 1.3 Kinds of machine learning problems

### 1.3.1 Supervised Learning 

Refers to the most common family of machine learning tasks, where a set of labelled examples is used (i.e. features and labels) to train the model. In terms of probability, we are interested in evaluating the conditional probability of a given label given the set of features. Common in business because the problem statements are crisp; e.g. predict cancer vs not cancer given some computer tomography images. 

Diverse set of applications depending on formulation of problem, type of data being used as features. Etc. (size, type, quantity, fixed/variable length features). 

￼
Schematic representation  of supervised learning model. The supervised learning algorithm takes as inputs the “training inputs” and “training labels” and outputs another function: the fully trained model. Finally we can use the trained model to predict the output (label) of some input example (features) with an unknown label. 

#### Regression 

What makes something a regression problem is actually the target (i.e. an arbitrary numerical value between some interval). E.g. predicting house prices based on the feature vector [square footage, n. bedrooms, n. bathrooms, walking distance]. Predicting the rating that a user might provide to a new movie or tv show, or length of stay of a patient in a hospital are also regression problems. Generally if the question can be phrased as “how many” or “how much” then the problem is a regression problem. 

#### Classification

Classification models address the question of “which one” rather than “how many” or “how much”. A commonly used example is digit classification, where an algorithm seeks to identify the handwritten digits on a piece of text and classify these (to convert to a digital representation	). Firmly classifying things is challenging to optimise, so these problems are often phrased as probabilities (the model is 90% sure this is a cat, vs an absolute classification of “cat”). When there are only two classes, this is a binary classification, versus a multiclass classification for >2 classes. A common loss function for a multiclass classification is the cross-entropy. The most probable class is not always the one you want to use for your final decision making. E.g. if you are training a model to predict whether a mushroom is edible or not, a 20% change that the mushroom is a death cap indicates that you absolutely shouldn’t eat it. More advanced methods of classification might make use of hierarchies, for example in animals, a model should be penalised more for misclassifying a poodle as a dinosaur than a poodle as a cockapoo. 

#### Tagging

Tagging broadly refers to the idea of assigning classes to an example when the classes are not mutually exclusive. For example when many objects may simultaneously appear in an image, or when assigning categories to scientific literature (chemistry, organic chemistry, machine learning, surfaces could call be valid for one paper) or articles (GPU, machine learning, tagging could all be valid at once). These problems are usually best described in terms of multi-label classification.

#### Search

Broadly speaking “search” is a problem of ranking web pages according to some specific criteria. PageRank (the original Google algorithm) did this, but only filtered for the most relevant pages based on the query. Most modern search has been enshittified through the use of machine learning algorithms to take into account user behavioural models to come up with query-dependent relevant scores for each user. 

#### Recommender Systems

Recommender systems are closely related to search and ranking, but differ in their increased focus on the personalisation of the results to the individual user. While two people searching for “Pompeii” should get broadly the same results from a search engine, two users looking at music in Spotify will receive wildly different recommendations. In some cases, users provide explicit feedback by liking, disliking, content. In others, they provide implicit feedback, by skipping suggestions, spending more time looking at certain advertisements, etc. Such models often exhibit flaws, even in production settings. One example is the likelihood of encountering feedback loops, where user behaviour is strongly affected by the recommendations of the current algorithm. In such cases, a certain item might be frequently recommended, resulting in more people accessing that item. Because more people have accessed that item, it may be more commonly recommended, resulting in feedback. 

#### Sequence Learning

In the previous examples, e.g. house prices regression, the assumption has been made that each individual observation or training datapoint is uncorrelated with the previous one. If we are instead looking at, say, video data, our prediction for what is happening in the current frame is likely to be much stronger if we also take into account what has been happening in the previous frames. Similarly, if we want to predict the likelihood of a patient dying in the next 24 hours in the ICU, we will be better off building a model which relies heavily on the medical history of that patient. Questions like these are instances of sequence learning, where a model is required to either ingest or output a sequence (or both). When a model both ingests and outputs a sequence, this is an example of sequence to sequence learning: e.g. machine translation and speech-to-text transcription. 

Special cases of sequence-to-sequence learning:
- Tagging and parsing: Annotating a text sequence with attributes, for example identifying which words are nouns, verbs, adjectives.	
- Automatic speech recognition: taking a sound recording and producing a text output from this. Highly challenging because the number of text samples is orders of magnitude smaller than the number of audio samples (which could be 44,000 per second for a 44kHz recording). 
- Text to speech: generating audio from text. Another sequence-to-sequence problem.
- Machine translation: poses special problems because of the different order of verbs in one language versus another. High different grammar structures makes this extremely challenging.


### 1.3.2 Unsupervised and self-supervised learning

In an unsupervised learning model, the data is unlabelled. Ya just get a big chunk of data and “do something with it”. Here are a few of the questions you might approach using unsupervised learning:

- Clustering : Can we find a small number of “prototypes” i.e. clusters which accurately summarise the data (e.g group photos by landscape and portraits) or group internet users into categories with similar behaviour?
- Subspace Estimation (or principle component analysis if linear): Which features of the data are most important for describing its properties?
- Can we match symbolic properties to the data?
- Is there a description of the root cause, or causality among the data.
- Deep generative models. Estimate the density of the data in some way, then provide information on the probability of samples, or generate examples of samples (.e.g the variational autoencoder and advanced on with generative adversarial networks).

Recently, models have also been trained using self-supervised learning methods. For example the BERT method was trained by masking words in large corpora of text data and asking the model to “fill in the blanks”, which requires no manual labelling at all. Models may also be trained by masking areas of an image and being asked to generate some aspect of those images as a prediction. 

### 1.3.3. Interacting with an Environment 

The examples up until this point have been mostly concerned with offline learning, where a large set of data is assembled first, and training is undertaken in isolation. This is convenient but a bit limiting in real life. Often in reality we wish to be concerned with intelligent agents, which make decisions and take actions in the real world. These agents actually have the possibility of impacting our environment, which must be taken into account. Some questions migth be:
- Does the environment remember what happened previously?
- Does the environment want to help us (e.g. a user reading text into a speech recogniser).
- Does the environment want to beat us (e.g. spammers setting up new spam email)
- Does the environment have shifting dynamics, what is true today may not be true a month from now. 

### 1.3.4 Reinforcement Learning 

Several famous examples of reinforcement learning have hit the news in recent years, including the deep Q-network and AlphaGo. These build a very general statement of the problem in which an agent operates. At each step, the agent receives some observation from the environment and must choose an action which is transmitted back to the evrionment (though what is known as the actuator). After each loop, the agent receives some reward for completing its task. 

The actions of a reinforcement learning agent are governed by a policy,  which is a function which maps from the observations of the environment into actions. The aim of reinforcement learning is essentially to determine good polices. 

The idea of reinforcement learning is an extremely generalisable one. For example, categorisation may be recast within the framework of reinforcement learning by making each categorisation an action. We then assign a reward for each action which is effectively the loss function of the original categorisation problem. 

Reinforcement learning can also approach problems which are not tractable with other machine learning methods. In supervised learning, we assume that every sample comes with the correct label, which is not assumed in reinforcement learning. Instead, the agent just receives a reward for each action. The environment may not even tell us which action led to that reward, a good example of this is chess, you only really know if you have done well when you win or lose the game, but many many individual actions will have led to that outcome. 

Reinforcement learners also have to deal with the idea of partial observability, i.e. that the current observation might not actually tell you everything about your current state. E.g. you observe where you are in the maze but cannot see the whole maze. They must also deal with the decision to exploit the current working best strategy, or explore other strategies, as the current strategy may not necessarily be the best one overall. 

- When the environment is full observed, the reinforcement learning process is called a Markov decision process.
- When the state does not depend on the previous observations, this is a contextual (one arm) bandit problem.
- When there is no state, just a set of availabel actions with no immediately known rewards, this is a “classic” (?) multi-armed bandit problem.

## 1.5 Road to Deep Learning

Things that have helped researchers achieve tremendous progress over the decade:

- New methods for capacity control, such as dropout, have helped mitigate overfitting. E.g. injecting noise into the model during training
- Attention mechanisms solved another problem which has existed for centuries, how to increase the memory and complexity of a system without increasing the number of “learnable parameters”. The learnable pointer structure meant that all that needed to be stored was a pointer to some intermediate state, rather than the entire sequence.
- The transformer structure, built entirely on attention mechanisms, has demonstrated improved scaling behaviour in dataset size, model size, and amount of training compute. 
    - GATO: a general agent: https://arxiv.org/pdf/2205.06175
- Modelling probabilities of text sequences, in modern language models: chatGPT. 
- Multi-stage designs allowed researchers to modify the internal state of a neural network to carry out multiple stages of reasoning
- Generative adversarial networks (don’t really understand explanation here)
- Diffusion models have begun to replace generative adversarial networks (e.g. DALL-E).
- Parallelisation, improving fitting algorithms to be run in larger batches across a greater number of GPUs, bypassing a major weakness of stochastic gradient descent (which needs small mini batches) 
- Reinforcement learning also benefitted from better computing resources as simulators for training were made available
- Better availability of deep learning frameworks, via TensorFlow (Keras), PyTorch, Gluon API,MXNet, JAX, etc. 