# Introduction

## What is Probability?

**What is Probability?** \
Uncertainty involves making decisions with incomplete information, and this is the way we generally operate in the world. Handling uncertainty is typically described using everyday words like chance, luck, and risk. Probability is a field of mathematics that gives us the language and tools to quantify the uncertainty of events and reason in a principled manner.

Probability theory is the mathematics of uncertainty. Uncertainty refers to imperfect or incomplete information. The world is messy and imperfect and we must make decisions and operate in the face of this uncertainty.For example, we often talk about luck, chance, odds,
likelihood, and risk. These are words that we use to interpret and negotiate uncertainty in the
world. When making inferences and reasoning in an uncertain world, we need principled, formal
methods to express and solve problems. Probability provides the language and tools to handle
uncertainty.

The probability, or likelihood, of an event is also commonly referred to as the odds of the event or the chance of the event. These all generally refer to the same notion, although odds often has its own notation of wins to losses, written as w:1; e.g. 1:3 for a 1 win and 3 losses or 1/4 (25%) probability of a win.

Probability theory has three important concepts:
* **Event (A)**: An outcome to which a probability is assigned.
* **Sample Space (S)**: The set of possible outcomes or events.
* **Probability Function (P)**: The function used to assign a probability to an event

The likelihood of an event (A) being drawn from the sample space (S) is determined by the probability function (P ). The shape or distribution of all events in the sample space is called the
probability distribution. Many domains have a familiar shape to the distribution of probabilities to events, such as uniform if all events are equally likely or Gaussian if the likelihood of the
events forms a normal or bell-shape.

**Two Schools of Probability**

There are two main ways of interpreting or thinking about probability. The perhaps simpler
approach is to consider probability as the actual likelihood of an event, called the Frequentist
probability. Another approach is to consider probability a notion of how strongly it is believed
the event will occur, called Bayesian probability. It is not that one approach is correct and the
other is incorrect; instead, they are complementary and both interpretations provide different
and useful techniques.


**Frequentist Probability**

The frequentist approach to probability is objective.
Events are observed and counted, and their
frequencies provide the basis for directly calculating a probability, hence the name frequentist.
Probability theory was originally developed to analyze the frequencies of events.

Methods from frequentist probability include p-values and confidence intervals used in
statistical inference and maximum likelihood estimation for parameter estimation.

**Bayesian Probability**

The Bayesian approach to probability is subjective. Probabilities are assigned to events based on
evidence and personal belief and are centered around Bayes’ theorem, hence the name Bayesian.
This allows probabilities to be assigned to very infrequent events and events that have not been
observed before, unlike frequentist probability.

One big advantage of the Bayesian interpretation is that it can be used to model
our uncertainty about events that do not have long term frequencies.

Methods from Bayesian probability include Bayes factors and credible interval for inference
and Bayes estimator and maximum a posteriori estimation for parameter estimation.

## Uncertainty in Machine Learning

There are many sources of uncertainty
in a machine learning project, including variance in the specific data values, the sample of data
collected from the domain, and in the imperfect nature of any models developed from such data.

Noise in data, incomplete coverage of the domain, and imperfect models provide the three
main sources of uncertainty in machine learning.

Applied machine learning requires getting comfortable with uncertainty. Uncertainty means
working with imperfect or incomplete information.For software engineers and developers,
computers are deterministic. You write a program, and the computer does what you say.
Algorithms are analyzed based on space or time complexity and can be chosen to optimize
whichever is most important to the project, like execution speed or memory constraints.

There are three main sources of uncertainty in machine learning:

**Noise in Observations**

An observation from
the domain is often referred to as an instance or a example and is one row of data. It is what
was measured or what was collected. It is the data that describes the object or subject. It is
the input to a model and the expected output.

Noise refers to variability in the observation. Variability could be natural, such as a larger or
smaller flower than normal. It could also be an error, such as a slip when measuring or a typo
when writing it down. This variability impacts not just the inputs or measurements but also
the outputs; for example, an observation could have an incorrect class label. This means that
although we have observations for the domain, we must expect some variability or randomness.

**Incomplete Coverage of the Domain**

Observations from a domain used to train a model are a sample and incomplete by definition. In
statistics, a random sample refers to a collection of observations chosen from the domain without
systematic bias (e.g. uniformly random). Nevertheless, there will always be some limitation
that will introduce bias. For example, we might choose to measure the size of randomly selected
flowers in one garden. The flowers are randomly selected, but the scope is limited to one garden.
Scope can be increased to gardens in one city, across a country, across a continent, and so on.

An appropriate level of variance and bias in the sample is required such that the sample
is representative of the task or project for which the data or model will be used. We aim to
collect or obtain a suitably representative random sample of observations to train and evaluate
a machine learning model. Often, we have little control over the sampling process. Instead, we
access a database or CSV file and the data we have is the data we must work with. In all cases,
we will never have all of the observations. If we did, a predictive model would not be required.
This means that there will always be some unobserved cases. There will be part of the problem
domain for which we do not have coverage.

This is why we split a dataset into train and test sets or use resampling methods like k-fold
cross-validation. We do this to handle the uncertainty in the representativeness of our dataset
and estimate the performance of a modeling procedure on data not used in that procedure.

**Imperfect Model of the Problem**

A machine learning model will always have some error. This is often summarized as all models
are wrong, or more completely in an aphorism by George Box:
All models are wrong but some are useful.

This does not apply just to the model, the artifact, but the whole procedure used to prepare
it, including the choice and preparation of data, choice of training hyperparameters, and the
interpretation of model predictions. Model error could mean imperfect predictions, such as
predicting a quantity in a regression problem that is quite different to what was expected, or
predicting a class label that does not match what would be expected. This type of error in
prediction is expected given the uncertainty we have about the data that we have just discussed,
both in terms of noise in the observations and incomplete coverage of the domain.

Another type of error is an error of omission. We leave out details or abstract them in order
to generalize to new cases. This is achieved by selecting models that are simpler but more
robust to the specifics of the data, as opposed to complex models that may be highly specialized
to the training data. As such, we might and often do choose a model known to make errors on
the training dataset with the expectation that the model will generalize better to new cases and
have better overall performance.

**How to Manage Uncertainty**

**In terms of noisy observations**, probability and statistics help us to understand and quantify the expected value and variability of variables in our observations from the domain.

**In terms of the incomplete coverage of the domain**, probability helps to understand and quantify the expected distribution and density of observations in the domain.

**In terms of model error**, probability helps to understand and quantify the expected capability and variance in performance of our predictive models when applied to new data.

But this is just the beginning, as probability provides the foundation for the iterative training
of many machine learning models, called maximum likelihood estimation, behind models such
as linear regression, logistic regression, artificial neural networks, and much more. Probability
also provides the basis for developing specific algorithms, such as Naive Bayes, as well as entire
subfields of study in machine learning, such as graphical models like the Bayesian Belief Network.

## Why Learn Probability for Machine Learning

* Class Membership Requires Predicting a Probability
* Some Algorithms Are Designed Using Probability
    * `Naive Bayes` which  is constructed using Bayes Theorem with some simplifying assumptions.
    * `Probabilistic Graphical Models (PGM)` are designed around Bayes Theorem.
    * `Bayesian Belief Networks` or Bayes Nets, which are capable of capturing the conditional dependencies  between variables.
* Models Are Trained Using a Probabilistic Framework

Many machine learning models are trained using an iterative algorithm designed under a
probabilistic framework. Perhaps the most common is the framework of maximum likelihood
estimation, sometimes shorted as MLE. This is a framework for estimating model parameters
(e.g. weights) given observed data. This is the framework that underlies the ordinary least
squares estimate of a linear regression model. The expectation-maximization algorithm, or
EM for short, is an approach for maximum likelihood estimation often used for unsupervised
data clustering, e.g. estimating k means for k clusters, also known as the k-Means clustering
algorithm.

For models that predict class membership, maximum likelihood estimation provides the
framework for minimizing the difference or divergence between an observed and a predicted
probability distribution. This is used in classification algorithms like logistic regression as
well as deep learning neural networks. It is common to measure this difference in probability
distributions during training using entropy, e.g. via cross-entropy. Entropy, differences between
distributions measured via KL divergence, and cross-entropy are from the field of information
theory that directly builds upon probability theory. For example, entropy is calculated directly
as the negative log of the probability.

* Models Can Be Tuned With a Probabilistic Framework
* Probabilistic Measures Are Used to Evaluate Model Skill