# Chapter 4 

### text: Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan

Inferential statistical techniques assign precise measures to our uncertainty about possibilities.   
Uncertainty is measured in terms of probability, and therefore we must establish the
properties of probability before we can make inferences about it.

## 4.1. THE SET OF ALL POSSIBLE EVENTS

Suppose I have a coin that I am going to flip. How likely is it to come up a head?   
How
likely is it to come up a tail?  
How likely is it to come up a torso?   
Notice that when
we contemplate the likelihood of each outcome, we have in mind a set of all possible
outcomes.   
Torso is not one of the possible outcomes.   
Notice also that a single flip of a
coin can result in only one outcome; it cannot be both heads and tails in a single flip.  
The outcomes are mutually exclusive.

Whenever we ask about how likely an outcome is, we always ask with a set of possible
outcomes in mind.   
This set exhausts all possible outcomes, and the outcomes are all
mutually exclusive.   
This set is called the **sample space(표본 공간)**. 

Consider the probability that a coin comes up heads when it is flipped.   
If the coin is
fair, it should come up heads in about 50% of the flips.   
If the coin (or its flipping mechanism) is biased, then it will tend to come up heads more than or less than 50% of the
flips.   
The probability of coming up heads can be denoted with parameter label $\theta$ (Greek
letter theta); for example, a coin is fair when $\theta$ = 0.5 (spoken “theta equals point five”).

We can also consider our **degree of belief** that the coin is fair.   
We might know that the coin was manufactured by a government mint(조폐공사), and therefore we have a high degree of belief that the coin is fair.   
Alternatively, we might know that the coin was manufactured
by Acme Magic and Novelty Company, and therefore we have a high degree of belief
that the coin is biased.   

**The degree of belief about a parameter can be denoted $p(\theta)$.**  
If the coin was minted by the federal government, we might have a strong belief that the coin is fair; for example we might believe that $p(\theta=0.5)$ = 0.99, spoken “the probability that theta equals 0.5 is 99 percent.”   
If the coin was minted by the Novelty company, we might have a strong belief that the coin is biased; for example we might believe that $p(\theta=0.5)$ = 0.01 and that $p(\theta=0.9)$ = 0.99.

Both "probability" of head or tail outcome and “degree of belief ” in biases refer to sample spaces.   
* The sample space for flips of a coin consists of two possible outcomes:
head and tail.  
* The sample space for coin bias consists of a continuum of possible values:
$\theta$ = 0.0, $\theta$ = 0.01, $\theta$ = 0.02, $\theta$ = 0.03, and all values in between, up to $\theta$ = 1.0.**    
* When we flip a given coin, we are sampling from the space of head or tail.   
* When we grab a coin at random from a sack of coins, in which each coin may have a different bias,we are sampling from the space of possible biases.

# Chapter 5 - Bayes' Rule

On a typical day at your location, what is the probability that it is cloudy?   Suppose you are told it is raining, now what is the probability that it is cloudy?   
Notice that
those two probabilities are not equal, because we can be pretty sure that $$p(cloudy) <
p(cloudy|raining)$$   
Suppose instead you are told that everyone outside is wearing
sunglasses.   
Most likely, it is true that $$p(cloudy) > p(cloudy|sunglasses)$$ 

We started with prior credibility allocated
over two possible states of the sky: cloudy or sunny.   
Then we took into account some
other data, namely, that it is raining or that people are wearing sunglasses.   Conditional on the new data, we re-allocated credibility across the possible states of the sky.   
When the data indicated rain, then cloudy was more credible than when we started.   
When the data
instead indicated sunglasses, then cloudy was less credible than when we started.   

**Bayes’rule is merely the mathematical relation between the prior allocation of credibility and
the posterior reallocation of credibility conditional on data.**

## 5.1. BAYES’ RULE

Thomas Bayes (1702-1761) was a mathematician and Presbyterian minister in England.  
His famous theorem was published posthumously in 1763.  
The simple rule has
vast ramifications(영향) for statistical inference.

There is another branch of statistics, called **frequentist**(빈도주의자), which does not use Bayes’ rule
for inference and decisions.

### 5.1.1. Derived from definitions of conditional probability

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$ \large p(c|r)= \frac{p(r,c)}{p(r)}$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5.1)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large p(c|r)p(r)= p(r,c)$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5.2)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large p(r|c)p(c)= p(r,c)$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5.3)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large p(c|r)p(r)=p(r|c)p(c)$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5.4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large p(c|r)= \frac{p(r|c)p(c)}{p(r)}$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5.5)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large p(c|r)= \frac{p(r|c)p(c)}{\sum_{c^*} p(r|c^*) p(c^*)}$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5.6)

In Equation 5.6, the $c$ in the numerator is a specific fixed value, whereas the $c^∗$ in the denominator is a variable that takes on all possible values.  

**Equations 5.5 and 5.6 are called Bayes’ rule.** 

### 5.1.2. Bayes’ rule intuited from a two-way discrete table

Consider Table 5.1, which shows the joint probabilities of a row attribute and a column
attribute, along with their **marginal probabilities(주변확률)**.

<br><br>
<img style="float: left;" src="pic3/05_01.png"  width="600">

<br><br>
<img style="float: left;" src="pic3/05_02.png"  width="600">

<br><br><br>
조건부 확률 $p(hair color|eye color=Blue)$ 구하기

<br><br>
<img style="float: left;" src="pic3/05_03.png"  width="600">

<br><br>
<img style="float: left;" src="pic3/05_10.jpg"  width="600">

<br><br><br><br>

### Disease Diagnosis

Consider trying to diagnose a rare disease.   
Suppose that in the general population, the
probability of having the disease is only one in a thousand.   

We denote the true presence
or absence of the disease as the value of a parameter, $\theta$, that can have the value 
* $\theta$ = ☹️ if disease is present in a person, 
* $\theta$ = 🙂 if the disease is absent.   

The base rate of the disease is therefore denoted $p(\theta =$☹️$)$ = 0.001.  
This is our prior belief that a
person selected at random has the disease.

Suppose that there is a test for the disease that has a 99% hit rate, which means that if a
person has the disease, then the test result is positive 99% of the time.   
We denote  as 
* T = + if a positive test result 
* T = − if a negative test result

The observed test result is the datum that we will use to modify our belief about the value of the underlying disease
parameter.  
The hit rate is expressed formally as   
$p(T = + | \theta =$☹️$)$ = 0.99.   

Suppose also
that the test has a false alarm rate of 5%.   
This means that 5% of the time when the disease
is absent, the test falsely indicates that the disease is present.   
The false alarm rate is expressed formally as   
$p(T = + | \theta =$🙂$)$ = 0.05.   


Suppose we sample a person at random from the population, administer the test, and
it comes up positive.   
What is the posterior probability that the person has the disease?  
Mathematically expressed, we are asking,   
what is $p(\theta =$☹️$ | T = + )$?   

Before determining
the answer from Bayes’ rule, generate an intuitive answer and see if your intuition matches
the Bayesian answer.   
Most people have an intuition that the probability of having the
disease is near the hit rate of the test (which in this case is 99%).

<br><br>
<img style="float: left;" src="pic3/05_04.png"  width="700">

Table 5.4 shows how to conceptualize disease diagnosis as a case of Bayes’ rule.  
The base rate of the disease is shown in the lower marginal of the table.   
Because the background probability of having the disease is $p(\theta =$☹️$)$ = 0.001, it is the case that the
probability of not having the disease is the complement, $p(\theta =$🙂$)$ = 1 − 0.001 = 0.999.  
Without any information about test results, this lower marginal probability is our prior
belief about a person having the disease.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large p(\theta =☹️|T=+)= \frac{p(T = + | \theta =☹️) p(\theta =☹️)}{\sum_{\theta} p(T=+|\theta) p(\theta)}$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large = \frac{0.99 \times 0.001}{0.99 \times 0.001+0.005 \times (1-0.001)}$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large = 0.019$ 

## 5.2. APPLIED TO PARAMETERS AND DATA

**The key application that makes Bayes’ rule so useful is when the row variable represents
data values and the column variable represents parameter values.**   
A model of data specifies
the probability of particular data values given the model’s structure and parameter values.  
The model also indicates the probability of the various parameter values.   
In other words, a model specifies  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$p$(data values | parameters values)  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;along with the prior, $p$(parameters values)

and we use Bayes’ rule to convert that to what we really want to know, which is how strongly we should believe in the various parameter values, given the data:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$p$(parameters values | data values)

<br><br>
<img style="float: left;" src="pic3/05_05.png"  width="600">

The factors of Bayes’ rule have specific names that will be used regularly throughout
the book, as indicated here:

<br><br>
<img style="float: left;" src="pic3/05_06.png"  width="600">

where the superscript asterisk in $\theta^∗$ is merely a reminder that the denominator’s $\theta^∗$ is
distinct from the specific $\theta$ value in the numerator of Equation 5.7.  
* prior, $p(\theta)$,
is the credibility of the $\theta$ values without the data $D$.   
* posterior, $p(\theta|D)$, is the
credibility of $\theta$ values with the data $D$ taken into account.   
* likelihood, $p(D|\theta)$, is
the probability that the data could be generated by the model with parameter value $\theta$.  
* evidence for the model, $p(D)$, is the overall probability of the data according to the model, determined by averaging across all possible parameter values weighted by the strength of belief in those parameter values.

The denominator(분모) of Bayes’ rule, labeled in Equation 5.7 as the evidence for the model, is also called the **marginal likelihood**.   
The term “evidence”is ambiguous.   
The term “marginal likelihood” refers specifically to the
operation of taking the average of the likelihood, $p(D|\theta)$, across all values of $\theta$, weighted
by the prior probability of $\theta$.   
In this book, I will use the terms “evidence” and “marginal
likelihood” interchangeably

Up to this point, Bayes’ rule has been presented only in the context of discrete-valued
variables.   
It also applies to continuous variables, but probability masses become probability densities and sums become integrals.   
For continuous variables, the only change in
Bayes’ rule is that the marginal likelihood changes from the sum in Equation 5.8 to an integral:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\large p(D)= \int p(D|\theta^*)p(\theta^*)d\theta^*$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(5.9)

### 5.2.1. Data-order invariance

Bayes’ rule in Equation 5.7 gets us from a prior belief, $p(\theta)$, to a posterior belief, $p(\theta|D)$,
when we take into account some data $D$.   
Now suppose we observe some more data, which we’ll denote $D'$.   
We can then update our beliefs again, from $p(\theta|D)$ to $p(\theta|D',D)$.  
Here’s the question: Does our final belief depend on whether we update with $D$ first
and $D'$ second, or update with $D'$ first and $D$ second?

The answer is: It depends!   
In particular, it depends on the model function that defines
the likelihood, $p(D|\theta)$.   
In many models, the probability of data, $p(D|\theta)$, does not depend
in any way on other data.   
That is, the joint probability  $p(D,D'|\theta)$ equals $p(D|\theta)$·$p(D'|\theta)$.  
In other words, in this sort of model, the data probabilities are independent. 
Under this condition, then the order
of updating has no effect of the final posterior.

This invariance to ordering of the data makes sense intuitively: If the likelihood
function has no dependence on data ordering, then the posterior shouldn’t have any
dependence on data ordering. 

## 5.4. WHY BAYESIAN INFERENCE CAN BE DIFFICULT

Determining the posterior distribution directly from Bayes’ rule involves computing the evidence (a.k.a. marginal likelihood) in Equations 5.8 and 5.9.      
In the usual case of continuous parameters, the integral in Equation 5.9 can be impossible to solve analytically. 

Instead of analytical mathematical approaches, another class of methods involves numerical approximation of the integral.   
When the parameter space is small, one numerical approximation method is to cover the space with a  grid of points and
compute the integral by exhaustively summing across that grid.   
This was the approach
we used in Figures 5.2 and 5.3, where the domain of the parameter $\theta$ was represented by a
fine comb of values, and the integral across the continuous parameter $\theta$ was approximated
by a sum over the many discrete representative values.

This method will not work for models with many parameters, however.   
In many realistic models, there are dozens
or even hundreds of parameters.   
If we represent each parameter with a grid
of 1,000 values, then for P parameters there are $1,000^P$ combinations of parameter values.   
When P is even moderately large, there are far too many combinations for even
a modern computer to handle.

Another kind of approximation involves randomly sampling a large number of
representative combinations of parameter values from the posterior distribution.   
In recent decades, many such algorithms have been developed, generally referred to as Markov chain Monte Carlo (MCMC) methods.   
What makes these methods so useful is that they can generate representative parameter-value combinations from the posterior distribution of complex models without computing the integral in Bayes’ rule.   
It is the development of these MCMC methods that has allowed Bayesian statistical methods to
gain practical use. 