# 통계학의 두 학파

Frequentist(빈도주의자) vs. Baysian(베이지안, 베이스주의자)

예: 한국 성인의 평균 키는 얼마인가? 

## 빈도주의자

For a frequentist, this number is unknown but fixed. This is a natural intuitive view, as you can imagine that if you go through all Korean adult citizens one by one, measure their height and average the list, you will get the actual number.

However, since you do not have access to all Korean citizens, you take a sample of, say, a thousand citizens, measure and average their height to produce a point estimate, and then calculate the estimate of your error. The point is that the frequentist looks at the average height as a single unknown number.

## 베이지안

A Bayesian statistician, however, would have an entirely different take on the situation. A Bayesian would look at the average height of the citizens not as a fixed number, but instead as an unknown distribution (you might imagine here a “bell” shaped normal distribution).

Initially, the Bayesian statistician has some basic prior knowledge which is being assumed: for example, that the average height is somewhere between 50cm and 250cm.

Then, the Bayesian begins to measure heights of specific Korean citizens, and with each measurement updates the distribution to become a bit more “bell-shaped” around the average height measured so far. As more data is collected, the “bell” becomes sharper and more concentrated around the measured average height.

# Chapter 2 - Introduction: Credibility, Models, and Parameters

### text: Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan

The goal of this chapter is to introduce the conceptual framework of Bayesian data
analysis.   
Bayesian data analysis has two foundational ideas.   
The first idea is that Bayesian
inference is reallocation of credibility across possibilities.   
The second foundational idea
is that the possibilities, over which we allocate credibility, are parameter values in
meaningful mathematical models.

In [2]:
import pandas as pd
import numpy as np
#import pymc3 as pm
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

from IPython.display import Image
from matplotlib import gridspec

%matplotlib inline

plt.style.use('seaborn-white')
color = '#87ceeb'

## 2.1. BAYESIAN INFERENCE IS REALLOCATION OF CREDIBILITY
## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;        ACROSS POSSIBILITIES

셜록홈즈는 어떤 범죄의 가능한 원인들을 생각했다.
물론 그 원인들 중에는 증거를 조사해 보기전에는 거의 일어날 수 없는 원인들도 포함되어 있다. 
셜록홈즈가 증거를 조사해서 그 원인이 일어날 수 없다면, 그 원인을 범죄의 가능한 원인에서 제외한다.
만일 하나만 제외하고 나머지 모든 원인들이 제거 된다면, 처음에는 불가능해 보였던 원인도 실제 원인으로 결론 내릴수 있다.

There are just four possible causes of the outcome to be explained.   
We label the
causes A, B, C, and D.   

<br><br>
<img style="float: left;" src="pic3/02_01.png"  width="650">

The heights of the bars in the graphs indicate the credibility
of the candidate causes. (“Credibility” is synonymous with “probability”; here I use
the everyday term “credibility” but later in the book, when mathematical formalisms
are introduced, I will also use the term “probability.”)   

Credibility(신뢰도, 신용도) can range from zero
to one.   
If the credibility of a candidate cause is zero, then the cause is definitely not
responsible.  
If the credibility of a candidate cause is one, then the cause definitely is
responsible.   
Because we assume that the candidate causes are mutually exclusive and
exhaust all possible causes, the total credibility across causes sums to one.

The upper-left panel of Figure 2.1 shows that the **prior** credibilities of the four
candidate causes are equal, all at 0.25.   
Suppose we make new observations that rule out candidate cause A.
For example, if A is a suspect in a crime, we may learn that A was far from the crime
scene at the time.   
Therefore, we must re-allocate credibility to the remaining candidate
causes, B through D, as shown in the lower-left panel of Figure 2.1.   
The re-allocated
distribution of credibility is called the **posterior** distribution because it is what we believe
after taking into account the new observations.   
The posterior distribution gives zero
credibility to cause A, and allocates credibilities of 0.33 (i.e., 1/3) to candidate causes B,
C, and D.
**The posterior distribution then becomes the prior beliefs for subsequent observations.**
Thus, the prior distribution in the upper-middle of Figure 2.1 is the posterior
distribution from the lower left. Suppose now that additional new evidence rules out
candidate cause B. We now must re-allocate credibility to the remaining candidate causes, C and D, as shown in the lower-middle panel of Figure 2.1.

This reallocation of credibility is not only intuitive,
it is also what the exact mathematics of Bayesian inference prescribe, as will be explained
later in the book.

<br><br>
<img style="float: left;" src="pic3/02_02.png"  width="650">

### 2.1.1. Data are noisy and inferences are probabilistic

Here is a simplified illustration of Bayesian inference when data are noisy.   
Suppose
there is a manufacturer of inflated bouncy balls, and the balls are produced in four
discrete sizes, namely diameters of 1.0, 2.0, 3.0, and 4.0 (on some scale of distance
such as decimeters).   
The manufacturing process is quite variable, however, because of
randomness in degrees of inflation even for a single size ball.   
Thus, balls of manufactured
size 3 might have diameters of 1.8 or 4.2, even though their average diameter is 3.0.  
Suppose we submit an order to the factory for three balls of size 2.  
We receive three balls
and measure their diameters as best we can, and find that the three balls have diameters
of 1.77, 2.23, and 2.70.   
From those measurements, can we conclude that the factory
correctly sent us three balls of size 2, or did the factory send size 3 or size 1 by mistake,
or even size 4?

<br><br>
<img style="float: left;" src="pic3/02_03.png"  width="650">

Figure 2.3 shows a Bayesian answer to this question.   
The upper graph shows the four possible sizes, with blue bars at positions 1, 2, 3, and 4.   
The prior credibilities of the
four sizes are set equal, at heights of 0.25, representing the idea that the factory received
the order for three balls, but may have completely lost track of which size was ordered,
hence any size is equally possible to have been sent.  

At this point, we must specify the form of random variability in ball diameters.  
For purposes of illustration, we will suppose that ball diameters are centered on their
manufactured size, but could be bigger or smaller depending on the amount of inflation.  
The bell-shaped curves in Figure 2.3 indicate the probability of diameters produced by
each size.   
Thus, the bell-shaped curve centered on size 2 indicates that size-2 balls are
usually about 2.0 units in diameter, but could be much bigger or smaller because of
randomness in inflation.   
The horizontal axis in Figure 2.3 is playing double duty as a
scale for the ball sizes (i.e., blue bars) and for the measured diameters (suggested by the
bell-shaped distributions).

The lower panel of Figure 2.3 shows the three measured diameters plotted as circles
on the horizontal axis.   
You can see that the measured diameters are closest to sizes 2
or 3, but the bell-shaped distributions reveal that even size 1 could sometimes produce
balls of those diameters.   
Intuitively, therefore, we would say that size 2 is most credible,
given the data, but size 3 is also somewhat possible, and size 1 is remotely possible,
but size 4 is rather unlikely.   
These intuitions are precisely reflected by Bayesian analysis, which is shown in the lower panel of Figure 2.3.   
The heights of the blue bars show the
exact reallocation of credibility across the four candidate sizes.   
Given the data, there is
56% probability that the balls are size 2, 31% probability that the balls are size 3, 11%
probability that the balls are size 1, and only 2% probability that the balls are size 4.

Inferring the underlying manufactured size of the balls from their “noisy” individual
diameters is analogous to data analysis in real-world scientific research and applications.  
The data are noisy indicators of the underlying generator.   
We hypothesize a range of
possible underlying generators, and from the data we infer their relative credibilities.

As another example, consider testing people for illicit(불법) drug use.   
A person is taken
at random from a population and given a blood test for an illegal drug.   
From the result
of the test, we infer whether or not the person has used the drug.   
But, crucially, the
test is not perfect, it is noisy.   
The test has a non-trivial probability of producing false
positives and false negatives.   
And we must also take into account our prior knowledge that the drug is used by only a small proportion of the population.   
Thus, the set of
possibilities has two values: The person uses the drug or does not.   
The two possibilities
have prior credibilities based on previous knowledge of how prevalent drug use is in
the population.   
The noisy datum is the result of the drug test.   
We then use Bayesian
inference to re-allocate credibility across the possibilities.   
As we will see quantitatively
later in the book, the posterior probability of drug use is often surprisingly small even
when the test result is positive, because the prior probability of drug use is small and the
test is noisy.   


This is true not only for tests of drug use, but also for tests of diseases such
as cancer.   
A related real-world application of Bayesian inference is detection of spam
in email.   
Automated spam filters often use Bayesian inference to compute a posterior
probability that an incoming message is spam.

In summary, the essence of Bayesian inference is reallocation of credibility across
possibilities.   
The distribution of credibility initially reflects prior knowledge about the
possibilities, which can be quite vague.   
Then new data are observed, and the credibility is
re-allocated.   
Possibilities that are consistent with the data garner more credibility, while
possibilities that are not consistent with the data lose credibility.  
Bayesian analysis is the
mathematics of re-allocating credibility in a logically coherent and precise way.

## 2.2. POSSIBILITIES ARE PARAMETER VALUES IN DESCRIPTIVE MODELS

<br><br>
<img style="float: left;" src="pic3/02_04.png"  width="650">

## 2.3. THE STEPS OF BAYESIAN DATA ANALYSIS

1. Identify the data relevant to the research questions.  
What are the measurement scales
of the data?   
Which data variables are to be predicted, and which data variables are
supposed to act as predictors?

2. Define a descriptive model for the relevant data.   
The mathematical form and its
parameters should be meaningful and appropriate to the theoretical purposes of the
analysis.

3. Specify a prior distribution on the parameters.   
The prior must pass muster with the
audience of the analysis, such as skeptical scientists.

4. Use Bayesian inference to re-allocate credibility across parameter values.   
Interpret
the posterior distribution with respect to theoretically meaningful issues (assuming
that the model is a reasonable description of the data; see next step).

5. Check that the posterior predictions mimic the data with reasonable accuracy (i.e.,
conduct a “posterior predictive check”).   
If not, then consider a different descriptive
model.

Perhaps the best way to explain these steps is with a realistic example of Bayesian data
analysis.   
The discussion that follows is abbreviated for purposes of this introductory
chapter, with many technical details suppressed. 

For this example, suppose we are interested in the relationship between weight and height of people.   
We suspect from
everyday experience that taller people tend to weigh more than shorter people,
but we would like to know by how much people’s weights tend to increase when
height increases, and how certain we can be about the magnitude of the increase.  
In particular, we might be interested in predicting a person’s weight based on their
height.

The first step is identifying the relevant data.   
Suppose we have been able to collect
heights and weights from 57 mature adults sampled at random from a population of
interest.   
Heights are measured on the continuous scale of inches, and weights are
measured on the continuous scale of pounds.  
We wish to predict weight from height.   
A scatter plot of the data is shown in Figure 2.5.

<br><br>
<img style="float: left;" src="pic3/02_05.png"  width="650">

The second step is to define a descriptive model of the data that is meaningful
for our research interest.   
We will describe predicted weight as a multiplier times height plus a
baseline.  
We will denote the predicted weight as $\hat{y}$ (spoken “y hat”), and we will denote
the height as x.   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\hat{y} = β_1 x + β_0$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2.1)

The coefficient, $β_1$ (Greek letter “beta”), indicates how much the predicted weight
increases when the height goes up by one inch.  
Equation 2.1 is the form of a line, in which $β_1$ is the slope and $β_0$ is the intercept, and this model of trend is often called
linear regression.  

The model is not complete yet, because we have to describe the random variation of
actual weights around the predicted weight.  
For simplicity, we will use the conventional
normal distribution (explained in detail in Section 4.3.2.2), and assume that actual
weights y are distributed randomly according to a normal distribution around the
predicted value $\hat{y}$  and with standard deviation denoted $σ$ (Greek letter “sigma”). This
relation is denoted symbolically as  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$y ∼ normal(\hat{y}, σ)$ &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(2.2)

where the symbol “∼” means “is distributed as.”  

The full model, combining Equations 2.1 and 2.2, has three parameters altogether:
the slope, $β_1$, the intercept, $β_0$, and the standard deviation of the “noise,” $σ$.   
Note that
the three parameters are meaningful.   
In particular, the slope parameter tells us how much
the weight tends to increase when height increases by an inch, and the standard deviation
parameter tells us how much variability in weight there is around the predicted value.  
This sort of model, called linear regression, is explained at length in Chapters 15, 17,
and 18.

The third step in the analysis is specifying a prior distribution on the parameters.  
We
might be able to inform the prior with previously conducted, and publicly verifiable,
research on weights and heights of the target population.   

But for purposes of this example, I will use a vague prior that places
virtually equal prior credibility across a vast range of possible values for the slope and
intercept, both centered at zero.   
I will also place a vague prior on
the noise (standard deviation) parameter, specifically a uniform distribution that extends
from zero to a huge value.   
This choice of prior distribution implies that it has virtually
no biasing influence on the resulting posterior distribution.

The fourth step is interpreting the posterior distribution.   
Bayesian inference has reallocated
credibility across parameter values, from the vague prior distribution, to values
that are consistent with the data.   
The posterior distribution indicates combinations of
$β_0$, $β_1$, and $σ$ that together are credible, given the data.  

The right panel of Figure 2.5
shows the posterior distribution on the slope parameter, $β_1$ (collapsing across the other
two parameters).   
It is important to understand that Figure 2.5 shows a distribution
of parameter values, not a distribution of data.  
The posterior distribution in Figure 2.5 indicates that the most
credible value of the slope is about 4.1.  

One way to summarize the uncertainty is by marking
the span of values that are most credible and cover 95% of the distribution.   
This is
called the **highest density interval (HDI)** and is marked by the black bar on the floor of
the distribution in Figure 2.5.   
Values within the 95% HDI are more credible (i.e., have
higher probability “density”) than values outside the HDI, and the values inside the
HDI have a total probability of 95%.  
Given the 57 data points, the 95% HDI goes from
a slope of about 2.6 pounds per inch to a slope of about 5.7 pounds per inch.  
With more
data, the estimate of the slope would be more precise, meaning that the HDI would be
narrower.

The fifth step is to check that the model, with its most credible parameter values,
actually mimics the data reasonably well.  
This is called a “posterior predictive check.”  
There is no single, unique way to ascertain whether the model predictions systematically and meaningfully deviate from the data, because there are innumerable ways in which
to define systematic deviation.   
One approach is to plot a summary of predicted data
from the model against the actual data.   
We take credible values of the parameters, $β_0$, $β_1$, and $σ$ , plug them into the model Equations 2.1 and 2.2, and randomly generate
simulated $y$ values (weights) at selected $x$ values (heights).  
We do that for many, many credible parameter values to create representative distributions of what data would look
like according to the model.   
The results of this simulation are shown in Figure 2.6.

<br><br>
<img style="float: left;" src="pic3/02_06.png"  width="650">

The predicted weight values are summarized by vertical bars that show the range of the
95% most credible predicted weight values.   
The dot at the middle of each bar shows
the mean of the predicted weight values.  
By visual inspection of the graph, we can
see that the actual data appear to be well described by the predicted data.   
The actual
data do not appear to deviate systematically from the trend or band predicted from
the model.

If the actual data did appear to deviate systematically from the predicted form, then
we could contemplate alternative descriptive models.   
For example, the actual data might
appear to have a nonlinear trend.   
In that case, we could expand the model to include
nonlinear trends.