# MSDM5058 Tutorial 2 Pattern Recognition and Decision Theory

## Contents
1. Association rules
2. Pattern recognition
3. Decision theory

---

# 1. Association rules

The causality between two events $A$ and $C$ can be formulated with an association rule $A\Rightarrow C$, which literally reads as:

<p style="text-align: center;">
"When $A$ happened, $C$ will happen."
</p>

The terminology to the two events are given as:

- $A$ = the **A**ntecedent 
- $C$ = the **C**onsequence
    
Note that we avoid calling $A$ as the "cause" because we just want to measure the predictability of the rule, i.e. how confident we can say "$C$ will happen" when we see "$A$ has happened". We do not care about how $A$ is associated to $C$. (e.g. they are just found to happen together frequently but without logical explanation.)

Here are some measures for quantifying the predictability:

## 1.1. Usefulness measures

- **Support**: 
 
 $$\mathrm{supp}(A\Rightarrow C) = P(A)$$
 
 i.e. The probability of the $A$ to occur. A rule is quite useless if its support is low, because it is rarely applicable.
 
 

 

- **Confidence**: 
 
 $$\mathrm{conf}(A\Rightarrow C) = P(C|A)$$ 

 i.e. The conditional probability of $C$ to occur given that $A$ has occured. Intuitively a rule with a high confidence is more trustworthy than one with a low confidence. However, confidence may be high just because $C$ happens a lot, and the high confidence is statistically expected.
 


 

- **Rule power factor**: 

 $$\mathrm{RPF}(A\Rightarrow C) = P(A\cap C)P(C|A)$$
 
 Similar to confidence but devised to emphasize the importance of association (by only counting the intersection between $A$ and $C$).  For example, if $C$ occurs frequently, the rule's confidence is trivially pulled up, whereas its RPF remains low.

 


 
## 1.2. Interdependence measures

- **Lift**: 

 $$\mathrm{lift}(A\Rightarrow C) = \frac{P(A\cap C)}{P(A)P(C)}$$

 It measures the dependence between $A$ and $C$ by intersection between events. If lift equals $1$, $A$ and $B$ are independent and thus in no way causal. If it is greater than $1$, $A$ and $C$ co-occur more frequently than expected, suggesting some kind of causality. If it is less than $1$, $A$ and $C$ co-occur less frequently than expected, suggesting that one cause the other's complement. Notice its symmetry: $\mathrm{lift}(A\Rightarrow C) = \mathrm{lift}(C\Rightarrow A)$.

- **Leverage**: 

 $$\mathrm{leve}(A\Rightarrow C) = P(A\cap C) - P(A)P(C)$$

 A symmetric measure similar to lift. If it equals $0$, $A$ and $C$ are independent. Otherwise, $A$ and $C$ associates with each other positively or negatively depending on its sign. Still, it differs from lift occasionally: for example, $P(A\cap C)=0$ yields a zero lift but a nonzero leverage.

- **Conviction**: 

 $$\mathrm{conv}(A\Rightarrow C)= \frac{1-P(C)}{1-P(C|A)}$$

 Also similar to lift but asymmetric. The nominator measures the probability for the rule to fail if $A$ and $C$ are in fact independent. The denominator does the same but assumes $A$ causes $C$. While it equals $1$ if $A$ and $C$ are independent, it is less than $1$ if it is more likely to have $A$ causes $C$, or greater than $1$ if $A$ causes $C$'s complement.
 
 
---

# 2. Pattern recognition

## 2.1. Bayesian inference

By the symmetry of Bayes' theorem, we may derive the conditional probability of $A$ on $C$ as 

$$
P(A|C) = \frac{P(C|A)P(A)}{P(C)} \ ,
$$ 

i.e. The probability that $A$ has occured previously if we observe $C$. Since the observation of $C$ can mean a lot - it could be associated by many antecedents. We may rewrite the equation by expanding $P(C)$:

$$
\begin{align*}
P(C)
=& \sum_{A'}P(C\cap A') \\
=& \sum_{A'}P(C| A')P(A') \ ,
\end{align*}
$$

where $\{A'\}$ represents a complete set of _mutually exclusive_ antecedents so that $\sum_{A'}P(A')=1$. This expansion a simple counting argument: each time when $C$ occurs, some antecedent $A'$ should have happened; if we sum up the co-occurrences of $C$ and $A'$ for all possible $A'$, we essentially get the number of $C$s occurrences.

By substitution, the conditional probability $P(A|C)$ becomes

$$
P(A|C)
=\frac{P(C|A)P(A)}{\sum_{A'} P(C|A')P(A')} \ .
$$

You may call this the formula of Bayesian inference. It is also referenced as "Bayesian updating" to our degree of belief, by interpreting the terms as: 

- $P(A)$ = **prior** (probability) of $A$ = Our degree of belief that $A$ has occured, before seeing the occurence of $C$;

- $P(A|C)$ = **posterior** (probability) of $A$ = Our updated degree of belief that $A$ has occured, after seeing the occurence of $C$;

- $P(C|A)$ = **likelihood** of $C$ given $A$.

- $P(C) = \sum_{A'} P(C|A')P(A')$ = **marginal likelihood** or modal evidence.

#### Example: Drug test

A typical example of Bayesian inference is drug tests in sport competitions: how probably has an athlete used drugs if his urine test affirms so?

**Solution**. The "antecedent-consequence" pair is obvious: 

- $A$ = "The athlete has used drugs." 
- $C$ = "His urine test is positive."

Since $\{A,\bar{A}\}$ is a complete set of mutually exclusive events, i.e. the athlete has definitely either used or not used drugs, we conclude that

$$
P(A|C) = \frac{P(C|A)P(A)}{P(C|A)P(A)+P(C|\bar{A})P(\bar{A})} \ ,
$$

where $P(\bar{A}) = 1-P(A)$. The value of $P(A)$ may be pre-determined in any sensible way, e.g. it may be the past proportion of drug-using athletes. 

What if we do not know this proportion? We may appeal to the principle of indifference and simply set it $0.5$, assuming each hypothesis is equally probable, so that we are as fair as possible. Different choices of prior of course lead to different values of posterior, but as long as your evidence supports your hypothesis, your posterior belief will exceed your assigned prior.

In fact, this subjectivity coming from the choice of prior has led to much criticism against Bayesian inference (not against Bayes' theorem, though) from another school of statistics, namely frequentist inference, which would simply answer "no conclusion" if the past proportion is unknown.

## 2.2. Bayes classifier

Bayes classifier is a slightly sophisticated application of Bayesian inference. The situation is as follow:

- Suppose we are collecting samples and need to divides the samples into classes. For each sample there are $n$ different features, and we have $m$ samples in total. We may pack the values into a table:

||Feature 1|Feature 2|$\cdots$|Feature n|belong to class|
|:---:|:---:|:---:|:---:|:---:|:---:|
|Sample 1|$x_{11}$|$x_{12}$|$\cdots$|$x_{1n}$|$C_1$|
|Sample 2|$x_{21}$|$x_{22}$|$\cdots$|$x_{2n}$|$C_2$|
|$\vdots$|$\vdots$|$\vdots$|$\ddots$|$\vdots$|$\vdots$|
|Sample m|$x_{m1}$|$x_{m2}$|$\cdots$|$x_{mn}$|$C_m$|

- Now we receive a new sample. What class would this sample belong to?

||Feature 1|Feature 2|$\cdots$|Feature n|belong to class|
|:---:|:---:|:---:|:---:|:---:|:---:|
|Sample k|$x_{k1}$|$x_{k2}$|$\cdots$|$x_{kn}$|???|


We can introduce the commonly used notation to make the situation looks mathematical:

- A data matrix $\mathbf{X}=(x_{ij})$, in which a row and a column respectively record the information about a sample and a feature. Then a sample can be described as a row vector $\vec{x}=(x_k)$

- A class vector $\vec{C}=(C_i)$, that records the class of (and thus classifies) each sample. 



$$
\mathbf{X} = (X_{ij}) = \begin{pmatrix}
x_{11} & x_{12} & \cdots & x_{1n} \\
x_{21} & x_{22} & \cdots & x_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
x_{m1} & x_{m2} & \cdots & x_{mn} \\
\end{pmatrix}
\quad \text{and} \quad
\vec{C} = (C_i) =\begin{pmatrix}
C_1 \\ C_2 \\ \vdots \\ C_m
\end{pmatrix}
$$


A Bayes classifier guesses the class of a new sample by choosing the class that has the maximum conditional probability $P(C_{\vec{x}}=a |\vec{x})$, i.e. the probability that class of sample $\vec{x}$ is $a$, given that we already know the values of its features $\vec{x}$. For a $K$-class problem, i.e. $a\in\{1,2,\dots,K\}$, we can use the Bayesian inference formula to write down

$$
P(C_{\vec{x}}=a |\vec{x})
= \frac{P(\vec{x}| C_{\vec{x}}=a)P(C_\vec{x}=a)}
{\sum_{b=1}^K P(\vec{x}| C_{\vec{x}}=b)P(C_\vec{x}=b)}
$$

Then we naively predict that

$$
C_\vec{x}=\underset{a\in\{1,\dots,K\}}{\arg\max} P(C_\vec{x}=a | \vec{x}) \ .
$$


This algorithm is sometimes called a naive Bayes classifier to distinguish it with other more advanced Bayesian methods. Alternatively, this way of prediction is called the **maximum-a-posteriori (MAP)** rule.

The prior probability and likelihood can be deduced from the data matrix: 

- The prior $P(C_\vec{x}=a)$ may be defined as the frequency of $a$'s in $\vec{C}$. 

- For the likelihood $P(\vec{x}| C_{\vec{x}}=a)$, we shall break it down into a multiplication chain of probabilities of each feature. For example, **only if each feature is independent**, 

    $$
    P(\vec{x}| C_{\vec{x}}=a) = P(x_1 | C_{\vec{x}}=a) P(x_2 | C_{\vec{x}}=a)...P(x_n | C_{\vec{x}}=a)
    $$
 
    And each probability may be taken as the frequency of the value in the sample or some distributions from our prior knowledge. However in reality, features are unlikely to be completely independent (e.g. gender, height, weight...). The construction of this chain of conditional probability greatly depends on our model construction, i.e. how we define the relations between the features. 



#### Example: Weather prediction

An outdoor activity will be held depends on if the weather is sunny ($S=1$) or rainly ($S=0$). Suppose we have some past data, e.g. time of the year $t$, temperature $T$, humidity $H$, pressure $p$. and records if those days are sunny or not. How can we predict if tomorrow is sunny or rainly with a Bayes classifier, given that we have today's data?

**Solution.** Weather data are in fact strongly correlated with each other. Let's say we model (very inaccurately) the relations of the features as follow: 

- Weather is seasonal, so all other features ($T,H,p$) are likely dependent to the time of the year $t$.

- Humidity $H$ is also partly related to temperature $T$ and pressure $p$.

We may present their casual relations as the following diagram:

<figure style="text-align:center">


    
</figure>

- The likelihood $P(t,P,H,T|S)$ can be read from the diagram, with each node with its influx arrows representing one conditional probability. 

    \begin{align*}
    P(t,p,H,T|S) &= P(H|t,p,T,S)P(t,p,T|S) \\
    &= P(H|t,p,T,S)P(p,T|t,S)P(t|S) \\
    &= P(H|t,P,T,S)P(p|t,S)P(T|t,S)P(t|S)
    \end{align*}

    The first two lines are simply the conditional probability formula, and the last line is by our assumption that $p$ and $T$ are independent. 
    
- The choices of prior have a even greater freedom. For example, we may use the statistics of frequency of sunny days of the current month, or frequency of sunny days in the exact date every year. Basically anything reasonable you can think of. 

---
# 3. Decision theory

Decision theory is very important as it incorporate the probabilistic nature of the prediction rule with cost of our decision, i.e. **to minimize the risk of making a wrong decision**.

## 3.1. Terminology

### 3.1.1. Hypothesis
In the simplest formulation of decision theory, we are given two hypotheses 

- **Null hypothesis** $H_0$ - Usually a negative hypothesis that suggests the absence of the thing we are interested in.

- **Alternative hypothesis** $H_1$ - Usually a positive hypothesis that suggests the presence of the thing.

We denote the **cost** $c_{ij}$ to believe $H_i$ when $H_j$ is true, and wish to minimize the expected cost per decision.

### 3.1.2. Scenarios

The scenarios $S_{ij}$, i.e. "believing $H_i$ when $H_j$ is true", are named in various ways.

| $S_{11}$ | $S_{10}$ | $S_{01}$ | $S_{00}$ |
|:---:|:---:|:---:|:---:|
| True positive | False positive | False negative | True negative |
|| False alarm | Miss ||
|| Type I error | Type II error ||

The names "**Type I error**" and "**Type II error**" are extremely confusing but common in statistics literature. The measures to the rate/probability of each scenario are also named in several ways. You may learn more about the conventions on [Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix).


| $P(S_{11})$ | $P(S_{10})$ | $P(S_{01})$ | $P(S_{00})$ |
|:---:|:---:|:---:|:---:|
| True positive rate (TPR) | False positive rate (FPR)| False negative rate (FNR) | True negative rate (TNR)|
| Sensitivity | Fall out | Miss rate | Specificity (SPC) |
| Recall ||| Selectivity|
| Power ||||
| $1-\beta$ | $\alpha$ | $\beta$ | $1-\alpha$ |




Let $P(H_i| H_j)$ represent the probability to the scenerio $S_{ij}$. Then we have:

- Probability of correct decision: $$P_\mathrm{CD} = P(H_0| H_0)P(H_0) + P(H_1| H_1)P(H_1)$$

- Probability of wrong decision: $$P_\mathrm{WD} = P(H_1| H_0)P(H_0) + P(H_0| H_1)P(H_1)$$

- Total cost per decision: 

 $$
 \begin{align*}
C =&\ c_{00}P(H_0| H_0)P(H_0) + c_{11}P(H_1| H_1)P(H_1) \\[0.5em] &+ c_{10}P(H_1| H_0)P(H_0) + c_{01}P(H_0| H_1)P(H_1)
\end{align*}
$$

### 3.1.3. Signal detection

We need to observe some one-dimensional evidence $Z\in \mathbb{R}$ and want to accordingly decide which hypothesis we should choose.

- The value of the evidence generally follows some distributions $f_Z(z|H_i)$ which depends on the prior. 

- The simplest way to make the decision is deciding one (or multiple) cutoff $Z^*$.


It is possible to rewrite the probability of the scenerios as

$$
P(H_i|H_j) = \int_{R_i} f_Z(z|H_j) \mathrm{d}z
$$

where $R_i$ is the range of $Z$ for choosing the hypothesis $H_i$. For example, if we have only one cutoff:
1. Choose the side, e.g. we believe $H_0$ if $Z<Z^*$ but $H_1$ if $Z>Z^*$, 
2. Then $R_0=(-\infty, Z^*]$ and $R_1=[Z^*, +\infty)$. 

But we can also design multiple cutoffs and have something looks like $R_0=[Z_1^*, Z_2^*]$ and $R_1=(-\infty,Z_1^*]\cup[Z_2^*,+\infty)$.


In general, the optimal $Z^*$s are chosen to be the value that minimizes the average cost. The criterion for choosing $Z^*$ is called a "detector".


## 3.2. Bayes detector

### 3.2.1. Formulation 

The most general detector is the Bayes detector. The criterion is formed by the two quantities:

1. **Likelihood ratio** $\Lambda(z)$

$$
\Lambda(z) \equiv \frac{f_Z(z| H_1)}{f_Z(z| H_0)}
$$

2. **The threshold** $\eta$

$$
\eta \equiv \frac{c_{10}-c_{00}}{c_{01}-c_{11}} \cdot \frac{P(H_0)}{P(H_1)}
$$

Its criterion is

$$
\begin{cases}
\Lambda(z) < \eta &\Rightarrow &\text{choose }H_0 \\
\Lambda(z) > \eta &\Rightarrow &\text{choose }H_1 
\end{cases}
$$

The critical values $Z^*$(s) can be found by solving the equality $\Lambda(z^*)=\eta$.


> (Tidying up the lecture note)
>
> **Derivation:**
>
>Recall the cost of decision is 
>
>$$
\begin{align*}
C =&\ c_{00}P(H_0| H_0)P(H_0) + c_{11}P(H_1| H_1)P(H_1) \\[0.2em] &+ c_{10}P(H_1| H_0)P(H_0) + c_{01}P(H_0| H_1)P(H_1)
\end{align*}
$$
>
>Since decisions base on the same prior are complementary
>
>$$
\begin{align*}
P(H_0|H_0) + P(H_1|H_0) &= 1 \\
P(H_0|H_1) + P(H_1|H_1) &= 1
\end{align*}
$$
>
>Substituting these three equalities into $C$. Only keep those $P(H_0|\bigcirc)$ so that only terms with integrations in $R_0$ are left. Then rearranging and re-grouping the terms,
>
>$$
\begin{align*}
C =&\ c_{00}P(H_0| H_0)P(H_0) + c_{11}\big[1-P(H_0| H_1)\big]P(H_1) \\[0.2em] 
&+ c_{10}\big[1-P(H_0| H_0)\big]P(H_0) + c_{01}P(H_0| H_1)P(H_1) \\[0.5em]
=&\ \big[(c_{00}-c_{10})P(H_0|H_0)+c_{10}\big]P(H_0) + \big[(c_{01}-c_{11})P(H_0|H_1)+c_{11}\big]P(H_1) \\[0.5em]
=&\ \big[c_{10}P(H_0) + c_{11}P(H_1)\big] + (c_{00}-c_{10})P(H_0|H_0)P(H_0) + (c_{01}-c_{11})P(H_0|H_1)P(H_1) \\[0.5em]
=&\ \big[c_{10}P(H_0) + c_{11}P(H_1)\big] + \int_{R_0}\big[ (c_{01}-c_{11})f_Z(z|H_1)P(H_1) - (c_{10}-c_{00})f_Z(z|H_0)P(H_0) \big]\mathrm{d}z \\[0.5em]
=&\ \big[c_{10}P(H_0) + c_{11}P(H_1)\big] + \int_{R_0} \big[I_1 - I_0\big] \mathrm{d}z
\end{align*}
$$
>
>Note that a minus sign is taking out of $(c_{00}-c_{10})$ because $c_{10} > c_{00}$ in general (i.e. cost more if decision is wrong). Then all terms are positive. In fact, you can interpret the $c$s terms as 
>
> - $I_0 \sim (c_{10}-c_{00})$ = Potential loss by choosing $H_1$ + Potential gain by choosing $H_0$, if $H_0$ is true.
> - $I_1 \sim (c_{01}-c_{11})$ = Potential loss by choosing $H_0$ + Potential gain by choosing $H_1$, if $H_1$ is true.
>
> So the integration to the $I$ terms are the average loss for the two priors. To minimize the average loss, it is rational to choose 
>
> - If $I_1 < I_0$, i.e. potentially lose less/gain more if $H_0$ is true $\Rightarrow$ choose $H_0$.
> - If $I_1 > I_0$, i.e. potentially lose less/gain more if $H_1$ is true $\Rightarrow$ choose $H_1$.
> 
> The rearrangement to these two inequalities can lead us back to the criterion in terms of $\Lambda(z)$ and $\eta$. 
> 
>$$
\Lambda(z) \equiv \frac{f_Z(z| H_1)}{f_Z(z| H_0)}
\ {\overset{H_0}<\atop\underset{H_1}>}\ 
\frac{c_{10}-c_{00}}{c_{01}-c_{11}} \frac{P(H_0)}{P(H_1)}\equiv\eta \ ,
$$
>
> **To emphasize again:** A detector is not about deciding which hypothesis is true, but only a guide for what we should choose to reduce our risk. 

### 3.2.2. Performance of Bayes detector

The detector only tells us how to minimize our risk, but no clues about which hypothesis is really true - we will still make wrong decisions even if we follow the decision from the detector (while the loss is minimized). These two probabilities give the error bounds:

- **Probability of False Alarm** (Type I error, $\alpha$): 

    By the detector we found $\Lambda(z)<\eta$, so we choose $H_0$, but it turns out $H_1$ is true. The probability of false alarm is defined as:

    $$\alpha = P(H_1|H_0) = \int_{R_1} f_Z(z|H_0) \mathrm{d}z = \int^\infty_\eta f_\Lambda(\lambda|H_0)\mathrm{d}\lambda$$ 

- **Probability of Miss** (Type II error, $\beta$): 

    By the detector we found $\Lambda(z)>\eta$, so we choose $H_1$, but it turns out $H_0$ is true. The probability of miss is defined as:

    $$\beta = P(H_0|H_1) = \int_{R_0} f_Z(z|H_1) \mathrm{d}z = \int^\eta_0 f_\Lambda(\lambda|H_1)\mathrm{d}\lambda$$

The last terms on each lines are integration by change of variables using $\lambda = \Lambda(z)$. For visualization, error can be represented as the area under curves:

<figure style="text-align:center">


    
</figure>

On the other hand, their complements quantify the performance:

- **Probability of detection** = Found = Not missed = $1-\beta$

- **Probability of no detection** = Do not false alarm = $1-\alpha$

## 3.3 Alternatives to Bayes detector

We may resort to alternatives to Bayes detector when we do not know the exact cost $c_{ij}$.

### 3.3.1. Maximum a posteriori detector

Usually there is no reward for a correct decision, while the cost of a false alarm sometimes equals that of a miss. Hence we may assume $c_{00}=c_{11}=0$ and $c_{01}=c_{10}$ and reduce a Bayes detector to a maximum a posteriori detector:

$$
\frac{f_Z(z| H_1)}{f_Z(z| H_0)}
\ {\overset{H_0}<\atop\underset{H_1}>}\ 
\frac{P(H_0)}{P(H_1)}\ ,
$$

which is, after rearrangement, identical to the MAP rule that we have seen for a naive Bayes classifier. This detector helps minimize the probability of a wrong decision. 

$$
\frac{f_Z(z| H_1)P(H_1)}{f_Z(z| H_0)P(H_0) + f_Z(z| H_1)P(H_1)} 
\ {\overset{H_0}<\atop\underset{H_1}>}\ 
\frac{f_Z(z| H_0)P(H_0)}{f_Z(z| H_0)P(H_0) + f_Z(z| H_1)P(H_1)} 
$$


### 3.3.2. Neyman-Pearson detector

A miss is normally much more costly than a false alarm (While a false alarm of a missile attack creates panic, a miss kills people.) Therefore, we occasionally set up a tolerable probability $\alpha$ of false alarm to prevent missing as much as possible. The more tolerant of a false alarm we stand, the less likely we will miss. Practically, we let $R_1=[Z^*, +\infty)$ and solve

$$
\int_{R_1} f_Z(z| H_0)\mathrm{d}z=\alpha
$$

for $Z^*$. This is called the Neyman-Pearson detector. 

Or we may directly adopt the distribution after change of variable by $\lambda = \Lambda(z)$, and solve for a critical threshold $\eta^*$

$$
\int_{\eta^*}^{\infty} f_\Lambda(\lambda\mid H_0)\mathrm{d}\lambda=\alpha
$$

#### Example: a nuclear accident

Let us consider a nuclear accident. After an earthquake, the government suspects that a nuclear plant is leaking radioactive substances and polluting the surrounding environment. Hence, the government decides whether

- $H_0$: the nuclear plant is safe, or
- $H_1$: the nuclear plant is leaking radioactive substances.

On one hand, the government is not rewarded for making a correct decision, so $c_{00} = c_{11} = 0$. On the other hand, the government will be accused by either the nuclear industry or environmental NGOs for making a wrong decision. Assume they are equal cost, i.e. $c_{01}=c_{10}$.

Now the government measures the radioactivity of soil near the plant and gets a reading $Z=110$ units. Meanwhile, the government's prior beliefs are $P(H_0)=0.8$ and $P(H_1)=1-P(H_0)=0.2$.

Suppose the government laboratory has previously found that measurements of radioactivity under radioactive / non-radioactive environment follows the distributions: 

$$
\begin{cases}
f_Z(z| H_0)&=&
\frac{5}{\sqrt{2\pi}\cdot 80}\exp{\left[-\frac{25}2\left(\frac{z-90}{80}\right)^2\right]} \\[0.5em]
f_Z(z|H_1)&=&
\frac{2}{80}\exp{\left(-\frac{4}{80}|z-110|\right)}
\end{cases}
$$

Note that the distribution $f_Z(z|H_1)$ has a peak $z=110$. Should the government declare that there is a leak?

**Solution.** The likelihood ratio is computed by

$$
\begin{align*}
\Lambda(z) &=
\frac{\frac{2}{80} \exp{\left(-\frac{4}{80}|z-110|\right)}}
{\frac{5}{\sqrt{2\pi}\cdot 80}\exp{\left[-\frac{25}{2}\left(\frac{z-90}{80}\right)^2\right]}}\\[0.5em]
&=\frac{2}{5}\sqrt{2\pi}\exp{\left[\frac{\left(z-90\right)^2}{512}-\frac{|z-110|}{20}\right]}\,,
\end{align*}
$$

which is still analytic but too tedious for solving $\Lambda(z^*)=\eta=\frac{P(H_0)}{P(H_1)}=4$. Numerical methods find two solutions, i.e. $z^*\approx 40$ or $z^*\approx 122$. (If you are patient in doing the math, you can get their analytic forms.) It is normal to have two solutions because of the $z^2$ and $|z|$. 

How do we interpret them then? In fact, both solutions are valid and necessary. By testing with numbers like $z\in\{0,80,160\}$, you can find out 

$$
\Lambda(z)\begin{cases}
< 4 & (40<z<122) \\
> 4 & (\text{otherwise})
\end{cases}.
$$ 

However, if it is somehow known that $z$ almost never drops below $40$ (e.g. presence of background radioactivity), the government could just simplify the detector as

$$z \ {\overset{H_0}<\atop\underset{H_1}>}\ 122.$$

So for a reading $Z=110$, the government should prefer $H_0$, i.e. declare that there is no leak.