**Exercise:** Whenever you survey people about sensitive issues, you have to deal with [social desirability bias](https://en.wikipedia.org/wiki/Social_desirability_bias), which is the tendency of people to adjust their answers to show themselves in the most positive light.
One way to improve the accuracy of the results is [randomized response](https://en.wikipedia.org/wiki/Randomized_response).

As an example, suppose want to know how many people cheat on their taxes.  
If you ask them directly, it is likely that some of the cheaters will lie.
You can get a more accurate estimate if you ask them indirectly, like this: Ask each person to flip a coin and, without revealing the outcome,

* If they get heads, they report YES.

* If they get tails, they honestly answer the question "Do you cheat on your taxes?"

If someone says YES, we don't know whether they actually cheat on their taxes; they might have flipped yes.
Knowing this, people might be more willing to answer honestly.

Suppose you survey 100 people this way and get 80 YESes and 20 NOs.  Based on this data, what is the posterior distribution for the fraction of people who cheat on their taxes?  What is the most likely quantity in the posterior distribution?

## First approach

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import binom
import seaborn as sns

# build a bayes table and get a distribution with the result of the experiment
# that will act as the prior
hypos = np.linspace(0, 1, 101)  # this represents all possible fractions of people
bt = pd.DataFrame({
    'hypos': hypos
})
positive, trials = 80, 100
bt['likelihood'] = binom.pmf(positive, trials, hypos)
bt['conjuction'] = bt.hypos * bt.likelihood
bt['prior'] = bt.conjuction / bt.conjuction.sum()

# As the probability of getting heads or tails is .5 we need a new likelihood
# that reflects this fact. In our experiment as we have 100 people this will
# mean that 50 of them got tails and the other 50 got heads
bt['flip_likelihood'] = binom.pmf(50, 100, hypos)
bt['flip_conjuction'] = bt.prior * bt.flip_likelihood
bt['posterior'] = bt.flip_conjuction / bt.flip_conjuction.sum()

# Get the MAP
max_prob = bt.posterior.max()
MAP = bt[bt.posterior == max_prob].hypos.values[0]
print(f'The most probable fraction of the people that cheat in their taxes is {MAP}')
print(f'However, the probability of having exactly such fraction is {round(max_prob, 2)}')

# Plot the results
sns.lineplot(x=bt.hypos, y=bt.prior, label='prior');
sns.lineplot(x=bt.hypos, y=bt.posterior, label='posterior');

### Getting 95% of the probability
As a practice I calculated the bounds that account for the 95% of the probability.

In [None]:
# Get the interval that contains the 95% of the results
# get the CDF
bt['cdf'] = bt.posterior.cumsum()

# get the CDF from .65 and locate the index where it reaches .475
# However we only need have of the probability for the .65 value (as the other
# half belongs to the lower interval)
over_65 = bt.posterior[bt.hypos >= .65]
over_65[65] /= 2 
over_65 = over_65.cumsum()
over_65_index = over_65[over_65 >= .475].index[0]

# now get the distance between .65 index and the .475 index
center_index = bt[bt.hypos == .65].index
d = (over_65_index - center_index).values[0]

# finally calculate the cdf between the indices we got
lower_bound = center_index.values[0] - d
upper_bound = center_index.values[0] + d
limits = (
    (bt.index >= lower_bound) &
    (bt.index <= upper_bound)
)
cum_prob = bt[limits].posterior.sum().round(2) * 100
lower_prob = bt.at[lower_bound, 'hypos']
upper_prob = bt.at[upper_bound, 'hypos']
print(
    f'The range that goes from {lower_prob} '
    f'to {upper_prob:2.2} accounts for the {int(cum_prob)}% of the probability')

### Conclusion
Although the outcome is accurate to some extent this approach is not completely right as it creates a prior out of the responses and then updates it through the likelihood of the flip of the coin. Therefore, is creating extra information (since it's updating twice) that makes the credible interval narrower making it somewhat misleading.

## Solution approach
||heads|tails|
|---|---|---|
|cheater|Y|Y|
|not cheater| Y|N|

||Y|N|
|---|---|---|
|all cheaters|1|0|
|half cheaters|.75|.25|
|no cheaters|.5|.5|

In [None]:
# the hypothesis space represents the fraction of people who cheat on their
# taxes where 0 is nobody and 1 is everyone.
hs = np.linspace(0, 1, 100)

outcomes = 80 * 'Y' + 20 * 'N'

# Let's calculate the likelihoods
likes = {
    # let's explore what happens in the extremes when we got a Y:
    #   * if I'm a cheater (hs=1) the likelihood is 1 because if I flipped
    #     heads I must report Y and if I flipped tails I must say the truth
    #     (yes in this case). So always I'm going to say Y no matter the
    #     outcome of the coin.
    #   * if I'm not a cheater (hs=0), the likelihood is .5 as if I flipped
    #     heads I must report Y and if I flipped tails I must say the truth (no
    #     in this case). So we rely on the probability of the coin to know the
    #     truth.
    'Y': .5 + .5 * hs,
    
   # let's explore what happens in the extremes when we got a N:
    #   * if I'm a cheater (hs=1) the likelihood is 0 because if I flipped
    #     heads I must report Y and if I flipped tails I must say the truth
    #     (yes in this case). So cheaters won't never say N
    #   * if I'm not a cheater (hs=0), the likelihood is .5 because if
    #     I flipped heads I must report Y and if I flipped tails I
    #     must say the truth (no in this case). So we rely on the probability
    #     of the coin
    'N': (1-hs) / 2
}
prior = hs.copy()

for r in outcomes:
    prior *= likes[r]

posterior = prior / prior.sum()
sns.lineplot(x=hs, y=posterior);