In [None]:
import numpy as np
import pandas as pd

## Gluten Sensitivity

In 2015 I read a paper that tested whether people diagnosed with gluten sensitivity (but not celiac disease) were able to distinguish gluten flour from non-gluten flour in a blind challenge
([you can read the paper here](https://onlinelibrary.wiley.com/doi/full/10.1111/apt.13372)).

Out of 35 subjects, 12 correctly identified the gluten flour based on
resumption of symptoms while they were eating it. Another 17 wrongly
identified the gluten-free flour based on their symptoms, and 6 were
unable to distinguish.

The authors conclude, "Double-blind gluten challenge induces symptom
recurrence in just one-third of patients."

This conclusion seems odd to me, because if none of the patients were
sensitive to gluten, we would expect some of them to identify the gluten flour by chance. 
So here's the question: based on this data, how many of the subjects are sensitive to gluten and how many are guessing?

We can use Bayes's Theorem to answer this question, but first we have to make some modeling decisions. I'll assume:

-   People who are sensitive to gluten have a 95% chance of correctly
    identifying gluten flour under the challenge conditions, and

-   People who are not sensitive have a 40% chance of identifying the
    gluten flour by chance (and a 60% chance of either choosing the
    other flour or failing to distinguish).

These particular values are arbitrary, but the results are not sensitive to these choices.

I will solve this problem in two steps. First, assuming that we know how many subjects are sensitive, I will compute the distribution of the data. 
Then, using the likelihood of the data, I will compute the posterior distribution of the number of sensitive patients.

The first is the **forward problem**; the second is the **inverse problem**.

In [None]:
## Forward problem
# Create two distributions for a population of 35 where 10 are sensitive and
# the others are guessing


# Sum both distributions and plot them


In [None]:
# Extend the previous method to cover all possible hypotheses ranging from 0 to 35


# Plot a few different hypotheses


In [None]:
## Inverse problem
# Compute posterior probabilities for the problem's given hypothesis 12


# Compute posterior probability for another hypothesis (like 20) to compare


# Plot the comparisons


# Compute MAPs for each of the posteriors


## Solution

In [None]:
# Create two distributions for a population of 35 where 10 are sensitive and
# the others are guessing
from scipy.stats import binom
sensitive = 10  # number of sensitive subjects
non_sensitive = 35 - sensitive

# Distribution of sensitive
dist_s = binom.pmf(np.arange(1, sensitive+1), n=sensitive, p=.70)

# Distribution for non_sensitive
dist_ns = binom.pmf(np.arange(1, non_sensitive+1), n=non_sensitive, p=.4)


# Sum both distributions and plot them
from empiricaldist import Pmf
dist_s = Pmf(dist_s, np.arange(1, sensitive+1))
dist_ns = Pmf(dist_ns, np.arange(1, non_sensitive+1))

dist_sum = dist_s.add_dist(dist_ns)

In [None]:
# Extend the previous method to cover all possible hypotheses ranging from 0 to 35
df = pd.DataFrame()
for sensitive in range(36):
    non_sensitive = 35 - sensitive
    
    # Create distributions
    dist_s = binom.pmf(np.arange(sensitive+1), n=sensitive, p=.95)
    dist_ns = binom.pmf(np.arange(non_sensitive+1), n=non_sensitive, p=.4)
    
    # Create Pmfs
    dist_s = Pmf(dist_s, np.arange(sensitive+1))
    dist_ns = Pmf(dist_ns, np.arange(non_sensitive+1))
    
    # Sum Pmfs
    dist_sum = dist_s.add_dist(dist_ns)
    
    # Add to df
    df[sensitive] = dist_sum
df.head(5)

In [None]:
# Plot a few different hypotheses
df[10].plot(legend="10");
df[20].plot(legend="20");
df[30].plot(legend="30");

In [None]:
# Compute posterior probabilities for the problem's given hypothesis 12
prior = Pmf(1, np.arange(36))
posterior1 = prior * df.loc[12, :]
posterior1.normalize()

In [None]:
# Compute posterior probability for another hypothesis (like 20) to compare
posterior2 = prior * df.loc[20, :]
posterior2.normalize()

In [None]:
# Plot the comparisons
posterior1.plot(legend="12");
posterior2.plot(legend="20");

In [None]:
# Compute MAPs for each of the posteriors
posterior1.max_prob(), posterior2.max_prob()