-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function to compute iso_probability_contours for given MCMCSamples or NestedSamples object #178
Comments
I just copied and adjusted some code from |
Hi Stefan, for completeness it's probably good to mention that there of course is the option of calculating the standard deviation with Beyond that we need to specify exactly what we mean by confidence intervals (let's go with 68% confidence intervals in my example plots with a skewed Gaussian), where I see two possible interpretations:
|
Hi Lukas, Thanks for the nice explanations! I actually neither new that there is a Yes I meant iso-probability intervals. I talked about that with @williamjameshandley recently and realized that quantiles are not very appropriate for the skewed distributions I was working with. Actually I thought that there are three common options of
But looking at your plot now I realize that the 2nd and 3rd option should be identical. I think about it like this: The 68% interval can only be moved by moving both bounds to the left, or both to the right. However, as the PDF has the same height at both points, when moving the lines, say, to the right, the left line will move to a higher and the right one to a lower PDF region, and therefore move less x-distance. Therefore the x-distance between the lines grows. I'm not 100% sure if am overlooking something, but if not, would this be a way to compute iso-probability confidence levels (for non-multimodal distributions*) without relying on KDEs or histograms? We could simply solve the optimization problem of optimizing A such that B(A)-A is minimal, where B(A) is defined such that the interval [A,B] that contains 68% of samples/weights. Edit: I'll test this later and see if it works :) *For multi-model peaks, 2nd==3rd should work similarly based on the KDE but the method I proposed is based on (noisy) samples which makes multi-modal peaks hard to define in general. |
Here is an example showing the prior-volume minimizing contour, and it looks like the iso-probability contour for this skewed distribution: import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sst
import scipy.integrate as sin
import scipy.optimize as sop
def b_of_a(a, f, level):
opt = lambda b: level-sin.quad(f,a,b)[0]
try:
b = sop.bisect(opt,-4,4)
return b
except ValueError:
return np.inf
def confidence_interval(pdf, level=0.68):
mini = lambda a: b_of_a(a, pdf, level)-a
a = sop.minimize(mini, -4)['x'][0]
b = b_of_a(a, pdf, level)
return [a,b]
f = lambda x: sst.skewnorm.pdf(x, a=5)
cl = confidence_interval(f)
interval = np.linspace(*cl, 100)
plt.fill_between(interval, f(interval))
x = np.linspace(-4,4,1000)
plt.plot(x, f(x))
plt.show() And here this method, based on samples (without using the PDF): samples = np.random.uniform(-4,4,size=100000)
weights = f(samples)
def b_of_a_samples(a, samples, weights, level, bmax):
#print("try a =",a)
norm = np.sum(weights)
posterior_volume_target = lambda b: level - np.sum(weights[np.logical_and(samples<b, a<samples)])/norm
try:
b = sop.bisect(posterior_volume_target, a, bmax)
#print("found b =", b)
return b
except ValueError:
return np.inf
def CL(samples, weights, level=0.68, xmin=-4, xmax=4):
prior_volume = lambda a: b_of_a_samples(a, samples, weights, level, xmax) - a
xtest = np.linspace(-4,4,1000)
arg = np.argmin([prior_volume(x) for x in xtest])
return xtest[arg], b_of_a_samples(xtest[arg], samples, weights, level, xmax) #sop.minimize(, xmin)
a,b = CL(samples, weights)
plt.hist(samples, weights=weights, density=True, bins=1000, range=(-4,4), label='Samples')
plt.axvline(a, color="red", label='68% CL')
plt.axvline(b, color="red")
plt.plot(x, f(x), label='True PDF')
plt.legend()
plt.show() Both are horribly inefficient of course (I think computing and interpolating a CDF from the samples would be much more efficient); and the PDF looks like a simple KDE or "fit to the histogram" would have worked better. But for this method you wouldn't have to choose any bandwidth or bin size, just count samples, so I might try it out with some real-world data later. Edit: Much faster and nicer version: import scipy.interpolate as sip
def fastCL(samples, weights, level=0.68):
# Sort and normalize
order = np.argsort(samples)
samples = samples[order]
weights = weights[order]/np.sum(weights)
# Compute inverse cumulative distribution function
CDF = np.append(np.insert(np.cumsum(weights), 0, 0), 1)
S = np.array([-np.inf, *samples, np.inf])
#cdf = sip.interp1d(S, CDF)
invcdf = sip.interp1d(CDF, S)
# Find smallest interval
distance = lambda a, level=level: invcdf(a+level)-invcdf(a)
a = sop.minimize(distance, (1-level)/2, bounds=[(0,1-level)], method="Nelder-Mead").x[0]
interval = np.array([invcdf(a), invcdf(a+level)])
return interval
fastCL(samples, weights) Edit: Small problem here: invcdf(0) goes to infinity due to interpolation bounds, invcdf(0) should be min(x) import scipy.interpolate as sip
def fastCL(samples, weights, level=0.68):
# Sort and normalize
order = np.argsort(samples)
samples = samples[order]
weights = weights[order]/np.sum(weights)
# Compute inverse cumulative distribution function
CDF = np.append(np.insert(np.cumsum(weights), 0, 0), 1)
S = np.array([-np.min(samples), *samples, np.max(samples)])
#cdf = sip.interp1d(S, CDF)
invcdf = sip.interp1d(CDF, S)
# Find smallest interval
distance = lambda a, level=level: invcdf(a+level)-invcdf(a)
a = sop.minimize(distance, (1-level)/2, bounds=[(0,1-level)], method="Nelder-Mead").x[0]
interval = np.array([invcdf(a), invcdf(a+level)])
return interval
fastCL(samples, weights) |
Hi Stefan, the EDIT: The perceived lower accuracy for smaller quantiles appears to come down to insufficient samples in the high probability region. E.g. when using Anyhow, for future lookup and completeness I wanted to add this: Approach making use of the KDE plotting dataThe bottle neck in this is the computation of the high accuracy KDE (i.e. high import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import optimize
from scipy import interpolate
from scipy import integrate
from anesthetic import MCMCSamples
# using your example data
# -----------------------
f = lambda x: stats.skewnorm.pdf(x, a=5)
x = np.linspace(-4, 4, 1000)
samples = np.random.uniform(-4, 4, size=100000)
weights = f(samples)
mcmc = MCMCSamples(data=samples, weights=weights, columns=['f'])
# generate the KDE data with anesthetic's `plot_1d` function
# ----------------------------------------------------------
# some remarks:
# * `q=1` ensures that the full data-range is being used
# * `density=True` normalises the posterior (default would be that maximum is 1)
# * `ncompress=100000`: set to sth lower to save time, then increase for accuracy
fig, axes = mcmc.plot_1d(['f'], q=1, density=True, ncompress=100000, label='anesthetic')
# interpolate the KDE data to get an effective PDF
# ------------------------------------------------
pdf_kde = interpolate.interp1d(axes.iloc[0].lines[0].get_xdata(),
axes.iloc[0].lines[0].get_ydata(),
kind='cubic', bounds_error=False, fill_value=0) # Compute iso-probability bounds
# ------------------------------
def root_func_for_lower_credibility_bound(z, pdf, ppf, q=0.68):
return pdf(ppf(z)) - pdf(ppf(z + q))
def get_iso_proba_bounds(pdf, q):
"""
Compute the lower and upper iso-probability bounds that enclose the quantile `q`.
Parameters
----------
pdf: scipy.interpolate.interpolate.interp1d
Probability density function of the posterior that you can get from the KDE.
q: float
Quantile that should be enclosed by the returned iso-probability bounds.
Returns
-------
lower, upper: (float, float)
Lower and upper iso-probability bounds given a `pdf` and a quantile `q`.
"""
if q < 0.5:
q = 1 - q
# integrate the PDF to get the CDF and interpolate to invert and get the PPF
sol = integrate.solve_ivp(lambda t, y: pdf(t),
t_span=(pdf.x[0], pdf.x[-1]),
y0=[0],
t_eval=pdf.x)
cdf = sol.y[0] / sol.y[0][-1]
ppf = interpolate.interp1d(cdf, sol.t, kind='cubic')
# run root finder
o = optimize.root_scalar(f=root_func_for_lower_credibility_bound,
args=(pdf, ppf, q),
bracket=[1e-5, 1-q-1e-5])
if o.converged:
return ppf(o.root), ppf(o.root + q)
else:
return o
b68 = get_iso_proba_bounds(pdf_kde, 0.68)
b95 = get_iso_proba_bounds(pdf_kde, 0.95) # Plot results
# ------------
plt.figure()
plt.hist(samples, weights=weights, density=True, bins=1000, range=(-4, 4),
color='k', alpha=0.3, label='Samples')
plt.plot(x, f(x), c='k', label='true PDF')
plt.plot(x, pdf_kde(x), c='C2', label='anesthetic KDE')
# 68% iso-probability bounds
plt.axvline(b68[0], color='C1', ls='-', label='68% lower bound')
plt.axvline(b68[1], color='C4', ls=':', label='68% upper bound')
plt.axhline(f(b68[0]), color='C1', ls='-')
plt.axhline(f(b68[1]), color='C4', ls=':')
# 95% iso-probability bounds
plt.axvline(b95[0], color='C3', ls='-', label='95% lower bound')
plt.axvline(b95[1], color='C0', ls=':', label='95% upper bound')
plt.axhline(f(b95[0]), color='C3', ls='-')
plt.axhline(f(b95[1]), color='C0', ls=':')
plt.legend(loc='upper left')
plt.show() |
Thanks for this @lukashergt, good to note these, I wasn't familiar with
Edit: Based on this I wrote a bestfit function which might be of interest: def bestFit(data, key, ncompress=10000, plot=False):
fig, axes = data.plot_1d([key], q=1, density=True, label='anesthetic', ncompress=ncompress)
pdf_kde = sip.interp1d(axes.iloc[0].lines[0].get_xdata(),
axes.iloc[0].lines[0].get_ydata(),
kind='cubic', bounds_error=False, fill_value=0)
if plot:
axes[key].hist(data[key], weights=data.weights, density=True, label='anesthetic', range=(0,0.1), bins=30)
plt.show()
else:
fig.clear()
plt.close(fig)
return sop.minimize(lambda x: -pdf_kde(x), fid[key], method="Nelder-Mead").x[0] When I have time I'll make a PR for a Edit: Updated version to deal with duplicated points in line data (seen with Planck chains): def bestFit(data, key, ncompress=10000, plot=False, verbose=True):
fig, axes = data.plot_1d([key], q=1, density=True, label='anesthetic', ncompress=ncompress)
x = axes.iloc[0].lines[0].get_xdata()
y = axes.iloc[0].lines[0].get_ydata()
mask = [*(np.diff(x)==0), False]
if np.sum(mask) > 0:
if verbose:
print("bestFit: Removing", np.sum(mask), "duplicates from kde interpolation function")
indices = np.where([*(np.diff(x)==0), False])[0]
assert np.all(y[indices] == y[indices+1])
x = x[np.logical_not(mask)]
y = y[np.logical_not(mask)]
pdf_kde = sip.interp1d(x,y,kind='cubic', bounds_error=False, fill_value=0)
if plot:
axes[key].hist(data[key], weights=data.weights, density=True, label='anesthetic', range=data.limits[key], bins=50)
else:
fig.clear()
plt.close(fig)
return sop.minimize(lambda x: -pdf_kde(x), fid[key], method="Nelder-Mead").x[0] |
I will make a new PR replacing the complicated implementation in #179 by the Do we also want my hacky Edit: Changelog import scipy.optimize as sop
import scipy.interpolate as sip
def fastCL(samples, weights, level=0.68):
# Sort and normalize
order = np.argsort(samples)
samples = samples[order]
weights = weights[order]/np.sum(weights)
# Compute inverse cumulative distribution function
CDF = np.append(np.insert(np.cumsum(weights), 0, 0), 1)
S = np.array([np.min(samples), *samples, np.max(samples)])
#cdf = sip.interp1d(S, CDF)
invcdf = sip.interp1d(CDF, S)
# Find smallest interval
distance = lambda a, level=level: invcdf(a+level)-invcdf(a)
a = sop.minimize(distance, (1-level)/2, bounds=[(0,1-level)], method="SLSQP").x[0]
interval = np.array([invcdf(a), invcdf(a+level)])
return interval
fastCL(samples, weights) Edit: Slightly neater version, planning to put this into a PR import scipy.optimize as sop
import scipy.interpolate as sip
def fastCL(samples, weights=None, level=0.68):
assert level<1, "Level >= 1!"
weights = np.ones(len(samples)) if weights is None else weights
# Sort and normalize
order = np.argsort(samples)
samples = samples[order]
weights = weights[order]/np.sum(weights)
# Compute inverse cumulative distribution function
CDF = np.append(np.insert(np.cumsum(weights), 0, 0), 1)
S = np.array([np.min(samples), *samples, np.max(samples)])
invcdf = sip.interp1d(CDF, S)
if method=="iso-probability":
# Find smallest interval
distance = lambda a, level=level: invcdf(a+level)-invcdf(a)
a = sop.minimize_scalar(distance, bounds=(0,1-level), method="Bounded")
interval = np.array([invcdf(a.x), invcdf(a.x+level)])
elif method=="lower-limit":
# Get value from which we reach the desired level
interval = invcdf(1-level)
elif method=="upper-limit":
# Get value to which we reach the desired level
interval = invcdf(level)
else:
assert False, method
return interval
fastCL(samples, weights) |
Is your feature request related to a problem? Please describe.
It is not trivial how to compute confidence intervals from
MCMCSamples
orNestedSamples
even though we have this functionality in the 1d plots (withfacecolor=True
).There is a function iso_probability_contours_from_samples which sounds like it should do that (but does not seem to), the documentation does not explain much and the source code looks mostly like
iso_probability_contours
.Describe the solution you'd like
A function analogous to plot_1d which returns confidence intervals, e.g.
MCMCSamples.confidence_1d(key, type='fastkde', confidence_level=[0.68, 0.95])
could return[[(1.2,1.4), (1.53,1.55)], [(1.1, 1.6)]]
. Possible types would behist
,kde
, andfastkde
.Edit: We need to return a list of tuples in case there are multiple peaks
Describe alternatives you've considered
In the past I just used the samples to compute the kde manually and compute this, but a built-in functionality would surely be beneficial for many users and also encourage using a sensible definition of confidence levels.
The text was updated successfully, but these errors were encountered: