Skip to content

Nonparametric plotting and analysis tool for estimating a one-dimensional data sample

License

Notifications You must be signed in to change notification settings

jennyfarmer/PDFAnalyze

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

================================================================
PDFAnalyze, Version 1.0, September 2022
Jenny Farmer jfarmer@carolina.rr.com
Donald Jacobs djacobs1@uncc.edu
University of North Carolina at Charlotte



================================================================
GENERAL INFORMATION
================================================================

The PDFAnalyze package includes the following high-level MATLABfunctions:

1. PDFe.m:  Computes a nonparametric probability density estimate for a multivariate data sample for 1 to 5 variables.
		(Calls EstimatePDFmv.mex function, which can also be called directly)

2. PDFAnalyze.m:  Computes a probability density estimate for a one-dimensional data sample and produces optional plots for analysis.  
			(Calls EstimatePDF.mex function, which can also be called directly)
 




Please cite at least one of these publication if you use this code for your research:

Jenny, F. and J. Donald, High throughput nonparametric probability density estimation. PLoS ONE, 2018. 13(5): p. e0196937.

Farmer, Jenny, and Donald J. Jacobs. “MATLAB Tool for Probability Density Assessment and Nonparametric Estimation.” SoftwareX, vol. 18, Elsevier BV, June 2022, p. 101017, doi:10.1016/j.softx.2022.101017.



=================================================================
INSTALLATION FOR MATLAB (2018 or greater)
=================================================================

Installation Steps


1. Prior to installing the PDFAnalyze, the MingGW C/C++ compiler for Windows must be installed as a MATLAB Add-on.  
To install, select [Add-Ons/Get Add-Ons] from the HOME menu within MATLAB and search for ‘MinGW’.  Select and install MinGW-w64.

2. Copy all source files into a single folder

3. Run the CompilePDF.m script in MATLAB to create a MATLAB Executable (mex) 

4. Run the CompilePDFmv.m script in MATLAB to create a MATLAB Executable (mex) 

5. (optional) Verify installation by running example.m script in MATLAB 



The PDFAnalyze package consists of the following files:

PDFAnalyze.m
PlotBeta.m
EstimatePDF.m
FigureSettings.m
GetTargets.m
example.m
CompilePDF.m
CompilePDFmv.m 

EstimatePDF.cpp; EstimatePDF.h
EstimatePDFmv.cpp; EstimatePDFmv.h
JointProbability.cpp; JointProbability.h
Variable.cpp; Variable.h
callPDF.cpp;  callPDF.h
ChebyShev.cpp; ChebyShev.h
InputData.cpp; InputData.h
InputParameters.cpp; InputParameters.h
MinimizeScore.cpp; MinimizeScore.h
Score.cpp; Score.h
ScoreQZ.cpp; ScoreQZ.h
WriteResults.cpp; WriteResults.h
OutputControl.cpp; OutputControl.h

README.txt



=================================================================
PDFe USAGE
=================================================================


[pdfPoints, pdfEst] = PDFe(r);



Input Parameters 

r	random data sample, one column of data for each variable



Output Parameters

pdfEst	the joint probability density function; a matrix of nVariable dimensions: [nGrids x nGrids x ... x nGrids]
pdfPoints	evaluation points for jp; one row for each variable





=================================================================
PDFAnalyze USAGE
=================================================================


[F, XI] = PDFAnalyze(X) Computes the density estimate of data in sample X.  F contains the density estimate at points XI.  
	The number of points and the relative spacing is determined automatically from the features of the data sample


[F, XI, CDF, SQR] = PDFAnalyze(...) also returns the cumulative density and the scaled quantile residual for each sample data point.


PDFAnalyze(...) with no output arguments produces a plot of the density estimate.


[...] = PDFAnalyze(..., 'param1', 'val1, 'param2', 'val2', ...) specifies parameter name/value pairs to control the density estimation.  

Valid parameters are as follows:

	Parameter			Value
	'PlotType'			Produces any combination of three plot types:
					'pdf'		probability density function for the
                	           	'sqr'		scaled quantile residual (see NOTES)
					'combined'	pdf and sqr plotted on one figure

				Multiple plot types occur with multiple name/value pairs specified

	'EstimationType'	The default estimation method is PDFEstimate.

				Additional KDE methods are available:
				'kde1'		built-in MATLAB function ksdensity
				'kde2'		Zdravko Botev (2020). Kernel Density Estimator 
						(https://www.mathworks.com/matlabcentral/fileexchange/14034-kernel-density-est
						MATLAB Central File Exchange. Retrieved March 17, 2020.

	'distribution'		A two column matrix, [F, XI], representing a distribution to plot on the same figure as the estimate for use with 'pdf' plot type.  
				Useful for comparison to a known distribution.



=================================================================
EXAMPLES
=================================================================

Example 1: Plot the estimate of random sample for a Normal distribution along with the true Normal distribution:

	data = randn(1000, 1);
	x = min(data):0.1:max(data);
	f = normpdf(x);
	d = [x(:), f(:)];
	PDFAnalyze(data, 'distribution', d);


Example 2: Plot the scaled quantile residual (SQR) for an estimate of the Normal distribution, showing confidence thresholds and uncertainty estimates:

	PDFAnalyze(randn(10000, 1), 'PlotType', 'sqr');







=================================================================
EstimatePDFmv USAGE
=================================================================

EstimatePDFmv is invoked from within PDFe.m but can be called directly to customize the resolution


Usage

[jp, x] = EstimatePDFmv(r, nSamples, nVariables, nGrids);



Input Parameters (all required)

r		random data sample, one column of data for each variable
nSamples	the number of rows in r, representing the number of samples per variable
nVariables	the number of columns in r, representing the number of variables
nGrids	the resolution, per variable, for desired output.


Output Parameters

jp		the joint probability density function; an array of size (nVariable)^(nGrids)
x		evaluation points for jp; an array of size (nVariable) * (nGrids)





=================================================================
EstimatePDF USAGE
=================================================================

EstimatePDF is invoked from within PDFAnalyze.m and can be customized through a collection of advanced input and output options.



Usage

[failed, y, pdf, cdf, sqr, lagrange, score, confidence, SURD] = EstimatePDF(data, parameters)

data		(required) a single vector of random sample data.
parameters	(optional) a MATLAB structure of options listed below



Optional Input Parameters

Name				Default Value
parameters.SURDtarget		[40]      
parameters.SURDmin		[5]
parameters.SURDmax		[100] 
parameters.LagrangeMin		[1]
parameters.LagrangeMax		[200]
parameters.lowBound		[calculated]
parameters.highBound		[calculated]      
parameters.integrationPoints	[calculated]
parameters.debug		[false]
parameters.partition		[1025]      
parameters.scoreType		['QZ']
parameters.outlierCutoff	[7]
parameters.adaptiveDx		[true]



Output Parameters

failed		non-zero if a solution was not found
y		range of values in PDF (independent variable)	
pdf		Probability Density Function (PDF)
cdf		Cummulative Denstiy Function
sqr		Scaled Quantile Residual
lagrange	Lagrange coefficients
score		Value returned by the score-type selected
confidence	SURD threshold achieved 
SURD		Sample Uniform Random Data            






=================================================================
NOTES
=================================================================

The following section includes a few brief notes concerning more advanced input and output options available, and how they may affect performance of the estimation.  
For a greater understanding of the methodology used, please see the publication referenced in the GENERAL INFORMATION section.


1. SURD Threshold Targets

Sample Uniform Random Data (SURD) loosely correlates with the strength of the solution, with higher thresholds indicating more probably solutions for the PDF.  


2. Scaled Quantile Residual

The equation for Scaled Quantile Residual (SQR) is given by SQR = sqrt(N+2)*(u - uniform-u) where N is the number of data samples. 
SQR plots are very useful as a diagnostic measure because they are sample size invariant and have universal characteristics independent of the true PDF.  
The SQR plot type plots the SQR for each data sample by position, highlighting in red those that fall outside of the expected 98% threshold.


3. Lagrange Coefficients

Each Lagrange multiplier returned as output is an expansion coefficient in the series of orthogonal functions within an exponential. 
The more complex the shape of the distribution, the more Lagrange multipliers are required to accurately define the PDF. 


4. Greater accuracy in numerical integration can be controlled

Increasing the number of integration points will improve the resolution of the PDF, but could increase runtime.  
Decreasing the integration points is not recommended, as it may produce poor solutions.  


5. Failed solutions 

Two safety measures are implemented to prevent the program from continuing an unreasonably long time without finding a solution. 

i) If progress stalls and the score is not improving significantly after many attempts, or 
ii) If the maximum number of Lagrange multipliers has been reached. 

If the maximum number of Lagrange multipliers is reached, this indicates that the solution is likely not yet converged. The user can increase the maximum. 
However, the default maximum of 200 is set to prevent cases that may never converge. 


6. Parametric maximum entropy method can be used with this program.

If a user desires an exact number of Lagrange multipliers, the  minLagrange and maxLagrange parameter options can be set to equal values. 
For example, if the user knows the distribution is a Gaussian, then the user could set both the minimum and maximum Lagrange mutlipliers to 3.  
In this case, the output will be equivalent to a parametric maximum entropy method, where the number of Lagrange multipliers is known in advance. 


7. Verbose outputs for debugging. 

For more details on the progress of the program and explanations of possible warnings and outcomes, set the debug parameter option to true.

About

Nonparametric plotting and analysis tool for estimating a one-dimensional data sample

Resources

License

Stars

Watchers

Forks

Packages

No packages published