Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdf-distribution for values <1 #33

Closed
Cils opened this issue May 14, 2016 · 1 comment
Closed

pdf-distribution for values <1 #33

Cils opened this issue May 14, 2016 · 1 comment

Comments

@Cils
Copy link
Contributor

Cils commented May 14, 2016

I noticed, that the values below 1 are not included in the pdf-distribution of the data (powerlow.pdf(), fit.plot_pdf()). For some reason unknown to me, the histogram logarithmic spaced bins boundaries are transformed to integers.
line 1952 in powerlaws.py: bins=unique(floor(logspace( log_min_size, log_max_size, num=number_of_bins)))
In this way a potential infinite number of bins are eliminated.

I modified the function "pdf" to solve the problem. The data are rescaled by multipling them by xmin before the histogram is computed. At the end the bins boundareies "edges" are transformed back to the original scale and returned.

Below the code:

def pdf(data, xmin=None, xmax=None, linear_bins=False, **kwargs):
"""
Returns the probability density function (normalized histogram) of the
data.  
Parameters
----------
data : list or array
xmin : float, optional
    Minimum value of the PDF. If None, uses the smallest value in the data.
xmax : float, optional
    Maximum value of the PDF. If None, uses the largest value in the data.
linear_bins : float, optional
    Whether to use linearly spaced bins, as opposed to logarithmically
    spaced bins (recommended for log-log plots).

Returns
-------
bin_edges : array
    The edges of the bins of the probability density function.
probabilities : array
    The portion of the data that is within the bin. Length 1 less than
    bin_edges, as it corresponds to the spaces between them.
"""
from numpy import logspace, histogram, floor, unique
from math import ceil, log10
if not xmax:
    xmax = max(data)
if not xmin:
    xmin = min(data)

# normalize data to xmin, allow to have pdf also from the data below x=1
data2=data/xmin     
xmax=xmax/xmin
xmin_old=xmin
xmin=1

if linear_bins:
    bins = range(int(xmin), int(xmax2))
else:
    log_min_size = log10(xmin)
    log_max_size = log10(xmax)
    number_of_bins = ceil((log_max_size-log_min_size)*10)
    bins=unique(
            floor(
                logspace(
                    log_min_size, log_max_size, num=number_of_bins)))
hist, edges = histogram(data2, bins, density=True)

# transform data back to original    
xmax=xmax*xmin_old
xmin=xmin_old
edges=edges*xmin
return edges, hist`
@jeffalstott
Copy link
Owner

Thank you very much for considering this use case and writing a solution
for it!

Can you make this edit as a pull request? It may also be worth testing out
some; I can see in the code a reference to xmax2, which is not defined. As
a lower priority, I would also consider if there's a way to do this without
making another copy of the data; some users have very large datasets and
making another copy of the data can get burdensome.

On Sat, May 14, 2016 at 10:38 PM, Cils notifications@github.com wrote:

I noticed, that the values below 1 are not included in the
pdf-distribution of the data (powerlow.pdf(), fit.plot_pdf()). For some
reason unknown to me, the histogram logarithmic spaced bins boundaries are
transformed to integers.

line 1952 in powerlaws.py: bins=unique( floor(logspace( log_min_size,
log_max_size, num=number_of_bins)))
In this way a potential infinite number of bins are eliminated.

I modified the function "pdf" to solve the problem. The data are rescaled
by multipling them by xmin before the histogram is computed. At the end the
bins boundareies "edges" are transformed back to the original scale and
returned.

Below the code:

`def pdf(data, xmin=None, xmax=None, linear_bins=False, **kwargs):
"""
Returns the probability density function (normalized histogram) of the
data. Version modified by A.Capelli, include values x<1 in the pdf
distribution.

Parameters

data : list or array
xmin : float, optional
Minimum value of the PDF. If None, uses the smallest value in the data.
xmax : float, optional
Maximum value of the PDF. If None, uses the largest value in the data.
linear_bins : float, optional
Whether to use linearly spaced bins, as opposed to logarithmically
spaced bins (recommended for log-log plots).

Returns

bin_edges : array
The edges of the bins of the probability density function.
probabilities : array
The portion of the data that is within the bin. Length 1 less than
bin_edges, as it corresponds to the spaces between them.
"""
from numpy import logspace, histogram, floor, unique
from math import ceil, log10
if not xmax:
xmax = max(data)
if not xmin:
xmin = min(data)

normalize data to xmin, allow to have pdf also from the data below x=1, modification by A.Capelli

data2=data/xmin
xmax=xmax/xmin
xmin_old=xmin
xmin=1

if linear_bins:
bins = range(int(xmin), int(xmax2))
else:
log_min_size = log10(xmin)
log_max_size = log10(xmax)
number_of_bins = ceil((log_max_size-log_min_size)*10)
bins=unique(
floor(
logspace(
log_min_size, log_max_size, num=number_of_bins)))
hist, edges = histogram(data2, bins, density=True)

transform data back to original

xmax=xmax_xmin_old
xmin=xmin_old
edges=edges_xmin
print(xmin)
return edges, hist`


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#33

@Cils Cils closed this as completed May 17, 2016
@Cils Cils reopened this May 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants