# Probability Density Estimation:

- Probability density is the relationship between observations and their probability.

- Some outcomes of a random variable will have low probability density and other outcomes will have a high probability density.

- The overall shape of the probability density is referred to as a probability distribution.

- The calculation of probabilities for specific outcomes of a random variable is performed by a probability density function(PDF).

# 1. Need

- It is useful to know the probability density function for a sample of data in order to know whether a given observation is unlikely, or so unlikely as to be considered an outlier or anomaly and whether it should be removed. It is also helpful in order to choose appropriate learning methods that require input data to have a specific probability distribution.

# 2. Probability Density:
- A rdom variable x has a probability distribution p(x).

- The relationship between the outcomes of a random variable and its probability is referred to as the probability density, or simply the density.

- If a random variable is continuous, then the probability can be calculated via probability density function(PDF).

- If a random variable is discrete, then the probability can be calculated via probability mass function(PMF).

- The shape of the probability density function across the domain for a random variable is referred to as the probability distribution and common probability distributions have names, such as uniform, normal, exponential, etc.

- Given a random variable, we are interested in the density of its probabilities.

For example, given a random sample of a variable, we might want to know things like the shape of the probability distribution, the most likely value, the spread of values, and other properties.

Knowing the probability distribution for a random variable can help to calculate moments of the distribution, like the mean and variance, but can also be useful for other more general considerations, like determining whether an observation is unlikely or very unlikely and might be an outlier or anomaly.

**The problem is, we may not know the probability distribution for a random variable. We rarely do know the distribution because we don’t have access to all possible outcomes for a random variable. In fact, all we have access to is a sample of observations. As such, we must select a probability distribution.
This problem is referred to as probability density estimation.**

- There are a few steps in the process of density estimation for a random variable.

# 1. Histogram
- The first step is to review the density of observations in the random sample with a simple histogram. From the histogram, we might be able to identify a common and well-understood probability distribution that can be used, such as a normal distribution. If not, we may have to fit a model to estimate the distribution.

- A histogram is a plot that involves first grouping the observations into bins and counting the number of events that fall into each bin. The counts, or frequencies of observations, in each bin are then plotted as a bar graph with the bins on the x-axis and the frequency on the y-axis.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from numpy.random import normal
# generate a sample
sample= normal(size= 1000)
#plot a histogram
plt.hist(sample, bins= 10)
plt.show()

We can clearly see the shape of the normal distribution. Note that our results will differ given the random nature of the data sample. Try running this a few times.

In [None]:
plt.hist(sample, bins= 5)
plt.show()

In [None]:
plt.hist(sample, bins= 3)
plt.show()

Reviewing a histogram of a data sample with a range of different numbers of bins will help to identify whether the density looks like a common probability distribution or not.

--> In most cases, you will see a unimodal distribution, such as the familiar bell shape of the normal, the flat shape of the uniform, or the descending or ascending shape of an exponential or Pareto distribution.

--> You might also see complex distributions, such as multiple peaks that don’t disappear with different numbers of bins, referred to as a bimodal distribution, or multiple peaks, referred to as a multimodal distribution. You might also see a large spike in density for a given value or small range of values indicating outliers, often occurring on the tail of a distribution far away from the rest of the density.

# 2. Parametric Density Estimation

- The common distributions are common because they occur again and again in different and sometimes unexpected domains.

- Get familiar with the common probability distributions as it will help you to identify a given distribution from a histogram.

- Once identified, you can attempt to estimate the density of the random variable with a chosen probability distribution. This can be achieved by estimating the parameters of the distribution from a random sample of data.

- We refer to this process as parametric density estimation.

For example, the normal distribution has two parameters: the mean and the standard deviation. Given these two parameters, we now know the probability distribution function. These parameters can be estimated from data by calculating the sample mean and sample standard deviation.

**Once we have estimated the density, we can check if it is a good fit. This can be done in many ways, such as:**

1. Plotting the density function and comparing the shape to the histogram.
2. Sampling the density function and comparing the generated sample to the real sample.
3. Using a statistical test to confirm the data fits the distribution.

We will generate a random sample of 1,000 observations from a normal distribution with a mean of 50 and a standard deviation of 5.

In [None]:
sample= normal(loc= 50, scale= 5, size= 1000)

We can then pretend that we don’t know the probability distribution and maybe look at a histogram and guess that it is normal. Assuming that it is normal, we can then calculate the parameters of the distribution, specifically the mean and standard deviation.

In [None]:
sample_mean= np.mean(sample)
sample_std= np.std(sample)

We would not expect the mean and standard deviation to be 50 and 5 exactly given the small sample size and noise in the sampling process.

In [None]:
print('Mean=%.3f, Standard Deviation=%.3f' % (sample_mean, sample_std))

- Then fit the distribution with these parameters, so-called parametric density estimation of our data sample.

In this case, we can use the norm() SciPy function.

In [None]:
from scipy.stats import norm
dist = norm(sample_mean, sample_std)

We can then sample the probabilities from this distribution for a range of values in our domain, in this case between 30 and 70.

In [None]:
values = [value for value in range(30, 70)]
probabilities = [dist.pdf(value) for value in values]

Finally, we can plot a histogram of the data sample and overlay a line plot of the probabilities calculated for the range of values from the PDF.

In [None]:
# plot the histogram and pdf
plt.hist(sample, bins=10, density=True)
plt.plot(values, probabilities)

The PDF is fit using the estimated parameters and the histogram of the data with 10 bins is compared to probabilities for a range of values sampled from the PDF.We can see that the PDF is a good match for our data.

**Note:**
    It is possible that the data does match a common probability distribution, but requires a transformation before parametric density estimation.

For example, you may have outlier values that are far from the mean or center of mass of the distribution. This may have the effect of giving incorrect estimates of the distribution parameters and, in turn, causing a poor fit to the data. These outliers should be removed prior to estimating the distribution parameters.

Another example is the data may have a skew or be shifted left or right. In this case, you might need to transform the data prior to estimating the parameters, such as taking the log or square root, or more generally, using a power transform like the Box-Cox transform.

These types of modifications to the data may not be obvious and effective parametric density estimation may require an iterative process of:

Loop Until Fit of Distribution to Data is Good Enough:
1. Estimating distribution parameters
2. Reviewing the resulting PDF against the data
3. Transforming the data to better fit the distribution

For Power transformation you can refer my notebook: <link>https://www.kaggle.com/mukeshchoudhary/power-transformation</link>

For Statistical tests you can refer my notebook: <link>https://www.kaggle.com/mukeshchoudhary/statistical-tests</link>

# 3. Non-Parametric Density Estimation

- In some cases, a data sample may not resemble a common probability distribution or cannot be easily made to fit the distribution.

- This is often the case when the data has two peaks (bimodal distribution) or many peaks (multimodal distribution).

- In this case, parametric density estimation is not feasible and alternative methods can be used that do not use a common distribution. Instead, an algorithm is used to approximate the probability distribution of the data without a pre-defined distribution, referred to as a nonparametric method.

- The distributions will still have parameters but are not directly controllable in the same way as simple probability distributions.

- The most common nonparametric approach for estimating the probability density function of a continuous random variable is called kernel smoothing, or kernel density estimation, KDE

**Kernel Density Estimation: Nonparametric method for using a dataset to estimating probabilities for new points.**

In this case, a kernel is a mathematical function that returns a probability for a given value of a random variable. The kernel effectively smooths or interpolates the probabilities across the range of outcomes for a random variable such that the sum of probabilities equals one, a requirement of well-behaved probabilities.

A parameter, called the smoothing parameter or the bandwidth, controls the scope, or window of observations, from the data sample that contributes to estimating the probability for a given sample

In [None]:
from numpy import hstack
# generate a sample
sample1 = normal(loc=20, scale=5, size=300)
sample2 = normal(loc=40, scale=5, size=700)
sample = hstack((sample1, sample2))
# plot the histogram
plt.hist(sample, bins=50)
plt.show()

We have fewer samples with a mean of 20 than samples with a mean of 40, which we can see reflected in the histogram with a larger density of samples around 40 than around 20.

- Data with this distribution does not nicely fit into a common probability distribution, by design. It is a good case for using a nonparametric kernel density estimation method.



The scikit-learn machine learning library provides the KernelDensity class that implements kernel density estimation.

First, the class is constructed with the desired bandwidth (window size) and kernel (basis function) arguments. It is a good idea to test different configurations on your data. In this case, we will try a bandwidth of 3 and a Gaussian kernel.

The class is then fit on a data sample via the fit() function. The function expects the data to have a 2D shape with the form [rows, columns], therefore we can reshape our data sample to have 1,000 rows and 1 column.

In [None]:
from sklearn.neighbors import KernelDensity
# fit density
model = KernelDensity(bandwidth=3, kernel='gaussian')
sample = sample.reshape((len(sample), 1))
model.fit(sample)

We can then evaluate how well the density estimate matches our data by calculating the probabilities for a range of observations and comparing the shape to the histogram, just like we did for the parametric case in the prior section.

In [None]:
# sample probabilities for a range of outcomes
values = np.asarray([value for value in range(1, 60)])
values = values.reshape((len(values), 1))
probabilities = model.score_samples(values)
probabilities = np.exp(probabilities)

Finally, we can create a histogram with normalized frequencies and an overlay line plot of values to estimated probabilities.

In [None]:
plt.hist(sample, bins=50, density=True)
plt.plot(values[:], probabilities)
plt.show()

In this case, we can see that the PDF is a good fit for the histogram. It is not very smooth and could be made more so by setting the “bandwidth” argument to 4 samples or higher. Experiment with different values of the bandwidth and the kernel function.

# Thanks....