# Probability Distributions

Probability and Statistics are the foundational pillars of Data Science

## Random Variable

A <b>random variable</b> is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and continuous.

A <b>discrete random variable</b> is one which may take on only a countable number of distinct values and thus can be quantified. For example, you can define a random variable X to be the number which comes up when you roll a fair dice. X can take values : [1,2,3,4,5,6] and therefore is a discrete random variable.

The <b>probability distribution</b> of a discrete random variable is a list of probabilities associated with each of its possible values. It is also sometimes called the <b>probability function</b> or the probability mass function. To have a mathematical sense, suppose a random variable X may take k different values, with the probability that $X=x_i$ defined to be $P(X=x_i)=p_i$. Then the probabilities $p_i$ must satisfy the following:

$$
 0<p_i<1, \text{for each } i
$$
$$
 p_1+p_2+\dots+p_k=1
$$

<div class="alert alert-block alert-info">
Some examples of discrete probability distributions are:<br>
    <ul>
        <li> Bernoulli distribution</li>
    <li> Binomial distribution </li>
    <li> Poisson distribution </li>
    </ul>    
</div>

A <b>continuous random variable</b> is one which takes an infinite number of possible values. For example, you can define a random variable X to be the height of students in a class. Since the continuous random variable is defined over an interval of values, it is represented by the area under a curve (or the integral).

The <b>probability distribution</b> of a continuous random variable, known as probability distribution functions, are the functions that take on continuous values. The probability of observing any single value is equal to 0 since the number of values which may be assumed by the random variable is infinite. For example, a random variable X may take all values over an interval of real numbers. Then the probability that X is in the set of outcomes A,P(A), is defined to be the area above A and under a curve. The curve, which represents a function p(x), must satisfy the following:

<div class="alert alert-block alert-info">
    <ul>
        <li> The curve has no negative values ($p(x)>0$ for all $x$)</li>
    <li> The total area under the curve is equal to $1$ </li>
    </ul>    
</div>


A curve meeting these requirements is often known as a <b>density curve</b>. Some examples of continuous probability distributions are normal distribution, exponential distribution, beta distribution, etc.

<div class="alert alert-block alert-info">
    There is another type of distribution that often pops up in literature which you should know about called <b>cumulative distribution function</b>. All random variables (discrete and continuous) have a cumulative distribution function. It is a function giving the probability that the random variable X is less than or equal to x, for every value x. For a discrete random variable, the cumulative distribution function is found by summing up the probabilities.
</div>

In [1]:
# for inline plots in jupyter
%matplotlib notebook
import seaborn as sns
# import matplotlib
import matplotlib.pyplot as plt
# for latex equations
from IPython.display import Math, Latex
# for displaying images
from IPython.core.display import Image

## Uniform Distribution

Perhaps one of the simplest and useful distribution is the uniform distribution. The probability distribution function of the continuous uniform distribution is:

$$
 f(x)=\begin{cases}
    \frac{1}{b-a} & a\leq x\leq b \\
    0             & x<a \text{ or } x>b
    \end{cases}
$$

Since any interval of numbers of equal width has an equal probability of being observed, the curve describing the distribution is a rectangle, with constant height across the interval and 0 height elsewhere. Since the area under the curve must be equal to 1, the length of the interval determines the height of the curve. 

You can visualize uniform distribution in python with the help of a random number generator acting over an interval of numbers (a,b). You need to import the uniform function from scipy.stats module.

In [2]:
# import uniform distribution
from scipy.stats import uniform

The uniform function generates a uniform continuous variable between the specified interval via its loc and scale arguments. This distribution is constant between loc and loc + scale. The size arguments describe the number of random variates. If you want to maintain reproducibility, include a random_state argument assigned to a number.

In [3]:
# random numbers from uniform distribution
n = 10000
start = 10
width = 20
data_uniform = uniform.rvs(size=n, loc = start, scale=width)

You can use Seaborn’s distplot to plot the histogram of the distribution you just created. Seaborn’s distplot takes in multiple arguments to customize the plot. You first create a plot object ax. Here, you can specify the number of bins in the histogram, specify the color of the histogram and specify density plot option with kde and linewidth option with hist_kws. You can also set labels for x and y axis using the xlabel and ylabel arguments.

In [4]:
ax = sns.distplot(data_uniform,
                  bins=100,
                  kde=True,
                  color='skyblue',
                  hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Uniform Distribution ', ylabel='Frequency')

<IPython.core.display.Javascript object>

  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval


[Text(0, 0.5, 'Frequency'), Text(0.5, 0, 'Uniform Distribution ')]

* Some definition

$$x_m=\frac{\sum_{i=1}^nx_i}{n}$$

In [6]:
import numpy as np

In [7]:
x=np.arange(10)
np.mean(x)

4.5

Data often may include missing values, which are in many cases replaced by nan’s 

In [8]:
xWithNan=np.hstack((x,np.nan)) # append nan
np.mean(xWithNan)

nan

In [9]:
np.nanmean(xWithNan)

4.5

* median

In [10]:
np.median(x)

4.5

* the <b>mode</b> is the most frequently occuring value in a distribution

In [11]:
from scipy import stats
data=[1,3,4,4,7]
stats.mode(data)

ModeResult(mode=array([4]), count=array([2]))