## 1. Uniform Distribution:
* In probability theory and statistics, a uniform distribution is a probability distribution where all *outcomes are equally likely within a given range*. This means that if you were to select a random value from this range, any value would be as likely as any other value.
* So when we bring into a range we can say, it could follow uniform distribution.
* There are two types of uniform distribution:
    1. Discrete
    2. Continuous
* It is denoted by 
       X = U(a, b) 
       // where a and b are parameters of range(a is lower and b is higher value)
* Example of discrete uniform distribution: when we roll our dice our range will be come {1,6} and we can plot bar graph here.
* Example of continuous uniform distribution:
    * The height of a person randomly selected from a group of individuals whose heights range from 5'6" to 6'0" would follow a continuous uniform distribution.
    * The weight of a randomly selected apple from a basket of apples that weighs between 100 and 200 grams, would follow a continuous uniform distribution.
    
* Now let's see how PDF and CDF of continuous uniform distribution looks like, where a and b are the ranges and rest of the graph will be at 0, 1/b-a came has b-a is length between a and b, and area under curve of PDF is 1, so height of rectange within a and b range will be 1/(b-a):
![pdf_cdf_of_uni_dis.png](attachment:pdf_cdf_of_uni_dis.png)
* **In continuous uniform distribution skewness is 0**, as it is symmetric just like normal distribution.

### Application in Machine learning and Data Science:

1. **Random initialization**: In many machine learning algorithms, such as neural networks and k-means clustering, the initial values of the parameters can have a significant impact on the final result. Uniform distribution is often used to randomly initialize the parameters, as it ensures that all values in the range have an equal probability of being selected.
2. **Sampling**: Uniform distribution can also be used for sampling. For example, if you have a dataset with an equal number of samples from each class, you can use uniform distribution to randomly select a subset of the data that is representative of all the classes.

3. **Hyperparameter tuning**: Uniform distribution can also be used in hyperparameter tuning, where you need to search for the best combination of hyperparameters for a machine learning model. By defining a uniform prior distribution for each hyperparameter, you can sample from the distribution to explore the hyperparameter space.



## 2. Log Normal Distribution:

* In probability theory and statistics, a lognormal distribution is a heavy tailed(right skewed) continuous probability distribution of a random variable whose logarithm is normally distributed.
* All right-skewed distributions cannot be called as log normal distributions, **for distribution whose we take log and get normal distribution, those are only called log normal distribution**.
* For eg we have a column age and if say that is is log normally distributed so when we take log of each value we get normal distributed values.

![log_normal_dist.png](attachment:log_normal_dist.png)

* Denoted as:
    X = lognormal(μ, σ)

* Examples of log normal distribution:
    * The length of comments posted in Internet discussion forums follows a log-normal distribution. As most of comments are not long.
    * Users' dwell time on online articles (jokes, news etc.) follows a log-normal distribution.
    * The length of chess games tends to follow a log-normal distribution.
    * In economics, there is evidence that the income of 97%–99% of the population is distributed log-normally.
    
* CDF of log normal distribution: https://en.wikipedia.org/wiki/Log-normal_distribution

### How to check if a random variable is log normally distributed?

* Let's say we a random variable x, first we will take log of it
* Now after taking log it should have been converted to normal distribution.
* We can check using QQ plot whether it is normally distributed or not.
* If it comes as normally distributed then we can say random variable x is log normally distributed.

## 3. Pareto Distribution:

* The Pareto distribution is a type of probability distribution that is commonly used to model the distribution of wealth, income, and other quantities that exhibit a similar power-law behaviour.
* It is special case of power law.

**What is Power Law**

* In mathematics, a power law is a functional relationship between two variables, where one variable is proportional to a power of the other. 

* Specifically, if y and x are two variables related by a power law, then the relationship can be written as:
        y = k * x^α
        // where k is some constant

* Power law follow a rule called 80-20 rule. It tells that whenever a data follows power law then only 20% of that will have 80% of occupancy.

* Graph of power law looks like this
    
![pareto.png](attachment:pareto.png)

* In above graph we can see green color is 80% of entire area and yellow one is just 20% of area.
* Vilfredo Pareto originally used this distribution to describe the allocation of wealth among individuals since it seemed to show rather well the way that a larger portion of the wealth of any society is owned by a smaller percentage of the people in that society. He also used it to describe distribution of income. This idea is sometimes expressed more simply as the Pareto principle or the "80-20 rule" which says that 20% of the population controls 80% of the wealth.
* In the above graph we can see that 20% of yellow area has 80% of wealth.
* Another example can be 80% of a cricket match is won by 20% of a team performance.


* Although this 80-20 rule is always not applicable on Pareto distribution. It is generally application for α=1.16, where α is only parameter of pareto distribution.

* Pareto distribution is skewed as we can see.
* **Interview question**: How to detect if a distribution is Pareto distribution?
    1. Using log-log plot: If we have two variables X and Y, we find log of each values them. Then we plot graph between log X and log Y and if that graph is a line looks like below then we can say that this distribution is pareto distribution:
    ![Log log plot](https://i.stack.imgur.com/ozsuR.png)
    
    2. Using QQ plot: As it compares two distributions, we can take X as some pareto distribution, now we compare original Y distribution with X. If it is comparable then we can say it is pareto distribution.
    
    
## Transformations:
* As we know if we have normal distributed then things becomes easy as normal distributed is very well researched and we know everything about it. But many times we may not get normally distributed data.
* Tranformations are mathematical transformations, where using some mathematical transformation we can convert a distribution to normal distribution.
* Some transformations given by sklearn:
  * Function transformer
    * Log transformation: For right skewed data
    * Reciprocal transformation: Here we make large values to small and vice versa
    * square transformation: Used for left skewed data
    * square root tranformation
  * Power transformer
    * Box cox
    * Yeo-Johnson
  * Quantile transformer(not used much)

* How to know if data is normal
  * Use sns.distplot() it will plot distribution
  * Use pd.skew(), if it is 0 then it means it is normal
  * QQ plot(most reliable way)

Let's study few:

### Log transform: 
* For eg we have a column age and if say that is is log normally distributed so when we take log of each value we get normal distributed values. Not exactly normal, but close to it or better than previous case.
* Log transform does not work with negative values, as we cannot take log of them, the logarithm of a negative number is undefined.
* Log tranform data on right skewed data bring data to center. So **log normal tranformation should be applied to right skewed data**.

### Box Cox transformation:
* This is more general tranformation, where log and square/square root transformation are its special cases.
* At the core of the Box Cox transformation is an exponent, lambda (λ), which varies from -5 to 5. All values of λ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve. The transformation of Y has the form:

![box-cox.png](attachment:box-cox.png)


Let's study some discrete distribution:

## 4. Bernoulli Distribution:

* Bernoulli distribution is a **probability distribution that models a binary outcome**, where the outcome can be either success (represented by the value 1) or failure (represented by the value 0). The Bernoulli distribution is named after the Swiss mathematician Jacob Bernoulli, who first introduced it in the late 1600s.
* Example of this distribution is coin toss, spam binary classifier.
* The Bernoulli distribution is characterized by a single parameter, which is the probability of success, denoted by p. The probability mass function (PMF) of the Bernoulli distribution.

* Denoted by:
      P(X=x) = pˣ * (1-p)ⁿ⁻ˣ
      // where p is probability of success and 1-p is proability of failure
      
* In coin toss, if we are considering head as success, if we try to find probability of value being 1 and 0:
      P(X=1) = (1/2)^1 * (1-1/2)^0 => 1/2
      P(X=0) = (1/2)^0 * (1-1/2)^1 => 1/2
* To see how PMF of it looks like: https://en.wikipedia.org/wiki/Bernoulli_distribution

* The Bernoulli distribution is commonly used in machine learning for modelling binary outcomes i.e. binary classifier, such as whether a customer will make a purchase or not, whether an email is spam or not, or whether a patient will have a certain disease or not.
* So, remember binary classifier algorithms like Logistic Regression, SVM, Decision Trees, Naive Bayes, etc, here each row in them, can be treated a bernoulli distribution.


## 5. Binomial Distribution:
* Binomial distribution is a probability distribution that describes the number of successes in a fixed number of **independent Bernoulli trials** with two possible outcomes (often called "success" and "failure"), where the probability of success is constant for each trial. The binomial distribution is characterized by two parameters: the number of trials n and the probability of success p.

* In bernoulli experiment we were doing experiment one time, but here in binomial we perform that experiment `n` times. For eg: single coin toss can be represented by bernoulli distribution and if we do this coin toss 5 times then it can be treated as binomial distribution.

* Here PDF equation is:  
      ⁿCₓ pˣ (1-p)ⁿ⁻ˣ
      
      // where n is no. of trials, p is probability of success, x is desired result like how many times we want(for eg we want 3 times head out of 5 so x will be 3)
      
* **Criteria required for Binomial Distributions**:
  1. The process consists of n trials
  2. Only 2 exclusive outcomes are possible, a success and a failure.
  3. P(success) = p and P(failure) = 1-p and it is fixed from trial to trial
  4. The trials are independent.
  

* **Uses in ML**:
1. **Binary classification problems**: In binary classification problems, we often model the probability of an event happening as a binomial distribution. For example, in a spam detection system, we may model the probability of an email being spam or not spam using a binomial distribution.

2. **Hypothesis testing**: In statistical hypothesis testing, we use the binomial distribution to calculate the probability of observing a certain number of successes in a given number of trials, assuming a null hypothesis is true. This can be used to make decisions about whether a certain hypothesis is supported by the data or not.


3. **Logistic regression**: Logistic regression is a popular machine learning algorithmused for classification problems. It models the probability of an event happening as a logistic function of the input variables. Since the logistic function can be viewed as a transformation of a linear combination of inputs, the output of logistic regression can be thought of as a binomial distribution.


4. **A/B testing**: A/B testing is a common technique used to compare two different versions of a product, web page, or marketing campaign. In A/B testing, we randomly assign individuals to one of two groups and compare the outcomes of interest between the groups. Since the outcomes are often binary (e.g., click- through rate or conversion rate), the binomial distribution can be used to model the distribution of outcomes and test for differences between the groups.

## 6. Poisson Distribution:
* A Poisson distribution is a discrete probability distribution. It gives the probability of an event happening a certain number of times (k) within a given interval of time or space.
* We can use a Poisson distribution to predict or explain the number of events occurring within a given interval of time or space. “Events” could be anything from disease cases to customer purchases to meteor strikes. The interval can be any specific amount of time or space, such as 10 days or 5 square inches.

* We can use a Poisson distribution if:
    - Individual events happen at random and independently. That is, the probability of one event doesn’t affect the probability of another event.
    - We know the mean number of events occurring within a given interval of time or space. This number is called λ (lambda), and it is assumed to be constant.

* The probability of observing exactly k events is given by the formula:
      P(X=k)= (e^−λ * λ^k) /k!

    where:
    λ is the average number of events per interval,
    k is the number of occurrences,
    e is the base of the natural logarithm (approximately equal to 2.71828),
    k! denotes the factorial of k


* **Example**: Number of Emails Received in an Hour: Suppose you typically receive an average of 5 emails per hour at your office. You can model the number of emails you receive in any given hour using a Poisson distribution with λ=5

**Question**: What is the probability of receiving exactly 3 emails in the next hour?

**Answer**:

Using the Poisson formula:
    P(X=3)=e^−5 * 5^3 / 3!=0.0067×125/6 ≈ 0.1404


So, there is approximately a 14.04% chance that you will receive exactly 3 emails in the next hour.

This model is helpful because it gives you a way to predict the likelihood of different outcomes (e.g., receiving no emails, or more than 10 emails), which can be useful for planning and resource allocation.