# Probability

- **probability distributions:** associated with a random variable is a function that measures the probability of that particular outcome

## Why Do We Need a Probability Space?

A probability space is a mathematical construct used to model a random experiment. It provides a formal framework for calculating the probabilities of different outcomes. Here's why we need it:

- **Modeling Uncertainty**: A probability space allows us to quantify the uncertainty inherent in a random experiment. It provides a way to assign a numerical probability to each possible outcome.

- **Predicting Outcomes**: By defining a probability space, we can use the laws of probability to predict the outcomes of complex experiments. This is crucial in many fields, including statistics, physics, computer science, and finance.

- **Analyzing Events**: A probability space allows us to analyze the relationships between different events. For example, we can use it to determine whether two events are independent or to calculate the probability of one event given that another event has occurred.

- **Theoretical Consistency**: The concept of a probability space arises naturally from the axioms of probability theory. It ensures that our calculations are theoretically consistent and well-defined.

### Limitations of Classic Boolean Logic

Classic Boolean logic, also known as binary or two-valued logic, is based on the principle that every proposition is either true or false. This makes it a powerful tool for formal reasoning, but it also has certain limitations:

- **Uncertainty**: Boolean logic doesn't handle uncertainty well. In many real-world situations, we don't have complete information, and we need to reason based on probabilities or degrees of belief. Boolean logic doesn't allow for this - a statement is either true or false, with no in-between.

- **Vagueness**: Boolean logic struggles with vagueness or ambiguity. For example, consider the statement "John is tall." Whether this is true or false can depend on how we define "tall", and different people might have different definitions. Boolean logic doesn't provide a way to capture this kind of subjective or context-dependent truth.

- **Inconsistencies**: In Boolean logic, if you have two contradictory statements (one saying a proposition is true, the other saying it's false), this leads to a contradiction and the entire system fails (this is known as the principle of explosion). In real-world reasoning, we often have to deal with inconsistencies and still find a way to make the best decision based on the information available.


## Probability and Random Variables

- **Probability**: Probability is a mathematical framework for quantifying our uncertainty. It provides a way of summarizing the uncertainty that comes from our laziness and ignorance. It's a way of expressing knowledge or belief that an event will occur or has occurred. In a more formal sense, probability can be seen as a measure on a set of events. If we have a set of events, probability gives a measure to each of these events, and this measure lies between 0 and 1.

- **Random Variables**: A random variable can be thought of as a function that maps from a random process's outcomes to numbers. It assigns a unique numerical value to each outcome in a random experiment or event. We often use random variables to simplify the analysis of complex experiments. In essence, a random variable is a variable whose possible values are numerical outcomes of a random phenomenon.

### Sample Space

- **Intuition**: The sample space gives us a way to list all the things that could possibly happen when we perform a random experiment. It's a way of laying out all the possibilities up front, so we can then start to reason about which outcomes are likely or unlikely, which sets of outcomes constitute events of interest, and so on. The sample space forms the foundation for defining probability measures and random variables.
- **Sample Space**: The sample space, often denoted by `S` or `Ω`, is the set of all possible outcomes of a random experiment. It's the universe in which the experiment takes place. For example, if you roll a six-sided die, the sample space is `{1, 2, 3, 4, 5, 6}`. If you flip a coin twice, the sample space is `{HH, HT, TH, TT}`.


### Event Space

- **Intuition**: The event space allows us to group outcomes together into events, which we can then assign probabilities to. It's a way of structuring the sample space into meaningful categories that we care about. The event space forms the basis for defining a probability measure.
- **Event Space**: The event space, often denoted by `F` or `ℰ`, is a collection of events, where each event is a set of outcomes from the sample space. For example, if the sample space is `{1, 2, 3, 4, 5, 6}` for a roll of a die, an event could be "rolling an even number", which corresponds to the set `{2, 4, 6}`. The event space must satisfy certain properties (it must be a σ-algebra) to ensure that probabilities can be properly assigned.


### Probability

- **Intuition**: Probability provides a way to quantify uncertainty. It gives us a measure of our confidence in the occurrence of an event. A probability of 0 means the event will never occur, a probability of 1 means the event is certain to occur, and a probability of 0.5 means the event is equally likely to occur or not occur.
- **Probability**: Probability is a mathematical measure that assigns a value between 0 and 1 to an event, indicating how likely that event is to occur. The probability of an event is calculated as the ratio of the number of favorable outcomes to the number of total outcomes. For example, in a fair six-sided die, the probability of rolling a 3 is 1/6.


### Probability in Machine Learning

- **Intuition**: Think of the probability space as a complex, high-dimensional cloud of points. Each point represents a possible state of the world, and the density of points in different regions represents the probability of different states. In machine learning, we're often not interested in the shape of this cloud in detail. Instead, we're interested in how certain quantities (the targets) vary across the cloud. So we focus on the probabilities of these quantities, rather than the probabilities of the individual points in the cloud.

- **Why we avoid explicitly referring to the probability space**: In machine learning, we often work with high-dimensional data and complex models. The underlying probability space of these can be extremely complex and difficult to work with directly. Moreover, the exact details of this space are often not known or not important for our purposes. What we care about is making good predictions, not modeling the precise details of the underlying randomness.

- **Why we refer to probabilities on quantities of interest (target space)**: In machine learning, our goal is often to make predictions about certain quantities of interest. These might be class labels in a classification problem, real numbers in a regression problem, etc. We use probabilities to quantify our uncertainty about these quantities. This allows us to make probabilistic predictions, which can be more useful and informative than deterministic predictions.


# Statistics 

- Using probability, we can model the uncertainty inherent in our data and make predictions about future events. 
- Statistics, on the other hand, is the science of collecting, analyzing, and interpreting data. It provides a way to summarize and interpret the information contained in our data.

## Discrete and Continuous Probabilities

- **Discrete Probability**: Discrete probability is used to model random variables that take on a finite or countably infinite number of distinct values. For example, the number of heads in a series of coin flips is a discrete random variable, as is the number of cars passing through a toll booth in a given hour.

- **Continuous Probability**: Continuous probability is used to model random variables that can take on any value within a given range. For example, the height of a person, the weight of a package, and the time it takes to complete a task are all continuous random variables.

### Discrete Probability

#### Probability Mass Function (PMF)

- The PMF is a function that gives the probability that a discrete random variable is exactly equal to some value. For example, if we have a random variable that represents the outcome of a roll of a fair six-sided die, the PMF would assign a probability of 1/6 to each of the outcomes 1, 2, 3, 4, 5, and 6.

- **Intuition**: The PMF can be thought of as a map that describes the landscape of a discrete random variable. Each possible outcome of the variable is like a location on the map, and the PMF tells you how high the terrain is at that location - in other words, how likely that outcome is. The total area (or mass) under the PMF is always 1, representing the total probability of all possible outcomes.

- **Properties**: The PMF must satisfy certain properties. It must be non-negative (the probability of an outcome can't be negative), and the sum of the probabilities of all possible outcomes must be 1 (the total probability must be 1).

#### Joint Probability

- **Joint Probability**: The joint probability of two events, A and B, is the probability of both events happening at the same time. It's denoted as P(A ∩ B) or P(A, B). For independent events, the joint probability is the product of the probabilities of the two events. For dependent events, you need additional information about how the events are related to calculate the joint probability.

- **Intuition**: You can think of joint probability as a measure of the overlap between two events. If you have two coins and you want to know the probability that both will land on heads, you're asking for a joint probability. If the coins are fair and independent, the joint probability is (1/2) * (1/2) = 1/4, because the outcome of one coin doesn't affect the outcome of the other.

#### Marginal Probability

- **Marginal Probability**: The marginal probability of an event is the probability of that event happening, ignoring all other variables. It's called "marginal" because it's often calculated by summing probabilities along the margins of a probability table. For example, if you have a joint probability distribution over two events A and B, the marginal probability of A is the sum of the joint probabilities for all possible outcomes of B.

- **Intuition**: You can think of marginal probability as a kind of summary of the probability distribution over a subset of the variables. It tells you how likely an event is without reference to the outcomes of other variables. For example, if you have a bag of colored balls and you want to know the probability of drawing a red ball, regardless of its size, you're asking for a marginal probability.

#### Conditional Probability

- **Conditional Probability**: The conditional probability of an event A given that another event B has occurred is denoted as P(A | B). It's calculated as the joint probability of A and B divided by the probability of B. This is based on the definition P(A | B) = P(A ∩ B) / P(B), assuming P(B) > 0.

- **Intuition**: Conditional probability can be thought of as updating our belief about an event based on new information. For example, if you have a deck of cards and you want to know the probability that a drawn card is a king given that it's a face card, you're asking for a conditional probability. Knowing that the card is a face card changes the probability that it's a king, because it reduces the number of possible outcomes.

### Why Discrete Probability Distributions for Categorical Variables

- **Distinct Outcomes**: Categorical variables have a finite number of distinct categories or outcomes. Discrete probability distributions can model this because they assign probabilities to distinct outcomes.

- **Order Doesn't Matter**: The order of outcomes doesn't matter in categorical variables. This is consistent with discrete probability distributions where each outcome is considered independently.

- **Easy to Understand and Implement**: Discrete probability distributions are relatively simple to understand and implement. They can be easily visualized and computed, which makes them practical for modeling categorical variables.

- **Intuition**: Discrete probability distributions are like a bag of differently colored balls. Each color represents a category, and the proportion of each color in the bag represents the probability of that category. When you draw a ball from the bag (i.e., observe a value of the variable), you get one of the categories, and the color of the ball tells you which one.

- **Example**: Let's say you're studying the pet preferences of a group of people, and the options are "dog", "cat", "bird", or "other". You could model this with a discrete probability distribution where each outcome is one of the four pet categories. If 50% of people prefer dogs, 30% prefer cats, 15% prefer birds, and 5% prefer other pets, then your distribution would assign a probability of 0.5 to "dog", 0.3 to "cat", 0.15 to "bird", and 0.05 to "other".

## Continuous Probability

### Technicalities of Continuous Spaces

- **Infinite Outcomes**: In a continuous space, there are an infinite number of possible outcomes. For example, if you're measuring the height of a person, there are theoretically an infinite number of possible heights, because height is a continuous variable.

- **Probability of a Single Outcome**: In a continuous probability distribution, the probability of a single specific outcome is technically zero, because there are an infinite number of possible outcomes. This is why we talk about the probability of a range of outcomes when dealing with continuous variables.

- **Probability Density Function (PDF)**: The PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. The probability of the variable falling within a given range is given by the integral of the variable's PDF over that range.

- **Cumulative Distribution Function (CDF)**: The CDF of a continuous random variable is a function that gives the probability that the variable is less than or equal to a certain value. It's the integral of the PDF from negative infinity up to the given value.

### Probability Density Function (PDF)

- The PDF is a function that describes the likelihood of a continuous random variable taking on a specific value. Unlike a probability mass function, the PDF doesn't give you the probability of a specific outcome; instead, it gives you the density of the probability at that point. The probability of an outcome within a certain range is given by the area under the PDF curve within that range.

- **Intuition**: You can think of the PDF as a landscape, where the height of the landscape at any point represents the density of probability at that point. The total area under the curve is 1, representing the total probability of all possible outcomes.

- **Example**: Suppose you have a random variable that represents the height of adult women in a certain population, and this variable follows a normal distribution (a common type of continuous distribution). The PDF of this distribution is a bell curve, where the peak of the curve represents the average height. The probability of a woman's height being within a certain range (say, between 5'4" and 5'6") is given by the area under the curve between those two values.

### Cumulative Distribution Function (CDF)

- The CDF of a random variable is a function that gives the probability that the variable is less than or equal to a certain value. It's the integral of the PDF from negative infinity up to the given value.

- **Intuition**: You can think of the CDF as a way to accumulate probabilities. As you move along the x-axis from negative infinity to a specific value, the CDF tells you the total probability that you've accumulated up to that point.

- **Example**: Let's go back to the example of the random variable representing the height of adult women. The CDF at a certain height (say, 5'6") gives the probability that a randomly selected woman is 5'6" or shorter. If the CDF at 5'6" is 0.8, that means there's an 80% chance that a randomly selected woman is 5'6" or shorter.

### Comparing Discrete and Continuous Distributions

- **Discrete Distributions**: Discrete distributions are used when the variable can take on a finite or countably infinite number of distinct values. Each individual outcome has a non-zero probability. Examples include the binomial and Poisson distributions.

- **Continuous Distributions**: Continuous distributions are used when the variable can take on an infinite number of values within a certain range. The probability of any single, specific outcome is zero; instead, we talk about the probability of the variable falling within a range of values. Examples include the normal and exponential distributions.

- **Comparison**: The main difference between discrete and continuous distributions lies in the type of variable they represent. Discrete distributions are used for variables that can only take on distinct, separate values (like the number of heads in a series of coin flips), while continuous distributions are used for variables that can take on any value within a certain range (like the height of a person).

### Uniform Distributions

- A uniform distribution, also known as a rectangular distribution, is a type of probability distribution in which all outcomes are equally likely. A deck of cards has a uniform distribution because the likelihood of drawing any card is the same.

- **Discrete Uniform Distribution**: This is a symmetric probability distribution whereby a finite number of outcomes are equally likely to happen. An example of a discrete uniform distribution is the roll of a fair die. The outcomes 1, 2, 3, 4, 5, 6 are all equally likely and have an equal probability of 1/6.

- **Continuous Uniform Distribution**: This is a symmetric probability distribution whereby any value within a specified range is equally likely to be observed. An example would be the random selection of a point between 0 and 1. Every point in this interval is equally likely to be chosen.