# Probability

### What Is Probability and Why It Matters for ML?
Probability is simply a measure of uncertainty. It tells us how likely something is to happen.
Machine Learning relies on probability because models don’t make absolute decisions; they estimate the likelihood of outcomes based on data.

P(Event) = Number of Favorable Outcomes / Total Number of Outcomes

### Key Probability Concepts
- **Sample Space (S)**: The set of all possible outcomes of an experiment.
- **Event (E)**: A subset of the sample space; an outcome or a set of outcomes.
    - **Simple Event**: An event with a single outcome.
    - **Compound Event**: An event with multiple outcomes.
    - **Complementary Events**: The complement of an event E is the event that E does not occur, denoted as E'. Everything else but not a specified event. E + E' = S
- **Mutually Exclusive Events**: Two events that cannot occur at the same time.
- **Independent Events**: Two events are independent if the occurrence of one does not affect the probability of the other.
- **Conditional Probability**: The probability of an event occurring given that another event has occurred.

### Basic Probability Rules
1. **Range**: Probability values range from 0 to 1. 0<= P(E) <= 1
   - P(E) = 0 means the event will not occur.
   - P(E) = 1 means the event is certain to occur.
2. **The Complement Rule**: P(E) + P(E') = 1
    - E is the event, E' is the complement of E.
3. **Addition Rule**: For mutually exclusive events A and B, P(A or B) = P(A) + P(B)
4. **Multiplication Rule**: For independent events A and B, P(A and B) = P(A) * P(B)
5. **Conditional Probability**: The probability of an event occurring given that another event has occurred.
   - P(A|B) = P(A and B) / P(B), provided P(B) > 0

## Conditional Probability and Independence

### Conditional Probability
Conditional probability is the probability of an event occurring given that another event has already occurred. It is denoted as P(A|B), which reads as "the probability of A given B."

Formula: P(A|B) = P(A and B) / P(B), provided P(B) > 0

**Example**: Imagine a class of 100 students. 60 of them like coffee. 30 of them like both coffee and tea. Now let's draw a 2*2 table to represent this data:
|               | Likes Tea | Does Not Like Tea | Total |
|---------------|-----------|-------------------|-------|
| Likes Coffee  |    30     |        30         |  60   |
| Does Not Like Coffee |    20     |        20         |  40   |
| Total         |    50     |        50         | 100   |

To find the probability that a student likes tea given that they like coffee, we use the conditional probability formula:
P(Tea|Coffee) = P(Tea and Coffee) / P(Coffee) = 30/60 = 0.5

### Independence
If knowing one event doesn’t change the probability of the other, they’re independent. Like two events A and B are independent if the occurrence of one does not affect the probability of the other. In other words, knowing that event B has occurred does not change the probability of event A occurring.

Two events A and B are independent if: P(A|B) = P(A)  or equivalently, P(A) = P(A and B) / P(B), provided P(B) > 0. It can also be expressed as: P(A and B) = P(A) * P(B)

**Example**: Consider rolling a fair six-sided die. Let event A be "rolling an even number" and event B be "rolling a number greater than 4."
- P(A) = 3/6 = 1/2 (even numbers are 2, 4, 6)
- P(B) = 2/6 = 1/3 (numbers greater than 4 are 5, 6)
Now, let's find P(A and B): (1/2) * (1/3) = 1/6 (only number 6 is both even and greater than 4)

Naive Bayes classifier assumes that all features are independent given the class - even though they aren’t truly independent - but it still
works surprisingly well

### Conditional Independence
Two events A and B are conditionally independent given a third event C. In other words, knowing C has occurred makes A and B independent of each other.

Mathematically, A and B are conditionally independent given C if: P(A|B,C) = P(A | C), that means B gives no extra information about A once we know C.

**Example**: Suppose we have three events:
- A: Whether a student passes the exam.
- B: Whether they studied with friends.
- C: Total study hours.

If we know the total study hours (event C), then whether they studied alone or with friends (B) adds very little extra info about passing (A). So: P(Pass ∣ StudiedWithFriends, Hours) ≈ P(Pass ∣ Hours)

Real-world data often has dependencies. So instead of assuming complete independence, we assume independence given something else (a “context” variable like C).

## Bayes's Theorem & The Base-Rate Effect — How Machines Update Their Beliefs

### Concept
It’s a rule that helps us update our belief when new information appears.
We start with what we already believe — the prior probability.
Then comes new evidence — something we observe.
Bayes’ theorem combines both to produce an updated belief — the posterior probability.

Formula:
P(A|B) = [P(B|A) * P(A)] / P(B)

Where:
- P(A) is the prior probability: the initial probability of event A.
- P(B|A) is the likelihood: the probability of observing event B given that A is true.
- P(B) is the evidence: the total probability of event B.
- P(A|B) is the posterior probability: the probability of event A occurring given that B is true. It's the new belief after seeing the evidence.

### Example: The Rain and Traffic Jam
Mr. Musk is getting ready for work and you notice heavy traffic outside the window. He wonders — is it because it’s raining?
We’ll translate this into Bayes terms:
* A: It’s raining.
* B: There’s a traffic jam.
##### Now we go through each part:
- P(A) → Prior:
Before looking outside, checking weather app. It says there’s a 30% chance of rain today.
That’s the prior belief — how likely it is to rain before seeing any evidence.
- P(B|A) → Likelihood:
If it is raining, there’s usually a 90% chance of a traffic jam because everyone drives slower and accidents increase.
- P(B) → Evidence:
How often the traffic jams occur in general, regardless of weather? Let’s say 40% of mornings have traffic even without rain.
- P(A|B) → Posterior:
Now that he sees the traffic, what’s the updated chance it’s actually raining?

Using Bayes’ Theorem, it’s higher than 30%, maybe around 60–70%, depending on the exact numbers.

So Bayes’ theorem is like:
“How much should I trust this new clue, given what I already know about the world?

### The Base-Rate Effect
The base-rate effect is when people ignore the general likelihood of an event (the base rate) or when we ignore how rare or common something is, we massively overestimate probabilities.

**Example**: Imagine a disease that affects 1 in 1,000 people (0.1% prevalence). There’s a test for it that is 99% accurate (it correctly identifies 99% of those with the disease and 99% of those without it).
Now, if someone tests positive, what’s the chance they actually have the disease? One might think it’s 99%, but because the disease is so rare, the actual probability is much lower.

Using Bayes’ theorem:
- P(Disease) = (1/1000) = 0.001 (base rate)
- P(Positive|Disease) = 0.99 (likelihood)
- P(Positive|No Disease) = 0.01 (false positive rate)
- P(No Disease) = 0.999 (1 - P(Disease))
- P(Disease|Positive) = [P(Positive|Disease) * P(Disease)] / P(Positive) = ?

Calculating P(Positive): 
P(Positive) = P(Positive|Disease) * P(Disease) + P(Positive|No Disease) * P(No Disease) = (0.99 * 0.001) + (0.01 * 0.999) = 0.01098

Now, plugging back into Bayes’ theorem:
P(Disease|Positive) = (0.99 * 0.001) / 0.01098 ≈ 0.0902 or about 9.02%

So even with a positive test, there’s only about a 9% chance of actually having the disease because the disease is so rare. Therefore, even with accurate data, ignoring base rates can fool both humans and machines. Base-rate effect reminds us that priors — the overall frequency — matter.

##### The Confusion Matrix
A confusion matrix is a table used to evaluate the performance of a classification model. It shows the counts of true positive, true negative, false positive, and false negative predictions. For above example, for 1,000,000 people, the confusion matrix would look like this:
|                      | Predicted Positive | Predicted Negative | Total |
|----------------------|--------------------|--------------------|-------|
| Actual Positive (Disease)   |        TP = 990         |       FN = 10          |  1000  |
| Actual Negative (No Disease) |       FP = 9990        |      TN = 989010       | 999000|
| Total                |       10980         |      989020        |1000000|


Where:
- TP (True Positive): The model correctly predicts the positive class (disease present). (1/1000) * 1000000 = 1000 people have the disease, 99% of them are correctly identified, so 990 are true positives.
- TN (True Negative): The model correctly predicts the negative class (disease absent). (999/1000) * 1000000 = 999000 people do not have the disease, 99% of them are correctly identified, so 989010 are true negatives.
- FP (False Positive): The model incorrectly predicts the positive class (disease absent). (1% of 999000) = 9990 people are incorrectly identified as having the disease.
- FN (False Negative): The model incorrectly predicts the negative class (disease present). (1% of 1000) = 10 people are incorrectly identified as not having the disease.

## Model Evaluation Matrics 

### Sensitivity/Recall
Sensitivity, also known as recall or true positive rate, measures the proportion of actual positives that are correctly identified by the model. It tells us out of all real positive cases, how many did we catch?

Sensitivity = TP / (TP + FN)

High sensitivity means we rarely miss real positives.

### Specificity
Specificity, or true negative rate, measures the proportion of actual negatives that are correctly identified. It tells us out of all real negative cases, how many did we correctly identify as negative.

Specificity = TN / (TN + FP)

Specificity is like a security guard - it decides who should not enter. High specificity means we rarely misclassify negatives as positives.

### False Positive & False Negative Rates
- **False Positive Rate (FPR)**: The proportion of actual negatives that are incorrectly classified as positives. 1 - Specificity
  
  FPR = FP / (FP + TN)

- **False Negative Rate (FNR)**: The proportion of actual positives that are incorrectly classified as negatives. 1 - Sensitivity

  FNR = FN / (FN + TP)

Both should ideally be low — but there’s always a trade-off. If the model is more sensitive (catches more positives), that often increase false positives too.

### Positive Predictive Value (PPV) / Precision
Positive Predictive Value, or precision, measures the proportion of positive predictions that are actually correct.

PPV = TP / (TP + FP)

High precision means when the model says "positive," it’s usually right.

### Negative Predictive Value (NPV)
Negative Predictive Value measures the proportion of negative predictions that are actually correct.

NPV = TN / (TN + FN)

High NPV means when the model says "negative," it’s usually right.

### Prevalence
Prevalence refers to the proportion of actual positives in the population. It affects predictive values and is important to consider when interpreting model performance.

Prevalence = (TP + FN) / (TP + TN + FP + FN)

The higher the prevalence, the more likely a positive prediction is to be correct (higher PPV). Conversely, lower prevalence can lead to lower PPV even with a highly accurate model.

### Accuracy
Accuracy measures the overall correctness of the model by calculating the proportion of true results (both true positives and true negatives) among the total number of cases examined.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

The accuracy can be misleading in imbalanced datasets, where one class significantly outnumbers the other. In such cases, a model could achieve high accuracy by simply predicting the majority class.

### F1 Score
The F1 Score is the harmonic mean of precision and recall (sensitivity). It provides a single metric that balances both concerns, especially useful when the class distribution is imbalanced.

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

The F1 Score ranges from 0 to 1, with 1 being the best possible score. It is particularly useful when a balance precision and recall is needed.

### Balancing Act
In real-world applications, there’s often a trade-off between sensitivity and specificity. Depending on the context, one might be prioritize over the other. For example, in medical diagnostics, high sensitivity is crucial to ensure that no cases are missed, even if it means more false positives. In contrast, for autonomous vehicles, high specificity is vital to avoid false alarms that could lead to unnecessary stops. Again in face recognition systems, high precision is important to minimize false positives, ensuring that when the system identifies someone, it is very likely to be correct.

### Summary Table of Metrics
| Metric                     | Formula                          | Interpretation                                  |
|----------------------------|----------------------------------|------------------------------------------------|
| Sensitivity (Recall)       | TP / (TP + FN)                  | Proportion of actual positives correctly identified. |
| Specificity                | TN / (TN + FP)                  | Proportion of actual negatives correctly identified. |
| False Positive Rate (FPR)  | FP / (FP + TN)                  | Proportion of actual negatives incorrectly classified as positives. |
| False Negative Rate (FNR)  | FN / (FN + TP)                  | Proportion of actual positives incorrectly classified as negatives. |
| Positive Predictive Value (PPV) | TP / (TP + FP)              | Proportion of positive predictions that are correct. |
| Negative Predictive Value (NPV) | TN / (TN + FN)              | Proportion of negative predictions that are correct. |
| Prevalence                 | (TP + FN) / (TP + TN + FP + FN) | Proportion of actual positives in the population. |
| Accuracy                    | (TP + TN) / (TP + TN + FP + FN) | Overall correctness of the model.               |
| F1 Score                   | 2 * (Precision * Recall) / (Precision + Recall) | Balance between precision and recall.          |

- Sensitivity -> how well the model catches positives.
- Specificity -> how well it avoids false alarms.
- FPR/FNR -> the types of mistakes it makes.
- PPV/NPV -> how trustworthy its predictions are.

##### Example Calculation
Using the confusion matrix from the previous example:
| Matrix Component       | Value  |
|------------------------|--------|
| Sensitivity (Recall)      | 990 / (990 + 10) = 0.99 -> 99% |
| Specificity               | 989010 / (989010 + 9990) = 0.99 -> 99% |
| False Positive Rate (FPR) | 9990 / (9990 + 989010) = 0.01 -> 1% |
| False Negative Rate (FNR) | 10 / (10 + 990) = 0.01 -> 1% |
| Positive Predictive Value (PPV) | 990 / (990 + 9990) = 0.0902 -> 9.02% |
| Negative Predictive Value (NPV) | 989010 / (989010 + 10) = 0.99999 -> 99.999% |  
| Prevalence                | 1000 / 1000000 = 0.001 -> 0.1% |
| Accuracy                   | (990 + 989010) / 1000000 = 0.989 -> 98.9% |
| F1 Score                   | 2 * (0.0902 * 0.99) / (0.0902 + 0.99) ≈ 0.164 | 

This shows that while the model is very good at identifying both positives and negatives (high sensitivity and specificity), the positive predictive value is low due to the rarity of the disease, illustrating the base-rate effect. So, the model is not precise when it predicts someone has the disease. But it is very reliable when it predicts someone does not have the disease. It may seem a bit contradictory, but this models has high accuracy overall because the disease is so rare. Therefore, it gets most predictions right by simply predicting "no disease" for almost everyone. This causes imbalanced metrics, which is why it’s important to look at all these different measures when evaluating a model.

## Class Imbalance and Its Impact on Interpreting Errors

### Understanding Class Imbalance
Class imbalance occurs when the classes in a dataset are not represented equally. In simply means, one class has many more samples than another. For example, in a medical dataset, if 95% of patients are healthy and only 5% have a disease, the dataset is imbalanced.
This imbalance can significantly impact the performance of machine learning models and the interpretation of evaluation metrics.

### Effects of Class Imbalance
1. **Misleading Accuracy**: In imbalanced datasets, a model can achieve high accuracy by simply predicting the majority class. For example, in a dataset with 95% healthy patients, a model that always predicts "healthy" will have 95% accuracy, but it fails to identify any diseased patients.
2. **Poor Generalization**: Models trained on imbalanced data may not learn the characteristics of the minority class well, leading to poor performance when predicting that class.
3. **Skewed Evaluation Metrics**: Metrics like accuracy can be misleading in imbalanced datasets. More informative metrics such as precision, recall, F1-score, and area under the ROC curve (AUC-ROC) should be used to evaluate model performance in such scenarios.

In imbalanced problems, False Negatives hurt more. For example, in disease detection, missing a positive case (false negative) can have severe consequences compared to a false positive. If a model predicts a healthy patient as diseased (false positive), it may lead to unnecessary tests, but missing a diseased patient (false negative) can be life-threatening.

### Strategies to Handle Class Imbalance
1. **Resampling Techniques**:
   - **Oversampling**: Increase the number of samples in the minority class by duplicating existing samples or generating new ones (e.g., using SMOTE - Synthetic Minority Over-sampling Technique).
   - **Undersampling**: Decrease the number of samples in the majority class by randomly removing samples.
2. **Algorithmic Approaches**:
   - Use algorithms that are robust to class imbalance, such as decision trees or ensemble methods like Random Forest and Gradient Boosting.
   - Adjust class weights in algorithms to give more importance to the minority class.
3. **Anomaly Detection**: Treat the minority class as anomalies and use anomaly detection techniques to identify them.
4. **Evaluation Metrics**: Focus on metrics that provide a better understanding of model performance on the minority class, such as precision, recall, F1-score, and AUC-ROC.
5. **Bayes/Calibration**: Use Bayesian methods to adjust predictions based on prior probabilities, especially when dealing with rare events.
6. **Class-weighting**: Assign higher weights to the minority class during model training to penalize misclassifications more heavily.

**Summary**: 
Imbalance + high accuracy = dangerous illusion of confidence