# CS50 AI with Python

##  Lecture 2

##  Uncertainty

Very rarely does AI know the answer for sure. In other words, AI's have some level of **uncertainty** and some **probability** of it's answer. For example, news stations do not know for sure what the weather is, but infer it from other knowledge. This prediction has some **probability**.

### What is Probability?

There are possible worlds, represented by **ω**.

Each of these worlds has a probability and is represented like this -> P(***ω***).

Any probability should be between 0 and 1: ***ω*** 0 < P(***ω***) < 1
* 0 is an impossible possibility(ex. rolling a 7 on a dice)
* 1 is a value that will happen no matter what(ex. like rolling a number less than 7 on a die)
* The higher the probability, the better chance it will happen
* 
All the probabilities of possible worlds of one event summed up should equal 1.

The picture below is the formula for exactly that.

![image.png](attachment:image.png)

For example if you roll a die, the probabilities of the worlds are as follows:
* 1: ⅙
* 2: ⅙
* 3: ⅙
* 4: ⅙
* 5: ⅙
* 6: ⅙

When all the probabilities are added together, they equal 1. A representation of one of these worlds is as follows- P(1) = ⅙ 

If we have two die, the sum of two rolls is one possible worlds.

All the possible worlds, ***ω***, are all the 2 rolls of die. 

Doing a little math, there are 36 possible worlds, and all of the worlds are equally likely.

However, if we calculate the sum of these die, there are 11 different sums, and all of the worlds are **not** equally likely. There are 6 possible die combinations where the sum of both die is 7.

P(**sum to 12**) = 1/36

P(**sum to 7**) = 1/6

---

When we decide what is the probability of a specific world without much evidence is called **unconditional probability**

*Unconditional Probability* is a degree of belief in a proposition in the absence of any other evidence.

When we apply probability to an AI it is not unconditional probability it is using to judge it's predictions but **conditional probability**.

*Conditional Probability* is a degree of belief in a proposition given some evidence that has already been revealed.

In terms of notation it looks like this:

P(**a** | **b**)

This notation is read like the following:

What is the probability of a given b.

Ex.

P(**route change** | **traffic conditions**)

P(**disease** | **test result**)

---

The formula for conditional probability is as follows:

#### **P(a|b) = P(a∧b) / P(b)**

The probability of **a** based on **b** is equal to the probability of **a** and **b** being true, divided by the probability of **b**

Using conditional probability we can add new scenarios to our dice problem:

P(**sum 12** | **Dice 1 = 6**)

Since the first dice is 6, there are only 6 possible worlds. The second thing that we need to know is the probability of the second dice being equal to 6. The probability for that is 1/6. 

Formula:

P(**sum 12** ^ **Dice 1 = 6**) = 1/36

P(**Dice 1 = 6**) = 1/6

P(**sum 12** | **Dice 1 = 6**) = 1/6

*Transforming the formula*

**P(a|b) = P(a∧b) / P(b)**

to

**P(a ∧ b) = P(b)P(a|b)**

to 

**P(a ∧ b) = P(a)P(b|a)**

---

A *Random variable* is a variable in probability theory with a domain of possible values it can take on.

Ex.

**Roll**

{**1**, **2**, **3**, **4**, **5**, **6**}

**Traffic**

{**none**, **light**, **heavy**}

Example of a **probability distribution**:

P(**Flight** = **on time**) = 0.6

P(**Flight** = **delayed**) = 0.3

P(**Flight** = **cancelled**) = 0.1

A **vector** can be used to demonstrate the probability distribution of a certain random variable.

Ex.

*P*(**Flight**) = ⟨0.6, 0.3, 0.1⟩

---

*Independence* is the knowledge that one event occurs does not affect the probability of the other event.

Ex.

The value of the first dice does not affect the value of the other.

An example where an event is not independent is the correlation between it raining and it being cloudy.

Formula:

**P(a ∧ b) = P(a)P(b)**

In words, this means the probability of **a** & **b** is the probability of **a** times the probability of **b**.

---

### Bayes Rule

**P(a ∧b) = P(b) P(a|b)**

The livlihood of a & b taking place is the likelihood b takes place and the likelihood that a takes place, knowing b is true.

**P(a ∧b) = P(a) P(b|a)**

The livlihood of a & b taking place is the likelihood a takes place and the likelihood that b takes place, knowing a is true.

We can join/transform these two equations into Bayes Rule:

**P(b|a) = P(b)P(a|b) / P(a)**

The probability of b given a is the probability of a given b times the likelihood of b to be true divided by the likelihood of a to be true. 

---

Example problem:

Given there are clouds in the morning and given these stats to conclude the probability of rain in the afternoon.

* 80% of rainy afternoons start with cloudy 
  mornings. 
* 40% of days have cloudy mornings. 
* 10% of days have rainy afternoons.

So using Bayes rule's formula we can plug in our statistics.

**P(rain|clouds) = P(clouds|rain)P(rain) / P(clouds)**

The probability of rain given clouds is the probability of clouds given rain times the chances of rain divided by the likelihood of clouds.

Plugging in our numbers we get the equation below:

= (.8)(.1)/ .4

= .2

= 20%

Knowing:

**P(cloudy morning | rainy afternoon)**

We can calculate:

**P(rainy afternoon | cloudy morning)**

---

### Joint Probability

| C = cloud   	| C = ¬cloud 	|
|-------------	|------------	|
| 0.4         	| .6         	|

| C = rain 	| C = ¬rain 	|
|----------	|-----------	|
| .1       	| .9        	|

A joint probability distribution:

|            	| C = rain 	| C = ¬rain 	|
|------------	|----------	|-----------	|
| C = cloud  	| .08      	| .32       	|
| C = ¬cloud 	| .02      	| .058      	|

P(C | rain)

, = and

**P(C | rain) = P(C, rain) / P(rain) = αP(C, rain)**

**= α⟨0.08, 0.02⟩ = ⟨0.8, 0.2⟩**

---

### Probability Rules

---

#### Negation

**P(¬a) = 1 - P(a)**

"*a*" is an event that has some probability of happening. In this scenario, we are trying to find what the chances are for a to *not* happen. So, ¬a = 1 - (the probability of a). 

---

#### Inclusion-Exclusion

**P(a ∨ b) = P(a) + P(b) - P(a ∧ b).**

An example of this scenario is trying to find the probability of either *Dice 1* or *Dice 2* to be rolled as 6. This called the Inclusion-Exclusion formula. We add the probability of *a* + the probability *b* and exclude the cases we have already counted(the chances both happen).

---

#### Marginalization  

**P(a, b) + P(a, ¬b).**

Marginalization is a rule that helps find the probability of *a* using the probability of *b*. To do this we add the probability of both *a* and *b* happens plus the probability of *a* happening and *b* not happening. This assuming they are disjoint probabilities.

---

#### Marginalization for Random Variables

In marginalization for 2 possible worlds, there only has to be 2 other cases that had to be taken care of(the case of both happening & the case one happening and the other not happening). This is not the case for Marginalization for random variables because you have to account for all the other random variables. This is for joint variables.
![](attachment:image-2.png)

---

#### Conditioning  

**P(a) = P(a | b)P(b) + P(a | ¬b)P(¬b)**

This concept is very similar to marginalization. The probavility of *a* happening is equal to chances that *a* is happening *given* *b* multiplied by the chances of *b* happening plus the probability of *a* given to *¬b* times the probability of *¬b*. 

---

#### Conditioning for Random Variables

Very similar to the concept of marginalization for random variables. This is for joint variables.

![image-3.png](attachment:image-3.png)

---

### Bayesian Networks

A **Bayesian Networks** is a data structure that represents the dependencies among random variable.

**Bayesian Network**
* A directed graph 
* Each node represents a random variable 
* An arrow from X to Y means X is a parent of Y 
* Each node "X" has probability distribution(**P(X | Parents(X))**)

Ex. 

![](attachment:image-4.png)




![image-2.png](attachment:image-2.png)
![](attachment:image.png)

---




### Inference

* Query(X): A variable for which to compute distribution 
* Evidence variables(E): observed variables for event E
* Hidden variables(Y): A variable that is non-evidence and non query. 
* Goal: Calculate P(X | e)

P(Appointment | light, no)
 = α P(Appointment, light, no)
 = α [P(Appointment, light, no, on time) 
    + P(Appointment, light, no, delayed)]
![image.png](attachment:image.png)

---

#### Inference by Enumeration

![image-2.png](attachment:image-2.png)

Let's put this inference into code. It does not matter what library you chose to use.

In [31]:
!pip install pomegranate==0.14.8


^C


  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [419 lines of output]
      Compiling pomegranate\BayesClassifier.pyx because it changed.
      Compiling pomegranate\BayesianNetwork.pyx because it changed.
      Compiling pomegranate\FactorGraph.pyx because it changed.
      Compiling pomegranate\MarkovChain.pyx because it changed.
      Compiling pomegranate\MarkovNetwork.pyx because it changed.
      Compiling pomegranate\NaiveBayes.pyx because it changed.
      Compiling pomegranate\base.pyx because it changed.
      Compiling pomegranate\bayes.pyx because it changed.
      Compiling pomegranate\gmm.pyx because it changed.
      Compiling pomegranate\hmm.pyx because it changed.
      Compiling pomegranate\kmeans.pyx because it changed.
      Compiling pomegranate\parallel.pyx because it changed.
      Compiling pomegranate\utils.pyx because it changed.
      Compiling pomegranate/distributions\Bernoull

Collecting pomegranate==0.14.8
  Using cached pomegranate-0.14.8.tar.gz (4.3 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'error'


First, we create our nodes and input the probability distribution for each node.

**(BTW, this code does not work on this notebook, so run this code on the source code)**

In [29]:
from pomegranate import *

# Rain node has no parents
rain = Node(DiscreteDistribution({
    "none": 0.7,
    "light": 0.2,
    "heavy": 0.1
}), name="rain")

# Track maintenance node is conditional on rain
maintenance = Node(ConditionalProbabilityTable([
    ["none", "yes", 0.4],
    ["none", "no", 0.6],
    ["light", "yes", 0.2],
    ["light", "no", 0.8],
    ["heavy", "yes", 0.1],
    ["heavy", "no", 0.9]
], [rain.distribution]), name="maintenance")

# Train node is conditional on rain and maintenance
train = Node(ConditionalProbabilityTable([
    ["none", "yes", "on time", 0.8],
    ["none", "yes", "delayed", 0.2],
    ["none", "no", "on time", 0.9],
    ["none", "no", "delayed", 0.1],
    ["light", "yes", "on time", 0.6],
    ["light", "yes", "delayed", 0.4],
    ["light", "no", "on time", 0.7],
    ["light", "no", "delayed", 0.3],
    ["heavy", "yes", "on time", 0.4],
    ["heavy", "yes", "delayed", 0.6],
    ["heavy", "no", "on time", 0.5],
    ["heavy", "no", "delayed", 0.5],
], [rain.distribution, maintenance.distribution]), name="train")

# Appointment node is conditional on train
appointment = Node(ConditionalProbabilityTable([
    ["on time", "attend", 0.9],
    ["on time", "miss", 0.1],
    ["delayed", "attend", 0.6],
    ["delayed", "miss", 0.4]
], [train.distribution]), name="appointment")

1.1.2


NameError: name 'Node' is not defined

Next, we create a model by adding all the nodes we made and describing which node is a parent of which other nodes by adding edges between them.

In [None]:
# Create a Bayesian Network and add states
model = BayesianNetwork()
model.add_states(rain, maintenance, train, appointment)

# Add edges connecting nodes
model.add_edge(rain, maintenance)
model.add_edge(rain, train)
model.add_edge(maintenance, train)
model.add_edge(train, appointment)

# Finalize model
model.bake()

To figure out how probable a specific event is, we run this model with values that describe the event. For example, we ask what the probability that there is no rain, track maintenance, the train is on time and we attend the meeting.

In [None]:
# Calculate probability for a given observation
probability = model.probability([["none", "no", "on time", "attend"]])

print(probability)

We could also use our program to give us the probability distributions for all variables given some evidence. In the example below we know the train was delayed. Given this evidence, using our model we can compute the probability distributions of the variables Rain, Maintenance, and Appointment.

In [None]:
# Calculate predictions based on the evidence that the train was delayed
predictions = model.predict_probability({
    "train": "delayed"
})

# Print predictions for each node
for node, prediction in zip(model.states, predictions):
    if isinstance(prediction, str):
        print(f"{node.name}: {prediction}")
    else:
        print(f"{node.name}")
        for value, probability in prediction.parameters[0].items():
            print(f"    {value}: {probability:.4f}")

The example(s) above used **inference by enumeration**. This way of computation is very inefficient, especially if there are a sizable amount of variables in a model. A different way to do inference is instead of doing **exact inference** to do **approximate inference.** We lose accuracy in generated probabilities, but this difference is often not notable. 

### Sampling

Sampling is one technique of the many methods of approximate inference. In sampling, each variable is sampled for a value that is calculated according to its probability distribution. 

Example:

If we start with sampling the **Rain** variable, the value *none* will be generated with a probability of 0.7, the value *light* will be generated with a probability of 0.2, and the value *heavy* will be generated with a probability of 0.1.

Next we sample the **Maintenance**, but only from a probability distribution where **Rain** is equal to *none*, because the **Rain** value is already sampled. We will continue this process for all the other nodes.

Now we have one sample, we can repeat this process multiple times to create a distribution. 

If we want to answer a question like what is ***P(Train = on time)***, we can count the number of samples where the variable **Train** has the value on *time*, and divide the result by the total number of samples. 

Another type of question we can answer we can solve with this method involves conditional probability: ***P(Rain = light | Train = on time)*** 

In code, sampling looks as the following:

In [None]:
import pomegranate

from collections import Counter

from model import model

def generate_sample():

    # Mapping of random variable name to sample generated
    sample = {}

    # Mapping of distribution to sample generated
    parents = {}

    # Loop over all states, assuming topological order
    for state in model.states:

        # If we have a non-root node, sample conditional on parents
        if isinstance(state.distribution, pomegranate.ConditionalProbabilityTable):
            sample[state.name] = state.distribution.sample(parent_values=parents)

        # Otherwise, just sample from the distribution alone
        else:
            sample[state.name] = state.distribution.sample()

        # Keep track of the sampled value in the parents mapping
        parents[state.distribution] = sample[state.name]

    # Return generated sample
    return sample

To compute ***P(Appointment | Train = delayed)***, which is the probability distribution of the Appointment variable given that the train is delayed, we do the following:

In [32]:
# Rejection sampling
# Compute distribution of Appointment given that train is delayed
N = 10000
data = []

# Repeat sampling 10,000 times
for i in range(N):

    # Generate a sample based on the function that we defined earlier
    sample = generate_sample()

    # If, in this sample, the variable of Train has the value delayed, save the sample. Since we are interested interested in the probability distribution of Appointment given that the train is delayed, we discard the sampled where the train was on time.
    if sample["train"] == "delayed":
        data.append(sample["appointment"])

# Count how many times each value of the variable appeared. We can later normalize by dividing the results by the total number of saved samples to get the approximate probabilities of the variable that add up to 1.
print(Counter(data))


NameError: name 'generate_sample' is not defined

In [40]:
!python bayesnet/inference.py

Traceback (most recent call last):
  File "c:\Users\nirva\OneDrive\CS50 AI with Python\Lecture 2 - Uncertainty\bayesnet\inference.py", line 4, in <module>
    predictions = model.predict_proba({
                  ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nirva\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pomegranate\bayesian_network.py", line 441, in predict_proba
    return self._factor_graph.predict_proba(X)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\nirva\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\pomegranate\factor_graph.py", line 315, in predict_proba
    if X.shape[1] != nm:
       ^^^^^^^
AttributeError: 'dict' object has no attribute 'shape'


#### Likelihood Weighting  

In the code and example above we got rid of the samples that did not match the evidence that we had. This is inefficient. One way to improve the efficiency of our technique is to use likelihood weighting using the follow steps:
* Fix the values for evidence variables. 
* Sample the non-evidence variables via conditional probabilities in  the Bayesian network.
* Weight each sample by it's likelihood(it's probability of all the evidence occurring.)

Traditional probability models don't consider time.

Many real-world tasks require predicting future events, so we introduce time-indexed variables

To handle such tasks, we use Markov Models, which assume a limited dependency on past states.

#### Markov Assumption
* The current state depends on only a fixed number of past states.
* An example of this is predicting todays wether using only yesterday's data.

#### Markov Chain 
* A sequence of random variables where each depends only on the previous the one
* This requires a transition model which defines probabilities of moving from one state to another.
* Example
  * P(sunny→sunny) = 0.8
  * P(rainy→rainy) = 0.7

#### HMMs(Hidden Markov Models)

* Extend Markov Chains to include hidden states (true states we can't directly observe) and observations (what we can measure).
* Ex.
  * Robot's location (hidden) vs. sensor data (observed)

  * Spoken words (hidden) vs. audio waveform (observed)

  * Weather (hidden) vs. umbrella usage (observed)

#### Sensor (Emission) Model

* The probability of an observation given a hidden state
  * Example: P(umbrella | rain) = 0.9, P(umbrella | sun) = 0.2
* Assumes observations depend only on the current state (sensor Markov assumption).

#### Applications of HMMs

* Filtering: Estimate current state from all past observations.

* Prediction: Estimate future states.

* Smoothing: Estimate past states.

* Most Likely Explanation: Find the most probable sequence of hidden states (e.g., in speech recognition).

