<a href="https://colab.research.google.com/github/wingated/cs473/blob/main/labs/cs473_lab_week_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><p><b>After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work</b></p>

# BYU CS 473 Lab Week 2

## Introduction:
Welcome to your first lab for CS 473, Advanced Machine Learning.

In machine learning, models often predict *unnormalized log probabilities*. These must often be converted into regular probabilities.

In this lab, you will explore the log-sum-exp function, which is described in the text (Sec. 2.5.4).  You will code up several variants of the function, and compare their performance.

# Part 1: Logsumexp
---
## Setup: The Iris Dataset
We'll begin by downloading the Iris dataset. The iris dataset is a simple, but very famous, dataset introduced to the world by RA Fisher (the “father” of modern statistics”) in 1939. The dataset has five columns:
* sepal length (cm)
* sepal width (cm)
* petal length (cm)
* petal width (cm)
* class

In order to get logits to play with, we'll first train a multinomial logistic regression model (Sec. 2.5.3).  This model naturally outputs logits.

In [63]:
import datasets
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

ds = datasets.load_dataset( "scikit-learn/iris" )

df = pd.DataFrame( ds['train'] )

X = np.array( df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']] )
Y = np.array( LabelEncoder().fit_transform( df['Species'] ) )

In [64]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(X,Y)

W = model.coef_
b = model.intercept_

b = np.reshape( b, (3,1))

logits = np.dot( W, X.T ) + b

---
## Exercise 1: convert logits to probabilities

Since our model outputs logits, they must be converted. To do this, we'll use the softmax function.

In [140]:
def softmax( logits ):
    # logits is a numpy matrix of d x N
    # where
    #   d is the number of classes
    #   N is the number of data points
    # use equation 2.99 (see also Eq. 2.94)

    # your code here
    m = np.max(logits, axis=0, keepdims=True)
    exp_logits = np.exp(logits - m)
    probs = exp_logits / np.sum(exp_logits, axis=0, keepdims=True)
    return probs


In [141]:
# print out test cases
probs = softmax( logits )
probs[:,0]


array([0.982  , 0.01823, 0.     ], dtype=float16)

In [142]:
probs[:,120]

array([5.543e-06, 2.397e-02, 9.761e-01], dtype=float16)

### test cases
probs = softmax( logits )
probs[:,0]
#### array([9.81803910e-01, 1.81960759e-02, 1.43430317e-08])
probs[:,120]
#### array([5.49519371e-06, 2.38812718e-02, 9.76113233e-01])

---
## Exercise 2: convert logits to probabilities

Now, code up the logsumexp function.  What test cases should you use for this function?

In [143]:
def logsumexp( logits ):
    # logits is a numpy matrix of d x N
    # where
    #   d is the number of classes
    #   N is the number of data points
    # use equation 2.100

    # your code here
    m = np.max(logits, axis=0, keepdims=True)
    sum_exp = np.sum(np.exp(logits - m), axis=0, keepdims=True)
    log_probs = logits - (m + np.log(sum_exp))
    probs = np.exp(log_probs)
    return probs

In [144]:
# test cases
probs = logsumexp( logits )
probs[:,0]

array([0.9805 , 0.01817, 0.     ], dtype=float16)

What should be printed??

it should be printed same as soft max function but shifted it decimals

array([0.9805 , 0.01817, 0.     ], dtype=float16)


---
## Exercise 3: explore underflow / overlow

First, code up a function that compares two distributions. This can be anything you want; you may consider things like the MSE.

In [145]:
def compare_probs( probs1, probs2 ):
    # your code here
    return np.mean((probs1 - probs2) ** 2)

In [146]:
probs1 = softmax( logits )
probs2 = logsumexp( logits )
compare_probs( probs1, probs2 )

np.float16(3.6e-07)

Now, see what happens if you add (or subtract) a constant from logits. How big must the constant be before things start going haywire?

In [147]:
constants = [0, 100, 1000, 10000, 1e5, -100, -1000, -10000, -1e5]
for C in constants:
        probs1 = softmax(logits + C)
        probs2 = logsumexp(logits + C)
        mse = compare_probs(probs1, probs2)
        print(f"C={C:6}: MSE = {mse:.2e}")

C=     0: MSE = 3.58e-07
C=   100: MSE = 1.04e-04
C=  1000: MSE = 2.62e-03
C= 10000: MSE = 1.67e-01
C=100000.0: MSE = nan
C=  -100: MSE = 1.04e-04
C= -1000: MSE = 2.62e-03
C=-10000: MSE = 1.67e-01
C=-100000.0: MSE = nan


  probs1 = softmax(logits + C)
  exp_logits = np.exp(logits - m)
  probs2 = logsumexp(logits + C)


Now convert the logits to 16-bit precision, and re-run your experiments. Analyze the differences you see (2-3 sentences).

In [148]:
logits = logits.astype( np.float16 )
constants = [0, 100, 1000, 10000, 1e5, -100, -1000, -10000, -1e5]
for C in constants:
        probs1 = softmax(logits + C)
        probs2 = logsumexp(logits + C)
        mse = compare_probs(probs1, probs2)
        print(f"C={C:6}: MSE = {mse:.2e}")

C=     0: MSE = 3.58e-07
C=   100: MSE = 1.04e-04
C=  1000: MSE = 2.62e-03
C= 10000: MSE = 1.67e-01
C=100000.0: MSE = nan
C=  -100: MSE = 1.04e-04
C= -1000: MSE = 2.62e-03
C=-10000: MSE = 1.67e-01
C=-100000.0: MSE = nan


  probs1 = softmax(logits + C)
  exp_logits = np.exp(logits - m)
  probs2 = logsumexp(logits + C)


### Analysis


I observed if we add small constant until $\pm$100 the output probabilites stay the same but larger constant they all become NAN Which causes Softmax to break due to overflow/underflow.

---
## Exercise 4: cleanly compute log probabilities

Sometimes, we want to compute log probabilities (which are different from logits), but we want to do so "cleanly", ie, while avoiding overflow / underflow. First, mathematically figure out what the log of the softmax is (ie, take the log of eq. 2.99), and then combine it with insights from coding up the logsumexp function. Hint: at the end of the day, you will simply shift each column by a per-column constant!

In [149]:
def log_logsumexp( logits ):
    # logits is a numpy matrix of d x N
    # where
    #   d is the number of classes
    #   N is the number of data points

    # your code here
    m = np.max(logits, axis=0, keepdims=True)
    logits_shifted = logits - m
    log_sum_exp = m + np.log(np.sum(np.exp(logits_shifted), axis=0, keepdims=True))
    return logits - log_sum_exp

In [150]:
probs = log_logsumexp( logits )
probs[:,0]

array([ -0.01953,  -4.008  , -18.06   ], dtype=float16)

In [152]:

constants = [0, 100, 1000, 10000, 1e5, -100, -1000, -10000, -1e5]
for C in constants:
        probs1 = logsumexp(logits)
        probs2 = log_logsumexp(logits + C)
        mse = compare_probs(probs1, probs2)
        print(f"C={C:6}: MSE = {mse:.2e}")

logits = logits.astype( np.float16 )
constants = [0, 100, 1000, 10000, 1e5, -100, -1000, -10000, -1e5]
for C in constants:
        probs1 = logsumexp(logits)
        probs2 = log_logsumexp(logits + C)
        mse = compare_probs(probs1, probs2)
        print(f"C={C:6}: MSE = {mse:.2e}")

C=     0: MSE = 5.54e+01
C=   100: MSE = 5.53e+01
C=  1000: MSE = 5.50e+01
C= 10000: MSE = 6.04e+01
C=100000.0: MSE = nan
C=  -100: MSE = 5.53e+01
C= -1000: MSE = 5.50e+01
C=-10000: MSE = 6.04e+01
C=-100000.0: MSE = nan
C=     0: MSE = 5.54e+01
C=   100: MSE = 5.53e+01
C=  1000: MSE = 5.50e+01
C= 10000: MSE = 6.04e+01
C=100000.0: MSE = nan
C=  -100: MSE = 5.53e+01
C= -1000: MSE = 5.50e+01
C=-10000: MSE = 6.04e+01
C=-100000.0: MSE = nan


  probs2 = log_logsumexp(logits + C)
  logits_shifted = logits - m
  probs2 = log_logsumexp(logits + C)


---
# Part 2: Probability Fundamentals

For the following exercises, you are encouraged to work both by hand and by code however makes the most sense.



## Exercise 1a: Joint Probability Distributions

You are given the following two binary variables, X and Y, that can each take on the values 0 or 1. Assuming X and Y are independent, calculate the joint probability table (2x2 table for P(X, Y)). Display as a numpy array.

P(X=0) = 0.6

P(X=1) = 0.4

P(Y=0) = 0.7

P(Y=1) = 0.3


In [165]:
# Your code here
P_X = np.array([0.6, 0.4])  # X=0, X=1
P_Y = np.array([0.7, 0.3])  # Y=0, Y=1
joint = np.outer(P_X, P_Y)
print(joint)


[[0.42 0.18]
 [0.28 0.12]]


Next, compute the following conditional probabilities:

P(X=0|Y=0) = ?

P(X=0|Y=1) = ?

P(Y=0|X=0) = ?

P(Y=0|X=1) = ?

In [175]:
# Your code here

p_x0_y0 = joint[0,0] / joint[:,0].sum()
p_x0_y1 = joint[0,1] / joint[:,1].sum()
p_y0_x0 = joint[0,0] / joint[0,:].sum()
p_y0_x1 = joint[1,0] / joint[1,:].sum()
print("P(X=0|Y=0):", p_x0_y0)
print("P(X=0|Y=1):", p_x0_y1)
print("P(Y=0|X=0):", p_y0_x0)
print("P(Y=0|X=1):", p_y0_x1)

P(X=0|Y=0): 0.6
P(X=0|Y=1): 0.6
P(Y=0|X=0): 0.7
P(Y=0|X=1): 0.7


Compare the result of these conditional probabilities to the original marginal probabilities given. What does this say about the relationship between variable dependence and using conditional probabilities? Write 1-2 sentences.

(Your answer here)

## Exercise 1b: Joint Probability Distributions

Now consider this joint distribution:

|  | $Y = 0$ | $Y = 1$|
| :------- | :------: | -------: |
| $X = 0$  | 0.45  | 0.10  |
| $X = 1$  | 0.25  | 0.20  |

First, compute the marginals from the joint table

In [176]:
joint_prob_table = np.array([[0.45, 0.10],
                             [0.25, 0.20]])

# P(X=0) = ?
# P(X=1) = ?
# P(Y=0) = ?
# P(Y=1) = ?

# Your answer here
joint = np.array([[0.45, 0.10], [0.25, 0.20]])
PX = joint.sum(axis=1)
PY = joint.sum(axis=0)

print("Marginals: P(X):", PX, "P(Y):", PY)

Marginals: P(X): [0.55 0.45] P(Y): [0.7 0.3]


Compute the same conditional probabilities as above:

P(X=0|Y=0) = ?

P(X=0|Y=1) = ?

P(Y=0|X=0) = ?

P(Y=0|X=1) = ?

In [177]:
# P(X=0|Y=0) = ?
# P(X=0|Y=1) = ?
# P(Y=0|X=0) = ?
# P(Y=0|X=1) = ?

# Your answer here
p_x0_y0 = joint[0,0]/PY[0]
p_x0_y1 = joint[0,1]/PY[1]
p_y0_x0 = joint[0,0]/PX[0]
p_y0_x1 = joint[1,0]/PX[1]
print("P(X=0|Y=0):", p_x0_y0)
print("P(X=0|Y=1):", p_x0_y1)
print("P(Y=0|X=0):", p_y0_x0)
print("P(Y=0|X=1):", p_y0_x1)


P(X=0|Y=0): 0.6428571428571429
P(X=0|Y=1): 0.3333333333333333
P(Y=0|X=0): 0.8181818181818181
P(Y=0|X=1): 0.5555555555555556


Check if the independence property $P(X, Y) = P(X)P(Y)$ holds for any cell.

In [180]:
# Your answer here
p_xy = joint[0,1]
px_py = PX[0]*PY[1]
print("Check independence for (X=0,Y=1):", p_xy, "vs", px_py )

Check independence for (X=0,Y=1): 0.1 vs 0.16500000000000004


Compare P(X=0|Y=0) to P(X=0|Y=1), and discuss what this says about the dependence between these variables (1-2 sentences).

(Your answer here)

P(X=0|Y=0) = 0.6429 ,
P(X=0|Y=1) = 0.3333
by this we can know that X and Y are both dependent variables since X=0 changes based on Y.

---
## Exercise 2: Bayes Theorem

After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive
for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you
have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The
good news is that this is a rare disease, striking only one in 10,000 people. What are the chances that you
actually have the disease? (Show your calculations as well as giving the final result.)

*Hint: write out the variables you know, and think about what you'll need to calculate to find the final answer

In [188]:
# Your answer/work here
P_D = 1 / 10000           # probability of having disease
P_notD = 1 - P_D          # probability of not having disease
P_Pos_given_D = 0.99      # True positive
P_Neg_given_notD = 0.99   # True negative

P_Pos_given_notD = 1 - P_Neg_given_notD
# law of total probability
P_Pos = P_Pos_given_D * P_D + P_Pos_given_notD * P_notD
#bayes theorem
P_D_given_Pos = (P_Pos_given_D * P_D) / P_Pos
print( P_D_given_Pos)
print(f"P(D|Positive) = {P_D_given_Pos:.4f}  ({P_D_given_Pos*100:.2f}%)")




0.009803921568627442
P(D|Positive) = 0.0098  (0.98%)
