<a href="https://colab.research.google.com/github/wingated/cs473/blob/main/labs/cs473_lab_week_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><p><b>After clicking the "Open in Colab" link, copy the notebook to your own Google Drive before getting started, or it will not save your work</b></p>

# BYU CS 473 Lab Week 2

## Introduction:
Welcome to your first lab for CS 473, Advanced Machine Learning.

In machine learning, models often predict *unnormalized log probabilities*. These must often be converted into regular probabilities.

In this lab, you will explore the log-sum-exp function, which is described in the text (Sec. 2.5.4).  You will code up several variants of the function, and compare their performance.

# Part 1: Logsumexp
---
## Setup: The Iris Dataset
We'll begin by downloading the Iris dataset. The iris dataset is a simple, but very famous, dataset introduced to the world by RA Fisher (the “father” of modern statistics”) in 1939. The dataset has five columns:
* sepal length (cm)
* sepal width (cm)
* petal length (cm)
* petal width (cm)
* class

In order to get logits to play with, we'll first train a multinomial logistic regression model (Sec. 2.5.3).  This model naturally outputs logits.

In [1]:
import datasets
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

ds = datasets.load_dataset( "scikit-learn/iris" )

df = pd.DataFrame( ds['train'] )

X = np.array( df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']] )
Y = np.array( LabelEncoder().fit_transform( df['Species'] ) )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

Iris.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/150 [00:00<?, ? examples/s]

In [6]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(X,Y)

W = model.coef_ # weights for predicting right weights.
b = model.intercept_ # bias or intercept

b = np.reshape( b, (3,1)) # 2 dimensional array

logits = np.dot( W, X.T ) + b # b is intercept,  doing dot product for W and X transpose.


In [7]:
logits

array([[  7.34220929,   6.9456426 ,   7.47512987,   6.91728918,
          7.48095349,   6.62790711,   7.35010476,   7.03636746,
          7.0616021 ,   6.89800136,   7.15544574,   6.86926983,
          7.09615725,   8.06440693,   8.03052061,   7.48602091,
          7.63590256,   7.23414525,   6.51232571,   7.27102714,
          6.36256726,   7.06666952,   8.65875141,   6.06943342,
          6.11327325,   6.39919425,   6.56824052,   7.04775981,
          7.20346508,   6.71913328,   6.58038908,   6.65043691,
          7.73358535,   7.84646191,   6.89800136,   7.59977687,
          7.42440567,   6.89800136,   7.40989455,   6.99391684,
          7.52859473,   6.5853248 ,   7.60248172,   6.44840603,
          6.15496766,   6.88002918,   7.12709232,   7.26558162,
          7.19789636,   7.19207273,  -3.3659641 ,  -2.7153267 ,
         -4.03186882,  -1.72379102,  -3.39495052,  -2.58721864,
         -3.18864426,   0.71539042,  -3.12497948,  -1.06733   ,
         -0.21623226,  -1.93966419,  -1.

---
## Exercise 1: convert logits to probabilities

Since our model outputs logits, they must be converted. To do this, we'll use the softmax function.

In [20]:
def softmax( logits ):
    # logits is a numpy matrix of d x N
    # where
    #   d is the number of classes
    #   N is the number of data points
    # use equation 2.99 (see also Eq. 2.94)

    # your code here
    m = np.max(logits)
    logits = logits - m
    return np.exp(logits) / np.sum(np.exp(logits), axis=0)



In [19]:
# print out test cases
probs = softmax( logits )
probs[:,120]


array([5.54234723e-06, 2.39030160e-02, 9.76091442e-01])

### test cases
probs = softmax( logits )
probs[:,0]
#### array([9.81803910e-01, 1.81960759e-02, 1.43430317e-08])
probs[:,120]
#### array([5.49519371e-06, 2.38812718e-02, 9.76113233e-01])

---
## Exercise 2: convert logits to probabilities

Now, code up the logsumexp function.  What test cases should you use for this function?

In [None]:
def logsumexp( logits ):
    # logits is a numpy matrix of d x N
    # where
    #   d is the number of classes
    #   N is the number of data points
    # use equation 2.100

    # your code here
    pass

In [None]:
# test cases
probs = logsumexp( logits )
probs[:,0]

What should be printed??

(your answer here)

---
## Exercise 3: explore underflow / overlow

First, code up a function that compares two distributions. This can be anything you want; you may consider things like the MSE.

In [None]:
def compare_probs( probs1, probs2 ):
    # your code here
    pass

In [None]:
probs1 = softmax( logits )
probs2 = logsumexp( logits )
compare_probs( probs1, probs2 )

Now, see what happens if you add (or subtract) a constant from logits. How big must the constant be before things start going haywire?

In [None]:
# your code here
probs1 = softmax( logits + C )
# etc.

Now convert the logits to 16-bit precision, and re-run your experiments. Analyze the differences you see (2-3 sentences).

In [None]:
logits = logits.astype( np.float16 )

# your code here
probs1 = softmax( logits + C )
# etc.

### Analysis

(Your analysis here)

---
## Exercise 4: cleanly compute log probabilities

Sometimes, we want to compute log probabilities (which are different from logits), but we want to do so "cleanly", ie, while avoiding overflow / underflow. First, mathematically figure out what the log of the softmax is (ie, take the log of eq. 2.99), and then combine it with insights from coding up the logsumexp function. Hint: at the end of the day, you will simply shift each column by a per-column constant!

In [None]:
def log_logsumexp( logits ):
    # logits is a numpy matrix of d x N
    # where
    #   d is the number of classes
    #   N is the number of data points

    # your code here
    pass

---
# Part 2: Probability Fundamentals

For the following exercises, you are encouraged to work both by hand and by code however makes the most sense.



## Exercise 1a: Joint Probability Distributions

You are given the following two binary variables, X and Y, that can each take on the values 0 or 1. Assuming X and Y are independent, calculate the joint probability table (2x2 table for P(X, Y)). Display as a numpy array.

P(X=0) = 0.6

P(X=1) = 0.4

P(Y=0) = 0.7

P(Y=1) = 0.3


In [None]:
# Your code here


Next, compute the following conditional probabilities:

P(X=0|Y=0) = ?

P(X=0|Y=1) = ?

P(Y=0|X=0) = ?

P(Y=0|X=1) = ?

In [None]:
# Your code here


Compare the result of these conditional probabilities to the original marginal probabilities given. What does this say about the relationship between variable dependence and using conditional probabilities? Write 1-2 sentences.

(Your answer here)

## Exercise 1b: Joint Probability Distributions

Now consider this joint distribution:

|  | $Y = 0$ | $Y = 1$|
| :------- | :------: | -------: |
| $X = 0$  | 0.45  | 0.10  |
| $X = 1$  | 0.25  | 0.20  |

First, compute the marginals from the joint table

In [None]:
joint_prob_table = np.array([[0.45, 0.10],
                             [0.25, 0.20]])

# P(X=0) = ?
# P(X=1) = ?
# P(Y=0) = ?
# P(Y=1) = ?

# Your answer here


Compute the same conditional probabilities as above:

P(X=0|Y=0) = ?

P(X=0|Y=1) = ?

P(Y=0|X=0) = ?

P(Y=0|X=1) = ?

In [None]:
# P(X=0|Y=0) = ?
# P(X=0|Y=1) = ?
# P(Y=0|X=0) = ?
# P(Y=0|X=1) = ?

# Your answer here


Check if the independence property $P(X, Y) = P(X)P(Y)$ holds for any cell.

In [None]:
# Your answer here


Compare P(X=0|Y=0) to P(X=0|Y=1), and discuss what this says about the dependence between these variables (1-2 sentences).

(Your answer here)

<br>

---
## Exercise 2: Bayes Theorem

After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive
for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you
have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The
good news is that this is a rare disease, striking only one in 10,000 people. What are the chances that you
actually have the disease? (Show your calculations as well as giving the final result.)

*Hint: write out the variables you know, and think about what you'll need to calculate to find the final answer

In [None]:
# Your answer/work here
