# STA410 Week 1 Homework (4 points)

## Due 12 PM Jan 17 (before class starts)


1. **Paired or individual assignment.** Work may be shared within pairs without restriction, but collaborations beyond the pairs must be limited to "hints" and may not share complete solutions.

2. You are encouraged to adapt code you find available online into your notebook; however, if you do so please provide a link to the utilized resource. ***If failure to cite such references is identified and confirmed, your mark may be reduced.***  

3. **Library imports are limited** to only libraries imported in the starter code and the [standard python modules](https://docs.python.org/3/py-modindex.html). Automated code tests that fail because of additional library imports will not recieve credit. Unless a problem instructs differently you may use any functions available from the Python stdlib and the libraries imported in the starter code.

<details><summary><span style="color: blue; text-decoration: underline; cursor: pointer;">Additional Details</span></summary>

> **Do not delete, replace, or rearranged cells.** This erases `cell ids` upon which automated code tests are based. The "Edit > Undo Delete Cells" option in the notebook editor might be helpful; otherwise, redownload the notebook (so it has the correct required `cells ids`) and repopulate it with your answers (assuming you don't overwrite them when you redownload the notebook). ***You may add cells for scratch work*** but if required answers are not submitted through the provided cells where the answers are requested your answers may not be marked. Due to potential problems with `cell ids` **the only environments supported in this class are** [UofT JupyterHub](https://datatools.utoronto.ca/) or [Google Colab](https://colab.research.google.com/)
>
> **No jupyter shortcut commands** such as `! python script.py 10` or `%%timeit` may be included in the final submission as they will cause subsequent automated code tests to fail.
>
> **No cells may have any runtime errors** because this causes subsequent automated code tests to fail and you will not get marks for tests which fail because of previous runtime errors. ***Restart and re-run the cells in your notebook to ensure there are no runtime errors before submitting your work.***

</details>


In [None]:
# Unless otherwise instructed, you may use any functions available 
# from the following library imports
import numpy as np
from scipy import stats
from scipy import integrate
import matplotlib.pyplot as plt
import time

## Student and Contribution

Are you working with a partner to complete this assignment?  
- If not, assign  the value of `None` into the variable `Partner`.
- If so, assign the name of the person you worked with into the variable `Partner`.
    - Format the name as `"<First Name> <Last Name>"` as a `str` type, e.g., "Scott Schwartz".

In [None]:
Partner = #None
# This cell will produce a runtime error until you assign a value to this variable

What was your contribution in completing the code for this assignments problems?  
Assign one of the following into each of the `Contribution` variable below.

- `"I worked alone"`
- `"I contributed more than my partner"`
- `"My partner and I contributed equally"`
- `"I contributed less than my partner"`
- `"I did not contribute"`

In [None]:
Contribution = #"I worked alone"
# This cell will produce a runtime error until you assign a value to this variable

## Part 1: Monte Carlo integration 

$\displaystyle \int h(x) = \int g(x)f(x) dx = E[g(X)] \approx \frac{1}{n}\sum_{i=1}^n g(x_i) \quad \textrm{ for } \quad x_i \sim f(X)$ 

### 1D and 2D versions

$\displaystyle \int_0^1 g(u_1) du_1 \quad \textrm{ and } \quad \displaystyle \int_0^1 \int_0^1 1_{[u_2 \leq g(u_1)]}(u_1, u_2) du_1 du_2$




In [None]:
# solving for y in r^2 = x^2 + y^2 
g = lambda x: np.sqrt(1-x**2)
x = np.linspace(0,1,1000)
plt.figure(figsize=(4,4)); plt.title("Quarter Circle")
plt.plot(x,g(x));
# https://en.wikipedia.org/wiki/Monte_Carlo_method

In [None]:
p1q1 = np.pi  # analytical normalizing constant c based on np.pi which makes function c*g a density f
p1q2 = integrate.quad(g, 0, 1)  # tuple of the actual and reported error for the area of normalized density f
# Using the code provided below...
p1q3 = 0  # minimum number of samples MC1 requires to correctly estimate the area under g to 3 decimal places
          # with an estimated standard error based on np.std() smaller than 0.0001
p1q4 = 0  # minimum number of samples MC2 requires to correctly estimate the area under g to 3 decimal places
          # with an estimated standard error based on np.std() smaller than 0.0001
p1q5 = 0.0000 # estimated standard deviation of values MC1 averages at the requested sample sizes above
p1q6 = 0.0000 # estimated standard deviation of values MC1 averages at the requested sample sizes above
p1q7 = "The least to most efficient method for estimating the area under f appears to be "+\
       "integrate.quad -> MC1 -> MC2"  # correctly order in the provided formatting


In [None]:
toc = time.time()
integrate.quad(g, 0, 1)
print(integrate.quad(g, 0, 1))
tic = time.time()
tic-toc

In [None]:
toc = time.time()
np.random.seed(410); n = 400000
u1 = stats.uniform().rvs(n)
print(g(u1).mean(),np.pi/4, g(u1).std()/n**0.5)


tic = time.time()
tic-toc

In [None]:
toc = time.time()
np.random.seed(410);  n=400000
u12 = stats.uniform().rvs([n,2])
tic = time.time()
tic-toc

In [None]:
# Cell for scratch work

# You are welcome to add as many new cells into this notebook as you would like.
# Just don't have scratch work cells with runtime errors because 
# notebook cells are run sequentially for automated code testing.

# Any cells included for scratch work that are no longer needed may be deleted so long as 
# - all the required functions are still defined and available when called
# - no cells requiring variable assignments are deleted 
#    - as this causes their `cell ids` to be lost, but these `cell-ids` are required for automated code testing.

In [None]:
# Cell for scratch work


## Part 2: Monte Carlo Integration with Inverse CDF and Rejection Sampling

$\displaystyle \int h(x) dx = \int \tilde h(x) f(x) dx \equiv \int x f(x) dx \approx \bar x \quad \textrm{ for } \quad x_i \sim f(X)$


In [None]:
# PDF f the density of g above analytically normalized based on np.pi
f = lambda x: None

# CDF of f
F = lambda x: None

@np.vectorize
def F_inv(u, F, a=0, b=1, K=50):
    # accurate to approximately 0.5**K
    x_l,x_r = a,b
    F_x_l,F_x_r = 0,1
    for k in range(K):
        x = (x_l+x_r)/2
        F_x = F(x)
        if F_x < u:
            x_l,F_x_l = x,F_x
        else:
            x_r,F_x_r = x,F_x
    return (x_l+x_r)/2
    # confirm with x = F_inv(F_(x))

try:
    fig,ax = plt.subplots(1,2,figsize=(10,4))
    x = np.linspace(0,1,1000); u=x.copy()
    ax[0].plot(x,F(x)); ax[1].plot(u,F_inv(u,F));
except:
    pass

In [None]:
try:
    h = lambda x: None
    plt.figure(figsize=(4,4)); plt.title("x*f(x)? The function being integrated?")
    plt.plot(x,h(x)); 
except:
    pass

In [None]:
# Cell for scratch work

# You are welcome to add as many new cells into this notebook as you would like.
# Just don't have scratch work cells with runtime errors because 
# notebook cells are run sequentially for automated code testing.

# Any cells included for scratch work that are no longer needed may be deleted so long as 
# - all the required functions are still defined and available when called
# - no cells requiring variable assignments are deleted 
#    - as this causes their `cell ids` to be lost, but these `cell-ids` are required for automated code testing.

In [None]:
# Cell for scratch work


In [None]:
p2q1 = integrate.quad(g, 0, 1)[0]  # expected value of f

# Using the code provided below...
p2q2 = 0  # minimum number of samples MC1 requires to correctly estimate the area under h to 3 decimal places
          # with an estimated standard error based on np.std() smaller than 0.0001
p2q3 = 0  # minimum number of samples MC2 requires to correctly estimate the area under h to 3 decimal places
          # with an estimated standard error based on np.std() smaller than 0.0001
p2q4 = 0.0000  # estimated standard deviation of the values averaged for MC1
p2q5 = 0.0000  # estimated standard deviation of the values averaged for MC2

p2q6 = "MC2 <is|is not> rejection sampling"
p2q7 = "MC2 could be made more efficient by scaling <u1|u2> to be <smaller|larger>"

p2q8 = "F <does|does not> have an analytical inverse based on transcendental functions"
p2q9 = "F_inv implements a <bracketing|illinois|Regula falsi|root finding> method"
p2q10 = "F_inv provides a <numerical|analytical|transcendental> approximation"

# Using the code provided below...
p2q11 = 0  # minimum number of inverse CDF based f(X) samples required for same accuracy as above
           # with an estimated standard error based on np.std() smaller than 0.001
p2q12 = 0  # minimum number of uniform samples required for same accuracy as above based on rejection sampling
           # with an estimated standard error based on np.std() smaller than 0.001
p2q13 = 0  # the number of f(X) samples produced using rejection sampling
p2q14 = 0.0000  # estimated standard deviation of the values averaged for the inverse CDF sampling method
p2q15 = 0.0000  # estimated standard deviation of the values averaged for the rejection sampling method

# correctly order in the provided formatting
p2q16 = "The least to most efficient computation for estimating the area under f appears to be "+\
        "integrate.quad -> MC1 -> MC2 -> inverse CDF sampling -> rejection sampling"  
p2q17 = "The methods requiring the greatest to least number of uniform samples in order are "+\
        "MC1 -> MC2 -> inverse CDF sampling -> rejection sampling"


In [None]:
toc = time.time()
integrate.quad(h, 0, 1)
tic = time.time()
tic-toc

In [None]:
toc = time.time()
np.random.seed(410); n = 30000
u1 = stats.uniform().rvs(n)
# MC1 method area under h(x)=x*f(x)
tic = time.time()
tic-toc

In [None]:
toc = time.time()
np.random.seed(410);  n = 30000
u12 = stats.uniform().rvs([n,2])
# MC2 method area under h(x)=x*f(x) in unit square
tic = time.time()
tic-toc

In [None]:
toc = time.time()
np.random.seed(410); n = 30000
u = stats.uniform().rvs(n)
x = None  # Inverse CDF sampling
tic = time.time()
tic-toc

In [None]:
toc = time.time()
np.random.seed(410);  n = 30000
u12 = stats.uniform().rvs([n,2]) 
# u1 is proposal and c=1.3 for rejection sampling
tic = time.time()
tic-toc

In [None]:
# Cell for scratch work

# You are welcome to add as many new cells into this notebook as you would like.
# Just don't have scratch work cells with runtime errors because 
# notebook cells are run sequentially for automated code testing.

# Any cells included for scratch work that are no longer needed may be deleted so long as 
# - all the required functions are still defined and available when called
# - no cells requiring variable assignments are deleted 
#    - as this causes their `cell ids` to be lost, but these `cell-ids` are required for automated code testing.

In [None]:
# Cell for scratch work

In [None]:
try:
    fig,ax = plt.subplots(1,4,figsize=(10,2.5))
    ax[0].hist(MC1_values_being_averaged)
    ax[1].hist(MC2_values_being_averaged.astype(int))
    ax[2].hist(inverseCDF_values_being_averaged)
    ax[3].hist(nonrejection_values_being_averaged)
    plt.tight_layout()
except:
    pass

## Part 3: Importance Sampling

$
\begin{align*}
\theta = E_f[X] = \int x f(x) dx = \displaystyle \int x \frac{f(x)}{q(x)} q(x) dx &\approx{} \frac{1}{n}\sum_{i=1}^n w_i^* x_i \textrm{ for  } x_i \sim q(X) \textrm{ with standard error } \sqrt{\frac{1}{n}\textrm{Var}_q(W^*X)} \\
&\approx{} \sum_{i=1}^n w_i x_i \textrm{ for  } w_i = \frac{w_i^*}{\sum_{i=1}^n w_i^*} \textrm{ and } x_i \sim q(X)\\& {\textrm{ with delta method approximated standard error }} \sqrt{\sum_{i=1}^n w_i^2\left(x_i - \sum_{i=1}^n w_ix_i\right)^2}
\end{align*}
$

### Some notes

If $q(X) = \frac{Xf(X)}{\theta}$ then ${E_q \left[\left( \frac{Xf(X)}{q(X)} - \theta \right)^2 \right]} = 0$ which suggests that $q(X) \propto Xf(X)$ might be a good idea

The **normalized weights** $w_i$ are **random variables** but for *a priori* known weights the ***effective sample size*** $n_{\text{eff}} = \frac{1}{\sum_{i=1}^n w_i^2}$ is maximized at $n$ when $w_i=\frac{1}{n}$ for all $i$ which suggests that $q(X) = f(X)$ might be a good idea and $q(X)$ should have heavier tails than $f(X)$ so that $\frac{f(X)}{q(X)}$ does not "explode"

The ***effective sample size*** $n_{\text{eff}}$ is derived by equating average and weighted average variance calculations $\frac{1}{n^2} \sum_{i=1}^n \sigma^2 = \sum_{i=1}^n w_i^2\sigma^2$ and solving for $n$


In [None]:
p3q1 = "<f(x_i)|q(x_i)|f(x_i)/q(x_i)|q(x_i)/f(x_i)> is the obvious natural choice for w*_i"
p3q2 = 0.0  # expected value of W*_i with respect to q(X) for normized density f(X)
p3q3 = "f(x) <must|need not> be a normalized density for the second estimation to work"

# Using the code provided below...
p3q4 = 0  # minimum number of q(X) samples required to correctly estimate E_f[x] to 3 decimal places
          # using importance sampling based on w*_i
          # with an estimated standard error based on np.std() smaller than 0.001
p3q5 = 0.0000  # estimated standard deviation of the values averaged to estimate E_f[x] above
p3q6 = 0.0000  # sum of w*_i
p3q7 = 0.0000  # sum of normalized weights w_i
p3q8 = 0  # decimal position where E_f[x] estimate differs if instead done using normalized weights 
p3q9 = 0.0000000  # delta method approximation of the standard error for this alternative estimate
p3q10 = 0.0000  # effective sample size assuming w_i are not random variables

p3q11 = "Effective sample size is <less than|equal to|greater than> the actual sample size "+\
        "which is <why|unrelated to why> the standard error estimates are <equal|unequal>"

p3q12 = "To increase the calcuated effective sample size value the importance weights "+\
        "should be made more <homogenous|different> by changing <x|f(x)|q(x)|x*f(x)> "+\
        "to be more like <x|f(x)|q(x)|x*f(x)> and less like <x|f(x)|q(x)|x*f(x)>"

p3q13 = "Increasing the effective sample size <guarantees|may not result in> "+\
        "reduced variability in the importance sampling estimate"

p3q14 = "The variability of interest is that of <x|f(x)|q(x)|x*f(x)|f(x)/q(x)|x*f(x)/q(x)> "+\
        "so what we actually want to ensure is that <x|f(x)|q(x)|x*f(x)|f(x)/q(x)|x*f(x)/q(x)> "+\
        "never gets exceptionally large compared to <x|f(x)|q(x)|x*f(x)|f(x)/q(x)|x*f(x)/q(x)>"

# For the analyses completed above
p3q15 = 0.000  # what fraction of uniform samples were wasted for rejection versus inverse CDF sampling           
p3q16 = 0.000000  # importance sampling required what fraction of the samples of the inverse CDF sampling

In [None]:
# h = lambda x: x
q = stats.beta(2.0,1.4).pdf
fig,ax = plt.subplots(1,2, figsize=(10,4))

x = np.linspace(0,1,1000)
ax[0].plot(x, h(x), label="x*f(x)")
ax[0].plot(x, 2.275*h(x), label="c*x*f(x)")
ax[0].plot(x, q(x), label="q(x)")
ax[0].set_title("q(x) is closely proportional to x*f(x)"); ax[0].legend()

ax[1].plot(x, f(x), label="f(x)")
ax[1].plot(x, q(x), label="q(x)")
start = int(0.3*len(x))
ax[1].plot(x[start:-1],f(x[start:-1])/q(x[start:-1]), label="w* = f(x)/q(x)")
ax[1].set_title("q(x) versus f(x)"); ax[1].legend();


In [None]:
toc = time.time()
integrate.quad(h, 0, 1)
tic = time.time()
tic-toc

In [None]:
toc = time.time()
np.random.seed(410); n = 1000
q = stats.beta(2.0,1.4)  # importance sampling 
tic = time.time()
tic-toc

In [None]:
# Cell for scratch work

# You are welcome to add as many new cells into this notebook as you would like.
# Just don't have scratch work cells with runtime errors because 
# notebook cells are run sequentially for automated code testing.

# Any cells included for scratch work that are no longer needed may be deleted so long as 
# - all the required functions are still defined and available when called
# - no cells requiring variable assignments are deleted 
#    - as this causes their `cell ids` to be lost, but these `cell-ids` are required for automated code testing.

In [None]:
# Cell for scratch work


In [None]:
q = stats.beta(2.0,1.4)
U = stats.uniform()

fig,ax = plt.subplots(3,4,figsize=(10,5))
x = np.linspace(0,1,1000)
ax[0,0].plot(x, h(x), label="x*f(x)")
ax[0,0].plot(x, U.pdf(x), label="Uniform")
ax[0,0].plot(x, f(x), label="f(x)")
ax[0,0].plot(x, q.pdf(x), label="q(x)")
ax[0,0].legend()

n=10000

u = U.rvs(n)
ax[0,1].hist(u, density=True)
ax[0,1].plot(x, U.pdf(x))
ax[0,1].plot(x, h(x))
ax[1,1].hist(h(u))

x_ = F_inv(u, F)
ax[0,2].hist(x_, density=True)
ax[0,2].plot(x, f(x))
ax[0,2].plot(x, h(x))
ax[1,2].hist(x_)

x_ = q.rvs(n)
w_ = 1+x_*0  # not correct
ax[0,3].hist(x_, density=True)
ax[0,3].plot(x, q.pdf(x))
ax[0,3].plot(x, h(x))
ax[0,3].plot(x, f(x), 'k:')
ax[1,3].hist(w_*x_)
ax[2,3].plot(sorted(np.log(w_)))

plt.tight_layout()