# Homework #1 (Due 09/18/2019, 11:59pm)
## Maximum Likelihood Learning and Bayesian Inference

**AM 207: Advanced Scientific Computing**<br>
**Instructor: Weiwei Pan**<br>
**Fall 2019**

**Name: Ian Weaver**

**Students collaborators:**

### Instructions:

**Submission Format:** Use this notebook as a template to complete your homework. Please intersperse text blocks (using Markdown cells) amongst `python` code and results -- format your submission for maximum readability. Your assignments will be graded for correctness as well as clarity of exposition and presentation -- a “right” answer by itself without an explanation or is presented with a difficult to follow format will receive no credit.

**Code Check:** Before submitting, you must do a "Restart and Run All" under "Kernel" in the Jupyter or colab menu. Portions of your submission that contains syntactic or run-time errors will not be graded.

**Libraries and packages:** Unless a problems specifically asks you to implement from scratch, you are welcomed to use any `python` library package in the standard Anaconda distribution.

In [1]:
### Mount working directory
#from google.colab import drive
#drive.mount("/content/drive")
#%cd /content/drive/My Drive/class/am207/HW1

### Import basic libraries
import numpy as np
import pandas as pd
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from IPython.display import display, HTML

### Plot configs
%matplotlib inline
%config InlineBackend.figure_format = "retina"
sns.set(style="darkgrid", palette="colorblind", color_codes=True)

### Table display configs
CSS = """
.output {
    flex-direction: row;
}
"""
HTML(f"<style>{CSS}</style>")

## Problem Description
In the competitive rubber chicken retail market, the success of a company is built on satisfying the exacting standards of a consumer base with refined and discriminating taste. In particular, customer product reviews are all important. But how should we judge the quality of a product based on customer reviews?

On Amazon, the first customer review statistic displayed for a product is the ***average rating***. The following are the main product pages for two competing rubber chicken products, manufactured by Lotus World and Toysmith respectively:


Lotus World |  Toysmith
- | - 
![alt](lotus1.png) |  ![alt](toysmith1.png)

Clicking on the 'customer review' link on the product pages takes us to a detailed break-down of the reviews. In particular, we can now see the number of times a product is rated a given rating (between 1 and 5 stars).

Lotus World |  Toysmith
- |  - 
![alt](lotus2.png) |  ![alt](toysmith2.png)


In the following, we will ask you to build statistical models to compare these two products using the observed rating. Larger versions of the images are available in the data set accompanying this notebook.

## Part I: A Maximum Likelihood Model
1. **(Model Building)** Suppose that for each product, we can model the probability of the value of each new rating as the following vector:

\begin{align}
\theta = [\theta_1, \theta_2, \theta_3, \theta_4, \theta_5]
\end{align}

  where $\theta_i$ is the probability that a given customer will give the product $i$ number of stars. That is, each new rating (a value between 1 and 5) has a categorical distribution $Cat(\theta)$. Represent the observed ratings of an Amazon product as a vector $R = [r_1, r_2, r_3, r_4, r_5]$ where, for example, $r_4$ is the number of $4$-star reviews out of a total of $N$ ratings. Write down the likelihood of $R$. That is, what is $p(R| \theta)$?

  **Note:** The observed ratings for each product should be read off the image files included in the dataset.

The Categorical distribution tells us the probabilities of each rating given by a single user. The distribution of ratings for $N$ users $(p(R|\theta))$ represent multiple independent trials, which generalizes to the Multinomial distribution

\begin{align}
    p(R|\theta) \sim \frac{Multi(N, i | \theta)}{N} 
    = \boxed{\frac{(N-1)!}{\prod_{i=1}^5(x_i!)}\prod_{i=1}^5 \theta_i^{x_i}} \quad,
\end{align}

where $N$ is the total number of ratings given, $i$ is the rating on the 1-5 Amazon star scale, $x_i$ is the total number of ratings with score $i$, and $\theta_i$ is the probability that a given customer will give the product a score of $i$. 

Note: We are dividing by $N$ because $Multi()$ gives back the absolute number of events, but we want the number $(R)$ relative to the number of trials (N).

In log space this looks like
    
\begin{align}
    \ln p(R | \theta) = \ln((N-1)!) - \sum_{i=1}^5 \ln(x_i!) + \sum_{i=1}^5 x_i\ln\theta_i \quad.
\end{align}

2. **(Model Fitting)** Find the maximum likelihood estimator of $\theta$ for the Lotus World model; find the MLE of $\theta$ for the Toysmith model. You need to make a reasonably mathematical argument for why your estimate actually maximizes the likelihood (i.e. recall the criteria for a point to be a global optima of a function).

  *Note:* I recommend deriving the MLE using the general expression of the likelihood. That is, derive the posterior using the variable $R$, then afterwards plug in your specific values of $R$ for each product.

We can optimize this using Lagrange multipliers, which says that a function $f(x)$ for some vector of inputs $x$ with constraint $g(x) = 0$ is maximized/minimized on the stationary points of the corresponding Lagrange function $\mathcal L(x, \lambda) = f(x) - \lambda(g(x))$. 

In our case, we want to optimimze the function $\ln p(R|\theta)$ given the constraint $\sum_i\theta_i - 1 = 0$, where $\sum_i \equiv \sum_{i=1}^N$. This constraint just enforces that the probabilities of each score for a given user rating have to sum to 1. Applying this, we have

\begin{align}
    \mathcal L(\theta_i, \lambda) 
    = \ln p(R|\theta) - \lambda\left(\sum_i\theta_i - 1\right)
    = \ln((N-1)!) - \sum_i\ln(x_i!) + \sum_ix_i\ln\theta_i - \lambda\sum_i\theta_i + \lambda
    \quad.
\end{align}

Finding the stationary points next,

\begin{align}
\begin{matrix}
    \frac{\partial\mathcal L}{\partial\theta_i}
    = \frac{x_i}{\theta_i} - \lambda = 0 \\
    \frac{\partial\mathcal L}{\partial\lambda}
    = -\sum_i\theta_i + 1 = 0
\end{matrix}
\quad\Longrightarrow\quad
\begin{matrix}
    \theta_i &= \frac{x_i}{\lambda} \\
    \sum_i\theta_i &= 1
\end{matrix}
\end{align}

Combining these two equation and solving for $\theta_i$ gives

\begin{align}
    \sum_i\frac{x_i}{\lambda} = 1 \quad\Longrightarrow\quad
    \lambda = \sum_i x_i = N \quad\Longrightarrow\quad
    \hat\theta_i = \frac{x_i}{N} = \boxed{r_i} \quad,
\end{align}

where $\hat\theta_i$ is our MLE of $\theta_i$ and $R \equiv [r_i]$, for $i = 1,\dots,5$.

Plugging in the relative score rankings for each product, we have

\begin{align}    
    \boxed{
    \hat\theta_\text{Loftus} = \left[ 0.06, 0.04, 0.06, 0.17, 0.67 \right] \\
    \hat\theta_\text{Toysmith} = \left[ 0.14, 0.08, 0.07, 0.11, 0.60 \right]} \quad,
\end{align}

which is also tabulated below.

In [2]:
# prints <data> in a Pandas table with rounded values for easy comparison
def make_df(data, colnames=None):
    df = pd.DataFrame(data, index=range(1,6), columns=colnames)
    df.columns.name = "rating"
    df = df.round(2)
    return df

# aggregate review data
MLE_loftus = np.array([0.06, 0.04, 0.06, 0.17, 0.67], ndmin=2).T
MLE_toysmith = np.array([0.14, 0.08, 0.07, 0.11, 0.60], ndmin=2).T
MLE_data = np.c_[MLE_loftus, MLE_toysmith]

# show the MLE for theta
df_MLE = make_df(MLE_data, ["Loftus", "Toysmith"])
df_MLE

rating,Loftus,Toysmith
1,0.06,0.14
2,0.04,0.08
3,0.06,0.07
4,0.17,0.11
5,0.67,0.6


3. **(Model Interpretation)** Based on your MLE of $\theta$'s for both models, do you feel confident deciding if one product is superior to another? Why or why not?

I do not have confidence one way or the other because theses MLE estimates do not have any errorbars on them. The sample size for each company is also relatively small. I would feel more confident looking at the bootstraped PI instead.

## Part II: A Bayesian Model

1. **(Model Building)** Suppose you are told that customer opinions are very polarized in the retail world of rubber chickens, that is, most reviews will be 5 stars or 1 stars (with little middle ground). What would be an appropriate $\alpha$ for the Dirichlet prior on $\theta$? Recall that the Dirichlet pdf is given by:

\begin{align}
p_{\Theta}(\theta) = \frac{1}{B(\alpha)} \prod_{i=1}^k \theta_i^{\alpha_i - 1}, \quad B(\alpha) = \frac{\prod_{i=1}^k\Gamma(\alpha_i)}{\Gamma\left(\sum_{i=1}^k\alpha_i\right)},
\end{align}

where $\theta_i \in (0, 1)$ and $\sum_{i=1}^k \theta_i = 1$, $\alpha_i > 0 $ for $i = 1, \ldots, k$.

We want to describe how customers are more likely to give a 1 or 5 rating, so maybe something like 

\begin{align}
    \alpha = [0.9, 0.1, 0.1, 0.1, 0.9] \quad,
\end{align}

assuming $\alpha = [\alpha_1, \cdots, \alpha_5]$.

2. **(Inference)** Analytically derive the posterior distribution (using the likelihoods you derived in Part I) for each product.

  *Note:* I recommend deriving the posterior using the general expression of a Dirichelet pdf. That is, derive the posterior using the variable $\alpha$, then afterwards plug in your specific values of $\alpha$ when you need to.

Subbing the likelihood and prior into Bayes' theorem (and using $\prod_i \equiv \prod_{i=1}^5$ for convenience),

\begin{align}
    p(\theta|R) &\propto p(R|\theta)p(\theta) \\
    &= \left[ \frac{(N-1)!}{\prod_i(x_i!)}\prod_i\theta_i^{x_i} \right]
    \frac{1}{B(\alpha)}\prod_i\theta_i^{\alpha_i - 1} \\
    &= \frac{(N-1)!}{B(\alpha)\prod_i(x_i!)}\prod_i\theta_i^{x_i + \alpha_i - 1} \\
    &\propto Dir(x + \alpha) \quad.
\end{align}

Thanks to "our" choice of a Dirichlet prior, our posterior is also Dirichlet because it is the conjugate prior of the Multinomial distribution! 

We are working in terms of absolute number of counts $(x_i)$ now instead of relative $(r_i)$, as in the MLE case. These absolute counts can be computed with $x_i = \text{round}(r_i*N)$ since we can only have a whole number of counts for each rating. Making this conversion, we have the following posterior:

\begin{align}
    \boxed{p(\theta|R) \propto Dir \left[ \text{round}(N*R) + \alpha \right]} \quad,
\end{align}

where for each product we have:

Company | $\alpha$ | r | N
:-: | :-: | :-: | :-:
Loftus | [0.9, 0.1, 0.1, 0.1, 0.9] | [0.06, 0.04, 0.06, 0.17, 0.67] | 162
Toysmith | [0.9, 0.1, 0.1, 0.1, 0.9] | [0.14, 0.08, 0.07, 0.11, 0.60] | 410

3. **(The Maximum A Posterior Estimate)** Analytically or empirically compute the MAP estimate of $\theta$ for each product, using the $\alpha$'s you chose in Problem 1. How do these estimates compare with the MLE? Just for this problem, compute the MAP estimate of $\theta$ for each product using a Dirichelet prior with hyperparameters $\alpha = [1, 1, 1, 1, 1]$. Make a conjecture about the effect of the prior on the difference between the MAP estimates and the MLE's of $\theta$.

The MAP is the mode of a distribution, so let's sample from our posterior and get what that is.

In [3]:
# product review data
N_loftus = 162
r_loftus = np.array([0.06, 0.04, 0.06, 0.17, 0.67])
N_toysmith = 410
r_toysmith = np.array([0.14, 0.08, 0.07, 0.11, 0.60])
alpha = np.array([1, 1, 1, 1, 1])

# samples from modified Dirichlet dist. assuming r and alpha are numpy arrays
# returns a (n_samples x k) array, where k = 1, ..., 5
def get_post(r, alpha, N, n_samples):
    x = np.array([np.round(N*r_i) for r_i in r])
    alpha = alpha + x # avoids += int, float clash
    return np.random.dirichlet(alpha, n_samples)

# sample posterior
post_loftus = get_post(r_loftus, alpha, N_loftus, 1_000)
post_toysmith = get_post(r_toysmith, alpha, N_toysmith, 1_000)

# compute mode of posterior for each rating
MAP_loftus, _ = stats.mode(post_loftus, axis=0)
MAP_toysmith, _ = stats.mode(post_toysmith, axis=0)
MAP_loftus, MAP_toysmith = MAP_loftus.T, MAP_toysmith.T # convert to column vectors

# display results
MLE_MAP_data = np.c_[MLE_loftus, MAP_loftus, MLE_toysmith, MAP_toysmith]
colnames = ["Loftus MLE", "Loftus MAP", "Toysmith MLE", "Toysmith MAP"]
df_MLE_MAP = make_df(MLE_MAP_data, colnames)
df_MLE_MAP

rating,Loftus MLE,Loftus MAP,Toysmith MLE,Toysmith MAP
1,0.06,0.01,0.14,0.1
2,0.04,0.01,0.08,0.05
3,0.06,0.01,0.07,0.04
4,0.17,0.08,0.11,0.07
5,0.67,0.54,0.6,0.52


From playing around with different values for $\alpha$, it looks like the more uniform the prior, the closer the MAP is to the MLE. This makes sense because in log space the log of a uniform value is zero, leaving us with just the ln likelihood function like in the non-Bayesian case.

4. **(The Posterior Mean Estimate)** Analytically or empirically compute the posterior mean estimate of $\theta$ for each product, using the $\alpha$'s you chose in Problem 1. How do these estimates compare with the MAP estimates and the MLE?

We can also compute the mean from the sampled posterior distribution, this time computing it with $\alpha=[0.9, 0.1, 0.1, 0.1, 0.9]$ to try and capture the polarized nature of the reviewers.

In [4]:
# sample posterior with alpha from problem 2.1
alpha = np.array([0.9, 0.1, 0.1, 0.1, 0.9])
post_loftus = get_post(r_loftus, alpha, N_loftus, 1_000)
post_toysmith = get_post(r_toysmith, alpha, N_toysmith, 1_000)

# compute mean of posterior for each rating
PM_loftus = np.mean(post_loftus, axis=0, keepdims=True).T
PM_toysmith = np.mean(post_toysmith, axis=0, keepdims=True).T

# display results
MLE_MAP_PM_data = np.c_[MLE_loftus, MAP_loftus, PM_loftus, 
                        MLE_toysmith, MAP_toysmith, PM_toysmith]
colnames = ["MLE Loftus", "MAP Loftus", "PM Loftus",
            "MLE Toysmith", "MAP Toysmith", "PM Toysmith"]
make_df(MLE_MAP_PM_data, colnames)

rating,MLE Loftus,MAP Loftus,PM Loftus,MLE Toysmith,MAP Toysmith,PM Toysmith
1,0.06,0.01,0.07,0.14,0.1,0.14
2,0.04,0.01,0.04,0.08,0.05,0.08
3,0.06,0.01,0.06,0.07,0.04,0.07
4,0.17,0.08,0.17,0.11,0.07,0.11
5,0.67,0.54,0.66,0.6,0.52,0.6


5. **(The Posterior Predictive Estimate)** Sample 1000 rating vectors from the posterior predictive for each product, using the $\alpha$'s you chose in Problem 1. Use the average of the posterior predictive samples to estimate $\theta$. How do these estimates compare with the MAP, MLE, posterior mean estimate of $\theta$?

From Lecture 4, we learned that the posterior predictive can be estimated by taking our estimates for $\theta$ from the posterior distribution and plugging them in to the Likelihood function for $R$ (i.e. the Multinomial distribution). This will give us the posterior predictive distribution, which can then be averaged to estimate the predicted values of $\theta$.

In [5]:
# allocate space to hold the posterior predictive
N = 1_000 # number of times to sample
post_pred_loftus, post_pred_toysmith = np.ndarray((N, 5)), np.ndarray((N, 5))

# plug posterior into Likelihood to computer posterior predictive N times
for i, (theta_loftus, theta_toysmith) in enumerate(zip(post_loftus, post_toysmith)):
    post_pred_loftus[i] = np.random.multinomial(N_loftus, theta_loftus) / N_loftus
    post_pred_toysmith[i] = np.random.multinomial(N_toysmith, theta_toysmith) / N_toysmith
    
# average together posterior predictives
PPM_loftus = np.mean(post_pred_loftus, axis=0, keepdims=True).T
PPM_toysmith = np.mean(post_pred_toysmith, axis=0, keepdims=True).T
MLE_MAP_PM_PPM_data = np.c_[MLE_loftus, MAP_loftus, PM_loftus, PPM_loftus,
                            MLE_toysmith, MAP_toysmith, PM_toysmith, PPM_toysmith]

colnames = ["MLE Loftus", "MAP Loftus", "PM Loftus", "PPM Loftus",
            "MLE Toysmith", "MAP Toysmith", "PM Toysmith", "PPM Toysmith"]
make_df(MLE_MAP_PM_PPM_data, colnames)

rating,MLE Loftus,MAP Loftus,PM Loftus,PPM Loftus,MLE Toysmith,MAP Toysmith,PM Toysmith,PPM Toysmith
1,0.06,0.01,0.07,0.07,0.14,0.1,0.14,0.14
2,0.04,0.01,0.04,0.04,0.08,0.05,0.08,0.08
3,0.06,0.01,0.06,0.06,0.07,0.04,0.07,0.07
4,0.17,0.08,0.17,0.17,0.11,0.07,0.11,0.11
5,0.67,0.54,0.66,0.66,0.6,0.52,0.6,0.6


The MLE, MAP, posterior mean (PM), and posterior predictive mean (PPM) are all shown above
for each company. The MLE, PM, and PPM for both products seem to be more similar to each other across all ratings than the MAP estimates.

6. **(Model Evaluation)** Compute the 95% credible interval of $\theta$ for each product (*Hint: compute the 95% credible interval for each $\theta_i$, $i=1, \ldots, 5$*). For which product is the posterior mean and MAP estimate more reliable and why? 

 We can compute the 95% CI for the PM, and PPM estimates above by using the same algorithm as in HW 1. Doing so, we get

In [6]:
# computes c% confidence interval of dist, default c = 95%
def get_ci(dist, c=95):
    bound = (100 - c) / 2
    mean = np.mean(dist)
    upper, lower = np.percentile(dist, 100 - bound), np.percentile(dist, bound)
    return mean, upper, lower

# displays CI with LaTex
def format_ci(ci_data):
    ci_m, ci_u, ci_d = ci_data
    return f"${ci_m:.2f}^{{+{ci_u:.2f}}}_{{-{ci_d:.2f}}}$"

# fill in table data
PM_PPM_CI_data = []
for p_loftus, pp_loftus, p_toysmith, pp_toysmith in zip(
    post_loftus.T, post_pred_loftus.T, post_toysmith.T, post_pred_toysmith.T):
    p_ci_loftus = get_ci(p_loftus)
    pp_ci_loftus = get_ci(pp_loftus)
    p_ci_toysmith = get_ci(p_toysmith)
    pp_ci_toysmith = get_ci(pp_toysmith)
    PM_PPM_CI_data.append([format_ci(p_ci_loftus), 
                           format_ci(pp_ci_loftus),
                           format_ci(p_ci_toysmith),
                           format_ci(pp_ci_toysmith)])
    
# display
column_names = ["PM Loftus", "PPM Loftus", "PM Toysmith", "PPM Toysmith"]
df = pd.DataFrame(PM_PPM_CI_data, columns=column_names, index=range(1,6))
df.columns.name = "rating"
df

rating,PM Loftus,PPM Loftus,PM Toysmith,PPM Toysmith
1,$0.07^{+0.11}_{-0.03}$,$0.07^{+0.13}_{-0.02}$,$0.14^{+0.17}_{-0.11}$,$0.14^{+0.19}_{-0.09}$
2,$0.04^{+0.07}_{-0.01}$,$0.04^{+0.09}_{-0.01}$,$0.08^{+0.11}_{-0.06}$,$0.08^{+0.12}_{-0.05}$
3,$0.06^{+0.10}_{-0.03}$,$0.06^{+0.12}_{-0.02}$,$0.07^{+0.10}_{-0.05}$,$0.07^{+0.11}_{-0.04}$
4,$0.17^{+0.23}_{-0.12}$,$0.17^{+0.26}_{-0.09}$,$0.11^{+0.14}_{-0.08}$,$0.11^{+0.15}_{-0.07}$
5,$0.66^{+0.74}_{-0.59}$,$0.66^{+0.76}_{-0.56}$,$0.60^{+0.65}_{-0.55}$,$0.60^{+0.67}_{-0.53}$


Based on the table above and the one in Problem 2.5, it looks like the PM and MAP estimates are more reliable for the Toysmith company because they are consistent with the MLE estimates for all ratings. The relative size of the uncertainties are also generally smaller for the Toysmith company, making me more confident in choosing them as the superior rubber chicken company. I think this is the case because the Toysmith company has more reviewer data points (410 vs. 162).

## Part III: Comparison
1. **(Summarizing Customer Ratings)** Recall that on Amazon, the first customer review statistic displayed for a product is the average rating. Name at least one problem with ranking products based on the average customer rating.

One problem with ranking products based on average customer rating could be selection bias. Reviewers that had extreme experiences with the product (e.g. 1 or 5 star) may be more likely to leave a review than someone that just had an average experience with the product. Based on the relative number of reviews, this could skew the average towards one of those extremes.

2. **(Comparison of Point Estimates)** Which point estimate (MAP, MLE, posterior mean or posterior predictive estimate) of $\theta$, if any, would you feel choose to rank the two Amazon products? Why? 

  *Hint: think about which of these estimates are equivalent (if any). If they are not equivalent, what are the special properties of each estimate? What aspect of the data or the model is each estimate good at capturing?*
  
   **Note:** we're not looking for "the correct answer" here. We are looking for a sound decision based on a statistically correct interpretation of your models.

For an infinite number of reviews, the MAP should approach the MLE. I would be more confident taking the posterior mean predictive (PPM) estimate though because it is able to incorporate our prior beliefs on reviewers' bias like the PM estimate. Unlike the PM estimate though, the PPM is based on simulated fake data, which I think makes it more robust for smaller sample sizes.