# From Scores to Conditional Probabilities  
1.  
$$
E_y[l(\hat{y}y) | x] = P(y = 1 | x) l(f(x)) + P(y = -1 | x)l(-f(x)) = \pi(x)l(f(x)) + (1-\pi(x))l(-f(x))
$$ 

2.
$$
\begin{aligned}
E_y[l(\hat{y}y) | x] &= \pi(x)e^{-f(x)} + (1 - \pi(x))e^{f(x)}  \\
\end{aligned}
$$
differentiate it w.r.t $\hat{y} = f(x)$  
$$
\frac{dE_y[l(\hat{y}y) | x]}{d\hat{y}} = -\pi(x)e^{-f(x)} + (1 - \pi(x))e^{f(x)} = 0
$$  
then we get  
$$
-\pi(x) + (1- \pi(x))e^{2f(x)} = 0
$$  
and  
$$
f^*(x) = \frac{1}{2}\ln\frac{\pi(x)}{1 - \pi(x)}
$$

3.  
$$
E_y[l(\hat{y}y) | x] = \pi(x)\ln(1 + e^{-f(x)}) + (1 - \pi(x))\ln(1 + e^{f(x)})
$$  
differentiation  
$$
\frac{dE_y[l(\hat{y}y) | x]}{d\hat{y}} = -\pi(x)\frac{-e^{-f(x)}}{1 + e^{-f(x)}} + (1- \pi(x))\frac{e^{f(x)}}{1 + e^{f(x)}}=0
$$  
then we get  

$$
f^*(x) = \ln\frac{\pi(x)}{1-\pi(x)}
$$

# logistic Regression  
##  Equivalence of ERM and probabilistic approaches  
$$
\begin{aligned}
\mathrm{NLL}(w) &=-\sum_{i=1}^{n} y_{i}^{\prime} \log \phi\left(w^{T} x_{i}\right)+\left(1-y_{i}^{\prime}\right) \log \left(1-\phi\left(w^{T} x_{i}\right)\right) \\
&=\sum_{i=1}^{n}\left[-y_{i}^{\prime} \log \phi\left(w^{T} x_{i}\right)\right]+\left(y_{i}^{\prime}-1\right) \log \left(1-\phi\left(w^{T} x_{i}\right)\right) \\ 
&= \sum_{i = 1}^n -y_i'\left(-\log \left(1 + e^{w^Tx_i}\right)\right) + (y_i' -1)\left(-w^Tx_i -\log \left(1 + e^{-w^Tx_i} \right) \right) \\
&= \sum_{i = 1}^n w^Tx_i(1- y_i') + \log \left(1 + e^{-w^Tx_i} \right)
\end{aligned}
$$   

if $y_i = 1$, then $y_i' = 1$  
$$
\begin{aligned}
n\hat{R}_n(w) &= \sum_{i = 1}^n\log \left(1 + e^{-w^Tx_i} \right) \\
&= \mathrm{NLL}(w)
\end{aligned}
$$  
else if $y_i = -1$, $y_i' = 0$  

$$
\begin{aligned}
n\hat{R}_n(w) &= \sum_{i = 1}^n\log \left(1 + e^{w^Tx_i} \right) \\
&= \sum_{i = 1}^n \log e^{w^Tx_i} + \log \left(1 + e^{-w^Tx_i} \right) \\ 
&= \sum_{i = 1}^n \log \left(1 + e^{w^Tx_i} \right) = \mathrm{NLL}(w)
\end{aligned}
$$

##  Regularized Logistic Regression   
1. Log-Sum-Exp is convex, norm is convex, the sum of convex is convex  


In [1]:
import numpy as np 
from scipy.optimize import minimize
from functools import partial

In [2]:
def f_objective(theta, X, y, l2_param=1):
    '''
    Args:
        theta: 1D numpy array of size num_features
        X: 2D numpy array of size (num_instances, num_features)
        y: 1D numpy array of size num_instances
        l2_param: regularization parameter

    Returns:
        objective: scalar value of objective function
    '''
    num_instance, num_feature = X.shape
    J = 0
    for i in range(num_instance):
        J += 1/(num_instance) * (np.logaddexp(0, -y[i] * np.dot(theta, X[i])))
    J_logistic = J + l2_param *  np.linalg.norm(theta)
    return J_logistic

In [3]:
def fit_logistic_reg(X, y, objective_function, l2_param = 1):
    '''
    Args:
        X: 2D numpy array of size (num_instances, num_features)
        y: 1D numpy array of size num_instances
        objective_function: function returning the value of the objective
        l2_param: regularization parameter
        
    Returns:
        optimal_theta: 1D numpy array of size num_features
    '''
    partial_objective = partial(objective_function, X = X, y = y, l2_param = l2_param)
    num_instance, num_feature = X.shape
    initial_theta = np.random.randn(num_feature)
    optimal_theta = minimize(partial_objective, initial_theta, method = 'Nelder-Mead')
    return optimal_theta

In [5]:
X_train_init = np.loadtxt('X_train.txt', delimiter = ',')
X_val_init = np.loadtxt('X_val.txt', delimiter = ',')
y_train_init = np.loadtxt('y_train.txt', delimiter = ',')
y_val_int = np.loadtxt('y_val.txt', delimiter = ',')

In [6]:
from sklearn.preprocessing import MinMaxScaler
X_train = MinMaxScaler().fit_transform(X_train_init)
X_val = MinMaxScaler().fit_transform(X_val_init)

In [7]:
X_train

array([[0.41716799, 0.60399767, 0.36840068, ..., 0.63191707, 0.23994726,
        0.42968297],
       [0.56673526, 0.87792192, 0.57637651, ..., 0.60527623, 0.53393452,
        0.43512694],
       [0.27095087, 0.49513585, 0.65234197, ..., 0.41306849, 0.60742099,
        0.40874664],
       ...,
       [0.46845955, 0.60300687, 0.57895224, ..., 0.5880624 , 0.50323627,
        0.42324938],
       [0.56176742, 0.74867496, 0.6004432 , ..., 0.62166153, 0.35907216,
        0.33692692],
       [0.18960615, 0.66065241, 0.85321182, ..., 0.62622847, 0.51578704,
        0.48375912]])

In [8]:
opt_theta = fit_logistic_reg(X_train, y_train_init, f_objective).x

In [9]:
opt_theta

array([ 0.03362034,  0.373663  ,  0.11982482, -0.03353394, -0.23651557,
       -0.06450096,  0.04983136,  0.04466461,  0.10927934,  0.07377231,
        0.06958334, -0.37405481,  0.28128228,  0.1423054 ,  0.13799391,
       -0.33843753,  0.04633243,  0.12222491,  0.05834767,  0.11109797])

In [10]:
prediction = X_val.dot(opt_theta)

In [11]:
new_prediction = prediction - 1
new_prediction[new_prediction >= 0.751] = 1
new_prediction[new_prediction < 0.751] =0

# Bayesian Linear Regression - Implementation  

see [problem.py](./problem.py)

#  Coin Flipping: Maximum Likelihood  
1.  
$$
P(D|\theta) = \theta^2(1-\theta)
$$  

2. we have
$$
(T,H,H),(H,T,H),(H,H,T)
$$  
$$
C_3^2\theta^2(1-\theta)
$$  

3.  
$$
P(D|\theta) = \theta^{n_h}(1-\theta)^{n_t}
$$  

4.Differentiate it w.r.t $\theta$,  
$$
\frac{dP}{d\theta} = n_h\theta^{n_h - 1}(1-\theta)^{n_t}-n_t\theta^{n_h}(1-\theta)^{n_t-1} = 0
$$  
thus  
$$
\hat{\theta}_{MLE} = \frac{n_h}{n_h+n_t}
$$


#  Coin Flipping: Bayesian Approach with Beta Prior   
1.  
$$
P(\theta \mid \mathcal{D}) \propto P(\theta)P(x_1,...,x_n|\theta) = p(\theta) \prod_{i=1}^n p\left(x_{i} \mid \theta\right) = \theta^{h-1+n_{h}}(1-\theta)^{t-1+n_{t}}
$$  

2.  
$$
\theta_{\mathrm{MLE}}=\frac{n_{h}}{n_{h}+n_{t}}
$$  
The MAP estimation $\theta_{MAP}$ should be the mode of posterior distribution:  
$$
\hat{\theta}_{\mathrm{MAP}}=\frac{n_{h}+h-1}{n_{h}+h+n_{t}+t-2}
$$  
The posterior mean of $\theta$ is:  
$$
\hat{\theta}_{\text {POSTERIOR MEAN }}=\frac{n_{h}+h}{n_{h}+h+n_{t}+t}
$$  

3.  
When $n$ approaches inﬁnity, the eﬀect of prior on posterior is negligible. Therefore we expect $\theta_{MAP}$, and $\hat{\theta}_{\text {POSTERIOR MEAN }}$ to converges to $\theta$  

4.  
$$
\mathbb{E}\left[\hat{\theta}_{\mathrm{MLE}}\right]=\mathbb{E}\left[\frac{n_{h}}{n}\right]=\frac{1}{n} \mathbb{E}\left[n_{h}\right]=\frac{n \theta}{n}=\theta
$$  
so MLE is unbiased estimator  
$$
\mathbb{E}\left[\hat{\theta}_{\mathrm{MAP}}\right]=\mathbb{E}\left[\frac{n_{h}+h-1}{n+h+t-2}\right]
$$  
MAP unbiased if $h = t = 1$,   
$$
\mathbb{E}\left[\hat{\theta}_{\text {POSTERIOR MEAN }}\right]=\mathbb{E}\left[\frac{n_{h}+h}{n+h+t}\right]
$$
posterior mean is unbiased if $h = t = 0$  

5.  
Posterior Mean, since 3 times is very unstable if we choose MLE. I will choose $\text{Beta}(2,2)$ as my prior

#  Hierarchical Bayes for Click-Through Rate Estimation  
##  
1.  
$$
P(D_i | \theta_i) = \theta_i^{x_i}(1-\theta_i)^{n_i - x_i}
$$  
2. Since if $\theta_i$ follows $\text{Beta}(a,b)$, then $p(\theta_i) \propto \theta_i^{a-1}(1-\theta_i)^{b-1}$, then by the definition, $$
\int \theta_{i}^{a-1}\left(1-\theta_{i}\right)^{b-1} d \theta_{i}=B(a, b)
$$  
3.  

$$
\begin{aligned}
p\left(\theta_{i} \mid \mathcal{D}_{i}\right) & \propto p\left(\mathcal{D} \mid \theta_{i}\right) p\left(\theta_{i}\right) \\
&=\frac{\theta_{i}^{x_{i}+a-1}\left(1-\theta_{i}\right)^{n-x_{i}+b-1}}{B(a, b)} \\
& \propto \theta_{i}^{x_{i}+a-1}\left(1-\theta_{i}\right)^{n-x_{i}+b-1}
\end{aligned}
$$  
and we know that $\int p(\theta_i|D_i)d\theta_i = 1$, combine with the conclusion before, we must have  
$$
p\left(\theta_{i} \mid \mathcal{D}_{i}\right)=\frac{\theta_{i}^{x_{i}+a-1}\left(1-\theta_{i}\right)^{n-x_{i}+b-1}}{B\left(x_{i}+a, n-x_ i+b\right)}
$$  
4. 
  
$$
\begin{aligned}
p\left(\mathcal{D}_{i}\right) &=\int_{0}^{1} p\left(\mathcal{D}_{i} \mid \theta_{i}\right) p\left(\theta_{i}\right) d \theta_{i} \\
&=\frac{1}{B(a, b)} \int_{0}^{1} \theta^{x_{i}+a-1}(1-\theta)^{n_{i}-x_{i}+b-1} \\
&=\frac{B\left(x_{i}+a, n_{i}-x_{i}+b\right)}{B(a, b)}
\end{aligned}
$$  
5. It may help to think about the integral $p\left(\mathcal{D}_{i}\right)=\int p\left(\mathcal{D}_{i} \mid \theta_{i}\right) p\left(\theta_{i}\right) d \theta_{i}$ as a weighted average of $p\left(\mathcal{D}_{i} \mid \theta_{i}\right)$, where the weights are $p(\theta_i)$. By definition of MLE, we must have $p\left(\mathcal{D}_{i} \mid \theta\right) \leq p\left(\mathcal{D}_{i} \mid \theta_{\mathrm{MLE}}\right)$, then  
$$
\begin{aligned}
p\left(\mathcal{D}_{i}\right) &=\int p\left(\mathcal{D}_{i} \mid \theta_{i}\right) p\left(\theta_{i}\right) d \theta \\
& \leq \int p\left(\mathcal{D}_{i} \mid \theta_{\mathrm{MLE}}\right) p\left(\theta_{i}\right) d \theta \\
&=p\left(\mathcal{D}_{i} \mid \theta_{\mathrm{MLE}}\right) \int p\left(\theta_{i}\right) d \theta \\
&=p\left(\mathcal{D}_{i} \mid \theta_{\mathrm{MLE}}\right)
\end{aligned}
$$  

6.  
If we keep increase the likelihood, the eﬀect of prior on posterior will increase and ﬁnally dominate the posterior distribution.

## Empirical Bayes Using All App Data  

1. 
$$
\begin{aligned}
p(\mathcal{D} \mid a, b) &=\prod_{i=1}^{d} p\left(\mathcal{D}_{i} \mid a, b\right) \\
&=\prod_{i=1}^{d} \frac{B\left(x_{i}+a, n_{i}-x_{i}+b\right)}{B(a, b)}
\end{aligned}
$$  

2.  

$$
p\left(\theta_{i} \mid \mathcal{D}\right)=\frac{p\left(\mathcal{D} \mid \theta_{i}\right) p\left(\theta_{i}\right)}{p(\mathcal{D})}=\frac{p\left(\mathcal{D} \mid \theta_{i}\right) p\left(\theta_{i}\right)}{\prod_{k=1}^{p} p\left(\mathcal{D}_{k}\right)}
$$  
we learn that only $p(D_i|\theta_i)$ is influenced by $\theta_i$, thus  
$$
\begin{aligned}
p\left(\theta_{i} \mid \mathcal{D}\right) &=\frac{p\left(\mathcal{D}_{i} \mid \theta_{i}\right) p\left(\theta_{i}\right) \prod_{k \neq i} p\left(\mathcal{D}_{k}\right)}{\prod_{k \neq i} p\left(\mathcal{D}_{k}\right) p\left(\mathcal{D}_{i}\right)} \\
&=\frac{p\left(\mathcal{D}_{i} \mid \theta_{i}\right) p\left(\theta_{i}\right)}{p\left(\mathcal{D}_{i}\right)} \\
&=p\left(\theta_{i} \mid \mathcal{D}_{i}\right)
\end{aligned}
$$