# Table of Contents
 <p><div class="lev1"><a href="#Question-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Question 1</a></div><div class="lev1"><a href="#Question-4"><span class="toc-item-num">2&nbsp;&nbsp;</span>Question 4</a></div>

# Question 1
(25pt + 10pt) In this question, we use the file {\tt brader.csv} which contains data from Brader, Valentino and Suhay (2008). The file includes the following variables for $n=265$ observations:

* the outcome of interest -- a four-point scale in response to ``Do you think the number of immigrants from foreign countries should be increased or decreased?''
* tone of the story treatment (positive or negative)
* ethnicity of the featured immigrant treatment (Mexican or Russian)
* respondents' age
* respondents' income
\end{itemize}

Consider the following ordered logit model for an ordered outcome variable with four levels:
        $$ \Pr(Y_i \leq j \mid X_i) \ = \ \frac{\exp(\psi_j - X_i^\top\beta)}
            {1 + \exp(\psi_j - X_i^\top\beta)} $$
for $j = 1,2,3,4$ and $i = 1,...,n$ where $\psi_4=\infty$ and $X_i = [{\tt tone}_i \ {\tt eth}_i \ {\tt ppage}_i \ {\tt ppincimp}_i]^\top$ (i.e. no intercept).

a) (5pt) Write down the likelihood function.



$$
l = 
\prod_{i=1}^n 
\frac{\exp(\psi_{Y_i} - X_i^\top\beta)}
     {1 + \exp(\psi_{Y_i} - X_i^\top\beta)} -
\frac{\exp(\psi_{Y_i-1} - X_i^\top\beta)}
     {1 + \exp(\psi_{Y_i-1} - X_i^\top\beta)} =
$$

To simplify, from here on $\phi$ will be the inverse link function and $\phi'$ its derivative. Now we can write this in a more Matricial form as:

$$
\prod_{i=1}^n \sum_{j=1}^4
\tilde M_{i,j}
\phi\left(\psi_{j} - \sum_{k=1}^m X_{ik}\beta_k\right)
$$

where 

$$
\tilde M = MK =  M 
\begin{pmatrix}
  1 &  0 &  0 &  0\\
 -1 &  1 &  0 &  0\\
  0 & -1 &  1 &  0\\
  0 &  0 & -1 &  1\\
\end{pmatrix}
$$

And $M$ is the OneHot matrix for the $Y_i$, i.e. a matrix that has $n$ rows, and each row is the row vector with zeros everywhere but at the position $j$, where the the observed $Y$ is $Y_j$.  

In practice, however, we are interested in the log-likelihood, which is:

$$
L = \sum_{i=1}^n \log \sum_{j=1}^4
\tilde M_{i,j}
\phi\left(\psi_{j} - \sum_{k=1}^m X_{ik}\beta_k\right)
$$


In [97]:
# Now we do it in R:

# Function log_likelihood
#
# @ Param: Y the observed Y values (a n-vector in our case)
# @ Param: X the observed X values (a n by m vector in our case )
# @ Param: beta, the beta vector (m-vector in our case)
# @ Param: psi, the four phi values we talk about above
#          IMPORTANT: IT assumes that the last parameter (~infinity)
#                     IS passed as a part of psi!!!
#
# @ Returns: The equation above

log_likelihood = function(Y = 0,X,beta,psi, M = 0){
    m = length(psi)+1
    n = length(Y)

    M = compute_M(Y,length(psi))
    
    # x is the phi_j-x_iTbeta
    x =   t(matrix(rep(psi,n),m)) - matrix(rep(t(X)%*%beta,m))
    
    Z = M*exp(x)/(1+exp(x))
    
    colsums(log(rowsums(Z)))
}

#Auxiliary function to compute the M matrix defined above
#
# @ Param: Y, the observed Y values
# @ Param: m, the number of possible Y values
#
# @ Returns: The ~M matrix defined above
#

compute_M = function (Y,m){
    
    #This is just a One Hot encoder:
    
    n = length(Y)
    #This creates an nxm matrix whose columns
    #are 1...1, 2...2, ...
    M1 = t(matrix(rep(c(1:m),n),m))
    
    #This creates an nxm matrix whose columns
    #are all Y (m times the same column, Y)
    M2 = matrix(rep(c(Y),m),n)
    
    #We compare if they are equal, which gives
    #the One Hot encoder
    M = (M1==M2)+0
    
    #Create the matrix K
    K = diag(m+1)
    K = K[2:m+1,]-K[1:m,]
    K = K[1:m,1:m+1]
    #Return the product
    M%*%K

}

b) (10pt) Derive the score functions for $\beta$ and $\psi_j$.


We will have:

$$
L = \sum_{i=1}^n \log \sum_{j=1}^4
\tilde M_{i,j}
\phi\left(\psi_{j} - \sum_{k=1}^m X_{ik}\beta_k\right)
$$


To simplify the computations, let $\phi_{ij}=\phi(\psi_{j} - X_i^\top\beta)$, and $\phi_{ij}'=\phi'(\psi_{j} - X_i^\top\beta)$

Since deriveatives and sums commute freely, we can compute the score easily:

$$
\frac{\partial L}{\partial \beta_k} = -
\sum_{i=1}^n \left (\sum_{j=0}^4 \tilde M_{i,j}\phi_{i,j} \right )^{-1}
\left (\sum_{j=0}^4
\tilde M_{i,j}\phi_{i,j}'\right )  X_{ik}
$$

For the $\phi$, we can think again of a vector $(\psi_0,..\psi_n)$, and write $\psi_k$ = $\langle \psi, \vec e_k \rangle$, to use matrix notation to make the derivative, the result is:

$$
\frac{\partial L}{\partial \psi_j} = -
\sum_{i=1}^n \left (\sum_{j=0}^4 \tilde M_{i,j}\phi_{i,j} \right )^{-1}
\left (\sum_{j=0}^4
\tilde M_{i,j}\phi_{i,j}'\right )  X_{ik}
$$

Now, it is easy to check that $
\left(\sum_{j=0}^4
\tilde M_{i,j}
\frac{\exp(\psi_{j} - X_i^\top\beta)}
     {(1 + \exp(\psi_{j} - X_i^\top\beta))^2} e_j\right)_{i,j}
=     
\tilde M_{i,j}
\frac{\exp(\psi_{j} - X_i^\top\beta)}
     {(1 + \exp(\psi_{j} - X_i^\top\beta))^2} 
$ (this is only important regarding the computations). Therefore:


$$
\frac{\partial L}{\partial \psi} = 
\sum_{i=1}^n \left (\sum_{j=0}^4
\tilde M_{i,j}
\frac{\exp(\psi_{j} - X_i^\top\beta)}
     {1 + \exp(\psi_{j} - X_i^\top\beta)} 
\right)^{-1}
\tilde M_{i,j}
\frac{\exp(\psi_{j} - X_i^\top\beta)}
     {(1 + \exp(\psi_{j} - X_i^\top\beta))^2} 
$$



As before, since we will use it with this name in the code, let $W_{i,j} = \frac{\exp(\psi_{j} - X_i^\top\beta)}
     {(1 + \exp(\psi_{j} - X_i^\top\beta))^2}$


In [98]:
# Now we do it in R:

# Function beta_score
#
# @ Param: Y the observed Y values (a n-vector in our case)
# @ Param: X the observed X values (a n by m vector in our case )
# @ Param: beta, the beta vector (m-vector in our case)
# @ Param: psi, the four phi values we talk about above
#          IMPORTANT: IT assumes that the last parameter (~infinity)
#                     IS passed as a part of psi!!!
#
# @ Returns: The equation above

beta_score = function(Y = 0,X,beta,psi, M = 0){
    m = length(psi)+1
    n = length(Y)

    M = compute_M(Y,length(psi))
    
    # x is the phi_j-x_iTbeta
    x =   t(matrix(rep(psi,n),m)) - matrix(rep(t(X)%*%beta,m))
    
    Z = M*exp(x)/(1+exp(x))
    W = M*exp(x)/(1+exp(x))**2
    
    sum(t(x)%*%(W/Z))

}


# Function psi_score
#
# @ Param: Y the observed Y values (a n-vector in our case)
# @ Param: X the observed X values (a n by m vector in our case )
# @ Param: beta, the beta vector (m-vector in our case)
# @ Param: psi, the four phi values we talk about above
#          IMPORTANT: IT assumes that the last parameter (~infinity)
#                     IS passed as a part of psi!!!
#
# @ Returns: The equation above

beta_score = function(Y = 0,X,beta,psi, M = 0){
    m = length(psi)+1
    n = length(Y)

    M = compute_M(Y,length(psi))
    
    # x is the phi_j-x_iTbeta
    x =   t(matrix(rep(psi,n),m)) - matrix(rep(t(X)%*%beta,m))
    
    Z = M*exp(x)/(1+exp(x))
    W = M*exp(x)/(1+exp(x))**2
    
    (rowSums(Z)**-1)%*%W
}


(10pt) Using (a) and (b), calculate the maximum likelihood estimates of $\beta$ and $\psi_j$ and their standard errors via the `optim` function in R. Confirm your results by comparing them to outputs from the `polr` function in the `MASS` package.

P(X>0)

# Question 4

Cross Validation for Polynomial Regression. (18 points)
Consider the following four data generating processes:

* DGP 1: $Y = -2* 1_{\{X < -3\}} + 2.55* 1_{\{ X > -2\}} - 2* 1_{\{X>0\}} + 4* 1_{\{X > 2\}} -1* 1_{\{ X > 3\}}+ \epsilon$
* DGP 2: $Y = 6 + 0.4 X - 0.36X^2 + 0.005 X^3 + \epsilon$
* DGP 3: $Y = 2.83 * \sin(\frac{\pi}{2} \times X) +\epsilon $
* DGP 4: $Y = 4 * \sin(3 \pi \times X) * 1_{\{X>0\}}+ \epsilon$

  $X$ is drawn from the uniform distribution in [-4,4] and $\epsilon
  $
  is drawn from a standard normal ($\mu =0$, $\sigma^2$ = 1).
  \begin{enumerate}

In [125]:
DGP1 = function(X){
    -2   *(X < -3)
    +2.55*(X > -2)
    -2   *(X >  0)
    +4   *(X >  2)
    -1   *(X >  3)}
DGP2 = function(X){
    6+0.4*X-0.36*X^2+0.005*X^3
}

DGP3 = function(X)

In [128]:
X[1](0)

ERROR: Error in eval(expr, envir, enclos): attempt to apply non-function


(5 pts.) Write a function to estimate the generalization error of a polynomial by $k$-fold cross-validation. It should take as arguments the data, the degree of the polynomial, and the number of folds $k$. It should return the cross-validation mean squared error.

In [None]:

n = nrow(data);
data = data[sample(n),]
folds = cut(seq(1,data),breaks=N,labels=FALSE)
for(i in 1:N)
{
    ind = which(folds==i,arr.ind=TRUE)
    test = yourData[ind, ]
    train = yourData[-ind, ]
}