## Question 1

### **Question**

Devise the update rule for minimizing $f(x)=(max⁡(0,x)−\frac{1}{2})^2$ using gradient descent. Roughly, when will gradient descent succeed or fail (based on where you start and step size)?

---

### **Answer**

The gradient descent update rule for a function is derived by:
1. Finding the loss function for a point $x_i$
2. Taking the derivate with respect to the model parameters to compute the gradient of the loss function, or $\nabla L(w)$
3. Plugging these values into the update rule to form:

$$ w_{\text{new}} = w_{\text{old}} - \eta \nabla L(w_{\text{old}}) $$

##### **Finding Loss Function**
   
Since we are given $f(x)=(max⁡(0,x)−\frac{1}{2})^2$, we will assume this is the loss function to perform gradient descent on.

##### **Taking the derivatives**

Since the function contains a max() function, we will compute the gradient for the max being 0 and being $x$.

- For $f(x)=(0−\frac{1}{2})^2$:
$$ \frac{d}{dx}((0−\frac{1}{2})^2) = 0 $$
- For $f(x)=(x−\frac{1}{2})^2$:
$$ \frac{d}{dx}((x−\frac{1}{2})^2) = 2(x-\frac{1}{2})\cdot \frac{d}{dx}(x-\frac{1}{2})$$
$$ = 2(x-\frac{1}{2})\cdot 1 $$
$$ = 2(x-\frac{1}{2}) $$

##### **Plugging it all in**

Plugging in the above results, we have that when $max(0,x) = 0$:

$$ w_{\text{new}} = w_{\text{old}} - \eta \cdot 0 $$

And when $max(0,x) = x$:

$$ w_{\text{new}} = w_{\text{old}} - \eta \cdot 2(x-\frac{1}{2}) $$

##### **Where will Gradient Descent Succeed and/or Fail**

Gradient descent can fail for several reasons:
- Local minima (non-convex function) may cause the algorithm to not be able to find the global minimum, or lowest loss, for the function
- Too large of a step size parameter may lead to the algorithm oscillating around the minimum or even diverging

We can use the second derivative test to check if the function is convex:

$$ \frac{d}{dx}(2(x-\frac{1}{2})) = 2 $$
$$ \frac{d}{dx}(2(0-\frac{1}{2})) = 0 $$

In both cases the second derivate is non-negative, so the function is convex everywhere.

And we can find the minimum of the function by setting the derivate equal to 0:

$$ 2(x-\frac{1}{2}) = 0 $$
$$ x = \frac{1}{2} $$

So the minimum is $\frac{1}{2}$.

Thus, in this case the algorithm will fail if $x \le 0$, as no updates will occur. It will also fail if using too large of a step size which could overshoot the minimum. So if we start with $x > 0$ and a reasonable step size, it will succeed.

## Question 2

### **Question**

Given a data set $A$, under what circumstances will its projection onto its principal components equal its projection onto its right singular vectors and left singular vectors

### **Answer**

PCA works really on a data set of points, but first requires organizing it into a matrix $A$. The question emphasizes that there are two ways of doing this:

The points are rows. In this case the principal components are the right singular vectors, the eigenvectors of $A^⊤A$.

The points are columns. In this case the principal components are the left singular vectors, the eigenvectors of $AA^⊤$. This is actually analogous to (1) except effectively with $A^⊤$ and $A$ switched.

In either case, a point equaling its projection onto the first $k$ principal components means that the points in your data set are really $k$-dimensional – they are all linear combinations of just the $k$ principal components. So that’s often something useful to know about a data set.

Brief review of SVD and PCA:

PCA computes the eigen-decomposition of a covariance matrix $A^TA$, which is by definition symmetric and therefore (spectral theorem) can be written as the product of $QΣQ^T$ where $Q$ is a orthogonal matrix (square matrix with orthonormal rows and vectors) and $Σ$ is diagonal. The entries of $Σ$ are the eigenvalues and the columns/rows of $Q$/$Q^T$ are the eigenvectors of the covariance matrix. The covariance matrix is square ($n x n$), as is $Q$ and $Σ$.

SVD on the other hand decomposes any matrix (does not have to be square, symmetric, etc.) into $USV^T$, where $U$ and $V^T$ are square, orthogonal matrices; however, in contrast to the eigen-decomposition above, if working with $A$, $U$ will be $m x m$, $V^T$ will be $n x n$, and $S$ will be $m x n$.

To start the problem assume $A$ is $m x n$, where $m$ is the number of data samples and $n$ is the feature dimension. That is, each row is a sample represented by $n$ values.

Compute the SVD of A=USVTA=USVT and then perform PCA over A by computing the eigen-decomposition of the covariance matrix:

$A^TA=(USV^T)^T(USVT^)=(VSU^T)(USV^T)=VS^2V^T$, which implies the columns/rows of $V/V^T$ are the eigenvectors of $A$ as well as the right singular vectors of $A$.

Now consider the case when $A$ is structured as rows of features and columns of samples. That is, $A$ is $n \times m$. Then the covariance matrix of $A=AA^T$ and the SVD of $A=(USV^T)(VSU^T)=(US^2U^T)$. Therefore, in this case, the eigenvectors of the covariance matrix are the left singular vectors of the data matrix.

In conclusion, the condition under which the right singular vectors of a data matrix AA are also the principal directions is AA being structured as rows of samples. On the other hand, when AA is columns of samples then the left singular vectors of AA and the principal directions.

A couple of other useful notes previous students discovered about PCA and SVD:

The eigen-decomposition of a symmetric expresses how a matrix scales (but does not rotate) a specific set of vectors. First you apply an invertible, linear transformation by multiplying by $Q^T$, then scales via $Σ$, finally inverting the original transformation.

SVD is a rotation, followed by scaling, followed by rotation

## Question 3

### **Question**

Let $S$ be a set of documents and let $T$ be a set of terms. Suppose that $C$ is a binary term-document incidence matrix (so entry ($i$,$j$) is a 1 if term $i$ appears in document $j$ and 0 otherwise). What do the entries of $C^TC$ represent?

### **Answer**

If $C$ is a binary term-document incidence matrix, and $S$ is a set of documents and $T$ is a set of terms, $C^TC$ will provide information about the similarity between the terms and documents.

In this case: The dimensions of $C^TC$ is num docs by num docs, so the element at $(i,j)$ is the number of shared terms between doc-$i$ and doc-$j$. Also for the diagonal elements like $(i,i)$ it means the total number of terms in a document.

## Question 4

### **Question**

How did we derive the equations for simple linear regression from class?

### **Answer**

**The goal** is to find a vector $w \in \mathbf{R}^n$ that minimizes $||X\cdot w-y||^2_2$

Example, given: 

$$ x^1 = x^1_1, \dots , x^1_n \;\;\;\; y^1 $$

Then: 

$$ (y-(x^1_1 w_1 + \dots +x^1_n w_n))^2 $$

#### **Formal Problem Statement**

$$ \underset{w}{\min} ||Xw-y||^2_2 $$

#### **How to Find w**

Let's find $w$ by using geometric concepts. $Xw$ is a vector in the span of the columns of $X$. The point $y \in \mathbf{r}^n$ is not necessarily in the span of $X$.

What point should we pick in the span of $X$ to best approximate $y$, geometrically speaking? We should take the orthogonal projection of $y$ down to the span of $X$, and that is the optimal point. We will call this point $Xw$, and that is the point we should choose.

The vector $y-Xw$ line from the point $y$ to the point $Xw$, which is orthogonal to  $X$.

Now we can take the vector $y-Xw$ and, since it is orthogonal to $X$, do the following:

$$ X^T\cdot (y-Xw) = 0 $$
$$ X^Ty-X^TXw = 0 $$
$$ X^Ty = X^TXw $$
$$ (X^TX)^{-1}X^Ty = w $$

Thus, we have solved for $w$.

## Question 5

### **Question**

Why can you assume without loss of generality that a mistake-bounded learner only updates its state when it makes a mistake?

### **Answer**

It can be shown that LL is also a mistake-bounded learner.

The above reasoning shows that for every L′L′, we can always construct a LL that only updates its state when making a mistake. Therefore, without losing generality, we can always assume a mistake-bounded learner only updates its state when making a mistake.

[Original explanation]

you can always change the ordering so that it makes mistake at the very beginning of the TT (the mistake bound) examples and stop making mistake. In this way, the learner who may change at correct label does not improve the mistake bound TT.

Detail: Suppose the mistake bound is TT for a learner LL that only make change at mistake, we can safely reorder the list of training examples so that the TT examples come in the first place. In this situation, the learner LL will perform the same thing and make mistakes at the first TT training examples. (since before reordering, LL does not update in correct training examples). Consider any other learner L′L′ that may change in correct label, in the re-ordering case, it will behave exactly like LL so the mistake bound is still at least TT, which make no improvement.

## Question 6

### **Question**

You are given a data set with $m$ points and an algorithm that satisfies the weak-learning condition (it always outputs a classifier with accuracy 60%). Each classifier output by the weak-learning algorithm can be encoded using two bits. How can you construct a classifier that can be described by less than $m$ bits and is correct on every data point in the data set (you may assume $m$ is very large). What is the size of your final classifier?

### **Answer**

The idea is to use boosting as follows. In this case we want a classifier with training error 0. But because the only possible values for training error are $1,(m−1)/m,...,1/m,0$ (since it is 0-1 loss and there are only $m$ points), this is equivalent to achieving training error $<1/m$. Now recall that the training error of the AdaBoost hypothesis after $T$ rounds is at most $exp⁡(−2γ^2T)$. Here $γ=0.1$ since we have a 0.6-accurate weak learner. Thus $T$ only needs to be $Θ(log⁡m)$ for the training error to be less than $1/m$, and hence to be 0. After this many rounds, the final classifier classifies the entire training set correctly. And recall again that this final classifier is just the majority of $T$ different classifiers, which each take two bits to describe. Thus we just need $Θ(logm)$ bits to describe this classifier.

## Question 7

### **Question**

How can you compare Markov’s inequality, Chebyshev’s inequality, and the Chernoff bound?

### **Answer**

Roughly speaking, Markov is the weakest bound here, Chebyshev's a little stronger, and the Chernoff bound is the strongest. This question is a little broad, but it's important to familiarize yourself with exactly when each inequality is applicable. For instance, sometimes Chernoff might not be applicable and Chebyshev might be the best you can do. In particular:

Markov applies to nonnegative random variables

Chebyshev applies to random variables for which you have a variance bound

Chernoff applies to sums of i.i.d. random variables (in fact generally indicator variables)

At a high level, all three bounds describe the probability that a random variable falls far from its expectation. They differ in the assumptions made about the underlying distribution.

Markov

A non-negative random variable cannot be much larger than its mean very often. This bound depends only on the expected value of the random variable and that XX is non-negative. No other assumptions made.

∀X≥0, P(X≥λ)≤E(X)λ
∀X≥0, P(X≥λ)≤λE(X)​

Proof: if you define random variable ZZ to be 11 if X≥λX≥λ and 00 otherwise then it's easy to see:

λZ≤X  ⟹  E(λZ)≤E(X)
λZ≤X⟹E(λZ)≤E(X)

E(λZ)=λE(Z)=λ(1P(X≥λ)+0P(X<λ))=λP(X≥λ)≤E(X)  ⟹  P(X≥λ)≤E(X)λ
E(λZ)=λE(Z)=λ(1P(X≥λ)+0P(X<λ))=λP(X≥λ)≤E(X)⟹P(X≥λ)≤λE(X)​

Chebyshev

The probability a random variable is more than kk standard deviations from the mean is no more than 1k2k21​. This bound assumes XX has non-zero, finite variance.

P(∣X−E(X)∣≥kσ)<1k2
P(∣X−E(X)∣≥kσ)<k21​

Proof: Let XX be a random variable over real numbers and Y=(X−E(X))2Y=(X−E(X))2. Note, YY is now a non-negative random variable. By Markov's rule we know:

P(Y≥λ)≤E(Y)λ  ⟹  P((X−E(X))2≥λ)≤E(X−E(X))2)λ=σ2λ
P(Y≥λ)≤λE(Y)​⟹P((X−E(X))2≥λ)≤λE(X−E(X))2)​=λσ2​

Set λ=σ2k2λ=σ2k2 and you have:

P((X−E(X))2≥σ2k2)=P(∣X−E(X)∣≥kσ)≤σ2σ2k2=1k2
P((X−E(X))2≥σ2k2)=P(∣X−E(X)∣≥kσ)≤σ2k2σ2​=k21​

Chernoff Bounds

I found it hard to state this succinctly, but similar to the above bounds the probability of a random variable being far from the mean.

Follows from the application of the Markov bound to random variable Y=etXY=etX. Specifically, it says:

P(X≥λ)=P(Y≥etλ)=P(etX≥etλ)≤E(Y)etλ=E(etX)etλ
P(X≥λ)=P(Y≥etλ)=P(etX≥etλ)≤etλE(Y)​=etλE(etX)​

This works because Y≥0Y≥0.

There is a special case I've seen mentioned which is when XX is the sum of mm independent Bernouli RV's, X1,..,XmX1​,..,Xm​ where ∀i∀i, P(Xi=1)=piP(Xi​=1)=pi​. Defining p=mp0p=mp0​ and h(δ)=(1+δ)log(1+δ)−δh(δ)=(1+δ)log(1+δ)−δ then

P(X>(1+δ)p)≤e−h(δ)p
P(X>(1+δ)p)≤e−h(δ)p

More on this in textbook appendix B.3 including the case for X<(1+δ)pX<(1+δ)p.