# Solution 

## Understanding Word2Vec

### (b): Computing the Partial Derivative

We have:
- $J_{\text{naive-softmax}}(\mathbf{v}_c, o, \mathbf{U}) = -\log P(O = o | C = c)$
- $P(O = o | C = c) = \frac{\exp(\mathbf{u}_o^T \mathbf{v}_c)}{\sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c)} = \hat{y}_o$

#### Step 1: Express the loss in terms of softmax
$$J = -\log \hat{y}_o = -\log \frac{\exp(\mathbf{u}_o^T \mathbf{v}_c)}{\sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c)}$$

$$= -\mathbf{u}_o^T \mathbf{v}_c + \log \sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c)$$

#### Step 2: Compute the gradient
$$\frac{\partial J}{\partial \mathbf{v}_c} = \frac{\partial}{\partial \mathbf{v}_c}\left[-\mathbf{u}_o^T \mathbf{v}_c + \log \sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c)\right]$$

For the first term:
$$\frac{\partial}{\partial \mathbf{v}_c}(-\mathbf{u}_o^T \mathbf{v}_c) = -\mathbf{u}_o$$

For the second term, using chain rule:
$$\frac{\partial}{\partial \mathbf{v}_c}\log \sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c) = \frac{1}{\sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c)} \cdot \sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c) \mathbf{u}_w$$

$$= \frac{\sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c) \mathbf{u}_w}{\sum_{w=1}^{|V|} \exp(\mathbf{u}_w^T \mathbf{v}_c)} = \sum_{w=1}^{|V|} \frac{\exp(\mathbf{u}_w^T \mathbf{v}_c)}{\sum_{k=1}^{|V|} \exp(\mathbf{u}_k^T \mathbf{v}_c)} \mathbf{u}_w$$

$$= \sum_{w=1}^{|V|} \hat{y}_w \mathbf{u}_w$$

#### Step 3: Combine terms
$$\frac{\partial J}{\partial \mathbf{v}_c} = -\mathbf{u}_o + \sum_{w=1}^{|V|} \hat{y}_w \mathbf{u}_w$$

#### Step 4: Express in vectorized form using 𝐲, 𝐲̂, and 𝐔

Note that:
- $\mathbf{u}_o = \mathbf{U} \mathbf{y}$ (since $\mathbf{y}$ is one-hot with $y_o = 1$)
- $\sum_{w=1}^{|V|} \hat{y}_w \mathbf{u}_w = \mathbf{U} \hat{\mathbf{y}}$


### ii When is the gradient equal to zero?

The gradient equals zero when:
$$\mathbf{U}(\hat{\mathbf{y}} - \mathbf{y}) = \mathbf{0}$$

This occurs when $\hat{\mathbf{y}} - \mathbf{y} = \mathbf{0}$, i.e., when $\hat{\mathbf{y}} = \mathbf{y}$.

**Answer**: The gradient is zero when the predicted probability distribution perfectly matches the true distribution (when $\hat{y}_o = 1$ and $\hat{y}_w = 0$ for all $w \neq o$).

#### Interpretation of the gradient terms

The gradient can be written as:
$$\frac{\partial J}{\partial \mathbf{v}_c} = \mathbf{U}\hat{\mathbf{y}} - \mathbf{U}\mathbf{y}$$

#### Term 1: $\mathbf{U}\hat{\mathbf{y}}$ (Predicted Context)
- This is a weighted average of all word vectors, weighted by their predicted probabilities
- It represents the "expected" context vector based on the current model's predictions
- When subtracted from $\mathbf{v}_c$, it pulls the center vector away from irrelevant word directions

#### Term 2: $-\mathbf{U}\mathbf{y}$ (True Context)
- This is simply $-\mathbf{u}_o$, the negative of the true output word vector
- When subtracted from $\mathbf{v}_c$, the double negative means we add $\mathbf{u}_o$
- This pushes the center vector toward the true output word vector

#### Combined Effect
When we update $\mathbf{v}_c := \mathbf{v}_c - \alpha \frac{\partial J}{\partial \mathbf{v}_c}$:

1. **Attraction**: The center vector is pulled toward the true output word vector ($\mathbf{u}_o$)
2. **Repulsion**: The center vector is pushed away from the weighted average of all word vectors (especially those with high predicted probability but shouldn't be there)

This creates the desired effect: making the center vector more similar to the true context word and less similar to incorrect words that the model incorrectly predicts with high probability.


## (c): When L2 Normalization Takes Away Useful Information

L2 normalization removes useful information when the **magnitude** of word embeddings carries semantic meaning that's relevant to the classification task.

Consider the hint: if $\mathbf{u}_x = \alpha\mathbf{u}_y$ where $\alpha > 0$, then after L2 normalization:
- $\frac{\mathbf{u}_x}{||\mathbf{u}_x||_2} = \frac{\alpha\mathbf{u}_y}{||\alpha\mathbf{u}_y||_2} = \frac{\alpha\mathbf{u}_y}{|\alpha|||\mathbf{u}_y||_2} = \frac{\mathbf{u}_y}{||\mathbf{u}_y||_2}$

So normalized $\mathbf{u}_x$ and normalized $\mathbf{u}_y$ become **identical**.

This is problematic when words have the same semantic direction but different **intensity**. For example:
- "good" vs "excellent": If $\mathbf{u}_{\text{excellent}} = 3\mathbf{u}_{\text{good}}$, the magnitude difference captures that "excellent" is more positive than "good"
- "bad" vs "terrible": Similarly, if $\mathbf{u}_{\text{terrible}} = 2\mathbf{u}_{\text{bad}}$, the magnitude indicates stronger negativity

In the classification task, the phrase "This movie is excellent" should contribute more positive signal than "This movie is good". But after normalization, both words contribute equally, losing the intensity information encoded in the original magnitudes.

### When L2 Normalization Doesn't Take Away Useful Information

L2 normalization preserves useful information when only the **direction** (semantic meaning) matters, not the magnitude.

This occurs when:

1. **Magnitude differences are noise**: If embedding magnitudes reflect training artifacts, word frequency, or other non-semantic factors rather than semantic intensity, normalization removes this noise.

2. **Words are semantically equivalent**: When $\mathbf{u}_x = \alpha\mathbf{u}_y$ represents true semantic equivalence (like synonyms "happy" and "joyful"), the magnitude difference might be spurious, and normalization correctly treats them equally.

3. **Downstream task is direction-sensitive only**: If the classification boundary depends only on the overall semantic direction of the phrase embedding (sum of individual embeddings), not its magnitude, then normalization can actually improve performance by removing irrelevant scale variations.

### Summary

- **Normalization hurts** when magnitude encodes semantic intensity (degree of positivity/negativity)
- **Normalization helps** when magnitude represents noise or when only semantic direction matters for the task

The key insight is that L2 normalization fundamentally changes the contribution weighting scheme from magnitude-dependent to purely directional, which may or may not align with the semantic structure relevant to your downstream task.

## (d): Computing the partial derivatives

The loss can be written as:
$$J = -\log \hat{y}_o = -\log P(O=o|C=c)$$

Taking the partial derivative with respect to $\mathbf{u}_w$:

$$\frac{\partial J}{\partial \mathbf{u}_w} = -\frac{1}{\hat{y}_o} \frac{\partial \hat{y}_o}{\partial \mathbf{u}_w}$$

Now I need to compute $\frac{\partial \hat{y}_o}{\partial \mathbf{u}_w}$.

Since $\hat{y}_o = \frac{\exp(\mathbf{u}_o^T \mathbf{v}_c)}{\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)}$, using the quotient rule:

$$\frac{\partial \hat{y}_o}{\partial \mathbf{u}_w} = \frac{\frac{\partial}{\partial \mathbf{u}_w}[\exp(\mathbf{u}_o^T \mathbf{v}_c)] \cdot \sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c) - \exp(\mathbf{u}_o^T \mathbf{v}_c) \cdot \frac{\partial}{\partial \mathbf{u}_w}[\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)]}{[\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)]^2}$$

### Case 1: When $w = o$

$$\frac{\partial}{\partial \mathbf{u}_o}[\exp(\mathbf{u}_o^T \mathbf{v}_c)] = \exp(\mathbf{u}_o^T \mathbf{v}_c) \mathbf{v}_c$$

$$\frac{\partial}{\partial \mathbf{u}_o}[\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)] = \exp(\mathbf{u}_o^T \mathbf{v}_c) \mathbf{v}_c$$

Therefore:
$$\frac{\partial \hat{y}_o}{\partial \mathbf{u}_o} = \frac{\exp(\mathbf{u}_o^T \mathbf{v}_c) \mathbf{v}_c \cdot \sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c) - \exp(\mathbf{u}_o^T \mathbf{v}_c) \cdot \exp(\mathbf{u}_o^T \mathbf{v}_c) \mathbf{v}_c}{[\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)]^2}$$

$$= \frac{\exp(\mathbf{u}_o^T \mathbf{v}_c) \mathbf{v}_c [\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c) - \exp(\mathbf{u}_o^T \mathbf{v}_c)]}{[\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)]^2}$$

$$= \hat{y}_o \mathbf{v}_c (1 - \hat{y}_o)$$

So: $$\frac{\partial J}{\partial \mathbf{u}_o} = -\frac{1}{\hat{y}_o} \cdot \hat{y}_o \mathbf{v}_c (1 - \hat{y}_o) = -\mathbf{v}_c (1 - \hat{y}_o) = \mathbf{v}_c(\hat{y}_o - 1)$$

Since $y_o = 1$:
$$\boxed{\frac{\partial J}{\partial \mathbf{u}_o} = \mathbf{v}_c(\hat{y}_o - y_o)}$$

### Case 2: When $w \neq o$

$$\frac{\partial}{\partial \mathbf{u}_w}[\exp(\mathbf{u}_o^T \mathbf{v}_c)] = 0$$

$$\frac{\partial}{\partial \mathbf{u}_w}[\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)] = \exp(\mathbf{u}_w^T \mathbf{v}_c) \mathbf{v}_c$$

Therefore:
$$\frac{\partial \hat{y}_o}{\partial \mathbf{u}_w} = \frac{0 - \exp(\mathbf{u}_o^T \mathbf{v}_c) \cdot \exp(\mathbf{u}_w^T \mathbf{v}_c) \mathbf{v}_c}{[\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)]^2}$$

$$= -\frac{\exp(\mathbf{u}_o^T \mathbf{v}_c)}{\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)} \cdot \frac{\exp(\mathbf{u}_w^T \mathbf{v}_c)}{\sum_{k} \exp(\mathbf{u}_k^T \mathbf{v}_c)} \mathbf{v}_c$$

$$= -\hat{y}_o \hat{y}_w \mathbf{v}_c$$

So: $$\frac{\partial J}{\partial \mathbf{u}_w} = -\frac{1}{\hat{y}_o} \cdot (-\hat{y}_o \hat{y}_w \mathbf{v}_c) = \hat{y}_w \mathbf{v}_c$$

Since $y_w = 0$ for $w \neq o$:
$$oxed{\frac{\partial J}{\partial \mathbf{u}_w} = \mathbf{v}_c(\hat{y}_w - y_w) \text{ for } w \neq o}$$

## Final Answer

For both cases, we can write the unified expression:

$$oxed{\frac{\partial J}{\partial \mathbf{u}_w} = \mathbf{v}_c(\hat{y}_w - y_w) \text{ for all } w}$$

where:
- When $w = o$: $y_w = y_o = 1$
- When $w \neq o$: $y_w = 0$

Looking at the structure of the matrix **U** and the partial derivatives I computed in the previous part, I need to arrange the column vector derivatives to form the gradient matrix.

Since **U** is a matrix where each column is an outside word vector:
$$\mathbf{U} = [\mathbf{u}_1, \mathbf{u}_2, \ldots, \mathbf{u}_{|\text{Vocab}|}]$$

The partial derivative with respect to **U** will have the same structure, where each column is the partial derivative with respect to the corresponding column vector of **U**.

From the previous part, I found that:
$$\frac{\partial J}{\partial \mathbf{u}_w} = \mathbf{v}_c(\hat{y}_w - y_w) \text{ for all } w$$

Therefore:

$$oxed{\frac{\partial J(\mathbf{v}_c, o, \mathbf{U})}{\partial \mathbf{U}} = \left[\frac{\partial J(\mathbf{v}_c, o, \mathbf{U})}{\partial \mathbf{u}_1}, \frac{\partial J(\mathbf{v}_c, o, \mathbf{U})}{\partial \mathbf{u}_2}, \ldots, \frac{\partial J(\mathbf{v}_c, o, \mathbf{U})}{\partial \mathbf{u}_{|\text{Vocab}|}}\right]}$$

This can be written more compactly as:

$$\boxed{\frac{\partial J(\mathbf{v}_c, o, \mathbf{U})}{\partial \mathbf{U}} = \mathbf{v}_c(\hat{\mathbf{y}} - \mathbf{y})^T}$$

**Explanation:** The gradient matrix has the same dimensions as **U**. Each column corresponds to the gradient with respect to the corresponding outside word vector. The compact form shows that we can compute this as the outer product of the center vector $\mathbf{v}_c$ with the difference between predicted and true probability distributions $(\hat{\mathbf{y}} - \mathbf{y})$.

## Machine Learning 

### a-i 

Looking at the momentum mechanism in Adam optimization, the key insight is that $\mathbf{m_{t+1}}$ represents an exponentially weighted moving average of past gradients rather than just using the current gradient directly.

This rolling average has the effect of smoothing out the gradient updates by dampening oscillations and noise that occur from minibatch to minibatch. When gradients consistently point in the same direction across multiple steps, the momentum term amplifies this signal, while when gradients fluctuate randomly (due to noisy minibatches or local curvature), the averaging effect reduces these fluctuations.

This lower variance in updates is helpful for learning because it allows the optimizer to make more consistent progress toward the optimum without getting sidetracked by noisy or conflicting gradient signals. The momentum helps the optimizer "remember" the general direction it should be moving and prevents it from making erratic updates that could slow convergence or cause the training to become unstable.

### a-ii

Looking at the Adam update rule, the parameters that will get **larger updates** are those with **smaller values of $\sqrt{v_{t+1}}$**.

Since $\mathbf{v_{t+1}}$ tracks a rolling average of the squared gradients (gradient magnitudes), parameters that have historically had **small gradients** will have small values in $v_{t+1}$, leading to small values of $\sqrt{v_{t+1}}$, and therefore **larger effective learning rates** when we divide by $\sqrt{v_{t+1}}$.

Conversely, parameters that have historically had **large gradients** will have large values in $v_{t+1}$, leading to **smaller effective learning rates**.

This adaptive mechanism helps with learning because it addresses the problem of having a single global learning rate for all parameters. In many neural networks, different parameters naturally have very different gradient scales - some parameters might consistently receive tiny gradients while others receive large gradients. With a fixed learning rate, parameters with small gradients would update too slowly (potentially getting stuck), while parameters with large gradients might update too aggressively (potentially overshooting).

By giving larger effective learning rates to parameters with historically small gradients and smaller effective learning rates to parameters with historically large gradients, Adam automatically balances the update magnitudes across different parameters, allowing for more efficient and stable learning across the entire parameter space.


### b-i

### b-ii


## Neural Transition-Based Dependency Parsing

## a

| Stack | Buffer | New dependency | Transition |
|-------|--------|----------------|------------|
| [ROOT] | [I, presented, my, findings, at, the, NLP, conference] |  | Initial Configuration |
| [ROOT, I] | [presented, my, findings, at, the, NLP, conference] |  | SHIFT |
| [ROOT, I, presented] | [my, findings, at, the, NLP, conference] |  | SHIFT |
| [ROOT, presented] | [my, findings, at, the, NLP, conference] | presented→I | LEFT-ARC |
| [ROOT, presented, my] | [findings, at, the, NLP, conference] |  | SHIFT |
| [ROOT, presented, my, findings] | [at, the, NLP, conference] |  | SHIFT |
| [ROOT, presented, findings] | [at, the, NLP, conference] | findings→my | LEFT-ARC |
| [ROOT, presented, findings, at] | [the, NLP, conference] |  | SHIFT |
| [ROOT, presented, findings, at, the] | [NLP, conference] |  | SHIFT |
| [ROOT, presented, findings, at, the, NLP] | [conference] |  | SHIFT |
| [ROOT, presented, findings, at, the, NLP, conference] | [] |  | SHIFT |
| [ROOT, presented, findings, at, the, conference] | [] | conference→NLP | LEFT-ARC |
| [ROOT, presented, findings, at, conference] | [] | conference→the | LEFT-ARC |
| [ROOT, presented, findings, at] | [] | at→conference | RIGHT-ARC |
| [ROOT, presented, findings] | [] | findings→at | RIGHT-ARC |
| [ROOT, presented] | [] | presented→findings | RIGHT-ARC |
| [ROOT] | [] | ROOT→presented | RIGHT-ARC |


## b

2n

explanation: n shift + n Arc


## e-i

I'll compute the derivative of $\mathbf{h} = \text{ReLU}(\mathbf{x} \mathbf{W} + \mathbf{b}_1)$ with respect to $\mathbf{x}$.

Let me break this down step by step.

First, let's establish the setup:
- $\mathbf{x} \in \mathbb{R}^{dm}$ is the input vector
- $\mathbf{W} \in \mathbb{R}^{dm \times h}$ is the weight matrix (where $h$ is the hidden layer size)  
- $\mathbf{b}_1 \in \mathbb{R}^h$ is the bias vector
- $\mathbf{h} \in \mathbb{R}^h$ is the hidden layer output

The computation proceeds as:
1. $\mathbf{z} = \mathbf{x} \mathbf{W} + \mathbf{b}_1$ (linear transformation)
2. $\mathbf{h} = \text{ReLU}(\mathbf{z})$ (element-wise ReLU activation)

For the $i$-th component of $\mathbf{h}$:
$$h_i = \text{ReLU}(z_i) = \text{ReLU}\left(\sum_{k=1}^{dm} x_k W_{k,i} + b_{1,i}\right)$$

where $z_i = \sum_{k=1}^{dm} x_k W_{k,i} + b_{1,i}$.

Now, to find $\frac{\partial h_i}{\partial x_j}$:

Using the chain rule:
$$\frac{\partial h_i}{\partial x_j} = \frac{\partial h_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial x_j}$$

Computing each part:

1. $\frac{\partial z_i}{\partial x_j} = W_{j,i}$ (since $z_i = \sum_{k=1}^{dm} x_k W_{k,i} + b_{1,i}$)

2. $\frac{\partial h_i}{\partial z_i} = \frac{\partial}{\partial z_i} \text{ReLU}(z_i) = \begin{cases} 
   1 & \text{if } z_i > 0 \\
   0 & \text{if } z_i < 0
   \end{cases}$

Therefore:
$$\frac{\partial h_i}{\partial x_j} = \begin{cases} 
W_{j,i} & \text{if } z_i > 0 \\
0 & \text{if } z_i < 0
\end{cases}$$

This can also be written more compactly as:
$$\frac{\partial h_i}{\partial x_j} = W_{j,i} \cdot \mathbf{1}[z_i > 0]$$

where $\mathbf{1}[z_i > 0]$ is the indicator function that equals 1 when $z_i > 0$ and 0 otherwise.

The key insight is that the derivative depends on whether the pre-activation $z_i$ is positive or negative. When $z_i > 0$, the ReLU is in its linear region and the gradient flows through with weight $W_{j,i}$. When $z_i < 0$, the ReLU saturates at 0 and blocks the gradient completely.

I'll compute the partial derivative of the cross-entropy loss $CE(\mathbf{y}, \hat{\mathbf{y}})$ with respect to $\mathbf{l}_i$.

Given:
- $\mathbf{l} \in \mathbb{R}^3$ (logits)
- $\hat{\mathbf{y}} \in \mathbb{R}^3$ (predictions from softmax)
- $\mathbf{y} \in \mathbb{R}^3$ (true one-hot label)
- True label is $c$ (so $\mathbf{y}_c = 1$ and $\mathbf{y}_j = 0$ for $j \neq c$)

The cross-entropy loss is:
$$CE(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{j=1}^{3} \mathbf{y}_j \log \hat{\mathbf{y}}_j = -\mathbf{y}_c \log \hat{\mathbf{y}}_c = -\log \hat{\mathbf{y}}_c$$

The softmax function gives us:
$$\hat{\mathbf{y}}_j = \frac{\exp(\mathbf{l}_j)}{\sum_{k=1}^{3} \exp(\mathbf{l}_k)}$$

Using the chain rule as suggested in the hint:
$$\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{l}_i} = \sum_{j=1}^{3} \frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \hat{\mathbf{y}}_j} \frac{\partial \hat{\mathbf{y}}_j}{\partial \mathbf{l}_i}$$

First, let's compute $\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \hat{\mathbf{y}}_j}$:

Since $CE(\mathbf{y}, \hat{\mathbf{y}}) = -\log \hat{\mathbf{y}}_c$:
$$\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \hat{\mathbf{y}}_j} = \begin{cases}
-\frac{1}{\hat{\mathbf{y}}_c} & \text{if } j = c \\
0 & \text{if } j \neq c
\end{cases}$$

This confirms the hint that $\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \hat{\mathbf{y}}_j} = 0$ if $j \neq c$.

Therefore, the sum reduces to:
$$\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{l}_i} = \frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \hat{\mathbf{y}}_c} \frac{\partial \hat{\mathbf{y}}_c}{\partial \mathbf{l}_i} = -\frac{1}{\hat{\mathbf{y}}_c} \frac{\partial \hat{\mathbf{y}}_c}{\partial \mathbf{l}_i}$$

Now I need to compute $\frac{\partial \hat{\mathbf{y}}_c}{\partial \mathbf{l}_i}$ using the softmax derivative:

For softmax $\hat{\mathbf{y}}_j = \frac{\exp(\mathbf{l}_j)}{\sum_{k=1}^{3} \exp(\mathbf{l}_k)}$:

$$\frac{\partial \hat{\mathbf{y}}_c}{\partial \mathbf{l}_i} = \begin{cases}
\hat{\mathbf{y}}_c(1 - \hat{\mathbf{y}}_c) & \text{if } i = c \\
-\hat{\mathbf{y}}_c \hat{\mathbf{y}}_i & \text{if } i \neq c
\end{cases}$$

Substituting back:

**Case 1: $i = c$**
$$\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{l}_i} = -\frac{1}{\hat{\mathbf{y}}_c} \cdot \hat{\mathbf{y}}_c(1 - \hat{\mathbf{y}}_c) = -(1 - \hat{\mathbf{y}}_c) = \hat{\mathbf{y}}_c - 1$$

**Case 2: $i \neq c$**
$$\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{l}_i} = -\frac{1}{\hat{\mathbf{y}}_c} \cdot (-\hat{\mathbf{y}}_c \hat{\mathbf{y}}_i) = \hat{\mathbf{y}}_i$$

Therefore:
$$\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{l}_i} = \begin{cases}
\hat{\mathbf{y}}_c - 1 & \text{if } i = c \\
\hat{\mathbf{y}}_i & \text{if } i \neq c
\end{cases}$$

This can be written more compactly as:
$$\frac{\partial CE(\mathbf{y}, \hat{\mathbf{y}})}{\partial \mathbf{l}_i} = \hat{\mathbf{y}}_i - \mathbf{y}_i$$

where $\mathbf{y}_i$ is the $i$-th component of the one-hot true label vector.