# Assignment 3: Neural networks in natural language processing

### Due Date: Oct 30 (both sections)

### Grade (100 pts, 10%)

#### Your Name:

#### Your EID:

*Note: This assignment covers material from the recording, notes, demo, and suggested readings from Lecture-08*

---

## Questions

### 1. Dropout (50 pts)

Dropout is a regularization technique that randomly sets units in each activation layer, $a \in \mathbb{R}^{D}$, to zero and then multiplies the resultant vector elementwise by a constant $\gamma$ according to:

$$a_{dropout} \leftarrow  \gamma H \odot a$$

where $\odot$ represents the element-wise product operator and $H \in \{0, 1\}^D$ is a mask with entries drawn from 

$$\begin{cases} p(0) &= p_{dropout} \\ p(1) &= 1 - p_{dropout} \end{cases}$$

Select a scaling factor ${\gamma}$ that ensures the expected value over the activation layer remains invariant to the above operation, $E\big[ a_{dropout} \big] = E\big[ a \big]$, and provide rationale for your selection.

*Hint: You want to show that*

$$
\sum_{i=1}^D a_i = \gamma \sum_{i=1}^D a_{dropout, i}
$$

In [1]:
# Your answer goes here
import numpy as np
import random

size = 10000

gamma_values = {}

for p in range(0, 100, 10):
    gamma_values[p] = []
    for i in range(1000):
        # Initial array
        a = np.random.rand(size)
        
        # H: 1s and 0s the same length as a, in proportion based on p
        dropouts = np.random.choice([0, 1], size=size, p=[p/100, 1-(p/100)])
        
        # a after random dropouts
        a_dropout = np.multiply(a, dropouts)


        gamma = np.mean(a) / np.mean(a_dropout)
        gamma_values[p].append(gamma)



print("Average estimated scaling factor for each value of p")
for k, v in gamma_values.items():
    p = k/100
    print("P: ", p)
    print("\t1/(1-p): ", 1/(1 - p))
    print("\tGamma: ", np.mean(v))

print("\nBased on this Monte Carlo-esque simulation, it seems that gamma is roughly = 1/(1-p) if gamma = mean[a]/mean[a_dropout]")

Average estimated scaling factor for each value of p
P:  0.0
	1/(1-p):  1.0
	Gamma:  1.0
P:  0.1
	1/(1-p):  1.1111111111111112
	Gamma:  1.1109693886685041
P:  0.2
	1/(1-p):  1.25
	Gamma:  1.2500664578933312
P:  0.3
	1/(1-p):  1.4285714285714286
	Gamma:  1.4280009706105377
P:  0.4
	1/(1-p):  1.6666666666666667
	Gamma:  1.666721001199416
P:  0.5
	1/(1-p):  2.0
	Gamma:  2.0008511050356224
P:  0.6
	1/(1-p):  2.5
	Gamma:  2.4988443213182854
P:  0.7
	1/(1-p):  3.333333333333333
	Gamma:  3.334257964853906
P:  0.8
	1/(1-p):  5.000000000000001
	Gamma:  4.994702730783005
P:  0.9
	1/(1-p):  10.000000000000002
	Gamma:  10.003381229897204

Based on this Monte Carlo-esque simulation, it seems that gamma is roughly = 1/(1-p) if gamma = mean[a]/mean[a_dropout]


### 2. Convolutions (50 pts)

Consider a sequence of $T$ token embeddings, $Z \in \mathbb{R}^{T \times D}$, for which $D=3$:

In [2]:
import numpy as np

Z = np.array([
    [1.3,   0.4, -0.2],
    [-3.1,  1.1,  2.1],
    [0.9,   2.8, -1.5],
    [1.3,   2.4,  0.1],
    [1.0,   1.0,  0.5],
    [3.0,  -1.4, -0.2],
    [-0.7,  1.8,  1.3]
])

and a set of convolutional filters, $W=\{ w^{(1)}, w^{(2)} \}$, and corresponding filter widths $S=\{ s^{(1)}, s^{(2)}  \}$:

In [3]:
w1 = np.array([
    [1, 1, 1],
    [1, 1, 1]
])

w2 = np.array([
    [2, 2, 2],
    [2, 2, 2],
    [2, 2, 2]
])

W = [w1, w2]

S = [2, 3]

In Lecture 08 we discussed a set of operations that maps $Z \in \mathbb{R}^{T \times D}$ onto $Z' \in \mathbb{R}^{N_F D}$ (in this problem $N_F = 2$). This involved three steps:

1. **Convolution**: The convolutional operation produces $N_F$ feature maps, $B^{(n)} \in \mathbb{R}^{(T - s^{(n)} + 1) \times D}$, where $n=\{1, \dots, N_F\}$, according to:

$$
\forall_{t \in \{ 1, \dots, T - s^{(n)} + 1 \} } \; B^{(n)}_{t,j} = \sum_{t'=1}^{S^{(n)}} w^{(n)}_{t',j} \; Z_{t+t'-1, \ j}
$$

2. **Max pooling**: The max pooling operation computes the max over the sequence dimension in each feature map, $ B_{maxpool}^{(n)} \in \mathbb{R}^D$, according to:

$$
B_{maxpool, j}^{(n)} = \underset{1 \leq t' \leq T - s^{(n)} + 1 }{\max} B^{(n)}_{t', j}
$$

3. **Concatenation**: The resultant set of $N_F$ feature vectors are then concatenated into a single vector $Z'$ according to:

$$
Z' = \big[ B_{maxpool}^{(1)}, \dots, B_{maxpool}^{(n)}, \dots,  B_{maxpool}^{(N_F)}  \big] \in \mathbb{R}^{D \cdot N_F}
$$

In the cell below, perform these three operations to produce $Z' \in \mathbb{R}^6$ and print it.

*Hint: The max pooling operation computes the maximum over each column in $B^{(n)}$*

In [4]:
# Assume 1 padding?
Z_padded = np.pad(Z, 1)

z_size = Z_padded.shape
w1_size = w1.shape
w2_size = w2.shape
conv_feature_1_size = ((z_size[0] - w1_size[0]) + 1, (z_size[1] - w1_size[1]) + 1)
conv_feature_2_size = ((z_size[0] - w2_size[0]) + 1, (z_size[1] - w2_size[1]) + 1)

conv_1 = np.zeros(conv_feature_1_size).astype(np.float32)
conv_2 = np.zeros(conv_feature_2_size).astype(np.float32)

# Step 1: Convolution
# Didn't use S for this -- it was easier to think about by manually calculating the 
for t in range(conv_feature_1_size[0]):
    for j in range(conv_feature_1_size[1]):
        conv_1[t][j] = np.sum(Z_padded[t:t + w1_size[0], j:j + w1_size[1]] * w1)

print("B1: ", conv_1)

for t in range(conv_feature_2_size[0]):
    for j in range(conv_feature_2_size[1]):
        conv_2[t][j] = np.sum(Z_padded[t:t + w2_size[0], j:j + w2_size[1]] * w2)

print("B2: ", conv_2)

# Step 2: Max Pooling

b1_max = np.amax(conv_1, axis=0)
b2_max = np.amax(conv_2, axis=0)
print("B1 Max: ", b1_max)
print("B2 Max: ", b2_max)

# Step 3: Concatenation

Z_prime = np.concatenate((b1_max, b2_max))
print("Z': ", Z_prime)

B1:  [[ 1.7  1.5  0.2]
 [-0.3  1.6  3.4]
 [ 1.7  2.3  4.5]
 [ 7.4  6.   3.8]
 [ 5.7  6.3  4. ]
 [ 3.6  3.9 -0.1]
 [ 2.7  3.8  1.5]
 [ 1.1  2.4  3.1]]
B2:  [[-0.6  3.2  6.8]
 [ 6.8  7.6  9.4]
 [10.8 12.2 14. ]
 [18.8 17.  10.6]
 [14.6 15.4  4.8]
 [ 9.4 12.6  6. ]
 [ 5.4  7.6  3. ]]
B1 Max:  [7.4 6.3 4.5]
B2 Max:  [18.8 17.  14. ]
Z':  [ 7.4  6.3  4.5 18.8 17.  14. ]
