<a href="https://colab.research.google.com/github/jackychen08/Self_Supervised_Learning/blob/main/jacky_spring2023_homework2_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center> <h1> CSCI 601.471/671 NLP: Self-supervised Models </h1> </center>

<center> <h2> Homework 2: Classifying Text with Word Embeddings </h2> </center>

In this homework, we will build a sentiment classifier using different representational choices, providing an opportunity to apply concepts learned in class in a hands-on manner. 

**After this assignment you will be able to be comfortable with :**
- the Softmax classifier and its gradients. 
- building classifiers and optimizing them with gradient descent. 

# Setup

For this and other assignments, we will be using Google Colab, for both code as well as descriptive questions. Your task is to finish all the questions in the Colab notebook and then upload a PDF version of the notebook, and a viewable link on Gradescope.


### Submission

Before you start working on this homework do the following steps:

1. Press __File > Save a copy in Drive...__ tab. This will allow you to have your own copy and change it.
2. Follow all the steps in this collaboratory file and write / change / uncomment code as necessary.
3. Do not forget to occasionally press __File > Save__ tab to save your progress.
4. After all the changes are done and progress is saved press __Share__ button (top right corner of the page), press __get a shareable link__ and make sure you have the option __Anyone with the link can view__ selected. Copy the link and paste it in the box below.
5. After completing the notebook, press __File > Download .ipynb__ to download a local copy on your computer, and then upload the file to Gradescope.


__Paste your notebook link in the box below.__ 

In [None]:
# Paste your Colab notebook link here 


Let's get started! Run the following cell to load the packages you will need.

In [None]:
!python3 --version # to check the Python version. We have used Python 3.8.10 when preparing this assignemnt 

!pip install --upgrade pip # updates your pip tool to its latest version 
!pip install nltk==3.8.1 # for tokenizing text 
!pip install numpy==1.24.1 # contains many useful numercial tools 
!pip install gensim==3.6.0 # to download word embeddings 
!pip install tqdm==4.64.1 # a nice library for visualizing progress bar 
!pip install datasets==2.9.0 # huggingface's library of datasets 

import random # for randomizing data 
import nltk
from nltk.tokenize import word_tokenize  # for tokenization 
import numpy as np # for numerical operators 
from tqdm import tqdm # progress bar 
import matplotlib.pyplot as plt # for plotting 
import gensim.downloader # for download word embeddings 

# Tokenization and cleaning
The first step is to **tokenize** text! 
Tokenization is the process of converting a string of text into tokens (smaller sub-strings). It is a common task in NLP and is often the first step in processing text data. Tokens can be words, phrases, or other meaningful elements of the text, depending on the application. Tokenization helps to simplify text data, making it easier to process and analyze.

Take this sentence for example: 

In [None]:
sentence = 'Who ❤️ "word embeddings" in 2023? We do!!!'

Next, use NLTK's tokenization engine to split the corpus into individual tokens.

In [None]:
nltk.download('punkt')  # download pre-trained Punkt tokenizer files for English

print(f'Initial string:  {sentence}')
data = nltk.word_tokenize(sentence)
print(f'After tokenization:  {sentence}')

As you can see, this function in `NLTK` turns a sentence into a list of tokens. 


Now try it out yourself with your own sentence.

In [None]:
data = nltk.word_tokenize("YOUR SENTENCE HERE!!")
print(f'Result after tokenization:  {list(data)}')

# Reading the dataset 

For this exercise, we will use [this dataset](https://huggingface.co/datasets/imdb) which has reviews about movies that are manually annotated with positive (`label=1`) or negative reviews (`label=0`). Spend a few minutes reading few examples on Huggingface🤗's to get a better sense how this dataset looks like. 


Next, we will use Huggingface🤗's `datasets` library to download this dataset locally. 

In [None]:
from datasets import load_dataset

# download dataset
dataset = load_dataset("imdb")
dataset = dataset.shuffle() # shuffle the data 

As you can see on Huggingface🤗's website, the dataset by default comes with `test` and `train` splits. 



In [None]:
train_dataset = dataset['train']
test_dataset = dataset['test']

print(" -------  an example from the train set -------")
print(train_dataset[0])

print(" -------  an example from the test set -------")
print(test_dataset[0])

The training set is used to train the model. However, the test set is used to evaluate the model's performance. Since we don't want to overfit the test set, we will not evaluate on it more than just a few times when we are dont with model training. **This is very important**!!

We wil also set aside a subset of training set as development sets. Dev sets are used in machine learning to evaluate the model's performance during the training process, providing an intermediate check on the model's accuracy before it is evaluated on the test set. 

Dev sets prevent overfitting during training. Overfitting occurs when a model is too complex and fits the training data too well, leading to poor performance generalization on new data. The development set allows for monitoring of the model's performance on data it has not seen during training, helping to avoid overfitting.

We will also cap our train and test sets at 20k and 1k to make our training/evaluation faster, obviously at the cost of a less accuracy,

In [None]:
dev_dataset = train_dataset[:1000]
train_dataset = train_dataset[1000:21000]
test_dataset = test_dataset[:1000]

To load all the inputs, we can simply call `text` key. 

In [None]:
train_dataset['text']

Similarly, to extact all the outputs, we can call `label` key: 

In [None]:
train_dataset['label']

Since this is a binary classification, let's check the count of each label: 

In [None]:
print(f"Number of positive instances: {len([x for x in train_dataset['label'] if x == 1])}")
print(f"Number of negative instances: {len([x for x in train_dataset['label'] if x == 0])}")

As you can see, the labels are almost balanced. So we don't need to worry about any label imbalance issues. 

**Aside:** Label imbalance in machine learning is a challenge because it can result in biased models, misleading performance metrics, and degraded performance due to over/under-sampling. A model trained on imbalanced data may be more accurate for the majority class and less accurate for the minority class. Common metrics such as accuracy may not be suitable for imbalanced data sets. To address label imbalance, techniques such as over-sampling, under-sampling, or class-specific cost metrics may need to be used, which can impact the model's performance.


# Building the feature vectors 
Here we will turn the input sentences into continuous vectors. 
To do so, we will start by loading a vector representation. 


In [None]:
embeddings = gensim.downloader.load('glove-twitter-50')

For each sentence, we will tokenize it and create feature vectors that is the mean of the embeddings of its tokens. Here is an example: 

In [None]:
vectors = []
for word in nltk.word_tokenize(sentence):
    # look up embedding for the words; if not found, throws KeyError if word not found
    try:
        vectors.append(embeddings[word])
    except KeyError:
        pass
features = np.array(vectors).mean(axis=0)  # average the vectors

# append a bias term to features[senId]
np.append(features, 1.0)

Note that this feature vector has dimension 51, which is the same as the original embedding dimensions plus one. The added one is usually called a bias term. Since we are going to use a linear classifier, the bias term allows the model to shift the decision boundary away from the origin, so it can better fit the data and improve classification accuracy. 

Next, we will build a function to featurize all of our data. 

In [None]:
def featurizer(data, embeddings):
    # determine the dimensionality of vectors
    feature_length = embeddings['the'].shape[0] + 1 # plus 1 for the bias term 

    features = np.zeros((len(data), feature_length)) # a feature vector for the whole data 
    empty_count = 0
    for senId, sentence in tqdm(enumerate(data)):
        vectors = []
        for word in nltk.word_tokenize(sentence) :
            # look up embedding for the words; if not found, throws KeyError if word not found
            try:
                vectors.append(embeddings[word])
            except KeyError:
                pass
        if len(vectors) > 0:
            feature_vector = np.array(vectors).mean(axis=0)  # average the vectors
            features[senId] = np.append(feature_vector, 1.0) # bias term 
        else:
            empty_count += 1
    print("Numer of samples with no words found: %s / %s" % (empty_count, len(data)))
    return features



# bag of embeddings features 
train_input_embeddings = featurizer(train_dataset['text'], embeddings=embeddings)
dev_input_embeddings = featurizer(dev_dataset['text'], embeddings=embeddings)
test_input_embeddings = featurizer(test_dataset['text'], embeddings=embeddings)

## **Question 1:** Explain what `featurizer` function does. (no more than 5 sentences). 

In [None]:
# Your answer here. 

Let's also pull out the labels

In [None]:
train_labels = np.array(train_dataset['label'])
dev_labels = np.array(dev_dataset['label'])
test_labels = np.array(test_dataset['label'])

# Softmax function
Remember the Softmax function that turns real-valued score is turned into probabilities:

$$
   \sigma(\mathbf{z}) = \frac{\exp(z_j)}{\sum_k \exp(z_k)}
$$


## **Question 2:** Implement the softmax function to turn any vector $\mathbf{z}$ (numpy array) into probabilities:

In [None]:
def softmax(x):
    """
    Compute the softmax function for each row of the input x.

    It is crucial that this function is optimized for speed because
    it will be used frequently in later code.
    You might find numpy functions np.exp, np.sum, np.reshape,
    np.max, and numpy broadcasting useful for this task. (numpy
    broadcasting documentation:
    http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)

    You should also make sure that your code works for one
    dimensional inputs (treat the vector as a row), you might find
    it helpful for your later problems.

    You must implement the optimization in problem 4.1 ("pro tip") of the 
    written assignment!
    """

    ### YOUR CODE HERE (should be 2-4 lines)
    ### END YOUR CODE
    
    return x

If your implementation is correct, it should pass the following test cases. 

In [None]:
print("Running basic tests...")
test1 = softmax(np.array([1,2]))
print(test1)
assert np.amax(np.fabs(test1 - np.array([0.26894142,  0.73105858]))) <= 1e-6

test2 = softmax(np.array([[1001,1002],[3,4]]))
print(test2)
assert np.amax(np.fabs(test2 - np.array([[0.26894142, 0.73105858], [0.26894142, 0.73105858]]))) <= 1e-6

test3 = softmax(np.array([[-1001,-1002]]))
print(test3)
assert np.amax(np.fabs(test3 - np.array([0.73105858, 0.26894142]))) <= 1e-6

print(" >> If you got here, the tests passed! GREAT!!")

Now, let's experiment with the `softmax()` to understand its function. We will design a function to visualize the input/output of softmax: 

In [None]:
def visualize_softmax(x):
    y = softmax(x)

    plt.bar(range(len(x)), x, alpha=0.4)
    plt.bar(range(len(y)), y, color='r', alpha=0.3)

    plt.legend(['Input to Softmax', 'Output of Softmax'])
    plt.show()


Here are a couple of inputs and their visualizations: 

In [None]:
visualize_softmax(np.array([0.5, 0.5]))
visualize_softmax(np.array([0.1, 0.1]))
visualize_softmax(np.array([0.9, 0.9]))
visualize_softmax(np.array([0.26894142, 0.73105858]))
visualize_softmax(2 * np.array([0.26894142, 0.73105858]))
visualize_softmax(0.5 * np.array([0.26894142, 0.73105858]))

## **Question 3:** Interpret the behavior of `softmax` function with the help of above examples (no more than 4 sentences). 

In [None]:
# your answer here

# Turning Softmax into a classifier 

We will build a classifier now. 
Let's remember the structure of the softmax linear classifier: the input vector $\mathbf{x}$ is transformed into a **logit score** vector $\mathbf{z}$ using a weight matrix $W$ and a bias vector $\mathbf{b}$:

$$
    \mathbf{z} = \mathbf{x} W 
$$

This logit score has one element per class, so the weight matrix must have a size $(d, c)$, where $c$ is the number of classes (output labels) and $d$ is the number of dimensions of the input space (features). The bias vector has $c$ elements (one per class).

The logit score is turned into probabilities using the **softmax** operator:

$$
    \hat{y}_j = P(\text{class = j}) = \frac{\exp(z_j)}{\sum_k \exp(z_k)}
$$

Let's initialize a rando weight vector first and classify our instances:

In [None]:
num_classes = 2 # number of output classes 
dimVectors = train_input_embeddings[0].shape[0] # loop up the feature dimension 
weights = 0.1 * np.random.randn(dimVectors, num_classes) # a random weight vector 

prob = softmax(train_input_embeddings.dot(weights))
pred = np.argmax(prob, axis=1)

prob, prob.shape, pred 

You should be able to see that the probabilities produced by this classifier are: 
 - all in between 0 and 1 
 - have two columns (one for each class) which sum to 1.0 
 - have 20k rows, which is the same dimension as the number of our training instances -- one output per each input. 
 - `pred` contains the class index with the highest probability. 


How good is this classifier? Let's build a function that measures the accuracy of this classifier, i.e., the rate at which the classifier output equals the gold label. 


## **Question 4:** complete the following function for accuracy: 

In [None]:
def accuracy(y, yhat):
    assert (y.shape == yhat.shape)
    
    # for checking equility between numpy arrays we can use == operator 
    # for summing values we can use np.sum() function 

    ## START CODE HERE (hint: 1-2 lines of code)
    ## END CODE HERE 
    return accuracy

If your implementation is right, it should pass the following tests

In [None]:
# perfect prediction --> 100% 
assert np.abs(
    np.fabs(accuracy(np.array([0, 3, 1, 2, 5]), np.array([0, 3, 1, 2, 5])) - 
            np.array(100.0))) < 1e-6

# all incorrect --> 0% 
assert np.abs(
    np.fabs(accuracy(np.array([0, 1, 1, 0, 0]), np.array([1, 0, 0, 1, 1])) - 
            np.array(0.0))) < 1e-6

print("if you got here, your implementation for accuracy is probably correct! ")

Now let's evaluate our predictions 

In [None]:
accuracy(train_labels, pred)

How bad is this number? Pretty bad! Given that our data is balanced, a random coin toss would result in 50% performance, which is the performance floor for any classifier we build. To improve our classifier, we will run an optimization algorithm on its parameters. 

# Optimizing our classifier 

We will start by definition an objective function that defines "goodness" for our classifier. 
A common choice for classification is [categorical cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy) or negative log-likelihood. 

A discussion or derivation of cross-entropy loss is beyond the scope of this class but a good introduction to it can be [found here](https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/). A discussion of what makes it superior to MSE for classification can be found [here](https://jamesmccaffrey.wordpress.com/2013/11/05/why-you-should-use-cross-entropy-error-instead-of-classification-error-or-mean-squared-error-for-neural-network-classifier-training/).  We will just focus on its properties instead.

Letting $y_i$ denote the ground truth value of class $i$, and $\hat{y}_i$ be our prediction of class $i$, the cross-entropy loss is defined as:

$$ CE(y, \hat{y}) = -\sum_{i} y_i \log \hat{y}_i $$

If the number of classes is 2 (which is the case here), we can expand this:

$$ CE(y, \hat{y}) = -{(y\log(\hat{y}) + (1 - y)\log(1 - \hat{y}))}\ $$

Notice that as our probability for the predicting the correct class approaches 1, the cross-entropy approaches 0. For example, if $y=1$, then as $\hat{y}\rightarrow 1$, $CE(y, \hat{y}) \rightarrow 0$. If our probability for the correct class approaches 0 (the exact wrong prediction), e.g. if $y=1$ and $\hat{y} \rightarrow 0$, then $CE(y, \hat{y}) \rightarrow \infty$.

This is true in the more general $M$-class cross-entropy loss as well, $CE(y, \hat{y}) = -\sum_{i} y_i \log \hat{y}_i $, where if our prediction is very close to the true label, then the entropy loss is close to 0, whereas the more dissimilar the prediction is to the true class, the higher it is.

**Practical tip:** in practice, a very small $\epsilon$ is added to the log, e.g. $\log(\hat{y}+\epsilon)$ to avoid $\log 0$ which is undefined.


Let's compute CE loss for the cross entropy loss: 

In [None]:
N = train_input_embeddings.shape[0]
cost = np.sum(-np.log(prob[range(N), train_labels])) / N

# also add a regularization term 
regularization = 0.5 # the hyperparameter 
cost += 0.5 * regularization * np.sum(weights ** 2)
 
print(f"Overall loss for our intiail weights: {cost}")

Note the the latter half adds an l2 regularization term $\frac{r}{2}\|W\|^2_2$ to the overall objective, where $r$ is a hyperparameter we will tune. It helps to prevent overfitting by adding a term to the loss function that discourages large weights. The result is a model that is less sensitive to the specific training data and can generalize better to new data. You can read more regularization [here](https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization).



To optimize this objective, we will compute its gradients with respect parameters $W$. 


Before doing that, let's redefine softmax + regularization in matrix form, for all classes: 
$$
    \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) =   - \frac{1}{N} \sum_{i} \big[ \mathbf{y}_i  \cdot \log \hat{\mathbf{y}}_i \big] + \frac{r}{2}\|W\|^2_2,
$$
where the summation is over $N$ many input-output instances, $\mathbf{y}_i \in \{0, 1\}^{c}$ is a one-hot encoding of the class label, and $\hat{\mathbf{y}}_i \in [0, 1]^{c}$ is the matrix of probabilities assigned to each class.


After doing the derivations, we obtain the gradient with respect to $W$:
$$
 \frac{\partial \mathcal{L} }{\partial W} = \frac{1}{N} \sum_{i} \big[ \mathbf{x}_i^\top (\mathbf{y}_i - \hat{\mathbf{y}}_i) \big] + r W.
$$
Verify the correctness of this gradient in your own time! :)  Note that because $W$ is a $(d, c)$ matrix, $\frac{\partial \mathcal{L} }{\partial W}$ too. $\mathbf{x}_i^\top(\mathbf{y}_i - \hat{\mathbf{y}}_i)$ is therefore the **outer product** between the error vector $\mathbf{y}_i - \hat{\mathbf{y}}_i$ ($c$ elements) and the input vector $\mathbf{x}$ ($d$ elements).

For more efficiency, we can write the above expression in matrix form:
$$
 \frac{\partial \mathcal{L} }{\partial W} = \frac{1}{N}  \mathbf{X}^\top (\hat{\mathbf{Y}} - \mathbf{Y}) + r W,
$$
where $\mathbf{X} \in \mathbb{R}^{N\times d}$ is the feature matrix of all our instances,   
$\mathbf{Y} \in \{0, 1\}^{N\times c}$ is the matrix of gold labels (one-hot vector for each instance), 
$\mathbf{Y} \in [0, 1]^{N\times c}$ is the matrix of probabilities for all of our instances.  


Now, let's put all these together and implement a function for our softmax classifier: 

In [None]:
def softmax_classifier(features, labels, weights, regularization=1.0, nopredictions=False):
    """ Multi-class softmax classifier
        Implement softmax regression with weight regularization.
    
    Inputs:
    - features: feature vectors, each row is a feature vector
    - labels: labels corresponding to the feature vectors
    - weights: weights of the regressor
    - regularization: L2 regularization constant
    - nopredictions: if True, do not compute predictions (used in q4_sgd)
        
    Output:
    - cost: cost of the regressor
    - grad: gradient of the regressor cost with respect to its weights
    - pred: label predictions of the regressor (you might find np.argmax helpful)
    - prob: label probabilities of the regressor  
    """

    prob = softmax(features.dot(weights))
    pred = np.argmax(prob, axis=1)
    
    if len(features.shape) > 1:
        N = features.shape[0]
    else:
        N = 1
    # A vectorized implementation of    1/N * sum(cross_entropy(x_i, y_i)) + 1/2*|w|^2
    cost = np.sum(-np.log(prob[range(N), labels])) / N
    cost += 0.5 * regularization * np.sum(weights ** 2)

    one_hot = np.zeros_like(prob)
    one_hot[range(N), labels] = 1

    # Code the gradient computation 
    # Consider using np.dot() for dot product and .T for transpose 
    ### YOUR CODE HERE (hint: 1-2 lines)
    ### END YOUR CODE

    if nopredictions:
        return cost, grad
    else:
        return cost, grad, pred, prob


## **Question 5:** implement the gradient calculations for our classifier according its matrix for given earlier. 

if your implementation is correct, the following tests should pass: 



In [None]:
weights = np.array(
    [
        [1, 1], 
        [1, 1], 
        [1, 1]
     ]
    )
features = np.array(
      [
        [1, 1, 1], 
        [0, 0, 0], 
        [0.5, 1, 0]
     ]
    )
labels = np.array([1, 0, 1])
cost, grad = softmax_classifier(features, labels, weights, regularization=1.0, nopredictions=True)

assert np.abs(np.fabs(cost - 3.693147180)) < 1e-6
assert np.sum(np.abs(np.fabs(grad - np.array([[1.25      , 0.75      ],
                                    [1.33333333, 0.66666667],
                                    [1.16666667, 0.83333333]])))) < 1e-4

print(" >>> If no error occurred, you did it right! Hurray!!!")

Now let's try running gradient descent and see whether our error chart goes down: 

In [None]:
num_classes = 2 # number of output classes 
dimVectors = train_input_embeddings[0].shape[0] # loop up the feature dimension 
weights = 0.1 * np.random.randn(dimVectors, num_classes) # a random weight vector 

step_size = 0.1
step_size_decay = 0.95
num_steps = 50

dev_cost_per_step = []
dev_accuracy_per_step = []
best_dev_cost = np.inf
best_weights = None
for step in range(num_steps):
    cost, grad = softmax_classifier(train_input_embeddings, train_labels, weights, nopredictions=True)

    # update weights with the gradients
    weights -= step_size * grad

    # decay the step size
    step_size *= step_size_decay

    # cost on the dev set validation set
    dev_cost, _, pred, _ = softmax_classifier(dev_input_embeddings, dev_labels, weights)
    dev_cost_per_step.append(dev_cost)
    dev_accuracy = accuracy(dev_labels, pred)
    dev_accuracy_per_step.append(dev_accuracy)


# create a plot with two subplots: one for validation and the other for accuracy
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
ax1.plot(dev_cost_per_step)
ax1.set_title("Validation Cost")
ax1.set_xlabel("Step")
ax1.set_ylabel("Cost")
ax2.plot(dev_accuracy_per_step)
ax2.set_title("Validation Accuracy")
ax2.set_xlabel("Step")
ax2.set_ylabel("Accuracy")
fig.suptitle("Step Size Decay: %s, Initial Step Size: %s" % (step_size_decay, step_size))
plt.show()

You should be able to see that the cost is going down as a function of training iterations, which shows that our algorithm is work. Yay! 

However, if you repeat this, you'd notice that the accuray on the dev set is sensitive to random initializations. It is also sensitive to the choice of hyperparameters such as `step_size`. These highlight the impotance of hyperparameter tuning! 

# Hyperparamers 

Hyperparameter tuning is the process of selecting optimal values for the hyperparameters of a machine learning model to improve its performance on a given task. It's important because the performance of a model can be significantly influenced by the choice of hyperparameters, and finding the best set of hyperparameters can improve the accuracy, reduce overfitting, and increase the generalization ability of the model.

To do hyperparameter tuning, let's put a wrapper around our training function and expose its parameters: 

In [None]:
def trainer(
        step_size_decay, initial_step_size, num_classes, num_steps,
        train_input_embeddings, train_labels, dev_input_embeddings, dev_labels, **kwargs
):
    '''
    Trains a softmax classifier on the data
    :param step_size_decay: step size decay
    :param initial_step_size: initial step size
    :param num_classes: number of classes
    :param num_steps: number of steps
    :param train_input_embeddings: training input embeddings
    :param train_labels: training labels
    :param dev_input_embeddings: development input embeddings
    :param dev_labels: development labels
    :return: the best weights and its corresponding cost on dev set 
    '''

    # initialize weight vector randomly
    dimVectors = train_input_embeddings[0].shape[0]
    model_weights = 0.1 * np.random.randn(dimVectors, num_classes)

    step_size = initial_step_size

    dev_cost_per_step = []
    dev_accuracy_per_step = []
    best_dev_cost = np.inf
    best_weights = None
    for step in range(num_steps):
        cost, grad, _, _ = softmax_classifier(train_input_embeddings, train_labels, model_weights)

        # update weights with the gradients
        model_weights -= step_size * grad

        # decay the step size
        step_size *= step_size_decay

        # accuracy on validation set
        dev_cost, _, pred, _ = softmax_classifier(dev_input_embeddings, dev_labels, model_weights)
        if dev_cost < best_dev_cost:
            best_dev_cost = dev_cost
            best_weights = model_weights

        dev_cost_per_step.append(dev_cost)

        dev_accuracy = accuracy(dev_labels, pred)
        dev_accuracy_per_step.append(dev_accuracy)

    assert best_weights is not None, "best_weights is None"
    return dev_cost_per_step, dev_accuracy_per_step, best_weights, best_dev_cost

First, let's try to better understand the effect step size and its decay. 







In [None]:
dimVectors = train_input_embeddings[0].shape[0] # loop up the feature dimension 
weights = 0.1 * np.random.randn(dimVectors, num_classes) # a random weight vector 

features = {
    'train_input_embeddings': train_input_embeddings, 
    'train_labels': train_labels, 
    'dev_input_embeddings': dev_input_embeddings, 
    'dev_labels': dev_labels
}

decay = 0.95

for step in [0.002, 0.2, 2.0]:
  dev_cost_per_step, dev_accuracy_per_step, _, _ = trainer(decay, step, 
                                                           num_classes=2, 
                                                           num_steps=50, **features)
  
  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))  
  # plot the cost
  ax1.plot(dev_cost_per_step)
  ax1.set_title("Validation Cost")
  ax1.set_xlabel("Step")
  ax1.set_ylabel("Cost")
  # plot the accuracy per step
  ax2.plot(dev_accuracy_per_step)
  ax2.set_title("Validation Accuracy")
  ax2.set_xlabel("Step")
  ax2.set_ylabel("Accuracy")
  # set title
  fig.suptitle(
      "Step Size Decay: %s, Initial Step Size: %s" % (decay, step))
  plt.show()


## **Question 6:** explain the above plots in the context of following plot below shown in our lectures (no more than 5 sentences): 

![step-size.png](https://self-supervised.cs.jhu.edu/sp2023/files/step-size.png)

In [None]:
# your answer here

Let's also look into the effect of the **decay** parameter. 

In [None]:
dimVectors = train_input_embeddings[0].shape[0] # loop up the feature dimension 
weights = 0.1 * np.random.randn(dimVectors, num_classes) # a random weight vector 

features = {
    'train_input_embeddings': train_input_embeddings, 
    'train_labels': train_labels, 
    'dev_input_embeddings': dev_input_embeddings, 
    'dev_labels': dev_labels
}

step = 0.2

for decay in [0.8, 0.95, 1.0]:
  dev_cost_per_step, dev_accuracy_per_step, _, _ = trainer(decay, step, 
                                                           num_classes=2, 
                                                           num_steps=50, **features)
  
  fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))  
  # plot the cost
  ax1.plot(dev_cost_per_step)
  ax1.set_title("Validation Cost")
  ax1.set_xlabel("Step")
  ax1.set_ylabel("Cost")
  # plot the accuracy per step
  ax2.plot(dev_accuracy_per_step)
  ax2.set_title("Validation Accuracy")
  ax2.set_xlabel("Step")
  ax2.set_ylabel("Accuracy")
  # set title
  fig.suptitle(
      "Step Size Decay: %s, Initial Step Size: %s" % (decay, step))
  plt.show()

As we can see in the results, reducing the learning rate over time via the decay parameter, helps the model converge to the optimal solution more efficiently and avoid getting stuck in suboptimal solutions. In the last plot, for example, where there is no decay, the optimization is stuck and oscillating. 

---

Okay, now that we know these hyperparameters matter, let's write a function for trying out a range of hyperparameters.  



In [None]:
# hyper-parameter sweep
def run_hyperparameter_tuning(features, log=True):
    best_selected_weights = None
    best_selected_cost = np.inf

    # hyper-parameter sweep
    for step in [0.02, 0.2, 2.0]:
        for decay in [0.8, 0.90, 0.95]:
            dev_cost_per_step, dev_accuracy_per_step, best_weights, best_dev_cost = \
                trainer(step_size_decay=decay, initial_step_size=step, num_classes=2, num_steps=100, **features)

            if best_selected_cost > best_dev_cost:
                if log: 
                  print(f" -> Revising the best cost on dev from {best_selected_cost} to {best_dev_cost} ")
                best_selected_cost = best_dev_cost
                best_selected_weights = best_weights
                

    test_labels = features['test_labels']
    test_input_embeddings = features['test_input_embeddings']

    _, _, best_pred, _ = softmax_classifier(dev_input_embeddings, dev_labels, best_weights)

    # evaluate the best model on the test set
    test_cost, _, pred, pred_prob = softmax_classifier(test_input_embeddings, test_labels, best_selected_weights)
    test_accuracy = accuracy(test_labels, pred)
    print("Test Cost: %s" % test_cost)
    print("Test Accuracy: %s" % test_accuracy)
    return test_accuracy
    

Let's do hyperparameter tuning on our data! 

In [None]:
features = {
    'train_input_embeddings': train_input_embeddings, 
    'train_labels': train_labels, 
    'dev_input_embeddings': dev_input_embeddings, 
    'dev_labels': dev_labels, 
    'test_input_embeddings': test_input_embeddings, 
    'test_labels': test_labels, 
}

run_hyperparameter_tuning(features)

Awesome, now that is the performance of our classifier the test set of this task. Notice that 
 - this performance is much higher than the random chance performance 50%, which is great news!
 - we evaluate the test set only once thus far. We should always minimize the number of times we see/evaluate the test set to avoid overfitting it.  

# Effect of word embeddings 

What happens if we use a different word embeddings? We will look into this problem now.


In [None]:
# note: this might take 15-30 minutes 

for embedding_type in ["glove-twitter-50", "glove-twitter-100", "glove-twitter-200", "word2vec-google-news-300"]: 
  print(f" ============= Embedding: {embedding_type} ============ ")
  embeddings = gensim.downloader.load(embedding_type)

  # re-extract the features using this particular embedding 
  train_input_embeddings = featurizer(train_dataset['text'], embeddings=embeddings)
  dev_input_embeddings = featurizer(dev_dataset['text'], embeddings=embeddings)
  test_input_embeddings = featurizer(test_dataset['text'], embeddings=embeddings)

  # package the embeddings for hyperparameter tuning 
  features = {
      'train_input_embeddings': train_input_embeddings, 
      'train_labels': train_labels, 
      'dev_input_embeddings': dev_input_embeddings, 
      'dev_labels': dev_labels, 
      'test_input_embeddings': test_input_embeddings, 
      'test_labels': test_labels, 
  }

  run_hyperparameter_tuning(features, log=False)

  


## **Question 7:** what are your takeaways from the above experiment on trying different embeddings? 


In [None]:
# your answer here

### Congratulations!

You've come to the end of this assignment. Here are the main points you should remember:

- Using pre-trained **word embeddings** from the internet is often a good way to get started. 
- **Softmax classifer** is a simple, yet effective tool for multi-class problems. It is often used in conjunction with cross-entropy loss. 
- To make your implementation more efficient, use **matrix/vector operations** rather than scalar operators.  
- Careful **hyperparameter tuning** is necessary for any statistical learning problem. 
- It is essential to evaluate the test set as minimally as you can. Instead, use the held-out development set. 

Congratulations on finishing the graded portions of this notebook! 
