# Week 3 Class Exercises: Training Logistic Regression

This week, we are breaking down how training/fitting logistic regression model (our mystery function) works, using our running example of probing bias in word vectors.


In [0]:
import torchtext.vocab as vocab
import numpy as np
import requests
import zipfile
import io
# Download class resources...
r = requests.get("http://web.stanford.edu/class/cs21si/resources/unit2_resources.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

np.random.seed(42)

VEC_SIZE = 300
glove = vocab.GloVe(name='6B', dim=VEC_SIZE)

.vector_cache/glove.6B.zip: 862MB [06:16, 2.29MB/s]                          
100%|█████████▉| 399196/400000 [00:42<00:00, 9565.27it/s]

Here's what we introduced the first week. Review it and make sure you still understand it!

In [0]:
def get_word_vector(word):
    return glove.vectors[glove.stoi[word]].numpy()
  
def compute_cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
  
man = get_word_vector('man')
woman = get_word_vector('woman')
gender_vector = woman - man

def compute_linear_regression(word):
    word_vector = get_word_vector(word)
    return np.dot(gender_vector, word_vector)
  
def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))
  
def compute_logistic_regression(word, weights, bias):
    word_vector = get_word_vector(word)
    return sigmoid(np.dot(weights, word_vector) + bias)


## Part 1

We need to calculate loss. As mentioned in lecture, the loss is a way to quantify how badly we are doing at making predictions. Once we have an idea of how well we are doing, we can create a model that minimizes loss. Finish the below function for calculating loss, which is a function of $y$ and $\hat{y}$:

In [0]:
def get_loss(y, y_hat):
  ### YOUR CODE HERE ###
  return -1 * (1 - y) * np.log(1 - y_hat) - y * np.log(y_hat)
  ### END CODE ###

## Part 2

We need to calculate the gradients with respect to the weights and bias to move the weights and bias in the direction opposite the gradient (which minimizes loss). This is called gradient descent. Finish the below functions for calculating loss. Note that you might not need all of $y$, $\hat{y}$, and word for calculating both gradients.

In [0]:
def get_weight_gradient(y, y_hat, word):
  ### YOUR CODE HERE ###
  return get_word_vector(word) * (y_hat - y)
  ### END CODE ###
  
def get_bias_gradient(y, y_hat, word):
  ### YOUR CODE HERE ###
  return y_hat - y
  ### END CODE ###

## Part 3

How do we actually train *weights* and *bias*? For our first and only "random guess", we can initialize *weights* randomly and *bias* as 0. We then update both of these away from the direction of the gradient with respect to loss, for each training example. Since the gradient is the direction of maximum 'upward slope', moving away from the gradient minimizes the loss. Note that we loop over the training set *NUM_EPOCHS*=1000 times so that the model has more time to learn the training set.

You will need *np.random.randn* (see previous cell for usage example), *np.log* (to calculate loss), and our helper functions. 

Don't worry if you don't finish this in class, we expect this!

**Some hints:**

Initialize *weights* using *np.random.randn* and bias to 0.

You are first computing the prediction *pred* (or *y_hat*), then using this to compute the loss, then computing the gradients, then using them to make weight updates.

When computing loss, accumulate loss over each epoch using the '+=' operator, so the final loss printed per epoch is the sum of the losses for each training example.

Notation note: dw and db are short for the partial derivatives of the loss with respect to w and b, respectively.

Use the *LEARNING_RATE* provided to update the parameters.


In [0]:
def fit_logistic_regression(training_data, NUM_EPOCHS=1000, LEARNING_RATE=0.001):
    np.random.seed(42)
    # YOUR CODE HERE - initialize weights and bias
    weights = np.random.randn(VEC_SIZE) 
    bias = 0
    # END CODE
    
    for epoch in range(NUM_EPOCHS):
        loss = 0
        for example in training_data:
            x, y = example
            # YOUR CODE HERE
            y_hat = compute_logistic_regression(x, weights, bias)
            loss += get_loss(y, y_hat)
            
            db = get_bias_gradient(y, y_hat, x)
            dw = get_weight_gradient(y, y_hat, x)
            
            weights -= LEARNING_RATE * dw
            bias -= LEARNING_RATE * db
            # END CODE
        if epoch % 100 == 0:
            print("Epoch %d, loss = %f" % (epoch, loss))   
    return weights, bias

By looping through each test example and computing gradients for each individually, we are performing what is called Stochastic Gradient Descent (SGD). We'll learn more about this in the coming weeks.

Test your implementation to see if its results match ours:

In [0]:
toy_examples = [('boy', 0), ('girl', 1)]
weights, bias = fit_logistic_regression(toy_examples)
print("First value in weights is %f" % weights[0])
print("Bias is %f" % bias)

Epoch 0, loss = 3.843859
Epoch 100, loss = 1.420667
Epoch 200, loss = 1.006021
Epoch 300, loss = 0.816709
Epoch 400, loss = 0.680780
Epoch 500, loss = 0.578991
Epoch 600, loss = 0.500879
Epoch 700, loss = 0.439593
Epoch 800, loss = 0.390541
Epoch 900, loss = 0.350585
First value in weights is 0.396410
Bias is 0.083128


**Expected output:**

Epoch 0, loss = 3.843859

Epoch 100, loss = 1.420667

Epoch 200, loss = 1.006021

Epoch 300, loss = 0.816709

Epoch 400, loss = 0.680780

Epoch 500, loss = 0.578991

Epoch 600, loss = 0.500879

Epoch 700, loss = 0.439593

Epoch 800, loss = 0.390541

Epoch 900, loss = 0.350585

First value in weights is 0.396410

Bias is 0.083128

You've just built a working logistic regression model from scratch. This is actually equivalent to a 1-layer neural network! See that this produces the same results as before when trained on our full data:

In [0]:
def read_train_examples():
    with open('unit2_resources/train.txt', 'r') as f:
        raw_text = f.read()
        lines = raw_text.split('\n')
        examples = [line.split() for line in lines]
        examples = [(line[0], int(line[1])) for line in examples]
    return examples
  

examples = read_train_examples()  
weights, bias = fit_logistic_regression(examples)

Epoch 0, loss = 127.387626
Epoch 100, loss = 21.641045
Epoch 200, loss = 8.226585
Epoch 300, loss = 4.346964
Epoch 400, loss = 2.740866
Epoch 500, loss = 1.956233
Epoch 600, loss = 1.521469
Epoch 700, loss = 1.250360
Epoch 800, loss = 1.065492
Epoch 900, loss = 0.931076


In [0]:
def print_test_output(test_examples, weights, bias):
    for test_example in test_examples:
        pred = compute_logistic_regression(test_example, weights, bias)
        print("%s is %s" % (test_example, 'male' if pred < .5 else 'female'))
        
print_test_output(['nurse', 'homemaker', 'carpenter', 'surgeon', 'doctor', 'artist', 
                   'engineer', 'entrepreneur', 'genius', 'intellectual', 'chef', 'cook', 
                   'maid', 'teacher', 'boss', 'manager', 'founder'], weights, bias)

nurse is female
homemaker is female
carpenter is male
surgeon is male
doctor is male
artist is female
engineer is male
entrepreneur is male
genius is male
intellectual is male
chef is male
cook is female
maid is female
teacher is female
boss is male
manager is male
founder is male


Try playing around with NUM_EPOCHS and LEARNING_RATE to see how this affects the loss values printed during training. These values, which are not learned during training but impact the performance of the model, are called hyperparameters. We'll see more examples of hyperparameters in coming weeks.

Congratulations on completing the notebook–you learned how to train a complete 1-layer neural network! For homework, we'll see that the tools you just built are useful beyond word vectors; your code naturally extends to a social good problem in a completely different domain.