Week1_Class_Exercises_Solutions_Final.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Week1_Class_Exercises_Solutions.ipynb",
      "version": "0.3.2",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "metadata": {
        "id": "WoEfNTZKN-rL",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "# Week 1 Exercises\n",
        "\n",
        "Today we will explore gender bias in natural language processing. We will learn about our first models to probe gender bias in word vectors. As a reminder, word vectors are a machine's representation of a word, learned from reading a large corpus of text to understand the context that words are used in. For example, since the words \"good\" and \"great\" are used in similar contexts, they have similar word vectors!\n",
        "\n",
        "These kinds of word vectors are used in everything from Google Search to Spotify recommendations, so if they are biased, this is a major problem.\n",
        "\n",
        "Today we will be using GloVe vectors, which are a standard type of word vector used in a variety of real-world applications. These word vectors were trained on 6 billion word tokens, sourced from Wikipedia 2014 + Gigaword5. If you're interested you can find more information [here](https://nlp.stanford.edu/projects/glove/).\n",
        "\n",
        "Run the below cell by highlighting it and typing Shift+Enter. This will import the required packages and download the GloVe vectors, which will take a few minutes."
      ]
    },
    {
      "metadata": {
        "id": "W8-lxbAqN-rT",
        "colab_type": "code",
        "outputId": "124b5f83-196b-4f30-dccc-ef37271498fd",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "cell_type": "code",
      "source": [
        "import torchtext.vocab as vocab\n",
        "import numpy as np\n",
        "np.random.seed(42)\n",
        "\n",
        "VEC_SIZE = 300\n",
        "glove = vocab.GloVe(name='6B', dim=VEC_SIZE)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            ".vector_cache/glove.6B.zip: 862MB [03:05, 4.65MB/s]                           \n",
            "100%|█████████▉| 399433/400000 [00:44<00:00, 8886.22it/s]"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "metadata": {
        "id": "gVD3d9d2N-rj",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Part 1\n",
        "Below we have included a short helper function that retrieves the word vector for a given word."
      ]
    },
    {
      "metadata": {
        "id": "dFWlGpoPN-ro",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "def get_word_vector(word):\n",
        "    return glove.vectors[glove.stoi[word]].numpy()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "XzL9mqgAN-r0",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Observe the results of this helper function below. Notice that we are outputting a numpy array of dimensionality (300,). This means that the output is a 300-dimensional vector."
      ]
    },
    {
      "metadata": {
        "id": "UDHpptWmN-r5",
        "colab_type": "code",
        "outputId": "07dfcad0-524e-4907-876e-290eee024ac7",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "cell_type": "code",
      "source": [
        "good = get_word_vector('good') # get the word vector for 'good'\n",
        "print(good.shape)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "(300,)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "l7t29BCSN-sF",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Below, use the above vector and the word vector for 'great' to determine the cosine similarity between 'good' and 'great'. Do the same for 'good' and 'human' (two words that are less similar). You'll need *np.linalg.norm* and *np.dot*."
      ]
    },
    {
      "metadata": {
        "id": "GXF-fJeqN-sJ",
        "colab_type": "code",
        "outputId": "fe64661e-39f6-4ff5-f0ce-6bb871094716",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "cell_type": "code",
      "source": [
        "great = get_word_vector('great') # YOUR CODE HERE\n",
        "human = get_word_vector('human') # YOUR CODE HERE\n",
        "def compute_cosine_similarity(a, b):\n",
        "    # YOUR CODE HERE:\n",
        "    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))\n",
        "    # END CODE\n",
        "\n",
        "print(\"Good-great similarity %f\" % compute_cosine_similarity(good, great))\n",
        "print(\"Good-human similarity %f\" % compute_cosine_similarity(good, human))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Good-great similarity 0.641005\n",
            "Good-human similarity 0.313640\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "AEGzfSzoN-sT",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "**Expected output:**\n",
        "\n",
        "Good-great similarity 0.641005\n",
        "\n",
        "Good-human similarity 0.313640\n",
        "\n",
        "Now, use our helper function to retrieve the \"gender vector\", or the vector representing 'woman' minus the vector representing 'man' (woman - man). "
      ]
    },
    {
      "metadata": {
        "id": "WREnsVcoN-sW",
        "colab_type": "code",
        "outputId": "298323b4-9833-485a-e041-3c6dc7356295",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "cell_type": "code",
      "source": [
        "# YOUR CODE HERE\n",
        "man = get_word_vector('man')\n",
        "woman = get_word_vector('woman')\n",
        "gender_vector = woman - man\n",
        "# END CODE\n",
        "print('First value of gender vector: %f ' % gender_vector[0])\n",
        "print('Shape of gender vector: ', gender_vector.shape)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "First value of gender vector: -0.220370 \n",
            "Shape of gender vector:  (300,)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "RDkIMIGAN-sc",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "**Expected output:**\n",
        "\n",
        "First value of gender vector: -0.220370 \n",
        "\n",
        "Shape of gender vector:  (300,)"
      ]
    },
    {
      "metadata": {
        "id": "_RvDXEQRN-se",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Now fill in the below function that computes linear regression on any word. Use the gender_vector to provide weights (*w*), and do not use a bias term (*b*). You'll need our helper function *get_word_vector* and *np.dot*."
      ]
    },
    {
      "metadata": {
        "id": "BqW9r-02N-sg",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "def compute_linear_regression(word):\n",
        "    # YOUR CODE HERE\n",
        "    word_vector = get_word_vector(word)\n",
        "    return np.dot(gender_vector, word_vector)\n",
        "    # END CODE"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "M00UPbiIN-sm",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Check to make sure your model matches the expected output for 'programmer':"
      ]
    },
    {
      "metadata": {
        "id": "TidgH9aGN-so",
        "colab_type": "code",
        "outputId": "cee4fe44-11bf-4365-cf3a-873fb71c311a",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "cell_type": "code",
      "source": [
        "compute_linear_regression('programmer')"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "-1.0347012"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 8
        }
      ]
    },
    {
      "metadata": {
        "id": "sdUSqEoyN-ss",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "**Expected output:**\n",
        "-1.0347012"
      ]
    },
    {
      "metadata": {
        "id": "zy5XK6a3N-st",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Feel free to play around with the model by changing the input word in the below cell! How does the score for 'programmer' compare to the score for 'nurse'? For 'homemaker'? What does this tell us about our word vectors?"
      ]
    },
    {
      "metadata": {
        "id": "I8n4qzRoN-sv",
        "colab_type": "code",
        "outputId": "3e7b6609-a54e-495f-df77-5eb58dd8fc90",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "cell_type": "code",
      "source": [
        "compute_linear_regression('nurse')"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "8.855267"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 13
        }
      ]
    },
    {
      "metadata": {
        "id": "iKMNBactN-sy",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "## Part 2\n",
        "\n",
        "Now we will build a more sophisticated logistic regression model to make predictions on our word vectors. Eventually, this model will actually learn the weights (*w*) and bias (*b*) by itself! It will also output a probability between 0 and 1 that a word is associated with females. (1 represents a word that is very 'female' according to our word vectors, 0 represents a word that is not) \n",
        "\n",
        "All we will do is tell this model that 1 represents female and 0 represents male, and, alarmingly, bias from our word vectors will transfer to our model.\n",
        "\n",
        "For these in-class exercises, you will build this model and learn how to train it. For homework, you will actually train it and see the results."
      ]
    },
    {
      "metadata": {
        "id": "CncvkY9qN-s0",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "As a warmup, fill in the below function to calculate the sigmoid of a scalar value. Use *np.exp* instead of python's built-in."
      ]
    },
    {
      "metadata": {
        "id": "em_XviO0N-s1",
        "colab_type": "code",
        "outputId": "436853f1-ac28-4655-d63e-a7cb63d80f2d",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "cell_type": "code",
      "source": [
        "def sigmoid(z):\n",
        "    # YOUR CODE HERE\n",
        "    return 1.0 / (1 + np.exp(-z))\n",
        "    # END CODE\n",
        "print(\"sigmoid(0.5) is %f\" % sigmoid(0.5))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "sigmoid(0.5) is 0.622459\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "eC3t0MeEN-s5",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "**Expected output:**\n",
        "\n",
        "sigmoid(0.5) is 0.622459\n",
        "\n",
        "Next, fill in the below function to compute logistic regression on a word given weights and bias. Note that you are not training the model yet, just computing what is known as the \"forward pass\". This should look similar to your *compute_linear_regression* function, but you are using the weights and bias given instead of the *gender_vector*, and you are using sigmoid to produce the final output."
      ]
    },
    {
      "metadata": {
        "id": "PW5__i7gN-s6",
        "colab_type": "code",
        "outputId": "48cbad17-8554-4f84-b60e-7aaaa1f4e12a",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "cell_type": "code",
      "source": [
        "def compute_logistic_regression(word, weights, bias):\n",
        "    # YOUR CODE HERE\n",
        "    word_vector = get_word_vector(word)\n",
        "    return sigmoid(np.dot(weights, word_vector) + bias)\n",
        "    # END CODE\n",
        "\n",
        "np.random.seed(42)\n",
        "rand_weights = np.random.randn(VEC_SIZE)\n",
        "rand_bias = np.random.rand()\n",
        "print(\"Predicted output: %f\" % compute_logistic_regression('hello', rand_weights, rand_bias))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Predicted output: 0.000089\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "1U1zCmIoN-s-",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "**Expected output:**\n",
        "\n",
        "Predicted output: 0.000089\n",
        "\n",
        "Don't read too much into this output, it was randomly generated. \n",
        "\n"
      ]
    },
    {
      "metadata": {
        "id": "oHSlhEhli031",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        "Congratulations on completing the first set of class exercises!\n",
        "\n",
        "For homework, we will use the functions you just wrote to show that bias transfers from the word vectors to models trained on them."
      ]
    },
    {
      "metadata": {
        "id": "REEggTtOHvFh",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}