# Intro to NLP

## What is NLP?

> Natural language processing, or NLP, is the use of human languages by a computer

With language being the primary mode of human communication, there are an abundance of use-cases for NLP.

Examples include:
- Sentiment analysis
- Question answering
- Translation
- Generating text

_Note that speech data is not language, it is numerical audio waveforms that represent language data, and can be turned into language (text data by transcription)_

## What makes text processing different and hard?

- Language has lots of nuance
    - Words can have different meanings in different contexts
- Language data is not naturally numerical, and so can't immediately be processed mathematically

## Essential Tools and Concepts within NLP

### What is a corpus?

A corpus is a body of text that represents your data. One classic example would be the [Gutenberg corpus](https://zenodo.org/record/2422561#.Y8NpV-zP06E), which contains the text of over 50000 books.

### What is a token?

A token is an atomic unit of text. In most cases, you can think of tokens as individual words, but in many cases tokens may be something like a common part of a word, like a suffix, in other cases a token might be an individual character.

Example tokens:
- the word "probable"
- the character "h"
    - You won't typically see this, as most NLP is done on the word level
- the sequence of tokens "ing", which is a common word suffix
    - You will commonly see this represented as a token "##ing", where the "##" indicates that this token is preceded by other characters
    - Similarly, you can have tokens which appear at the start of words, like "pre##", or that appear in the middle of words like "##ab##"
    - The tokens you end up with, depend on how you turn your raw text into these tokens, through the process known as _tokenisation_

Going forward, you can think about tokens as individual words, which is what they are in most cases. Note that you will probably see the words "token" and "word" used interchangably.

### What is a tokeniser?

A tokeniser is a function that takes in raw text and turns it into a sequence of tokens.
A tokeniser performs tokenisation on raw text to produce tokens.

## TODO diagram

### What is a vocab?

A vocab is an assignment of an integer index to each token. 
If you imagine a list of tokens, the index of each token is the position of that token in the list.

## TODO diagram

## How do we represent words?

### Word indexes

As indicated by the vocab, we can represent each token mathematically by assigning it an integer index.

An alternative way to represent that index is using a vector that is as long as the number of tokens in your corpus, which contains zeros everywhere except in the position of the index corresponding to the word.

# TODO !(1-hot vector)[1-hot vector.png]

We call this vector a _1-hot vector_, or a _one-hot encoding_ of the token.

### Why does the 1-hot representation make more sense mathematically than the index?

The one-hot vector makes more sense mathematically than the index, because the index indicates that the words are somehow on the same number line, which they are not. The token with index=2 is not necessarily between tokens 1 & 3. The token represented by the index 100 is not bigger than token 1.

### The problem with 1-hot encodings

#### Similar words do not have similar representations

- Similar words do not have similar representations
- in fact, all vectors are orthogonal

#### The length of your 1-hot encodings increases with every new token
- 1-hot encodings contain an element for every possible word, so for larger corpora, with more tokens, they are longer

# TODO diagram of 1-hot word vectors

> Overall, We want to avoid using 1-hot encodings to represent our words and try something else... word embeddings

### Word embeddings

Word embeddings are vector representations of tokens that contain a meaningful representation of what the word means. 

# TODO diagram

Where 1-hot encodings are "sparse", containing mostly zeros, word embeddings are "dense".

# TODO diagram

## Word embeddings can be learnt in a number of ways

### Directly maximise the vector similarity between words that appear in similar context - The Word2Vec algorithm

Word2vec was the original algorithm used to create meaningful vector representations of words.

It is based on the assumption that similar words appear in similar contexts. 

The famous quote from 1957 that highlights this assumption was: "you shall know a word by the company it keeps".

At a high level, this is how it works:
- Initialise random embeddings
- Take pairs of words that appear close together (within a threshold distance)
- Calculare their cosine similarity
- Maximise this objective using gradient descent

# TODO diagram

### They can be learnt for a specific problem

Alternatively, word representations can be learnt from scratch by solving a specific downstream task, such as sentiment analysis.

In this setup, the embeddings are simply a part of the model parameters, like the other weights and biases. The input to the model is the integer indexes of the tokens, the first layer of the model is an embedding layer which indexes out the row to use as a word embedding, and the output is whatever is required for the task, such as a classification for sentiment analysis.

## TODO diagram

Learning word representations in this way can produce problem-specific representations. For example, words might have different representations in a 

Note that however you learn embeddings, what they represent will be determined by the data they are learnt from.

The most common of these is BERT, which learns representations of words based on a domain agnostic language modelling problem.

## Pre-trained word embeddings

Learning meaningful word representations can take a lot of time and compute. 
Thankfully, we can take the embeddings learnt by others straight off the shelf.

One of the most influential machine learning models 

It's not important to understand BERT at this point, but for now:
- BERT stands for Bidirectional Encoding Representations using Transformers
- It is trained to fill in the missing word in text
- It contains the word embeddings within its first layer's parameters. These BERT embeddings are widely used as a good starting point for word embeddings.

We will talk about BERT more later, but we can already start using it.

Let's start by downloading the model.

In [5]:
!pip install transformers



In [6]:
from transformers import BertModel

model_name = 'bert-base-uncased'
model = BertModel.from_pretrained(model_name)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


You can see the parameters that the model contains by printing its `modules` attribute.

In [7]:
print(model.modules)

<bound method Module.modules of BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(

You can see that the very first layer is the embedding layer. The parameters of this layer are the embeddings for the thousands of words which BERT recognises.

Now, let's get those embeddings.

In [8]:
n_embeddings = 30000

embedding_matrix = model.embeddings.word_embeddings.weight.detach()

embedding_matrix = embedding_matrix[:n_embeddings]
# print(embedding_matrix)
print("Embedding shape:", embedding_matrix.shape)

Embedding shape: torch.Size([30000, 768])


Now we have the embedding matrix, but we don't know which word each of those embeddings correspond to. This is where we need to use the vocab to map from the index of the word (its row in the embedding matrix) to the word itself.

In HuggingFace, the vocab is accessible through the tokeniser. In the same way that we loaded in a pre-trained BERT model, we can load in the corresponding tokeniser.

Check out the docs here.

In [9]:
from transformers import BertTokenizer

bert_tokenizer = BertTokenizer.from_pretrained(model_name)
sentence = "How does this sentence get tokenised?"
tokens = bert_tokenizer.encode(sentence)

print(tokens)

[101, 2129, 2515, 2023, 6251, 2131, 19204, 5084, 1029, 102]


In [10]:
from torch.utils.tensorboard import SummaryWriter
from time import time


def create_embedding_labels():
    # ADD NEW COLS
    label_functions = {
        "Length": lambda word: len(word),
        "# vowels": lambda word: len([char for char in word if char in "aeiou"]),
        "is number": lambda word: word.isdigit(), # boolean label for numbers
        # "is preposition": lambda word: word in prepositions
    }
    labels = [
        [
            word,
            *[label_function(word) for label_function in label_functions.values()]
        ]
        for word in list(bert_tokenizer.ids_to_tokens.values())[:n_embeddings]
    ]

    label_names = ["Word", *list(label_functions.keys())]

    return labels, label_names


def visualise_embeddings(embeddings, labels=None, label_names="Label"):
    print("Embedding")

    writer = SummaryWriter()
    start = time()
    writer.add_embedding(
        mat=embeddings,
        metadata=labels,
        metadata_header=label_names
    )
    print(f"Total time:", time() - start)

    print("Embedding done")

labels, label_names = create_embedding_labels()
visualise_embeddings(embedding_matrix, labels, label_names)

Embedding
Total time: 37.8265860080719
Embedding done


Now, open tensorboard by running the below cell.


In [2]:
%load_ext tensorboard
%tensorboard --logdir logs

_Note that this is a 3D projection of much higher dimensional embeddings, so most information is lost when we visualise it._

## Fine-tuning word represenations

Word representations can be taken off the shelf and plugged into an otherwise untrained model before being further updated by training that model from end to end.

# TODO move to another notebook - great RNN example


## Overview of the text processing pipeline so far
- create tokeniser and vocab
- turn each word into a word embedding


## TODO image


## Where are we on the map of our journey to ChatGPT?
- we now know how to represent text data numerically

## TODO image