# Introductory Lab

### Aims
* Install the required libraries and refamiliarise yourself with Python and Jupyter notebooks if you need it
* Understand regular expressions
* Carry out tokenisation steps

### Outline

* Getting started: libraries, how to install them, Jupyter notebooks introduction
* Acquiring dialogue data
* Regular expressions
* Tokenisation

### How To Complete This Lab

Read the text and the code then look for 'TODOs' that instruct you to complete some missing code. You don't have to stick rigidly to the lab -- feel free to explore other methods and data to help you understand what's going on or to go beyond this lab. 

Aim to work through the lab during the scheduled lab hours. You can also post your questions to our Team's general channel throughout the week.

The labs *will not be marked*. However, they will prepare you for the coursework, so try to keep up with the weekly labs and have fun with the exercises!
      
### Additional Exercises

If you would like to do more lab exercises or would like an alternative explanation, please see Chapters 1-3 in the NLTK book, which goes into more detail than we do here. https://www.nltk.org/book/ 

## 1. Getting Started

This lab assumes you have used Python and Jupyter Notebooks before. For an introduction or refresher on Python, see the [Introduction to Python lab](https://github.com/UoB-COMS21202/lab_sheets_public/tree/master/lab_1) or the University of Bristol [Beginning Python](https://milliams.gitlab.io/beginning_python/) course. If you are a beginner with Python, you might also like to look at Chapter 1 in the NLTK book, which also provides a guide for "getting started with Python": https://www.nltk.org/book/ 

You will need to use Python 3, not Python 2. Specifically Python 3.6 or newer is recommended.

The following libraries will be used in this lab. You will need to learn how to install new packages using conda or pip (recommended to use a virtual environment) when they come up in later labs.

- [Transformers](https://huggingface.co/transformers/index.html).
- [NLTK](https://www.nltk.org/) (optional) OR [Spacy](https://spacy.io/).

The libraries above have good documentation which can be used to learn other features of the libraries or for questions and examples. The documentation is available either online (links above) or via Python itself, e.g. `help(numpy.array)` in the Python interpreter.

As an example, to install nltk in a new conda environment, run
```
$ conda create -n DN_labs
$ conda activate DN_labs 
$ conda install nltk
```
For further help see the installation guides on the libraries documentation.

**Feel free to skip the next part if you're already confident with Jupyter notebooks.**

## Jupyter Notebook

The labs will be run on [Jupyter Notebook](http://jupyter.org/), an interactive coding environment embedded in a webpage supporting various programing languages (Python, R, Lua, etc.) through the concept of kernels.  

It allows you to enrich your code with complex comments formatted in Markdown and $\LaTeX$, as well as to place the results of your computation right below your code.

Notebooks are organised in cells which can contain either code (in our case, this will be Python code) or text, which can be easily and nicely formatted using the Markdown notation. 

To edit an already existing cell simply double-click on it. You can use the toolbar to insert new cells, edit and delete them (or use keyboard shortcuts which are very handy to speed up coding). 

Cells can be run, by hitting `shift+enter` when editing a cell or by clicking on the `Run` button at the top. Running a Markdown cell will simply display the formatted text, while running a code cell will execute the commands executed in it. 

**Note**: when you run a code cell, all the created variables, implemented functions and imported libraries will be then available to every other code cell. However, it is commonly assumed that cells will be run sequentially in terms of prerequisites. To reset all variables and functions (for debugging) simply click `Kernel > Restart` from the Jupyter menu.

#### A bit on Markdown language (and a bit of LaTeX and HTML) if you're interested

Markdown cells allow you to write fancy and simple comments: all of this is written in Markdown - double click on this cell to see the source. Introduction to Markdown syntax can be found [here](https://daringfireball.net/projects/markdown/syntax).

As Markdown is translated to HTML upon displaying it also allows you to use pure HTML: more details are available [here](https://daringfireball.net/projects/markdown/syntax#html).

Finally, you can also display simple $\LaTeX$ equations in Markdown thanks to `MathJax` support. For inline equations wrap your equation between `$` symbols; for display mode equations use `$$`.

# 2. Doc2Dial Dataset

This is a recent 'shared task' that involves building a dialog system. The goal is to respond to a user by first retrieving some information from a document, then using it to formulate a response. More on the task here:
https://doc2dial.github.io/workshop2021/shared.html

The raw data is available [here](https://doc2dial.github.io/file/doc2dial_v1.0.1.zip) but we will use a data loader class from the [HuggingFace datasets library](https://huggingface.co/docs/datasets/loading_datasets.html) to load it. 

In [1]:
from datasets import load_dataset

split = "train"
cache_dir = "./data_cache"

dataset = load_dataset(
    "doc2dial",
    name="dialogue_domain",  # this is the name of the dataset for the second subtask, dialog generation
    split=split,
    ignore_verifications=True,
    cache_dir=cache_dir,
)

Reusing dataset doc2dial (./data_cache/doc2dial/dialogue_domain/1.0.1/c15afdf53780a8d6ebea7aec05384432195b356f879aa53a4ee39b740d520642)


In [2]:
print(f'The dataset has {len(dataset)} instances')

print('An example instance: ')
print(dataset[2342])

The dataset has 3474 instances
An example instance: 
{'dial_id': '92db4f3c68ab3fb2851f4a559d9c2d1e', 'doc_id': 'Temporary Disability Rating After Surgery Or Cast | Veterans Affairs#1_0', 'domain': 'va', 'turns': [{'turn_id': 1, 'role': 'user', 'da': 'query_condition', 'references': [{'sp_id': '28', 'label': 'precondition'}], 'utterance': 'Hello I need information on How do I get these disability rating benefits?'}, {'turn_id': 2, 'role': 'agent', 'da': 'respond_solution', 'references': [{'sp_id': '29', 'label': 'solution'}, {'sp_id': '30', 'label': 'solution'}], 'utterance': 'To get these benefits, you must file a disability compensation claim. Find out how to file a disability compensation claim'}, {'turn_id': 3, 'role': 'agent', 'da': 'query_condition', 'references': [{'sp_id': '36', 'label': 'precondition'}], 'utterance': 'How much do you have a qualified injury of 30% or more that has worsened?'}, {'turn_id': 4, 'role': 'user', 'da': 'response_negative', 'references': [{'sp_id': '3

Notice how we can access the different fields in a single data sample as if reading from a Python dictionary.

For our lab this week, we will need some examples of dialogue written by a user. Let's get some from the training set of Doc2Dial.

TODO 2.1: get a list of user utterances from 100 different conversations. Name the list 'utterances'.

In [3]:
### WRITE YOUR ANSWER HERE
utterances = []
num_of_examples = 100
for data_entry in dataset:
    for turn in data_entry['turns']:
        if len(utterances) >= num_of_examples:
            break
        elif 'utterance' in turn.keys():
            utterances.append(turn['utterance'])
###
utterances.append('we are going to Canada canada tomorrow.')
print(len(utterances))
print(utterances[:10])

101
['Hello, I forgot o update my address, can you help me with that?', 'hi, you have to report any change of address to DMV within 10 days after moving. You should do this both for the address associated with your license and all the addresses associated with all your vehicles.', 'Can I do my DMV transactions online?', 'Yes, you can sign up for MyDMV for all the online transactions needed.', 'Thanks, and in case I forget to bring all of the documentation needed to the DMV office, what can I do?', "This happens often with our customers so that's why our website and MyDMV are so useful for our customers. Just check if you can make your transaction online so you don't have to go to the DMV Office.", 'Ok, and can you tell me again where should I report my new address?', "Sure. Any change of address must be reported to the DMV, that's for the address associated with your license and any of your vehicles.", 'Can you tell me more about Traffic points and their cost?', 'Traffic points is the 

# 3. Regular Expressions

Next, we are going to experiment with building a simple chatbot using regular expressions. The aims are to get familiar with this important NLP tool and to see some limitations of rule-based approaches.

Many text processing systems make use of regular expressions, which are a language for specifying sets of strings. We can use regular expressions to define a pattern to search for in a corpus of text and retrieve all the occurrences of that pattern. We can also use regular expressions to replace on text pattern with another. Regular expressions are therefore used in various NLP systems, e.g., to implement tokenisation or extract features for a classifier by looking for specific patterns. They can often be used to build a simple baseline for tasks like text classification before developing a machine learning solution. 

## 3.1 Search

We can start by finding occurrences of the word "inform":

In [4]:
import re

def find_re(word_to_find, utterances, print_set=True):
    all_matches = []
    for utterance in utterances:
        matches = re.findall(word_to_find, utterance)
        if len(matches):  # if it found something
            all_matches.extend(matches)
    if print_set:
        print(set(all_matches))  # use a set to get the unique instances in the list
        print(len(all_matches))  # length of the list of matches
    return all_matches
        
    
word_to_find = r'can'
all_matches = find_re(word_to_find, utterances)


{'can'}
22


Just searching for the word itself does not really use the power of regular expressions. Let's use the *disjunction* capabilities of REs to find both capitalised and lower case occurrences. The disjunction of two or more characters is written using square brackets. To match either 'C' or 'c', we can use the following:

In [5]:
word_to_find = r'[Cc]an'
all_matches = find_re(word_to_find, utterances)    


{'Can', 'can'}
32


Our current search does not consider word boundaries, so will retrieve occurrences of "can" that are part of a longer word. In Python regular expressions, we can match word boundaries using '\b'. This matches the empty string at the beginning or end of a word. 

TODO: Write a new regular expression to search for "can" that excludes words like "canal" and "arcane" using '\b':

In [6]:
word_to_find = r'\b[Cc]an\b'
all_matches = find_re(word_to_find, utterances)    



{'Can', 'can'}
28


There are lots of other special characters besides '\b'. For a complete list of special characters, see https://docs.python.org/3/library/re.html#regular-expression-syntax. For the next exercise, the ones you will need are:
   * Match any lower case letter: 'a-z'
   * Repetition: Match zero or more repetitions of the preceding RE: '\*'
   * Disjunction between longer expressions: Match either the RE on the left-hand side OR the RE on the right-hand side: '\|'
   * Groups: '(...)'. Parentheses encapsulate *groups*, which are nested regular expressions within a larger RE. They are useful because you can apply special characters such as \* and \| to expressions inside a group. If we specify an RE containing N groups, each match returned by findall will be a list of length N where each item corresponds to a group.

These special characters can implement two other ways to find both upper and lower case occurrences of 'can': r'Cear|cear' or r'(C|c)(ear)'. The second will divide each match up into two *groups*, where each group matches one of the expressions inside the parentheses.

In [7]:
all_matches = find_re(r'Can|can', utterances)
all_matches = find_re(r'(C|c)(an)', utterances)

{'Can', 'can'}
32
{('c', 'an'), ('C', 'an')}
32


We have to concatenate the groups of characters back together to retrieve the complete matches:

In [8]:
matches = [m[0] + m[1] for m in all_matches]
print(set(matches))  # set() extracts the unique items in the list

{'Can', 'can'}


Make sure you understand what's happening in the cell above.
* Why do we need the parentheses around 'ear' in the RE in the cell above? 
* What happens if we remove the parentheses?
* On the first line in the cell above, what do the square brackets '[...]' do? Hint: *list comprehension* 

Now, let's use the special characters described above to retrieve all the words containing 'can'. 

TODO: Write a new regular expression than returns all complete words starting with "can".

In [9]:
### WRITE YOUR CODE HERE
all_matches = find_re(r'\b(C|c)(an)([a-z]*)', utterances)

{('c', 'an', 'ada'), ('c', 'an', ''), ('c', 'an', 'cel'), ('C', 'an', 'ada'), ('c', 'an', 'celing'), ('C', 'an', '')}
32


This is starting to seem more useful -- we've retrieved a set of related words with a common substring. What we really want is to extract the whole context of these words, i.e., the sentences or phrases they are contained in. For this we need a few more special characters:
   * Any character except newline: '.'
   * Complement, match any character except the specified ones: '[^A]'
   * New line: '\\n'
   * Escape: e.g., '\\?', '\\'. Using the backslash in front of special characters means that they are not interpreted as special chracters but are treated literally, in this case as a question mark or full stop. 

TODO: the code below retrieves all phrases including 'can', starting from the preceding punctuation mark or newline, until the following punctuation mark or newline. Modify it so it does not retrieve words like 'Canada' that contain 'can'. 

In [10]:
all_matches = find_re(r'([\.\!\?\n;:,])([^\.\?!;:,\n]*)([Cc]an)([^\.\?!;:,\n]*)([\.\!\?\n;:,])', utterances, print_set=False)
        
all_matches = [m[1] + m[2] + m[3] for m in all_matches]
for match in set(all_matches):
    print(match)
# print(set(all_matches))  # use a set to get the unique instances in the list
print(len(all_matches))  # length of the list of matches

 what can I do
 you can sign up for MyDMV for all the online transactions needed
 you could check on the DMV website to see if you can do your transaction online
 The best you can do is to check our website to see if you can do your transaction online so you don't have to go to the DMV Office
 before going to the DMV office you should check our website to see if you can do your transaction online
 What can I do
 can you help me with that
 in thşs case you can miss a suspensin order
 Before going to a DMV Office you should see if your transaction can be performed online
 we recommend that you go first to our website to see if you can do your transaction online
 This way you can avoid going to a DMV Office
 Can you explain more
 and can you tell me again where should I report my new address
 Just check if you can make your transaction online so you don't have to go to the DMV Office
14


### 3.2 Substitution

Finding and replacing patterns is an important application of regular expressions. We can use this to clean up the  text we extracted from the book by removing the line breaks, which don't carry much information and are mainly for formatting the EBook.

In Python, we can use the re.sub() function to replace one regular expression with an other. re.sub() takes three arguments: 
* The first argument specifies the expression to match
* The second defines the pattern we should replace it with
* The third is the text to apply the subtitution to. 

The *groups* matched by the first argument are assigned to variables that can be referred to by number: \1 for the first group, \2 for the second, etc. In the second argument, we can refer to these groups using \1 to refer to the first group, \2 to refer to the second group, and so on. 

The example below shows how to use substitution to remove line breaks from the dataset:

Let's use regular expression substitutions to create our first dialogue system! A famous chatbot that uses regular expression subsitutions is ELIZA [1], which mimicked a Rogerian psycotherapist using regular expressions.

[1] Weizenbaum, J. (1966).  ELIZA – A computer program forthe study of natural language communication between manand machine.CACM 9(1), 36–45

In our next task, we are going to use regular expression substition to respond to the dialogue that we extracted from the book. You might need a bit of imagination here: imagine that the dialogue we extracted has been typed into a chatbot and the chatbot must respond suitably.

Consider the example below, which uses 

In [11]:
for m in all_matches:   # matches is the list of phrases generated in your previous code cell
    
    # pretend that each match is an utterance from a user. The dialogue system must generate a response.
    print('DIALOGUE: ' + m)    
    
    # generate responses
    search_re = r'.*[Cc]an you'
    if re.search(search_re, m):  # repond to the lines that say 'I fear...'   
        subbed = re.sub(search_re + r'(.*)', r'Yes, I can\1', m)
    elif re.search(r'.*[Yy]ou can', m):
        subbed = re.sub(r'.*[Yy]ou can' + r'(.*)', r'Can I\1?', m)
    else:  # respond to other lines
        subbed = 'I do not understand.'
    print('CHATBOT RESPONSE: ' + subbed)

DIALOGUE:  can you help me with that
CHATBOT RESPONSE: Yes, I can help me with that
DIALOGUE:  you can sign up for MyDMV for all the online transactions needed
CHATBOT RESPONSE: Can I sign up for MyDMV for all the online transactions needed?
DIALOGUE:  what can I do
CHATBOT RESPONSE: I do not understand.
DIALOGUE:  Just check if you can make your transaction online so you don't have to go to the DMV Office
CHATBOT RESPONSE: Can I make your transaction online so you don't have to go to the DMV Office?
DIALOGUE:  and can you tell me again where should I report my new address
CHATBOT RESPONSE: Yes, I can tell me again where should I report my new address
DIALOGUE:  The best you can do is to check our website to see if you can do your transaction online so you don't have to go to the DMV Office
CHATBOT RESPONSE: Can I do your transaction online so you don't have to go to the DMV Office?
DIALOGUE:  you could check on the DMV website to see if you can do your transaction online
CHATBOT RESPO

TODO: Choose some more patterns to respond to to reduce the frequency with which the chatbot says 'I do not understand' or otherwise improve the generated responses.

# 4. Tokenisation

Up to now we have been able to work directly with the raw text. However, for most text processing tasks we will need to perform a number of steps to transform the raw text to a suitable format for a model such as a classifier or dialogue system. Here we will try out the key step of word tokenisation.

Let's start with a naïve approach: splitting the sentences based on whitespace. 

The re module provides the re.split() function, which takes a regular expression as its argument and splits the text when it finds a match. The special character '\s' is used to match whitespace characters.

TODO: use re.split() to split the raw text into tokens on whitespace characters. Save the sequence of tokens to a new variable called tokens.

In [12]:
# The dataset has already stripped out most punctuation, so here's a made-up example:
raw = "If I want to register my vehicle here in new york, I was forewarned that out-of-state insurance can't be accepted? "

### WRITE YOUR OWN CODE HERE
tokens = re.split(' ', raw)
###
print(tokens)

['If', 'I', 'want', 'to', 'register', 'my', 'vehicle', 'here', 'in', 'new', 'york,', 'I', 'was', 'forewarned', 'that', 'out-of-state', 'insurance', "can't", 'be', 'accepted?', '']


Whitespace tokenisation doesn't handle things like punctuation very well. For example, parentheses '()' are not excluded from the tokens. To see this, run the following code to inspect the non-letter characters in your tokens. 

In [13]:
for tok in tokens:
    if re.search(r'[^a-zA-Z0-9]', tok):
        print(tok)

york,
out-of-state
can't
accepted?


If we start to split the tokens based on any non-letter characters, we can encounter further issues. The punctuation may be informative, so we should not throw it away. Hyphenated words may need to be kept together while contractions like "don't" might need to be split.

Luckily, we can make use of existing rule-based tokenizers that deal with these issues:
* Spacy: https://spacy.io/api/tokenizer
* NLTK: https://www.kite.com/python/docs/nltk.word_tokenize 

For some domains and languages, tokenisation is not so easy and we may need to construct a regular-experession based approach.

TODO: refer to the documentation linked above for Spacy or NLTK's word tokeniser, and apply one of them to the raw text. Compare the output to the whitespace tokeniser. Save the tokens to a variable called 'tokens_rulebased'.

In [14]:
### WRITE YOUR OWN CODE HERE
import nltk
# nltk.download('punkt') # need to downlaod this on first time running
tokens_rulebased = nltk.word_tokenize(raw)
print(tokens_rulebased)

['If', 'I', 'want', 'to', 'register', 'my', 'vehicle', 'here', 'in', 'new', 'york', ',', 'I', 'was', 'forewarned', 'that', 'out-of-state', 'insurance', 'ca', "n't", 'be', 'accepted', '?']


TODO: Run the code below to see how NLTK has handled the non-letter characters. What does it do with most punctuation marks? When does it include punctuation marks in a token with letters? When does it not split tokens based on punctuation?

In [15]:
for tok in tokens_rulebased:
    if re.search(r'[^a-zA-Z0-9]', tok):
        print(tok)

,
out-of-state
n't
?


In the textbook, we also encountered subword tokenization methods, including byte-pair encoding (BPE). We can test this out using the implementation from HuggingFace's Transformers library:

https://huggingface.co/transformers/tokenizer_summary.html

In [19]:
from transformers import GPT2Tokenizer
bpe_tokenizer= GPT2Tokenizer.from_pretrained("gpt2")
tokens_bpe = bpe_tokenizer.tokenize(raw)

TODO: Print out some of the tokens and see if you can find any subwords. 

There will be some strange symbols that encode whitespaces, which are treated as part of the following word. See if you can work out what they represent.

In [17]:
print(raw)
tokens_bpe

If I want to register my vehicle here in new york, I was forewarned that out-of-state insurance can't be accepted? 


['If',
 'ĠI',
 'Ġwant',
 'Ġto',
 'Ġregister',
 'Ġmy',
 'Ġvehicle',
 'Ġhere',
 'Ġin',
 'Ġnew',
 'Ġy',
 'ork',
 ',',
 'ĠI',
 'Ġwas',
 'Ġfore',
 'warn',
 'ed',
 'Ġthat',
 'Ġout',
 '-',
 'of',
 '-',
 'state',
 'Ġinsurance',
 'Ġcan',
 "'t",
 'Ġbe',
 'Ġaccepted',
 '?',
 'Ġ']

You may also have heard of the BERT model. It uses a similar subword tokenisation method to BPE, called wordpiece. We can also test that out using the HuggingFace Transformers library. 

TODO: Use the code below to see if you can find some differences between BERT's wordpiece method and BPE.

In [18]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize(raw)

['if',
 'i',
 'want',
 'to',
 'register',
 'my',
 'vehicle',
 'here',
 'in',
 'new',
 'york',
 ',',
 'i',
 'was',
 'fore',
 '##war',
 '##ned',
 'that',
 'out',
 '-',
 'of',
 '-',
 'state',
 'insurance',
 'can',
 "'",
 't',
 'be',
 'accepted',
 '?']