<a href="https://colab.research.google.com/github/moO0lk/LING227/blob/main/07_tokens_and_punctuation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What! is? a. Token?!**


In this notebook we return to the problem of tokenizing a text and defining what counts as a word. So far we've already been doing this with the `.split()` function, which has worked relatively well for us. But, there is one issue, which is that splitting on white space means that sometimes punctuation is included with our words.

For example, running `.split()` on the example below will retain commas and exclamation marks as part of the words:






In [None]:
turtles = """teenage mutant ninja turtles,
            teenage mutant ninja turtles,
            teenage mutant ninja turtles,
            heroes in a halfshell, turtle power!"""

turtles.split()

Therefore, we might want to perform some operations on this text *before* we start processing it for linguistic information. These operations will work to normalize and standardize the text so that noise is removed. This is called preprocessing. Preprocessing comes in many options - you could remove just punctuation, or convert everything to lowercase, or remove very frequent words, or remove words that are not in the dictionary, or remove words that only occur one time, and so on. Different algorithms and approaches to NLP will all include their own methods and steps for preprocessing, which are tied to the goals of the analysis.

For now, let's focus on the issue of punctuation in the turtles text.

## **Cleaning punctuation**

Pur problem above with `turtles` was caused by the use of punctuation and `.split()`. What could we do? Well, we *could* remove all of the punctuation before splitting the text, and this would provide a satisfactory solution (for now).

Based on what we know now about Python, how could we remove all of the punctuation from a text? We can actually do this quite simply and quickly using a list comprehension.

We would want to set up a conditional test which inspects each character in a string, and as long as that character is *not* a punctuation mark, keep it.

Here is some pseudocode that expresses our goal:


```
[character for character in string if character not punctuation]
```

To execute this code, we'd need to tell Python what we mean by "punctuation". One way is to define a string containing all the punctuation marks we don't want.

We will also lower case the strings within the same expression.


In [None]:
# define a string containing punctuation we don't like, in this case just commas and exclamation marks
punctuation = ',!'

If you run the cell below, you still see that the punctuation has been removed, but unfortunately the output is a list of characters, not words!

In [None]:
# write a list comprehension that only keeps characters that aren't in punctuation
[character.lower() for character in turtles if character not in punctuation]

### **`string.join()`**

The list comprehension has returned a list of *characters*, but we wanted to retain the whitespace and other properties of the texts as a series of words. No worries, we can use the handy `string.join()` function to join a list of characters back into one string!

`string.join()` is sort of the bizzare cousin of `string.split()`. Because `.join()` is actually a string method, you need to attach a string to the front part of the `.join()`. The string that you attach to `.join` represents the nature of the join...the character that you want to join everything by. Much like `.split()`, you can choose whatever you like to join stuff with.

For example `''.join()` will join using an empty string, `' '.join()` will join usinga space, and `'HELLO'.join()` would join everything with the string `HELLO`.

It is a bit confusing at first, but basically we use the part in front of `.join()` to determine *how* the characters should be glued back together.


In [None]:
example_string = ['n', 'i','n','j','a',' ', 't','u','r','t', 'l', 'e']

In [None]:
# join with an empty string
''.join(example_string)

In [None]:
# join with a space
' '.join(example_string)

In [None]:
# join with HELLO
'HELLO'.join(example_string)

So, if we simply wanted to glue back together a list of characters *without* making any other changes, we would then attach an empty string to `.join()`, indicated with two string delimiters: `''`, in which case we would type `''.join()`.

Then, the thing that you want to join goes inside the `()` part of `''.join()`.

```
''.join([list of characters])
```



In [None]:
# we just wrap the whole list comprehension in ''.join
remove_punctuation = ''.join([character.lower() for character in turtles if character not in punctuation])

In [None]:
# it looks different now...but it's been reformed back into what we first had without punctuation
remove_punctuation

How else could we rejoin our cleaned text without using `.join()`?

One way would be to write a loop which analyses each word in a text, removing punctuation from that word, and then puts that word into a list. This is made slightly difficult because strings are `immutable`, meaning that we cannot remove or replace individual elements of a string.


In [None]:
# this returns an error because we cannot modify strings in place
'string'[0] = 'b'

One way to do this is scan through each character and then reconstruct the string as we go, only including characters that pass a conditional test. This method iteratively creates a new string by first creating an empty string `output` and then adding each character to it in a sequence using string concatenation.



In [None]:
# create an output container
output = ''

# loop through each character in the whole string
for character in turtles:
  # if the character is NOT in this list:
  if character not in [',', '!']:
    # add the lowercased version of the character to the list
    output = output + character.lower()

# results are identical to the ''.join() method above
output

#### using a regular expression

Another way is to use a `regular expression` to clean the string. A regular expression is a method of defining complex string patterns using abstract symbols. Using regular expressions, we can quickly search and replace strings for specific patterns. Here we will use a simple pattern and a function to replace the pattern.

We will need to import the library for regular expressions, `re`

In [None]:
#import the regular expression library
import re

We can now use the `re.sub` function, which will substitute patterns in a string with another pattern. The syntax for `re.sub` is:

`re.sub(pattern, replacement, string)`

So you first type the pattern that you want to search for, then what you would like the pattern replaced with, and then finally the string that you are targeting - you are telling `re.sub` to replace THIS with THAT over THERE.

And, if we say that the replacement should be an empty string, then the replacement will be nothing, meaning that you are effectively removing the pattern from the string. For example:

In [None]:
# remove all the 'a' from the string 'banana'
re.sub(pattern = 'a', repl = '', string = 'banana')

Using this same logic, we can remove all of the punctuation from a string. Be sure to save the results as a variable, otherwise the replacements will not be saved.


In [None]:
# original string
exclamation = 'too! many! exclamation! points!'
exclamation

In [None]:
# substitute out the exclamation marks and make a new string
exclamation = re.sub(pattern = '!', repl = '', string = exclamation)

In [None]:
# a cleaned string
exclamation

Now, if we want to remove more than one punctuation mark, we can define a pattern which says "anything in this pattern." To do so, write a string with brackets and put any character you want removed in those brackets, like this:

```
punctuation = [',!']
```

Then use that pattern in your `re.sub` call to replace those punctuation marks.

In [None]:
# original version of turtles
turtles

In [None]:
# cleaned version of turtles (not saved to a variable)
punctuation = '[,!]'
re.sub(pattern = punctuation, repl = '', string = turtles)

# Using `nltk.word_tokenize()` for better tokenization

Now we understand how to preprocess text so that `.split()` will return a more normalised set of tokens.

Having learned this, we now need to ask, what if we want to *retain* punctuation? Do you think it would be important to know the difference between words that come before / after punctuation? Could punctuation tell us something about the syntax of a sentence or the tone of voice of writing? These are questions without clear answers, but are worthy of consideration. Another more practical aspect of retaining punctuation is that punctuation markers could help with segmentation of strings into words and/or sentences. For this reason, we will actually stop using `.split()` as a means to create word tokens, and moreover think about whether punctuation is needed.

> Please keep in mind that learning how to clean the strings and using `.split()` is still useful, so it was not for naught. You may still find that using `.split()` and some cleaning is helpful for various subtasks you might want to perform.

Anyhow, let's look at the NLTK segmentation functions which are improvements upon `.split()`. These function are `nltk.word_tokenize()` and `nltk.sent_tokenize()`. They convert raw strings into tokens or sentences, respectively.


First import the NLTK library and download the necessary `punkt` resource. The `punkt` resource is the algoritm that NLTK uses to identify sentence and word boundaries.

Edit: as of mid-2024, this has been replaced with [`punkt_tab`](https://github.com/nltk/nltk/issues/3293)

In [None]:
import nltk
nltk.download('punkt_tab')

To tokenize with NLTK, we use the `nltk.word_tokenize()` function with a string as input.

In the cells below, compare the difference between using `.split()` and `nltk.word_tokenize()`:

In [None]:
# What is the difference between using `.split()` and `nltk.word_tokenize()`?
pretzels = 'These pretzels are making me thirsty!'

split_tokens = pretzels.split()
nltk_tokens = nltk.word_tokenize(pretzels)

print(f"Using .split(): \n{split_tokens}\n\nUsing nltk: \n{nltk_tokens}")

The NLTK tokenizer has treated the punctuation as a separate word - so it is smart enough to recognise that words should be separated from punctuation. It does this using a set of additional rules as well as some splitting. This makes perfect sense for punctuation which occurs after words, such as commas, full stops, exclamation marks, and so on.

What's going on in the cell below?

In [None]:
# What is different about these tokens?
nltk.word_tokenize('I can\'t even.')

The word "can't" was split into two tokens! Why is that? Well, if we think about it, "can't" actually stands for *two* words - "can" and "not." The tokenizer has an additional set of rules to search these contractions and split them accordingly. Using `.split()`, on the other hand, would result in "can't" being stored as a single word. Moreover, removing the punctuation *before* tokenization would turn "can't" into "cant", and then `nltk.word_tokenize()` would treat "cant" as a single word. Is this an issue? Well, considering the word "cant" is its own word separate in meaning from "can't", it certainly could be.


The point is that the order of pre-processing and normalisation steps is important, as are the different things you might want to do to a text. Many modern NLP libraries perform pre-processing automatically, and some analyses now actually do not need any preprocessing at all! It is nonetheless fundamental to understand how your data is being normalised in order to use these functions properly.

As a general rule, using `nltk.word_tokenize()` is preferred to `.split()`, because with `word_tokenize()` you retain the punctuation as separate tokens, which you can then choose to use or not use in your analysis.

# **Creating sentences with `nltk.sent_tokenize()`**

We can also obtain full sentences from texts using the `nltk.sent_tokenize()` function, which operates in the same way. The output here is a list of sentences.

In [None]:
# all the sentences in turtles
nltk.sent_tokenize(turtles)

The turtles text was already organised by newlines - here is a two sentence string, showing how `sent_tokenize()` also addresses this:

In [None]:
# a string not separated by newlines is still split into sentences.
nltk.sent_tokenize("Give a man a fire and he's warm for a day. Set fire to him and he\'s warm for the rest of his life.")

In [None]:
# we can see how reliant it is on punctuation:
nltk.sent_tokenize("Give a man a fire and he's warm for a day Set fire to him and he\'s warm for the rest of his life.")