<a href="https://colab.research.google.com/github/moO0lk/LING227/blob/main/08_more_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **More preprocessing**

This notebook walks through some more considerations you might want to make when preprocesing text. We have already considered punctuation and tokenizations. Other forms of preprocessing include lowercasing, removing excessive whitespace, and a consideration of so-called stopwords.


# **Lexical diversity and preprocessing**

Let's start by considering how pre-processing influences the effects of a measure we've already explored: lexical diversity. Compare what capitalization will do to measures of lexical diversity on these two texts:

In [None]:
# create two texts that only differ based on capitalization
version1 = ['Soda', 'soda', 'Onion', 'onion']
version2 = ['soda', 'soda', 'onion', 'onion']

In [None]:
# calcualte their TTR
version1_ttr = len(set(version1)) / len(version1)
version2_ttr = len(set(version2)) / len(version2)

Compare - version 1 has a higher TTR than version 2, even thoough we see that they are effectively the "same" words:

In [None]:
version1_ttr

In [None]:
version2_ttr

We clearly would not want to think that `version1` is more lexically diverse than `version2`, unless we have strong reason to believe the capitalization results in a fundamentally different word.

Hence, normalization is needed to address these issues. If we convert all of the tokens to lowercase, we now obtain the same TTR measures:






In [None]:
# use a list comprehension to lowercase all the tokens
version1_lower = [token.lower() for token in version1]
version2_lower = [token.lower() for token in version2]

In [None]:
# calculate new TTR scores
version1_lower_ttr = len(set(version1_lower)) / len(version1_lower)
version2_lower_ttr = len(set(version2_lower)) / len(version2_lower)

We see that both texts now have a TTR of 0.5 (50%):

In [None]:
version1_lower_ttr

In [None]:
version2_lower_ttr

While this is helpful, we need to be careful. Perhaps in some cases you will want to retain captialization. In English, at least, capitalization will signal names and proper nouns. And in other languages which use logographic scripts, such as Chinese and Japanese, lowercasing does not matter at all!

# **Cleaning whitespace**

Text is messy and sometimes there are extra spaces that should be removed from texts. The built-in string functions to do so are:

- **`str.rstrip()`**  
  Removes trailing whitespace or specified characters from the right end of a string.

- **`str.lstrip()`**  
  Removes leading whitespace or specified characters from the left end of a string.

- **`str.strip()`**  
  Removes both leading and trailing whitespace or specified characters from both ends of a string.


Consider the following texts with an extra space at either side:

In [None]:
endspace = 'These pretzels are making me thirsty '

In [None]:
startspace = ' These pretzels are making me thirsty'

We can apply the `strip()` functions to the entire string before tokenization. This is a quick and effective way to clean up an entire string/text before doing something with it:

In [None]:
endspace.rstrip()

In [None]:
startspace.lstrip()

In [None]:
endspace.strip()

In [None]:
startspace.strip()

# **Stopwords**

Another form of preprocessing is to remove so-called stopwords. Stopwords are frequently occuring **function** words, such as determiners & articles (e.g., *the*, *an*), prepositions (*over*, *in*, *at*), and so on. Contrast these words with content words, such as nouns (e.g., *dog*), verbs (e.g., *run*), and adjectives (e.g., *quick*), and you should begin to see the difference.

The logic of removing stopwords is driven by an assumption that stopwords generally contribute very little to the meaning of a text. Stopwords are also not good for distinguishing among texts, because they are so common and used in every text.

However, as NLP algorithms have advanced recently, removing stopwords can something be counterproductive. But before we consider why, let's first look into how we could remove stopwords from a text, and the effects this has on the text. The NLTK module has a list of stopwords built-in, run the cell below to see it.



In [None]:
# Load in and inspect the stopwords resource
import nltk
nltk.download(['stopwords', 'punkt_tab'])

In [None]:
# import the entire stopwords resource
from nltk.corpus import stopwords

# loop through all the the English stopwords
[word for word in stopwords.words('english')]

Have a look through the list above - you can see that there are a lot of words and pieces of words identified as stop words. You can use this list as a check to remove stopwords via a list comprehension.

To do so, we can include a conditional test that retains words only if they are *not* a stopword. Let's observe with a sample sentence:

In [None]:
# bring in a quote to use:
hitchiker = """Far Out in the uncharted backwaters of the unfashionable end
of the Western Spiral arm of the galaxy lies a small unregarded yellow sun"""

In the following list comprehension, I test whether the lowercased version of each word is in the stopwords list. Note how this does not transform the original token, but instead uses the lowercased version of the token for the condition test. This means I can have the best of both words - make decision based on a normalised version of the token while retaining the raw version of the token:

In [None]:
# remove any token for which the lowercased version is in the NLTK stopwords resource
[token for token in nltk.word_tokenize(hitchiker) if token.lower() not in stopwords.words('english')]

What effect has removing stopwords had on the text? Can you still understand it?

Let's try this out on a longer text:

In [None]:
!wget 'https://raw.githubusercontent.com/scskalicky/LING-226-vuw/refs/heads/main/sample-texts/marine_biologist.txt'
mb = open('marine_biologist.txt').read()

Let's get an idea of the text - I'll use `nltk.sent_tokenize()` on the text and print out some sentences from near the end:

In [None]:
# first ten sentences of the text
for sent in nltk.sent_tokenize(mb)[350:373]:
  print(sent)

Now let's see what happens when we remove stopwords. Note that I use a combination of `' '.join()` and `nltk.word_tokenize()` within the list comprehension to split each sentence and glue them back together during the list comprehension.


What do you think of the output? Can you still understand the 'gist' of the texts?


In [None]:
for sent in nltk.sent_tokenize(mb)[350:373]:
  print(' '.join([token for token in nltk.word_tokenize(sent) if token.lower() not in stopwords.words('english')]))

The logic behind removing "stopwords" is that you could do this for *any* set of words that you want removed from a text. For example, what if I wanted to remove the names of the characters in the script? I could easily define my own list of stopwords and use those in the test:

In [None]:
# create a list of target names to remove:
stopnames = ['ELAINE', 'JERRY', 'GEORGE', 'KRAMER']

In [None]:
# remove any the names
for sent in nltk.sent_tokenize(mb)[350:373]:
  print(' '.join([token for token in nltk.word_tokenize(sent) if token not in stopnames]))

# More punctuation removal

One way we have explored to remove punctuation is to define a string of target punctuation marks and then remove anything in that string using a conditional test. This works well enough if we define all of the punctuation we are interested in removing, although that can be annoying! One nice thing, in English at least, is the presence of the `string` class in Python. This is a stored set of information about English string data you can easily access for convenience. First import `string`


In [None]:
import string

We can see that there isn't much here, but the methods `punctuation` and various letters might be useful! We can access these using `string.X`:

In [None]:
dir(string)

In [None]:
# look at all the uppercase letters
string.ascii_uppercase

In [None]:
# all letters
string.ascii_letters

In [None]:
# all punctuation
string.punctuation

Don't confuse the `string` class with the `str` type in Python, they are not the same thing. The `string` class is really just a helper object for you to quickly access letters and punctuation marks, which you could use as needed.

## using **`.isalpha()` and `.alnum()`**

There is another way to remove punctuation, or but differently, to retain characters that are only alphabetic or alphanumeric. You can do so using the built-in string methods `.isalpha()` and `.isalnum()`. Read the explanation for each one - note how tkhey both refer to a check on the **entire string**.  

In [None]:
help(str.isalpha)

In [None]:
help(str.isalnum)

So a string in which *any* character violates the test will return `False` for the entire string:

In [None]:
# False because the ! is not alphabetic
'hi!'.isalpha()

In [None]:
# False because the 1 is not alphabetic
'hi1'.isalpha()

In [None]:
# True because 'hi' is alphabetic and '1' is numeric
'hi1'.isalnum()

Using this method, we can drop any string that fails this test. When combined with a function such as `nltk.word_tokenize()`, we can run this test on the entire token. Why? Because `nltk.word_tokenize()` will separate out the punctuation into its own separate token!

In [None]:
# target string
jp = 'Life, uh, finds a way.'

In [None]:
# check tokenization - the comma and full stop are their own tokens
nltk.word_tokenize(jp)

Use `.isalnum()` to "clean" out the punctuation from the list of tokens:

In [None]:
[token for token in nltk.word_tokenize(jp) if token.isalnum()]

## **a simple regex pattern to remove all punctuation**

Finally, here is another method for removing punctuation in a text that can be applied to an entire string before tokenization. This method is nice because it does not use a loop and thus does not require any joining at the end. The downside is that it is slightly more complex to understand, because it uses a somewhat opaque regular expression pattern.

Using this slightly more complex regular expression pattern, we can remove all punctuation in a string. The pattern is:

```
[^\w\s]
```

This pattern contains two token representations:

- `\w` matches any word character (equivalent to `[a-zA-Z0-9_]`)
- `\s` matches any whitespace character (equivalent to `[\r\n\t\f\v  ]`)

The `^` is a metacharacter which negates the pattern, saying anything *but* these things. Because the `^` is inside the square brackets `[]`, it says anything **but** what is inside these brackets.

So, with a bit of reverse logic, the pattern below means replace anything that is *not* a word character or whitespace character with nothing, effectively cleaning out punctuation:

```
re.sub(r'[^\w\s]', '', text)
```

It's okay if this is a bit over your head, but remember this pattern and come back to copy this cell when you need to remove punctuation in the future! Also keep in mind there might be instances where you want to more carefully control the removal of specific characters, so be careful!

In [None]:
import re

In [None]:
# an easy way to remove punctuation from a string
re.sub(r'[^\w\s]', '', jp)