# Text Preprocessing

This notebook demonstrates a simple text preprocessing pipeline using the [Natural Language Toolkit (NLTK)](https://www.nltk.org/index.html). 

Make sure you first follow the [instructions on Wattle](https://wattlecourses.anu.edu.au/mod/page/view.php?id=2943340) to set up your environment for this lab.

In [None]:
import nltk
import string
from collections import Counter

Raw text from [this Wikipedia page](https://en.wikipedia.org/wiki/Australia).

In [None]:
raw_text = "Australia, officially the Commonwealth of Australia, is a sovereign country comprising the mainland of the Australian continent, the island of Tasmania, and numerous smaller islands. With an area of 7,617,930 square kilometres (2,941,300 sq mi), Australia is the largest country by area in Oceania and the world's sixth-largest country. Australia is the oldest, flattest, and driest inhabited continent, with the least fertile soils. It is a megadiverse country, and its size gives it a wide variety of landscapes and climates, with deserts in the centre, tropical rainforests in the north-east, and mountain ranges in the south-east.\nIndigenous Australians have inhabited the continent for approximately 65,000 years. The European maritime exploration of Australia commenced in the early 17th century with the arrival of Dutch explorers. In 1770, Australia's eastern half was claimed by Great Britain and initially settled through penal transportation to the colony of New South Wales from 26 January 1788, a date which became Australia's national day."

In [None]:
# print(raw_text)

## Sentence splitting

Splitting text into sentences.

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
# uncomment the below line to see the documentation of `sent_tokenize'
# sent_tokenize?

In [None]:
nltk.download('punkt')

In [None]:
sentences = sent_tokenize(raw_text)

In [None]:
print(f'There are {len(sentences)} sentences')

## Tokenisation

Dividing a string into a list of tokens.

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
# word_tokenize?

In [None]:
tokens_list = [word_tokenize(s) for s in sentences]

In [None]:
# tokens_list

The top-10 most common tokens.

In [None]:
Counter([w for x in tokens_list for w in x]).most_common(10)

### Exercise

Try [other tokenisers provided by NLTK](https://www.nltk.org/api/nltk.tokenize.html) (e.g. RegexpTokenizer, WhitespaceTokenizer, WordPunctTokenizer etc.) and compare their outputs.

In [None]:
# from nltk.tokenize import WhitespaceTokenizer

### Question 

What are the differences and how can we choose the best tokeniser for a task?

## Removing punctuation and stop words

Stopwords and punctuation are usually not helpful for many IR tasks, and removing them can reduce the number of tokens we need to process. 

In [None]:
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')

In [None]:
stopwords_en = set(stopwords.words('english'))

In [None]:
# stopwords_en

In [None]:
tokens_list[:] = [[w for w in x if w not in string.punctuation and w not in stopwords_en] for x in tokens_list]

In [None]:
# tokens_list

The top-10 most common tokens.

In [None]:
Counter([w for x in tokens_list for w in x]).most_common(10)

### Question

Will we get a different set of tokens if we lower casing all words before removing stopwords? What are the potential problems by doing that?

## Stemming

Turning words into stems.

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()

In [None]:
tokens_stem = [stemmer.stem(w) for x in tokens_list for w in x]

In [None]:
# tokens_stem

In [None]:
Counter(tokens_stem).most_common(10)

### Exercise

Try other NLTK stemmers (e.g. SnowballStemmer, RegexpStemmer), you may need to download additional data packages, see https://www.nltk.org/data.html

In [None]:
# from nltk.stem import SnowballStemmer, RegexpStemmer

## Lemmatisation

Turning words into lemmas (entries in a dictionary). It requires knowledge of the context (typically the intended
Part-of-Speech of a word in the context).

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

POS tagging for lemmatisation.

In [None]:
nltk.download('averaged_perceptron_tagger')
tags_list = nltk.pos_tag_sents(tokens_list)
# tags_list

A heuristic to convert POS tags to the [four syntactic categories that wordnet recognizes (i.e. **noun**, **verb**, **adj** and **adv**)](https://wordnet.princeton.edu/):
- `n` for nouns
- `v` for verbs
- `a` for adjectives
- `r` for adverbs

In [None]:
# tags_list

In [None]:
wordnet_tag = lambda t: 'a' if t == 'j' else (t if t in ['n', 'v', 'r'] else 'n')

Lemmatising

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
tokens_lemma = [lemmatizer.lemmatize(w.lower(), pos=wordnet_tag(t[0].lower())) for x in tags_list for (w, t) in x]

In [None]:
# tokens_lemma

In [None]:
Counter(tokens_lemma).most_common(10)

### Question

Compare the results of stemming and lemmatisation. Can you see the differences and the potential problems with stemming and lemmatisation?