# Text Processing the Bughunt Corpus
This notebook follows the process of taking the manually cleaned Bughunt corpus and creating a frequency distribution of the different bug words.

NB: This notebook does not actually process the whole corpus -- that is done by the script `insect-freq-unigram.py`. The examples here are a walk-through and explanation of the code using a single file.

We will use the code library called Natural Language Toolkit (NLTK) to provide a lot of text mining functions that are already written. More information on this can be found here: http://www.nltk.org/

## Corpus Files

We already have the corpus **split into files by decade**. Here is a list of them:

In [2]:
import os
from pathlib import Path
data_path = Path('..', 'corpora', 'bughunt', '2-clean-by-decade')
files = [Path(root, filename) for root, _, files in os.walk(data_path) for filename in files]
files

[PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1800.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1810.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1820.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1830.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1840.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1850.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1860.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1870.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1880.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1890.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1900.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1910.txt')]

## Preparing to Process
Before we are ready to process these files, we need to gather together some resources.

### Bug Words
We have our list of **simple bug words** as a text file. Here it is:

In [3]:
wordlist = Path('..', 'wordlists', 'insect-wordlist.txt')
with open(wordlist) as reader:
    bug_words = reader.read().splitlines()
bug_words

['ant',
 'bee',
 'beetle',
 'butterfly',
 'cockroach',
 'cricket',
 'dragonfly',
 'earwig',
 'flea',
 'fly',
 'gnat',
 'grasshopper',
 'ladybird',
 'louse',
 'mosquito',
 'moth',
 'spider',
 'termite',
 'wasp']

We also have a list of the **stems** of bug words. **Stemming** is a form of word normalisation. It means reducing a word to its root, eliminating plurals and other inflections. Stems may not be actual words. 

In [4]:
stemlist = Path('..', 'wordlists', 'insect-wordstems.txt')
with open(stemlist) as reader:
    bug_stems = reader.read().splitlines()
bug_stems

['ant',
 'bee',
 'beetl',
 'butterfli',
 'cockroach',
 'cricket',
 'dragonfli',
 'earwig',
 'flea',
 'fli',
 'gnat',
 'grasshopp',
 'ladybird',
 'lous',
 'mosquito',
 'moth',
 'spider',
 'termit',
 'wasp']

As you can see in the list above, the stems 'butterfli', 'dragonfli' and 'fli' are not real words.

This contrasts with **lemmatisation** where the reduced word, the **lemma**, is a proper word in the language; in fact, it is the canonical or dictionary form.

### English Stopwords
We are not interested in common words in English that carry little meaning, such as 'the', 'a' and 'its'. There is no definitive list of stopwords, but a commonly-used list is provided by the Natural Language Toolkit (NLTK).

In [5]:
import nltk
nltk.download('stopwords', download_dir=Path('..', 'nltk_data'))
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
sorted(list(english_stops))[:20]

[nltk_data] Downloading package stopwords to ../nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been']

## Tokenising the Corpus
Tokenising means splitting a text into meaningful elements, such as words, sentences, or symbols.

To do this we use a simple facility provided by the NLTK to read in the files and a function to do the tokenising for us. The code example below takes a single corpus file and tokenises it. 

In [6]:
nltk.download('punkt', download_dir=Path('..', 'nltk_data'))

from nltk.corpus.reader import PlaintextCorpusReader
reader = PlaintextCorpusReader('.', '')
file_1810 = os.path.join(data_path, 'bughunt-clean-1810.txt')
text = reader.raw(file_1810)

from nltk import word_tokenize
tokens = word_tokenize(text)
tokens[:20]

[nltk_data] Downloading package punkt to ../nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['CONTENTS',
 '.',
 'CHAP',
 '.',
 'I',
 '.',
 'A',
 'young',
 'Bee',
 ',',
 'deceived',
 'by',
 'fine',
 'weather',
 ',',
 'leave*',
 'the',
 'Hive',
 'too',
 'early']

There are a number of problems with these tokens: the capitalisation of the words has been preserved, and some of the tokens have unwanted special characters or comprise single items of punctuation.

### Normalising to Lowercase
Normalising all words in a corpus to lowercase ensures that the same word in different cases can be recognised as the same word, e.g. we want 'Gnat', 'gnat' and 'GNAT' to be recognised as the same word.

However, whether you choose to do this depends on the nature of your corpus and the questions you are investigating. For example, in another case, you may be not want the word 'Conservative' to be conflated with the word 'conservative'.

In our case, we will lowercase the whole corpus immediately before tokenising it:

In [7]:
tokens = word_tokenize(text.lower())
tokens[:20]

['contents',
 '.',
 'chap',
 '.',
 'i.',
 'a',
 'young',
 'bee',
 ',',
 'deceived',
 'by',
 'fine',
 'weather',
 ',',
 'leave*',
 'the',
 'hive',
 'too',
 'early',
 ',']

### Removing Puctuation
Punctuation such as commas, fullstops and apostrophes can complicate processing a corpus. For example, if punctuation is left in, the words "termite" and "termite," might be considered to be different words.

This is a complicated matter, however, and what you choose to do would vary depending on the nature of your corpus and what questions you wish to ask.

It may be appropriate to remove punctuation at different stages of processing. In our case we are going to remove it *after* the text has been tokenised.

We will replace *all* punctuation with the empty string ''.

In [8]:
import string
table = str.maketrans('', '', string.punctuation)
tokens_nopunct = [token.translate(table) for token in tokens]
tokens_nopunct[:20]

['contents',
 '',
 'chap',
 '',
 'i',
 'a',
 'young',
 'bee',
 '',
 'deceived',
 'by',
 'fine',
 'weather',
 '',
 'leave',
 'the',
 'hive',
 'too',
 'early',
 '']

### Removing Non-Word Tokens

We are still left with some problematic tokens that are not useful words, such as empty tokens `''` and tokens that may be chapter numbers:

In [9]:
tokens_empty = [word for word in tokens_nopunct if not word.isalpha()]
tokens_empty[:10]

['', '', '', '', '', '', '', '', '', '']

In [10]:
tokens_nonwords = [word for word in tokens_nopunct if word.isnumeric()]
tokens_nonwords[:10]

['1', '6', '1', '5', '1', '1', '1', '1', '1', '1']

We can remove both these by filtering for only those words that are alphabetic:

In [11]:
words = [word for word in tokens_nopunct if word.isalpha()]
words[:20]

['contents',
 'chap',
 'i',
 'a',
 'young',
 'bee',
 'deceived',
 'by',
 'fine',
 'weather',
 'leave',
 'the',
 'hive',
 'too',
 'early',
 'and',
 'contrary',
 'to',
 'the',
 'advice']

### Removing Stopwords
We are now ready to remove the stopwords we prepared earlier and thereby create a list of only meaningful words. Before using the stopwords, we will also remove all the punctuation so that it matches the text of the corpus.

In [17]:
english_stops_nopunct = {stopword.translate(table) for stopword in english_stops}
words_nostops = [word for word in words if word not in english_stops_nopunct]
words_nostops[:20]

['contents',
 'chap',
 'young',
 'bee',
 'deceived',
 'fine',
 'weather',
 'leave',
 'hive',
 'early',
 'contrary',
 'advice',
 'commands',
 'mother',
 'sufferings',
 'close',
 'confinement',
 'result',
 'disobe',
 'dience']

### Stemming the Tokens
Stemming the tokens ensures that plurals and adjectives are reduced to the same stem and can be counted as the same word. For example, 'lice' and 'louse' will be normalised to 'lous', but so too will 'lousy', which may or may not be desirable.

To do this we use another facility provided by the NLTK called a **stemmer**. There are many different ways to stems words, but we will use the Porter Stemmer. (The Porter Stemmer is the original stemmer, first created in 1979. It is simple and speedy, but has some important limitations.)

In [13]:
from nltk import PorterStemmer
porter = PorterStemmer()
stems = [porter.stem(word) for word in words_nostops]
stems[:20]

['content',
 'chap',
 'young',
 'bee',
 'deceiv',
 'fine',
 'weather',
 'leav',
 'hive',
 'earli',
 'contrari',
 'advic',
 'command',
 'mother',
 'suffer',
 'close',
 'confin',
 'result',
 'disob',
 'dienc']

## Creating a Frequency Distribution
At last, we are ready to create a frequency distribution. We will use another NLTK facility called `FreqDist` to count the frequency of each unique word in the corpus, and then create a relative frequency value between `0` and `1`.

First, we create a frequency distribution:

In [14]:
from nltk.probability import FreqDist
freqdist = FreqDist(stems)

Here are the top 20 most frequent words (the numbers are the absolute word count):

In [15]:
freqdist.most_common(20)

[('one', 432),
 ('butterfli', 404),
 ('bee', 310),
 ('would', 308),
 ('friend', 284),
 ('return', 232),
 ('littl', 206),
 ('could', 200),
 ('wing', 190),
 ('said', 184),
 ('much', 182),
 ('never', 174),
 ('though', 170),
 ('time', 170),
 ('see', 160),
 ('flower', 160),
 ('place', 150),
 ('think', 142),
 ('joe', 142),
 ('look', 140)]

We are not interested in a lot of these words, so the next thing to do is filter out all the words that are not in our list of bugs. Once we have done this we have a dictionary of stems and their relative frequencies.

In [16]:
from nltk.corpus.reader import WordListCorpusReader
insect_words = WordListCorpusReader('.', [stemlist])

insect_freq = {word: freqdist.freq(word) for word in insect_words.words()}
insect_freq

{'ant': 0.00022559104854719364,
 'bee': 0.0069933225049630034,
 'beetl': 9.023641941887746e-05,
 'butterfli': 0.009113878361306624,
 'cockroach': 0.0,
 'cricket': 0.00022559104854719364,
 'dragonfli': 0.0,
 'earwig': 0.0,
 'flea': 0.0001353546291283162,
 'fli': 0.0027522107922757625,
 'gnat': 4.511820970943873e-05,
 'grasshopp': 4.511820970943873e-05,
 'ladybird': 0.0,
 'lous': 0.0001353546291283162,
 'mosquito': 0.0,
 'moth': 0.0002707092582566324,
 'spider': 9.023641941887746e-05,
 'termit': 0.0,
 'wasp': 0.00022559104854719364}

## What's Next
In the script `insect-freq-unigram.py` the process above is applied to each of the corpus files in turn, and the results are output as a CSV file `insect-stem-freq-unigram.csv`. We will use this file in the next notebook `3-visualising-data.ipynb` to create some visualisations of the data.