# Text Processing the Bughunt Corpus
This notebook follows the process of taking the manually cleaned Bughunt corpus and creating a frequency distribution of the different insect words. The frequency distribution is output in csv format, which is used in the next notebook to create the visualisation.

We will use the code library called Natural Language Toolkit (NLTK) to provide a lot of text mining functions that are already written. More information on this can be found here: http://www.nltk.org/

## Corpus Files

We already have the corpus **split into files by decade**. Here is a list of them:

In [36]:
from pathlib import Path
data_path = Path('..', 'corpora', 'bughunt', '2-clean-by-decade')
files = [Path(root, filename) for root, _, files in os.walk(data_path) for filename in files]
files

[PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1800.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1810.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1820.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1830.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1840.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1850.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1860.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1870.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1880.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1890.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1900.txt'),
 PosixPath('../corpora/bughunt/2-clean-by-decade/bughunt-clean-1910.txt')]

## Preparing to Process
Before we are ready to process these files, we need to gather together some resources.

### Bug Words
We have our list of **simple bug words** as a text file. Here it is:

In [37]:
wordlist = Path('..', 'wordlists', 'insect-wordlist.txt')
with open(wordlist) as reader:
    bug_words = reader.read().splitlines()
bug_words

['ant',
 'bee',
 'beetle',
 'butterfly',
 'cockroach',
 'cricket',
 'dragonfly',
 'earwig',
 'flea',
 'fly',
 'gnat',
 'grasshopper',
 'ladybird',
 'louse',
 'mosquito',
 'moth',
 'spider',
 'termite',
 'wasp']

We also have a list of the **stems** of bug words. Stemming means reducing a word to its root, eliminating plurals and other inflections.

In [42]:
stemlist = Path('..', 'wordlists', 'insect-wordstems.txt')
with open(stemlist) as reader:
    bug_stems = reader.read().splitlines()
bug_stems

['ant',
 'bee',
 'beetl',
 'butterfli',
 'cockroach',
 'cricket',
 'dragonfli',
 'earwig',
 'flea',
 'fli',
 'gnat',
 'grasshopp',
 'ladybird',
 'lous',
 'mosquito',
 'moth',
 'spider',
 'termit',
 'wasp']

### English Stopwords
We are not interested in common words in English that carry little meaning, such as 'the', 'a' and 'its'. There is no definitive list of stopwords, but a commonly-used list is provided by the Natural Language Toolkit (NLTK).

In [44]:
import nltk
nltk.download('stopwords', download_dir=Path('..', 'nltk_data'))
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
english_stops

[nltk_data] Downloading package stopwords to ../nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

### Removing Puctuation
Punctuation such as commas, fullstops and apostrophes can complicate processing a corpus. For example, if punctuation is left in, when we come to split the text into tokens, the words "termite" and "termite," might be considered to be different words.

This is a complicated matter, however, and what you choose to do would vary depending on the nature of your corpus and what questions you wish to ask.