# Steps 2 and 3: Cleaning and Exploring Text

---
---
Before we can analyse text we have to clean and prepare it.

## Tokenising the Text
_Tokenising_ means splitting a text into meaningful elements, such as words, sentences, or symbols.

To do this we are going to use [spaCy](https://spacy.io), a free and open-source Natural Language Processing (NLP) package. If you're interested in learning how to work with spaCy more broadly for a variety of NLP tasks I recommend the tutorial [Natural Language Processing with spaCy in Python](https://realpython.com/natural-language-processing-spacy-python/).

---
### spaCy and NLTK
[NLTK](https://www.nltk.org/) was the first open-source Python library for Natural Language Processing (NLP), originally released in 2001, and it is still a valuable tool for teaching and research. Much of the literature uses NLTK code in its examples, and researchers publish models, code and data sets for obscure languages designed for use with NLTK.

If you are interested in using NLTK for the tasks presented in these notebooks instead of spaCy, please see last year's notebooks: https://github.com/mchesterkadwell/intro-to-text-mining-with-python

NLTK has been overtaken in efficiency and ease of use by other more modern libraries, such as [SpaCy](https://spacy.io/), which uses machine learning. It is designed to use less computer memory and split workloads across multiple processor cores (or even computers) so that it can handle very large corpora easily. It also has excellent documentation. However, out of the box, spaCy only has support for a limited set of modern languages. It is necessary to train your own models to use spaCy for your use case.

I am using spaCy in these notebooks to cut down on the amount of code I am presenting. The idea is to reduce the complexity so that beginners can concentrate on first principles.

---

---
---
## Loading Data from a File

Firstly, we need to reload into memory the file that was saved at the end of the notebook [3-choosing-and-collecting-text](3-choosing-and-collecting-text.ipynb).

---
### Opening and Reading Text Files
We don't have enough time in this course to cover opening, reading and closing text files. Fortunately, it is not necessary to understand the next block of code to understand the rest of the notebook.

If you would like to learn more about this by yourself, try this guide [Reading and Writing Files in Python](https://realpython.com/read-write-files-python/).

---

In [None]:
# Import a module that helps with filepaths
from pathlib import Path

# Create a filepath for the file
text_file = Path('data', 'PREPPED-2199-0.txt')

# Open the file, read it and store the text with the name `iliad`
with open(text_file, encoding='utf-8') as file:
    iliad = file.read()

iliad[0:200]

---
---
## Tokenising with spaCy

First we import and load an English language model provided by spaCy (`en_core_web_sm`) and give it the name `nlp`, ready to do the work on the text.

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

Then we pass the text as an argument to the function `nlp` and spaCy does the rest. spaCy processes the text into a _document_ that contains a lot of information about the text.

This may take a while as the book is long. Watch until the `In [*]:` to the left of the code has finished and turned into an output number.

> **Important**: If you are running this notebook on **Binder**, you will need to modify the next line to something like `document = nlp(iliad[0:200000])` so that less of the file is processed at once. Binder has a memory (RAM) limit of around 2GB, and the *Iliad* is big and takes 2-3GB to process. If Binder goes over its memory limit this causes the kernel to die. You can now keep an eye on how much memory you are using in the top right-hand corner of the menu-bar, at the top of page. If you are running the notebooks locally on your own machine this shouldn't be an issue.

In [None]:
document = nlp(iliad)

We print out the document text stored in `document.text` just to check that spaCy has correctly parsed it.

In [None]:
document.text[0:500]

The document can be treated like a list of word tokens (and information about those tokens), which we can print out using a list comprehension:

In [None]:
tokens = [word.text for word in document]
tokens

In [None]:
# Write code here to get just the first 20 tokens

spaCy has split the text into sentences for us too. The document stores these sentences in `document.sents` and we can also print them out using a list comprehension.

In [None]:
sentences = [sent.text for sent in document.sents]
sentences

There are a number of problems with the word tokens: the capitalisation of the words has been preserved, and some of the tokens have unwanted special characters or comprise single items of punctuation.

---
---
## Normalising to Lowercase
Normalising all words to lowercase ensures that the same word in different cases can be recognised as the same word, e.g. we want 'Shield', 'shield' and 'SHIELD' to be recognised as the same word.

However, whether you choose to do this depends on the nature of your corpus and the questions you are investigating. For example, in another case, you may not want the word 'Conservative' to be conflated with the word 'conservative'.

How can we lowercase all the tokens in the list of tokens `tokens`? By using the string method `lower()` and a list comprehension.

In [None]:
tokens_lower = [token.lower() for token in tokens]
tokens_lower[0:20]

---
---
## Removing Puctuation
Punctuation such as commas, fullstops and apostrophes can complicate processing a corpus. For example, if punctuation is left in, the words "poet" and "poet," might be considered to be different words.

This is a complicated matter, however, and what you choose to do would vary depending on the nature of your corpus and what questions you wish to ask. It may be appropriate to remove punctuation at different stages of processing. In our case we are going to remove it *after* the text has been tokenised.

We will replace *all* punctuation with the empty string `''`. (You do not need to understand this code fully.)

In [None]:
# Import a module that helps with string processing
import string

# Make a table that translates all punctuation to an empty value (`None`)
table = str.maketrans('', '', string.punctuation)
punc_table = {chr(key):value for (key, value) in table.items()}
punc_table

In [None]:
tokens_nopunct = [token.translate(table) for token in tokens_lower]
tokens_nopunct[0:20]

---
---
## Removing Non-Word Tokens

We are still left with some problematic tokens that are not useful words, such as empty tokens (`''`) and newline characters (`\r`, `\n`).

We can try a filter condition for the empty tokens. The operator `!=` is the negative equality operator, so `if token != ''` means "if token is _not_ equal to the empty string".

In [None]:
tokens_notempty = [token for token in tokens_nopunct if token != '']
tokens_notempty[0:10]

The operator `==` is the equality operator. If you just want a list of empty tokens, write the list comprehension replacing the `!=` with `==`.

In [None]:
# Write code here to get a list of empty tokens

Finally, we can remove all the newline characters by adding a condition that filters out all non-alphabetic characters. The string method to use is `isalpha()`.

In [None]:
tokens = [token for token in tokens_notempty if token.isalpha()]
tokens

---
---
## Saving Data to a File

Now we need to save the clean tokens into a file (`CLEAN-6130-8.txt`) and place it under the `data` folder. This is the reverse process to loading the data from a file that we did at the beginning of this notebook.

In [None]:
# Import a module that helps with filepaths
from pathlib import Path

# Create a filepath for the file
tokens_file = Path('data', 'CLEAN-2199-0.txt')

# Open a file and save the list of tokens inside it
with open(tokens_file, 'w', encoding='utf-8') as file:
    file.writelines(' '.join(tokens))

You should inspect the file now to see what it looks like.

---
---
## Summary

In this notebook we have: 

* **Loaded** text from a file.
* **Tokenised** the text into words and sentences with **spaCy**.
* **Normalised** the words into **lowercase** and **removed non-word** tokens (punctuation and empty tokens)
* **Saved** the clean tokens to file.