# Basic Text-Mining Concepts with Python

---
---

## Introduction
This notebook is an introduction to some basic concepts of text wrangling and natural language processing (NLP) you need for this course. With a familiarity of these topics you will better understand the named entity recognition examples in the following notebooks. There is undoubtedly more to learn and consider for each concept, so consider this as a **brief overview** of:

* The text-mining pipeline
* Tokenising
* Normalising
* Stopwords
* Word stems and lemmatization
* Part-of-speech (POS) tagging
* Syntactic dependency parsing

---
---

## The Text-Mining Pipeline: 5 Steps of Text-Mining
There is no set way to do text-mining, but typically a workflow will involve steps like these:
1. Choosing and collecting text
2. Cleaning and preparing text
3. Exploring data
4. Analysing data
5. Presenting results

![Text-mining pipeline](assets/pipeline.png)

You may go through these steps more than once to refine your data and results, and frequently steps may be merged together. The important thing to realise is that steps 1-2 are critical in ensuring your data is capable of actually addressing the goals of your project. You are likely to spend significant time on cleaning and preparing your text.

For our purposes, I have saved a copy of Homer's _Iliad_ for us to work with in the [`data`](data) folder as [`data/iliad-butler-2199-0-prepped.txt`](data/iliad-butler-2199-0-prepped.txt). This copy comes from [Project Gutenberg](http://www.gutenberg.org/): [Homer's *Iliad*, translated by Samuel Butler in 1898](http://www.gutenberg.org/ebooks/2199). Therefore, we will start at step 2, with **cleaning and preparing** this text.

---
---
## Loading Data from a File

First, we need to open the file containing the _Iliad_ text to play with.

### Open and Read Text Files
To open and read a text file we use the `open()` function and pass the filepath of the text file as an argument.

> I am skipping out some detail here of how exactly opening a file works. If you would like to learn more about opening and reading text files, try this guide [Reading and Writing Files in Python](https://realpython.com/read-write-files-python/).

In [None]:
# Import a module that helps with filepaths
from pathlib import Path

# Create a filepath for the file
text_file = Path('data', 'iliad-butler-2199-0-prepped.txt')

# Open the file, read it and store the text with the name `iliad`
with open(text_file, encoding="utf-8") as file:
    iliad = file.read()

iliad[0:200]

---
---
## Tokenising with spaCy

Before we can do anything with a text we have to transform it into a form that can be manipulated by a computer. 

**Tokenising** means splitting a text into its individual elements, such as words, sentences, or symbols. Without this, the computer would just 'see' long strings of characters and have no idea what characters might form meaningful groups.

---

To do some tokenising we are using [spaCy](https://spacy.io), a free and open-source Natural Language Processing (NLP) package. We will also use this same package for named entity recognition (NER) in the rest of the notebooks in this series, when we'll learn a lot more about it. For now, we will just use some of its basic functions.

---

First we import and load an English language model provided by spaCy (`en_core_web_sm`) and give it the name `nlp`, ready to do the work on the text.

In [None]:
import en_core_web_sm
nlp = en_core_web_sm.load()

The book is so long that we might have to process only part of it, depending on how much memory your computer has. To be on the safe side, we will slice the book up to character 400000.

In [None]:
iliad_small = iliad[0:400000]

Then we pass the text as an argument to the function `nlp` and spaCy does the rest. spaCy processes the text into a **document** that contains a lot of information about the text.

This may take a while as the book is long. Watch until the `In [*]:` to the left of the code has finished and turned into an output number.

In [None]:
document = nlp(iliad_small)

We print out the document text stored in `document.text` just to check that spaCy has correctly parsed it.

In [None]:
document.text[0:500]

The document can be treated like a list of **word tokens**, which we can print out using a list comprehension:

In [None]:
tokens = [word.text for word in document]
tokens

In [None]:
# Write code here to get just the first 20 tokens

> **EXERCISE**: What problems can you spot this these word tokens?

spaCy has also split the text into **sentence tokens**. The document stores these sentences in the attribute `sents` and again we can print them out using a list comprehension.

In [None]:
sentences = [sent.text for sent in document.sents]
sentences

---
---
## Basic Text Processing
When spaCy runs its various text-mining functions it takes care of many things for you. For example:

* spaCy understands that titlecase, uppercase and lowercase versions of a word may be the same word.
* spaCy realises that generally punctuation is not part of the beginning or end of a word.
* spaCy knows that words like "the" and "a" are not meaningful words in English.

Nevetherless, you might not always be using spaCy or another powerful library. Even if you are, it's important to understand how basic text processing works as a foundation for more advanced techniques.

### Normalise to Lowercase
Normalising all words to lowercase ensures that the same word in different cases can be recognised as the same word.

For example, we might want 'Shield', 'shield' and 'SHIELD' to be recognised as the same word. Though, in another case, you may not want the word 'Conservative' to be conflated with the word 'conservative'.

How can we lowercase all the tokens in the list of tokens `tokens`? By using the string method `lower()` and a list comprehension.

In [None]:
tokens_lower = [token.lower() for token in tokens]
tokens_lower[160:180]

### Remove Puctuation
Punctuation such as commas, fullstops and apostrophes can complicate processing a dataset. For example, if punctuation is left in, a word count may not be accurate.

This is a complicated matter, however, and what you choose to do would vary depending on your project. It may be appropriate to remove punctuation at different stages of processing. In our case we are going to remove it *after* the text has been tokenised.

We will replace *all* punctuation with the empty string `''`. (You do not need to understand this code fully.)

In [None]:
# Import a module that helps with string processing
import string

# Make a table that translates all punctuation to an empty value (`None`)
table = str.maketrans('', '', string.punctuation)
punc_table = {chr(key):value for (key, value) in table.items()}

# Print the punctuation translation table to inspect it
punc_table

In [None]:
# Perform the translation
tokens_nopunct = [token.translate(table) for token in tokens_lower]
tokens_nopunct[160:180]

### Remove Non-Word Tokens

We are still left with some problematic tokens that are not useful words, such as empty tokens (`''`) and newline characters (`\r`, `\n`).

We can try a filter condition for the empty tokens. The operator `!=` is the negative equality operator, so `if token != ''` means "if token is _not_ equal to the empty string".

In [None]:
tokens_notempty = [token for token in tokens_nopunct if token != '']
tokens_notempty[140:160]

The operator `==` is the equality operator. If you just want a list of empty tokens, write the list comprehension replacing the `!=` with `==`.

In [None]:
# Write code here to get a list of empty tokens

Finally, we can remove all the newline characters by adding a condition that filters out all non-alphabetic characters. The string method to use is `isalpha()`.

In [None]:
tokens = [token for token in tokens_notempty if token.isalpha()]
tokens

### English Stopwords
Let's say we are interested in a frequency analysis of words in this book. In other words, we want to find out what are the most common words in order to get an idea about its contents.

But not all words are equally interesting. Some common words in English carry little meaning, such as "the", "a" and "its". These are called **stopwords**. There is no definitive list of stopwords, but most Python packages used for Natural Language Processing provide one as a starting point, and spaCy is no exception.

In [None]:
# Import the spaCy standard stopwords list
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = [stop for stop in STOP_WORDS]

# Sort the stopwords in alphabetical order to make them easier to inspect
sorted(stopwords)

In [None]:
# Write code here to count the number of stopwords

> **EXERCISE**: What do you notice about these stopwords?

For your own projects you would need to consider which stopwords are most appropriate:
* Will standard stopword lists for modern languages be suitable for that language written 10, 50, 200 years ago?
* Are there special stopwords specific to the topic or style of literature?
* How might you find or create your own stopword list?

Now we can filter out the stopwords that match this list:

In [None]:
tokens_nostops = [token for token in tokens if token not in stopwords]
tokens_nostops

### Create a Frequency Distribution
Just to demonstrate how nice and clean our tokens are now, we will create a frequency distribution by counting the frequency of each unique word in the text.

First, we create a frequency distribution:

In [None]:
# Import a module that helps with counting
from collections import Counter

# Count the frequency of words
word_freq = Counter(tokens_nostops)
word_freq

This `Counter` maps each word to the number of times it appears in the text, e.g. `'coward': 17`. By scrolling down the list you can inspect what look like common and infrequent words.

Now we can get precisely the 10 most common words using the function `most_common()`:

In [None]:
common_words = word_freq.most_common(10)
common_words

### Visualise Results
Visualising results can be very useful during text processing to review how well things are going.

There are many options for displaying simple charts, and very complex data, in Jupyter notebooks. We are going to use the most well-known library called [Matplotlib](https://matplotlib.org/), although it is perhaps not the easiest to use compared with some others.

We don't need to dwell on details of this code as we won't be using Matplotlib again in this course.

Let's display our results as a simple line plot:

In [None]:
# Display the plot inline in the notebook with interactive controls
# Comment out this line if you are running the notebook in Deepnote
%matplotlib notebook

# Import the matplotlib plot function
import matplotlib.pyplot as plt

# Get a list of the most common words
words = [word for word,_ in common_words]

# Get a list of the frequency counts for these words
freqs = [count for _,count in common_words]

# Set titles, labels, ticks and gridlines
plt.title("Top 10 Words used in Homer's Iliad in English translation")
plt.xlabel("Word")
plt.ylabel("Count")
plt.xticks(range(len(words)), [str(s) for s in words], rotation=90)
plt.grid(visible=True, which='major', color='#333333', linestyle='--', alpha=0.2)

# Plot the frequency counts
plt.plot(freqs)

# Show the plot
plt.show()

---
---
## Word Stems and Lemmatization
One form of normalisation we have not yet done is to make sure that different **inflections** of the same word are counted together. In English, words are modified to express quantity, tense, etc. (i.e. **declension** and **conjugation**).

For example, 'fish', 'fishes', 'fishy' and 'fishing' are all formed from the root 'fish'.

There are two main ways to normalise for inflection:

* **Stemming** is reducing a word to a stem by removing endings (a **stem** may not be an actual word).
* **Lemmatization** is reducing a word to its meaningful base or dictionary form using its context (a **lemma** is typically a proper word in the language).

We will only cover lemmas here.

---
### Lemmatization with spaCy
When we first processed the _Iliad_ text with spaCy [above](#Tokenising-with-spaCy) it created lemmas for all the tokens automatically.

In [None]:
lemmas = [(token.text, token.lemma_) for token in document if token.text.isalpha()]
lemmas_interesting = [lemma for lemma in lemmas if lemma[0] != lemma[1] and lemma[1] != '-PRON-']
lemmas_interesting[-20:]

---
---
## Part-of-Speech (POS) Tagging
Another important natural language processing (NLP) task is marking up a word according to its particular **part of speech**. A part of speech is broadly defined as a category of words with similar grammatical properties. In English the following parts of speech are commonly recognised: **noun**, **verb**, **article**, **adjective**, **preposition**, **pronoun**, **adverb**, **conjunction**, and **interjection**.

However, in computational linguistics many more sub-categories are recognised.

> spaCy follows the [Universal Dependences scheme](https://universaldependencies.org/u/pos/) and a version of the [Penn Treebank tag set](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). To read about the full set of POS tags see the spaCy documents: [Part-of-speech tagging](https://spacy.io/usage/linguistic-features/#pos-tagging).

Again, spaCy has already POS tagged the text. We just need to look at the document:

In [None]:
pos_tags = [(token.text, token.pos_, token.tag_) for token in document if token.text.isalpha()]
pos_tags[250:270]

---
---
## Syntactic Dependency Parsing
As well as tagging tokens relatively independently from one another, a further form of NLP is tagging tokens according to their context and **relations with other tokens in a sentence**. 

For example, an adjective (e.g. "dearest") of a particular noun (e.g. "comrades") might be tagged as being an "adjectival modifier" of that _particular_ noun. Parsing a full sentence results in a **tree structure** of how every word in the sentence is related to every other word.

> Read more about the syntactic dependency labels used by spaCy in the documentation at [Syntactic Dependency Parsing](https://spacy.io/usage/linguistic-features/#dependency-parse).

Once more, spaCy has already done this for us. But rather than show you yet another list, this time we can use a nice visualiser called **displaCy** to see this in action.

> To play with an online version of syntactic dependency parsing with displaCy see the [displaCy Dependency Visualizer](https://explosion.ai/demos/displacy).

To prepare something a little more manageable than the whole _Iliad_ we will take an excerpt and create a new spaCy document first.

In [None]:
excerpt = "Achilles smiled as he heard this, and was pleased with Antilochus, who was one of his dearest comrades."

nlp2 = en_core_web_sm.load()
document2 = nlp2(excerpt)

In [None]:
# Import the displacy package
from spacy import displacy

# Add some options to display it nicely
options = {"compact": True, "distance": 100, "color": "brown"}

# Pass the excert document to displacy to display
displacy.render(document2, options=options)

> **EXERCISE**: Try visualizing the parse tree of other sentences and see if you can understand the tags and relations.

---
---
## Summary

Here's what we have covered in this notebook:

* For any project a typical **text-mining pipeline** will involve choosing and collecting text, cleaning and preparing text, exploring data, analysing data and presenting results.
* **Tokenising** is splitting a text into its individual elements, such as words, sentences, or symbols.
* Basic text processing can include: **normalising** tokens to lowercase, removing punctuation, removing non-word tokens and **stopwords**.
* **Lemmatization** is reducing a word to its meaningful base or dictionary form to give its **lemma**.
* **Part-of-speech (POS) tagging** is labelling words with the part of speech they represent, for example, noun, verb or adjective.
* **Syntactic dependency parsing** is tagging tokens according to their grammatical relations with other tokens in a sentence. Parsing a full sentence results in a **tree structure** of how every word in the sentence is related to every other word.
* The NLP library **spaCy** automatically does many of these tasks when you feed it a text to process.

In the [next notebook](2-named-entity-recognition-of-henslow-data.ipynb) we will start our case study of another text-mining method: **named entity recognition** with letters from the Henslow Correspondence Project.