# Preparing and Processing Text using Python

## 5 Steps of Text-Mining
There is no set way to do text-mining, but typically a workflow will involve steps like these:
1. Choosing and collecting your data
2. Cleaning and preparing your data
3. Exploring your data
4. Analysing your data
5. Presenting the results of your analysis

You may go through these steps more than once to refine your data and results, and frequently steps may be merged together. The important thing to realise is that steps 1-2 are critical in ensuring your data is capable of actually answering your research questions. You are likely to spend significant time on cleaning and preparing your data.

> **Rubbish in = rubbish out**

This notebook covers steps 1-4. The next notebook `3-visualising-results.ipynb` will show step 5.

## More Python Basics

Before we start in earnest to code again, we need to cover a few more Python basics.

### Imports

Python has a lot of amazing capabilities built-in to the language itself, like being able to manipulate strings. However, in any Python project you are likely to want to use Python code written by someone else to go beyond the built-in capabilities. Code 'written by someone else' comes in the form of a file (or files) separate to the one you are currently working on.

An external Python file (or sometimes groups of files) is called a **module** and in order to use them in your code, you need to **import** it.

This is a simple process using a **keyword** called `import` and the name of the module. Just make sure that you `import` something _before_ you want to use it!

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


Obviously, that is a trivial example. It simply prints out the philosophy of the Python programming language.

You can also `import` modules and then use them:

In [4]:
import math
math.pi

3.141592653589793

In [5]:
import locale
locale.getlocale()

('en_GB', 'UTF-8')

---
#### Going Further: The Python Standard Library

If you are interested in delving into everything that the Python language has to offer, you can browse the [Python standard library](https://docs.python.org/3/library/index.html) and try some of the modules there by importing them.

---
---
#### Going Further: Python Package Index

I have completely glossed over how you get hold of modules and libraries from other sources. The answer to this is by using the [Python Package Index](https://pypi.org/), known as PyPI, but that is out of the scope of this workshop. Feel free to go and learn about this yourself with the tutorial [What Is Pip? A Guide for New Pythonistas](https://realpython.com/what-is-pip/) from RealPython.

---

As you will see in the sections below, we will `import` the Natural Language Toolkit (or parts of it), which is a massive **library** of modules dedicated to working with natural language. A library is simply a *collection of modules*.

### Functions
A function is a _reusable block of code_ that has been wrapped up and given a _name_. In order to run the code, we use the name followed by `()` parentheses. We have already seen this earlier. Here are all the functions (or methods) we have run so far:

In [6]:
# 'lower()' is the function
my_sentence = 'Butterflies are important as pollinators.'
my_sentence.lower()

'butterflies are important as pollinators.'

In [7]:
# 'upper()' is the function
my_sentence.upper()

'BUTTERFLIES ARE IMPORTANT AS POLLINATORS.'

In [8]:
# 'isalpha()' is the function
my_sentence.isalpha()

False

In [9]:
# 'getlocale()' is the function
locale.getlocale()

('en_GB', 'UTF-8')

Essentially, you can think of a function as a box. You put an input into the box, the box does something with it, and then the box gives you back an output. You generally don't need to worry _how_ the function does what it does (unless you really want to, in which case you can look at its code). You just know that it works.

---
#### Going Further: Functions and Methods
There is a technical difference between functions and methods. You don't need to worry about the distinction for our workshop. We will treat all functions and methods as the same.

If you are interested in learning more about functions and methods try this [Datacamp Python Functions Tutorial](https://www.datacamp.com/community/tutorials/functions-python-tutorial).

---

#### Functions that Take Arguments
If we need to pass particular information to a function, we put that information _in between_ the `()`. Like this:

In [11]:
# Calculate the square root of 25
math.sqrt(25)

5.0

The `25` is the value we want to pass to the `sqrt()` function so it can do its work. This value is called an **argument** to the function. Functions may take any number of arguments, depending on what the function accepts.

Here is another function with an argument:

In [25]:
# Get the text of a webpage (but only the first 270 characters)
import requests
r = requests.get('https://www.wikipedia.org/')
r.text[0:270]

'<!DOCTYPE html>\n<html lang="mul" class="no-js">\n<head>\n<meta charset="utf-8">\n<title>Wikipedia</title>\n<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">\n'

The string `'https://www.wikipedia.org/'` is the argument we want to pass to the `get()` function for it to open the webpage and read it for us.

Why not try your own URL? What explains the strange appearance of this text?

## Step 1: Choosing and Collecting Your Data
No matter your research subject, you need to be aware of the many issues of electronic data collection. We cannot cover them all here, but you should ask yourself some questions as you start to collect data, such as:
* What sort of data do I need to answer my research questions?
* What data is available?
* What is the quality of the data?
* How can I get the data?
* Am I allowed to use it for text-mining?

### A Simple Example: Top Words Used in Homer's Iliad

Our research question will be:

> What are the top 10 words used in Homer's Iliad in English translation?

#### What sort of data do I need to answer my research questions?

I need a copy of Homer's Iliad in English translation. In this instance, I am not bothered by which translation.

#### What data is available?

[Project Gutenberg](http://www.gutenberg.org/) is the first provider of free electronic books and has over 58,000. "You will find the world's great literature here, with focus on older works for which U.S. copyright has expired. Thousands of volunteers digitized and diligently proofread the eBooks, for enjoyment and education."

Here is Homer's Iliad, translated by Alexander Pope in 1899: http://www.gutenberg.org/ebooks/6130

#### What is the quality of the data?

Variable. Some books have been digitised by OCR ([Optical Character Recognition](https://en.wikipedia.org/wiki/Optical_character_recognition)) and remain uncorrected by their volunteers, but a quick look at this file shows that it is excellent quality.

#### How can I get the data?

Project Gutenberg clearly states on their [Terms of Use](http://www.gutenberg.org/wiki/Gutenberg:Terms_of_Use) that their website is 'intended for human users only'. If you want to use code to get their data you must use one of their [mirror sites](http://www.gutenberg.org/MIRRORS.ALL) -- you should pick the one that is nearest to your location.

We will be using the text file at http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-8.txt

#### Am I allowed to use it for text-mining?

Project Gutenberg says in their [Permission: How To](http://www.gutenberg.org/wiki/Gutenberg:Permission_How-To) that "The vast majority of Project Gutenberg eBooks are in the public domain in the US." However, since UK copyright is different from US copyright, we still have to check for ourselves. This is a complicated area, but broadly we can say that UK copyright expires 70 years after the death of the author. Since [Alexander Pope](https://en.wikipedia.org/wiki/Alexander_Pope) died in 1744, we are probably ok to use his work.

### Getting a Copy of the Homer's Iliad Text
We saw above that we can use a Python library called `requests` to get webpages! We can therefore get a copy of the text file like this:

In [46]:
r = requests.get('http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-8.txt')
iliad = r.text
# Print a section of the text from the middle
iliad[18007:18500]

'It is on the coast, at\r\nsome distance from the city, northward, and appears to have been an open\r\ntemple of Cybele, formed on the top of a rock. The shape is oval, and in\r\nthe centre is the image of the goddess, the head and an arm wanting. She\r\nis represented, as usual, sitting. The chair has a lion carved on each\r\nside, and on the back. The area is bounded by a low rim, or seat, and\r\nabout five yards over. The whole is hewn out of the mountain, is rude,\r\nindistinct, and probably of the '

We can find out how many characters the file has by using the `len()` function.

In [39]:
len(iliad)

1201763

## Steps 2 and 3: Cleaning and Exploring Your Data
We are going to combine these two steps in this workshop.
### Inspecting and Preparing the Text
The first thing to do is inspect the text and see what might need sorting out. Looking again at the text by eye (http://www.mirrorservice.org/sites/ftp.ibiblio.org/pub/docs/books/gutenberg/6/1/3/6130/6130-8.txt) you can see that the book starts with a load of front matter we don't want.

The book actually starts after the text "`***START OF THE PROJECT GUTENBERG EBOOK THE ILIAD OF HOMER***`":

In [45]:
# Book contents starts at character 553
iliad[553:700]

'\r\n\r\n\r\n\r\n\r\n\r\nThe Iliad of Homer\r\n\r\n\r\nTranslated by Alexander Pope,\r\n\r\nwith notes by the\r\nRev. Theodore Alois Buckley, M.A., F.S.A.\r\n\r\nand\r\n\r\nFlaxman'

There is also unwanted matter at the end after "`***END OF THE PROJECT GUTENBERG EBOOK THE ILIAD OF HOMER***`" that we should get rid of too.

---
#### Going Further: OCR Errors
We are very fortunate that this text does not suffer from common OCR errors, where the OCR process has 'transcribed' the text incorrectly. We won't be covering what to do about this in this workshop, but if you are curious you can read more about how the British Library has dealt with this in a blog post [Dealing with Optical Character Recognition errors in Victorian newspapers](https://blogs.bl.uk/digital-scholarship/2016/07/dealing-with-optical-character-recognition-errors-in-victorian-newspapers.html).

---

### Creating and Preparing a Local Copy

It is not very efficient to keep making web requests to Project Gutenberg, especially with a very large corpus. I have therefore downloaded a copy for us and placed it in our project. We will use this local copy instead from now on.

I have also taken some steps to prepare the file on your behalf, to save us some time. In the spirit of full transparency and documentation here is what I have done:

* Removed the unwanted Gutenberg-related matter at the front and back of the book
* Converted the character encoding from 'ISO 8859-1' to 'UTF-8'

You don't need to worry about the details of _character encoding_ for this workshop. You only need to know that Python works most easily with UTF-8 files and so we must have the file in that encoding to avoid problems.

---
#### Going Further: Character Encoding
Character encoding is a very important topic, but it is not an easy one. If you end up dealing with a lot of text files in building up your corpus you will have to be aware that dealing with files that have different, or unknown, character encodings can get very messy. If you don't know, or wrongly assume the character encoding of a file you can end up with this sort of thing: ࡻࢅ࢖

``

---



### Step 3: Exploring Your Data

* Steps 2 and 3: cleaning/preparing and exploring - these often go together as you need to explore your text to understand how best to clean it - cleaning (normalisation (case, punctuation), stopwords, tokenising) - input/output

## Steps 4: Analysing Your Data with Frequency Analysis

* Step 4: analysing your data - frequency distributions - counting words




Need to simplify the following steps (but perhaps with Going Further activities) so that people can actually have a go rather than just following along.