
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kasparvonbeelen/ghi_python/blob/main/5%20-%20Corpus%20Selection.ipynb)

# 5 Corpus Selection


## Text Mining for Historians (with Python)
## A Gentle Introduction to Working with Textual Data in Python

### Created by Kaspar Beelen and Luke Blaxill

### For the German Historical Institute, London

<img align="left" src="https://www.ghil.ac.uk/typo3conf/ext/wacon_ghil/Resources/Public/Images/institute_icon_small.png">






## 5.1 Introduction

When confronted with large collections of text, being able to find and select the relevant documents is a crucial skill for the digital historian.

Selecting information from digital archives is a critical part of the research process. In this Notebook, we demonstrate various procedures for creating meaningful subsamples from a large collection of text (i.e. more relevant for a particular research question). 

For both the digital and analogue, corpus creation, finding those documents the possible merit closer inspection, is the first step. 
By selecting and filtering data, we can bring together otherwise disparate elements in one subcorpus.

In most scenarios filtering documents is based on a combination of **metadata** and **content** criteria:
- Metadata criteria: this involves electing documents that fall within a certain date range, or are produced in a specific geography, or by a political party. Such information is often encoded in the document metadata in our case studies, we will mainly use the filenames as metadata. 
- Content criteria: this involves selecting documents based on the words they contain. In this Notebook, we have a look at regular expressions, a powerful query technique that allows you to select documents based on complex patterns. We won't have time to go into details but discuss a relevant example in which you query multiple tokens at once.


At the end of this Notebook, you'll be able to:
- Iterate over a collection of files
- Create a control flow `if else` for selecting documents
- Write simple functions

## 5.2 Unit of Analysis

Before we create a subcorpora, we have to define the units of our collection, should these whole documents, paragraphs, sentences or even ngrams?

For studying specific keywords we don't require the whole document, and sentences would suffice. In other words: what contexts do we want to include for our analysis? This depends on the question of course and we will explore different scenarios.

For example, you could approximately split a text into paragraphs splitting a string on hard returns (two hard returns).

In this cell we download "Oliver Twist" from gutenberg.org and get the text from the first sentence onwards.

In [None]:
import requests
text  = requests.get('https://www.gutenberg.org/files/730/730-0.txt').content.decode('utf-8') # get oliver twist
content = text.split(' CHAPTER I.')[1] # get the string from the first sentence onwards

In [None]:
content[:2000] # print the first 2000 characters

Inspecting the special characters in the string, you'll notice the sequence **"\r\n\r\n"** marking the boundary between paragraphs (approximately). We use this sequence to split `text` and store the result in `paragraphs`.

In [None]:
paragraphs = content.split('\r\n\r\n')
len(paragraphs)

In [None]:
print(paragraphs[10])

Another option is to split a text into **sentences**. You can use the NLTK function `sent_tokenize`...

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

In [None]:
sentences = sent_tokenize(content)
len(sentences)

In [None]:
sentences[100]

... or rely on SpaCy. Running the cell below could take a while.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm") # Load English model
doc = nlp(content)

In [None]:
sentences = []
for s in doc.sents:
    sentences.append(s)
len(sentences)

### --- Exercise

Download another book from gutenburg.org (search for any book, select the "Plain Text UTF-8" version and use URL as a string in `requests.get`. The compute the average number of sentences per paragraph (i.e. count the number of paragraphs and divide this by the number of sentences).

In [None]:
# Enter code here

### --- Exercise

What is the average sentence lengths (in tokens) of Oliver Twist (i.e. divide the number 

In [None]:
# Enter code here

### --- Exercise

What is the length of the longest sentence in Oliver Twist? 

Tip: 
- Create an empty list
- Iterate over the sentences and append the sentence length (with len) to the this list.
- apply `max()` to this list, this will return the maximum value in the list

In [None]:
# Enter code here

## 5.3 Filtering Based on Metadata

After selecting the textual unit of our corpus, we proceed with defining other criteria for data selection. Here we focus on aspects related to the document's metadata, especially filtering by time.

In the examples, we use articles from Heritage Made Digital newspapers. Please note that this corpus is already a sample (because the whole dataset was too large to share). We selected articles containing the word **"slavery"**. The exercises below demonstrate different techniques that enable you to create subsamples that zoom in specific periods and newspapers.

### 5.3.1 Paths

First, we show how to exploit information encoded in file names as metadata and use it for filtering documents; then we have a closer look at XML documents where metadata appears in the document's markup.

**[Important]** Please run the following cells, which download and extract the data needed in the remainder of the Notebook. If you are using Colab and you need to restart the Kernel/Runtime (or it restarted by itself), please run these cells again.

In [None]:
!mkdir working_data
!mkdir working_data/hmd

In [None]:
!wget -O working_data/aricles.zip https://github.com/kasparvonbeelen/ghi_python/raw/main/data/hmd_data/articles.zip
!unzip -o working_data/aricles.zip -d working_data/hmd

We use an external libary `pathlib` to make working with files and directories a bit easier.  

We need to import `Path` object form this library first.

In [None]:
from pathlib import Path

Before we continue, let's inspect where (and how) the articles are stored. We use the bash command `ls .` to list all documents in the current directory. To differentiate between bash and Python code, the former always start with an exclamation mark!

In [None]:
!ls .

Now we list all folders in `working_dta/hmd`. Each newspaper in the collection has each folder. The names are  (NLP) IDs:

- **0002088**: Liverpool standard and general commercial advertiser
- **0002194**: The Sun (London) 
- **0002643**: The British Press; or, Morning Literary Advertiser
- **0002644**: National Register
- **0002646**: The Star
- **0002647**: The Statesman

In [None]:
!ls working_data/hmd

The command below lists all files in `0002644` (**National Register**). You'll notice that filenames have a particular structure. The `_` separate different parts of metadata. 

In [None]:
!ls working_data/hmd/0002644

Using `pathlib` we can collect paths to all the files in our HMD collection. The code below may look a bit obscure at first but (explained in human language) it does the following:
- define the location where the data are stored (path is provided as a string)
- convert the string to a `Path` object, this allows us to use the functions and methods provided by the `pathlib` library
- we apply `.glob()` to the `Path` object, this returns the path to all files that match a specific query pattern. We `"**/*.txt"` as query, this will find all `.txt` files that are descendant of `hmd` in `working_data`. See the folder structure below:
```
working_data
|___ hmd
	|___ 0002643
	|        |__ 0002643_18030128_art0012.txt
	|        |__ ...
	|
	|___ 0002194
	|        |__ ...
	|___ ....
```
- Lastly, we convert the output of `.glob()` to a list (this for a minor technical reason we don't have to discuss this now) and print the number of paths we collected.

In [None]:
path_to_hmd = Path('working_data/hmd') # tell where data is stored and return a Path object
path_to_files = path_to_hmd.glob("**/*.txt") # find all .txt files saved in working_data/hmd
path_to_files = list(path_to_files) # convert generator to list
len(path_to_files) # print number of paths

We could write this more concisely:

In [None]:
path_to_files = list(Path('working_data/hmd').glob("**/*.txt"))
len(path_to_files)

We can print the path to the first file in our collection (and the `.stem` attribute, i.e. the actual file name)

In [None]:
path_to_first_file = path_to_files[0] # get the path to the first file
print(path_to_first_file) # print the path of path_to_first_file
print(type(path_to_first_file)) # print the data type of path_to_first_file
print(path_to_first_file.stem) # print the file name of path_to_first_file

You'll notice that the file names follow a pattern `{newspaper ID}_{date}_{article ID}`. We can use this information to filter articles by date. In the scenario below, we want to select only those articles published between the 1st of January and the 25th of March 1807, to inspect the press coverage in the months before the Abolition Act received royal assent. We first show how to apply a filter to one file, but scaling up tho the whole collection is straightforward. 

We take a random path as the working example:

In [None]:
example_path = path_to_files[100] # select a pathname 
print(example_path)

From this path, we get the `.stem` attribute ...

In [None]:
file_name = example_path.stem
file_name

... and use `str.split()` to get the individual components of the file name as a Python list.

In [None]:
file_name.split('_')

The date appears in the second position, but remember that in Python we start counting from 0. To fetch the date from the list we need to use `[1]`

In [None]:
date = file_name.split('_')[1] # split file name by _ and get second element in the resulting list
date

The first four characters of the `date` string refer to the year of publication. We can select those characters using slice notation, i.e. `[:4]`.

In [None]:
year_str = date[:4] # get first four characters
print(year_str, type(year_str))

In the last step, we convert the string to an integer (this is called typecasting in Python).

In [None]:
year = int(year_str) # convert string to integer
print(year, type(year)) 

Now we can put everything together and make the code more elegant by making use of multiple assignment (see example below).

In [None]:
t = '1_2_3'
print(t.split("_"))
one, two, three = t.split("_")
print(one, two, three)

In [None]:
example_path = path_to_files[100] # select a pathname 
newspaper_id, date, art_id = example_path.stem.split("_")
print(newspaper_id, date, art_id)
year,month,day = int(date[:4]),int(date[4:6]),int(date[6:])
print(year,month,day)

Now we extracted different elements from the file name and parsed the date string, we can convert it to a proper Python time-stamp (i.e. a `datetime` object.)

In [None]:
from datetime import datetime
ts_1 = datetime(year,month,day) # create datetime object from integers representing year, month and day
ts_1

This allows is to compare to dates, for example to check if one date is earlier or later target date. For this we can use `>` (bigger than) and `<` (smaller than) operators.

`<` and `>` are **boolean** operators, as they return a `True` or `False` value.

In [None]:
ts_2 = datetime(1821,7,1)
print(ts_2 > ts_1)

Another boolean operator we'll encounter later on is the **equal to** operator (`==`).

Please note that this is different then an assignment statement, which only uses one `=`:

In [None]:
x = 'Hello World' # assing x to the string "Hello World"
print(x == 'Hello World') # check for equality, this should return True
print(x == 'Hello World!') # check for equality, this should return False because of the ! at the end

We can also check if two dates are equal:

In [None]:
ts_1 = datetime(1821,7,1)
ts_2 = datetime(1821,7,1)
ts_1 == ts_2

Lastly, we test for a range, i.e. test if a time-stamp falls within a particular date range.
We first decide on the lower and upper boundary and test if a `target_date` falls within the selected period (i.e. is greater than the lower boundary and smaller than the upper boundary).

In [None]:
lower_b = datetime(1807,1,1)
upper_b = datetime(1807,3,15)

In [None]:
target_date = datetime(1807,2,15)
lower_b < target_date < upper_b

In [None]:
target_date = datetime(1806,2,15)
lower_b < target_date < upper_b

In [None]:
target_date = datetime(1808,2,15)
lower_b < target_date < upper_b

## `Breakout`:
- [Boolean operators](break_out/conditions.ipynb)

We can package these steps in together in one function, the takes a path, upper and lower boundary are arguments, and returns a boolean (i.eWe can package these steps together in one function, which takes a path, an upper and a lower boundary are arguments, and returns a boolean value (i.e. `True` or `False`).

Functions are ideal to group several statements (that you need repeatedly) and give them a name. Below we reuse the previous code for converting a path to a date, and evaluate if it falls within the date range set by the lower and upper boundary. We give this sequence of operations the name `in_daterange`. For each path in our collection, can call the function `in_daterange` to check if we should select it for our subsample.

Don't forget to run the code cell below, otherwise, you won't be able to use the `in_daterange()` function.. `True` or `False`) value. 

Don't forget to run the code cell below, otherwise you won't be able to use the `in_daterange()` function.

## `Breakout`:
- [Functions](break_out/functions.ipynb)

In [None]:
def in_daterange(path,lower_b,upper_b):
    newspaper_id, date, art_id = path.stem.split("_")
    year,month,day = int(date[:4]),int(date[4:6]),int(date[6:])
    target_date = datetime(year,month,day)
    return lower_b < target_date < upper_b

Before applying the function to whole collection of paths, we test it on a few examples.

In [None]:
lower_b = datetime(1807,1,1)
upper_b = datetime(1807,3,15)

In [None]:
path_to_files[700]  

In [None]:
path = path_to_files[100]  
print(path)
in_daterange(path,lower_b,upper_b)

Now we are almost ready to iterate over the whole corpus. We only need to discuss one more element of the Python syntax: the conditions. With the `for` loop you can iterate over a corpus, but we'd like to have a bit more control by for example treating documents inside our date range differently than others. 

Conditional statements are helpful here. We only have a closer look at the simplest form the 'if else ` statements. The following mock code shows how this works in Python

```
if condition is True:
	execute code
else:
	execute code
```
Just one practical example will make this more understandable. We write a program the check is a number is greater than 10. Change the variable `i` to see how the program changes it behaviour depedending on wether the condition evaluates to `True` or `False`. In this case we use the greater or smaller than operation. 

In [None]:
print(4 > 10)
print(100 > 10)

Please note the use of indentation (when a line ends with a colon).

In [None]:
i = 4
if i > 10: # check if i is larger than 10, this will
    print(i,f'is bigger than 10 because {i} > 10 evaluates to', i > 10)
else:
    print(f'{i} is smaller than 10. {i} > 10 evaluates to ', i > 10)

The breakout will provide a bit more information about `if else`, at this point please remember that when the code following the `if` evaluates to True, we will execute the code in the next line, other we skip this part and go straight to the else statements.

Please remember that the function we wrote earlier `in_daterange` also returns a boolean value. In the small program below, 
- create an empty list where store the paths that match the conditions defined in line 6
- we iterate over all paths and check if the date of the article matches the period we defined by setting a lower and upper boundary



In [None]:
lower_b = datetime(1807,1,1) # create start date of target period
upper_b = datetime(1807,3,15) # create end date of target period

selected_paths = [] # create a new variable referring to an empty list
for p in path_to_files: # iterate over all the paths
    if in_daterange(p,lower_b,upper_b): # check if the date of the article is within the boundaries of the target period
        selected_paths.append(p) # if the above evaluates to True, append this path to the list
    else: # else...
        pass # ... do nothing
print(len(selected_paths)) # print the number of selected paths

If you want, you could continue with close reading these articles.

In [None]:
print(open(selected_paths[1]).read())

## `Breakout`:
- [Conditions and control flow](break_out/conditions.ipynb)

### --Exercise

other date

### --Exercise

Other corpus

### 5.3.2 XML

**[Under construction]**

## 5.4. Filtering based on Content

Let's now explore techniques for selecting articles based on their content. We will touch on a new topic (but only in passing): regular expression, a rich query language that enables you to search for complex textual patterns. It is outside the scope of this tutorial to discuss regular expressions in-depth, but we show a useful example that allows you to search for multiple words at once. 

We'd like to know the extent to which articles discussing slavery make mention of political concepts, such as "freedom" and "democracy".

Using regular expression often follows this procedure:
- import re module (line 1) (only once suffices)
- define pattern (line 2)
- compile pattern (line 3)
- apply the pattern to string (line 4)

In [None]:
import re # import re module
pattern = r'\bfreedom\b|\bdemocracy\b' # define pattern, search for word freedom and democracy
query = re.compile(pattern) # compile this pattern
query.findall('Can there be freedom without democracy?') # apply the pattern

We'll skip the technicalities, since there are many excellent introductions regular expressions (the [NLTK handbook](https://www.nltk.org/book/ch03.html) is a good starting point) but we can explain the some of the syntax here, so you can adapt the code to other queries of interest.

- `|`: 'OR' seperator 
- `\b` word boundary

If we remove the word boundary character, our query become more inclusive, it will also substrings. For example, the code below still matches the word "democracy", even though it only appears as a substring of "ddemocracys"

In [None]:
pattern = r'\bfreedom\b|democracy'
query = re.compile(pattern)
query.findall('can there be dfreedom, without ddemocracys?')

We can easily extend the query with the `OR` separator. Below we search for the tokens "freedom", "democracy" and words starting with the substring "equal".
Please notice 
- the word boundary only appears at the left-hand side of "equal"
- this may match more words than you'd think, both equality and equal, so be careful!

In [None]:
pattern = r'\bfreedom\b|\bdemocracy\b|\bequal'
query = re.compile(pattern)
query.findall("can there be freedom, without democracy? What equality, that's equally important")

Let's now apply this technique to our corpus. Most of the code should like familiar by now, only line 8 needs a bit of explanation. `query.findall` returns a list with all the substrings that match the given regular expression. If there are more than `0` words found (line 8) then we add the path to `selected_paths`.

Running the code may take a minute or two since we have to process the content of quite some files.

In [None]:
query = re.compile(r'\bfreedom\b|\bdemocracy\b') # define and compile the query
selected_paths = [] # create empty variable where we'll store the results of the iteration
for p in path_to_files: # iterate over all the files
    txt = open(p).read() # open and read the file
    txt_lower = txt.lower() # lowercase the text, save in new variable
    results = query.findall(txt_lower) # query lowercased texts
    if len(results) > 0: # check if query returned any results
        selected_paths.append(p) # if True, add this path to selected_paths
print(len(selected_paths)) # print number of collected files

In Python. an empty `list` (or dictionary) will evaluate to False, otherwise, if the list contains one or more items, the `if` condition returns True. 

In [None]:
# the code below will not print the message after if
empty_list = []
if empty_list:
    print('condition is True')

In [None]:
# the code below will print the message after if
list_with_content = [1,2,3]
if list_with_content:
    print('condition is True')

We could therefore make the code in line 8 a bit more concise.

In [None]:
import re
query = re.compile(r'(?:\bfreedom\b|\bdemocracy\b|\babolit)')
selected_files = []
for p in path_to_files:
    txt = open(p).read()
    txt_lower = txt.lower()
    results = query.findall(txt_lower)
    if results:
        selected_files.append(p)
print(len(selected_files))

## 5.5 Putting Everything Together

By combining metadata and content criteria, you can navigate a corpus and select relevant documents. The code cell merges the previous examples.

The crucial difference is line 20 where the `if` statements contains **two** conditions, both have to evaluate to `True` (since we use `and` operator).

In [None]:
True and True

In [None]:
True and False

In [None]:
True or False

In [None]:
if True and True:
    print('!')

In [None]:
if True and False:
    print('!')

In [None]:
import re
from datetime import datetime

def in_daterange(path,lower_b,upper_b):
    newspaper_id, date, art_id = path.stem.split("_")
    year,month,day = int(date[:4]),int(date[4:6]),int(date[6:])
    target_date = datetime(year,month,day)
    return lower_b < target_date < upper_b

lower_b = datetime(1830,1,1)
upper_b = datetime(1831,1,1)

query = re.compile(r'(?:\bfreedom\b|\bdemocracy\b)')

selected_files = []
for p in path_to_files:
    txt = open(p).read()
    txt_lower = txt.lower()
    results = query.findall(txt_lower)
    if results and in_daterange(p,lower_b,upper_b):
        selected_files.append(p)
print(len(selected_files))

## 5.6 Saving the output

While selecting articles is useful for creating a specific subcorpus, you'd probably want to spend some time close-reading the results. Below we show how to export all the document to tabular data, an Excel file in this case. Part II of this course will have a closer look at working with tabular data.

In [None]:
import re
from datetime import datetime
import pandas as pd

def in_daterange(path,lower_b,upper_b):
    newspaper_id, date, art_id = path.stem.split("_")
    year,month,day = int(date[:4]),int(date[4:6]),int(date[6:])
    target_date = datetime(year,month,day)
    return lower_b < target_date < upper_b

lower_b = datetime(1830,1,1)
upper_b = datetime(1831,1,1)

query = re.compile(r'\bfreedom\b|\bdemocracy\b')

rows = []
for p in path_to_files:
    txt = open(p).read()
    txt_lower = txt.lower()
    results = query.findall(txt_lower)
    if results and in_daterange(p,lower_b,upper_b):
        row = [p.stem,'; '.join(results),txt]
        rows.append(row)

df = pd.DataFrame(rows)
df.to_excel('working_data/corpus_hmd.xlsx')

## Fin.