# 5 Corpus Selection


## 5.1 Introduction

Creating research corpora is often an critical feature of the historian's research. In this Notebook, we demonstrate how Python is a very powerful tool to filtering text data. When confronted with a huge collection, the first step usually involves source selection, finding those documents the possible merit closer inspection. 

By selecting and filtering data, we can bring together otherwise disparate elements in one subcorpus.

In most scenarios we'd like to filter documents based on a combination of **metadata** and **content** criteria:
- Metadata criteria:  for example retaining only documents within a certain date range, or produced in a specific geography, or by a political party. This information is often encoded in the document metadata, in our case study below we will use the file names as metadata. 
- Content criteria: this involves selecting documents based on words (or more broadly patterns of tokens) they contain. In this Notebook we have look at regular expressions, and powerful query technique that allows you to select documents based on complex patterns. We won't have time to go into details, but discuss a rather powerfulle example in which you can use mutliple sequence of patterns.

Lastly, Python gives us to tools to save the output of the selection process in formats convenient for further analaysis.

## 5.2 Unit of Analysis

Before we start creating our own subcorpora, we bave to define which units of a document we need for our analysis (depending on what type of access we have).  We should ask the question: what is unit of analysis we'd prefer to work with? Ar these whole documents, paragraphs, sentences or even ngrams?

For studying specific keywords we don't require the whole document, and sentences would suffice. In other words: what contexts do we want to include? All this depends on the tasks and we will explore different scenarios later.

But it is good to know the the different options. For example, you could approximately splite a text into paragraphs using splitting on hard returns (two hard returns).


In [14]:
import requests
text  = requests.get('https://www.gutenberg.org/files/730/730-0.txt').content.decode('utf-8')
content = text.split(' CHAPTER I.')[1]

In [15]:
content[:2000]

'\r\nTREATS OF THE PLACE WHERE OLIVER TWIST WAS BORN AND OF THE\r\nCIRCUMSTANCES ATTENDING HIS BIRTH\r\n\r\n\r\nAmong other public buildings in a certain town, which for many reasons\r\nit will be prudent to refrain from mentioning, and to which I will\r\nassign no fictitious name, there is one anciently common to most towns,\r\ngreat or small: to wit, a workhouse; and in this workhouse was born; on\r\na day and date which I need not trouble myself to repeat, inasmuch as\r\nit can be of no possible consequence to the reader, in this stage of\r\nthe business at all events; the item of mortality whose name is\r\nprefixed to the head of this chapter.\r\n\r\nFor a long time after it was ushered into this world of sorrow and\r\ntrouble, by the parish surgeon, it remained a matter of considerable\r\ndoubt whether the child would survive to bear any name at all; in which\r\ncase it is somewhat more than probable that these memoirs would never\r\nhave appeared; or, if they had, that being comp

In [16]:
paragraphs = content.split('\r\n\r\n')
len(paragraphs)

3997

In [17]:
print(paragraphs[10])

The surgeon deposited it in her arms. She imprinted her cold white lips
passionately on its forehead; passed her hands over her face; gazed
wildly round; shuddered; fell back—and died. They chafed her breast,
hands, and temples; but the blood had stopped forever. They talked of
hope and comfort. They had been strangers too long.


Another option is to split a text into sentences.

In [25]:
from nltk.tokenize import sent_tokenize

In [24]:
sentences = sent_tokenize(content)
len(sentences)

6668

In [23]:
sentences[100]

'I have come out\r\nmyself to take him there.'

### --- Exercise

average number of sentences per paragraph for another book.

### --- Exercise

average sentence length between two books of Charles Dickens

### --- Exercise


Longest sentence

## 5.3 Filtering Based on Metadata

After deciding which units to extract, you can proceed the defining other criteria for data selection. Here we focus on aspects related to the documents metadata, first on source and time, then other criteria.

In the examples we below we use articles from Heritage Made Digital newspapers. This corpus is already a selection (becaus the whole dataset was too large to share) since we retained only articles containing the word **"slavery"**. The exercises below demonstrate different techniques that would enable you to zoom in specific aspects of press coverage on this topic.

### 5.3.1 Paths

First, we show how paths and file name encode specific information about the document. Is in the previous examples we use a libary `pathlib` to make working with files and directories a bit easier.  

In [35]:
from pathlib import Path

Let's first inspect where (and how) the articles are stored. For this we use the bash command `ls .` to list all documents in the current directory. The differentiate between bash an python code, the former always start with an exclamation mark!

In [38]:
!ls .

1 - Introduction.ipynb
2 - Values and Variables.ipynb
3 - Text and String Methods.ipynb
4 -  Processing texts.ipynb
5 - Corpus Selection.ipynb
6 - Corpus Exploration.ipynb
7 - Trends over time.ipynb
8 - Advanced Topics - Classification and Topic Modelling with SKLearn.ipynb
8 - Trends over time II.ipynb
LICENSE
README.md
[34mbreak_out[m[m
[34mdata[m[m
[34mexample_data[m[m
[34mimgs[m[m
[34mold_notebooks[m[m
[34mutils[m[m


Now we list all folders in `data/hmd_data/plaintext/`. Each newspaper in the collection has each folder. The names are IDs:

- 0002088:
- 0002194:
- 0002643:
- 0002644:
- 0002646:
- 0002647: 


In [30]:
!ls data/hmd_data/plaintext/

[34m0002088[m[m [34m0002194[m[m [34m0002643[m[m [34m0002644[m[m [34m0002646[m[m [34m0002647[m[m


The command below lists all files in `0002643`. You'll notice that filenames have a particular structure. The `_` separate different parts of metadata. 

In [31]:
!ls data/hmd_data/plaintext/0002643

0002643_18030128_art0012.txt 0002643_18240124_art0011.txt
0002643_18030201_art0004.txt 0002643_18240131_art0006.txt
0002643_18030303_art0014.txt 0002643_18240131_art0015.txt
0002643_18030308_art0018.txt 0002643_18240202_art0009.txt
0002643_18030318_art0016.txt 0002643_18240204_art0018.txt
0002643_18030325_art0029.txt 0002643_18240205_art0020.txt
0002643_18030326_art0001.txt 0002643_18240207_art0038.txt
0002643_18030425_art0023.txt 0002643_18240209_art0017.txt
0002643_18030531_art0005.txt 0002643_18240211_art0023.txt
0002643_18030604_art0022.txt 0002643_18240211_art0024.txt
0002643_18030610_art0012.txt 0002643_18240218_art0037.txt
0002643_18030613_art0015.txt 0002643_18240219_art0012.txt
0002643_18030613_art0018.txt 0002643_18240223_art0026.txt
0002643_18030621_art0019.txt 0002643_18240223_art0033.txt
0002643_18030630_art0026.txt 0002643_18240225_art0011.txt
0002643_18040515_art0016.txt 0002643_18240225_art0016.txt
0002643_18040522_art0013.txt 0002643_18240227_art0012.tx

Using `pathlib` we can quickly collect paths to all files in the HMD collection. The code below may look a but obscure at first but 
- define the location where the data are stored as string
- convert the string to `Path` object, this allows us to use the convenient functions and methods of the `pathlib` library
- we apply `.glob()` to the `Path` object, this return the path to all files that match a specific query pattern. `"**/*.txt"` will find all `.txt` files that are descendant of `plaintext`. See the folder structure below:
```
plaintext
|___ 0002643
|        |__ 0002643_18030128_art0012.txt
|        |__ ...
|
|___ 0002194
|        |__ ...
|___ ....
```
- Lastly we convert the output of `.glob()` to a list for minor technical reason we don't have to discuss at now.

In [39]:
path_to_hmd = Path('data/hmd_data/plaintext')
path_to_files = path_to_hmd.glob("**/*.txt")
path_to_files = list(path_to_files)
len(path_to_files)

7856

We could write this more concisely:

In [None]:
path_to_files = list(Path('data/hmd_data/plaintext').glob("**/*.txt"))
len(path_to_files)

We can print the path to the first file in our collection (and the `.stem` attribute, i.e. the actual file name)

In [43]:
print(path_to_files[0])
print(path_to_files[0].stem)

data/hmd_data/plaintext/0002643/0002643_18210719_art0021.txt
0002643_18210719_art0021


You'll notice that file name follows a pattern `{newspaper ID}_{date}_{article ID}` which we can use to filter articles by date. In the scenario bellow we want select only those articles published between the 1st January  and 25th March 1807. Again, as is hopefully familiar by now, we show how to do this for one file, after which scaling up tho the whole collection is rather straightforward. 

Line 2 and 4 make use of multiple assignment
Line 4 also combines string slicing with typecasting

In [50]:
date = "18210629"
year_str = date[:4]
print(year_str, type(year_str))
year = int(date[:4])
print(year, type(year))

1821 <class 'str'>
1821 <class 'int'>


In [47]:
example_path = path_to_files[100] # select a pathname 
newspaper_id, date, art_id = example_path.stem.split("_")
print(newspaper_id, date, art_id)
year,month,day = int(date[:4]),int(date[4:6]),int(date[6:])
print(year,month,day)

0002643 18210629 art0009
1821 6 29


Now we extract different elements from the date string we can convert it to a proper Python time-stamp (i.e. a datetime object.)

In [52]:
from datetime import datetime
ts_1 = datetime(year,month,day)
ts_1

datetime.datetime(1821, 6, 29, 0, 0)

This allows is to compare to dates, for example if one date is before or after another. For this we can use `>` (bigger than) and `<` (smaller than) operators.

In [54]:
ts_2 = datetime(1821,7,1)
print(ts_2 > ts_1)

True


We can also test for a range, i.e. test if a time-stamp falls within a particular date-range.
We first decide on the lower and upper boundary, and then for different target_dates evaluate if fall within the selected period

In [56]:
lower_b = datetime(1807,1,1)
upper_b = datetime(1807,3,15)

In [58]:
target_date = datetime(1807,2,15)
lower_b < target_date < upper_b

True

In [59]:
target_date = datetime(1806,2,15)
lower_b < target_date < upper_b

False

In [60]:
target_date = datetime(1808,2,15)
lower_b < target_date < upper_b

False

We can packages these steps in together in one function, the takes a path, upper and lower boundary are arguments, and returns a boolean (i.e. `True` or `False`) value. 

Don't forget to run the code cell below, otherwise you won't be able to use the `in_daterange()` function.

In [61]:
def in_daterange(path,lower_b,upper_b):
    newspaper_id, date, art_id = path.stem.split("_")
    year,month,day = int(date[:4]),int(date[4:6]),int(date[6:])
    target_date = datetime(year,month,day)
    return lower_b < target_date < upper_b

Before applying the function, we test it on a few examples.

In [None]:
lower_b = datetime(1807,1,1)
upper_b = datetime(1807,3,15)

In [72]:
path_to_files[700]  

PosixPath('data/hmd_data/plaintext/0002643/0002643_18240818_art0008.txt')

In [70]:
path = path_to_files[100]  
print(path)
in_daterange(path,lower_b,upper_b)

data/hmd_data/plaintext/0002643/0002643_18210629_art0009.txt


False

Now we are ready to iterate over the whole corpus. Note that the small program is follow a structure similar to what we have seen earlier. In the first line we define an empty list variable, and fill it with items (using `.append()`) that match the condition defined in liene 3. The we print the number of files we found.

In [75]:
selected_files = []
for p in path_to_files:
    if in_daterange(p,lower_b,upper_b):
        selected_files.append(p)
print(len(selected_files))

15


In [76]:
selected_files

[PosixPath('data/hmd_data/plaintext/0002643/0002643_18070130_art0013.txt'),
 PosixPath('data/hmd_data/plaintext/0002643/0002643_18070302_art0022.txt'),
 PosixPath('data/hmd_data/plaintext/0002643/0002643_18070131_art0009.txt'),
 PosixPath('data/hmd_data/plaintext/0002643/0002643_18070224_art0009.txt'),
 PosixPath('data/hmd_data/plaintext/0002643/0002643_18070211_art0011.txt'),
 PosixPath('data/hmd_data/plaintext/0002643/0002643_18070224_art0006.txt'),
 PosixPath('data/hmd_data/plaintext/0002643/0002643_18070228_art0012.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18070224_art0008.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18070226_art0015.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18070207_art0017.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18070206_art0005.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18070206_art0004.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18070228_art0007.txt'),
 PosixPath('

If you want you could continue with close reading these article.

In [78]:
print(open(selected_files[1]).read())

IA ECON CI IrCU IT.

•

The Honourable George Harding..
Atael Moymy, Esq.
Cardiff—Tuesday, March 24.
precru fueolay, March 31.
Predieigne-o f ussdan April 7.
CHESFER CIRCUIT.
kobert Delhi>, EN.
Francis Burton, E.‘q.
Montgnmeryihire—Thursday, March at Pool.
Denbighshire-Wednesday, Aiuil t, at ktuthol.
Fliotshire--Tuimlay, April 7, at Mold.
Cogitate—Monday, Apia 13, at the Castle of Chester.
NORTH WALES SPRING CIRCUIT, 11107.
Hush Ltyc,ster, laq.
Thomas Musser, Esq.
hferiemethshire—Thlimlay, Arra 7, at Sala.
Carstamonsolre--Wellnestlay, Aped t, at Camarvon.
Anglesey—Tutmlay, Match 26, at Be mataris.
A "co.-The F.sighso Circuits have been already given in Taw
Btlilita Pawls, with the oirtivvion of the Awns of Hunting-
dun, which n fixed loor 7th March, and Cambridge the 10th.

The following is a statement of the distribution of
nor Naval Force, up to this day:—At sea, eighty-six
sail of the line ; seven ships from 50 to 44 guns; 115
frigates, 152 sloops, and 1132 gun-brigs and smaller
ves

### 5.3.2 XML

**[Under construction]**

## 5.4. Filtering based on Content

Let's now explore techniques for selecting articles based on their content. We will slightly touch on a new topic, Regular Expression, a rich query language that enables you to search for complex textual patterns. It is outsde the scope of this tutorial to discuss REs in depth, but we show a useful example that allows you to search for multiple words at one. 

Using regular expression often follows this procedure:
- import re module (line 1) (only once suffices)
- define pattern (line 2)
- compile pattern (line 3)
- apply pattern (line 4)

In [87]:
import re
pattern = r'\bfreedom\b|\bdemocracy\b'
query = re.compile(pattern)
query.findall('can there be freedom, without democracy?')

['freedom', 'democracy']

In [88]:
query.findall('can there be freedom, without democracyy?')

['freedom']

We'll skip the technicalities, there are plenty of good introduction to working with regular expressions (the [NLTK handbook](https://www.nltk.org/book/ch03.html) is a good starting point) but we can explain the 

- `|`: 'OR' seperator 
- `\b` word boundary

The word boundary seperator

In [89]:
import re
pattern = r'freedom|democracy'
query = re.compile(pattern)
query.findall('can there be dfreedomd, without ddemocracys?')

['freedom', 'democracy']

In [90]:
import re
query = re.compile(r'(?:\bfreedom\b|\bdemocracy\b|\babolit)')
selected_files = []
for p in path_to_files:
    txt = open(p).read()
    txt_lower = txt.lower()
    results = query.findall(txt_lower)
    if results:
        selected_files.append(p)
print(len(selected_files))

4230


## 5.5 Putting Everything Together

By combining metadata and content criteria you can rigorously navigate corpora and select relevant information. The code cell merges the previous examples.

The important difference is in line 20 where the `if` statements contains two conditions, and two have to evaluate to `True`.

In [94]:
if True and True:
    print('!')

!


In [93]:
if True and False:
    print('!')

In [91]:
import re
from datetime import datetime

def in_daterange(path,lower_b,upper_b):
    newspaper_id, date, art_id = path.stem.split("_")
    year,month,day = int(date[:4]),int(date[4:6]),int(date[6:])
    target_date = datetime(year,month,day)
    return lower_b < target_date < upper_b

lower_b = datetime(1830,1,1)
upper_b = datetime(1831,1,1)

query = re.compile(r'(?:\bfreedom\b|\bdemocracy\b|\babolit)')

selected_files = []
for p in path_to_files:
    txt = open(p).read()
    txt_lower = txt.lower()
    results = query.findall(txt_lower)
    if results and in_daterange(p,lower_b,upper_b):
        selected_files.append(p)
print(len(selected_files))

228


In [92]:
selected_files

[PosixPath('data/hmd_data/plaintext/0002194/0002194_18300309_art0045.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301124_art0005.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301216_art0028.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301018_art0004.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301211_art0018.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18300604_art0037.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301116_art0005.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18300608_art0019.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301123_art0008.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301001_art0016.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301123_art0022.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18300902_art0003.txt'),
 PosixPath('data/hmd_data/plaintext/0002194/0002194_18301013_art0030.txt'),
 PosixPath('

## 5.6 Saving the output

While selecting articles is useful for creating a specific subcorpus, th

dump in an excel file


part II focusses more on tabular data

In [98]:
import re
from datetime import datetime
import pandas as pd

def in_daterange(path,lower_b,upper_b):
    newspaper_id, date, art_id = path.stem.split("_")
    year,month,day = int(date[:4]),int(date[4:6]),int(date[6:])
    target_date = datetime(year,month,day)
    return lower_b < target_date < upper_b

lower_b = datetime(1830,1,1)
upper_b = datetime(1831,1,1)

query = re.compile(r'(?:\bfreedom\b|\bdemocracy\b|\babolit)')

rows = []
for p in path_to_files:
    txt = open(p).read()
    txt_lower = txt.lower()
    results = query.findall(txt_lower)
    if results and in_daterange(p,lower_b,upper_b):
        row = [p.stem,'; '.join(results),txt]
        rows.append(row)

        

In [100]:
df = pd.DataFrame(rows)
df.to_excel('../test.xlsx')

## Fin.