
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kasparvonbeelen/ghi_python/blob/main/6%20-%20Corpus%20Exploration.ipynb)



# 6 Corpus Exploration


## Text Mining for Historians (with Python)
## A Gentle Introduction to Working with Textual Data in Python

### Created by Kaspar Beelen and Luke Blaxill

### For the German Historical Institute, London

<img align="left" src="https://www.ghil.ac.uk/typo3conf/ext/wacon_ghil/Resources/Public/Images/institute_icon_small.png">




This Notebook explores various tools for analysing and comparing texts at the corpus level. As such, these are your first ventures into "macro-analysis" with Python. The methods described here are particularly powerful in combination with the techniques for content selection explained in Notebook 5 **Corpus Creation**.

More specifically, we will have a closer look at:

- **Keyword in Context Analysis**: Explore context of words, similar to concordance in AntConc
- **Collocations**: Compute which tokens tend to co-occur together
- **Feature selection**: Compute which tokens are distinctive for a subset of texts

## 6.1 Keyword in Context

Computers are excellent in indexing, organizing and retrieving information. However, interpreting information (especially natural language) is still a difficult task. Keyword-in-Context (KWIC) analysis, brings together the best of both worlds: the retrieval power of machines, with the close-reading skills of the historian. KWIC (or concordance) centres a corpus on a specific query term, with `n` words (or characters) to the left and the right. 

In this section, we investigate reports of the London Medical Officers of Health, the [London's Pulse corpus](https://wellcomelibrary.org/moh/). 

> The reports were produced each year by the Medical Officer of Health (MOH) of a district and set out the work done by his public health and sanitary officers. The reports provided vital data on birth and death rates, infant mortality, incidence of infectious and other diseases, and a general statement on the health of the population. 

Source: https://wellcomelibrary.org/moh/about-the-reports/about-the-medical-officer-of-health-reports/

We start by importing the necessary libraries. Some of the code is explained in previous Notebooks, so won't discuss it in detail here.

The tools we need are:
- `nltk`: Natural Language Toolkint: for tokenization and concordance
- `pathlib`: a library for managing files and folders

In [1]:
import nltk # import natural language toolkit
nltk.download('stopwords')
from pathlib import Path # import Path object from pathlib
from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kbeelen/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
!ls data/MOH/ # list all files in data/MOH/python/

[34mantconc[m[m    python.zip


In [7]:
!unzip data/MOH/python.zip -d data/MOH/

Archive:  data/MOH/python.zip
   creating: data/MOH/python/
  inflating: data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt  
  inflating: data/MOH/python/CityofWestminster.1932.b18247945.txt  
  inflating: data/MOH/python/CityofWestminster.1921.b18247830.txt  
  inflating: data/MOH/python/PoplarandBromley.1900.b18245754.txt  
  inflating: data/MOH/python/Poplar.1919.b18120878.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt  
  inflating: data/MOH/python/CityofWestminster.1907.b18247726.txt  
  inflating: data/MOH/python/CityofWestminster.1906.b18247714.txt  
  inflating: data/MOH/python/CityofWestminster.1903.b18247684.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1903.b1824578x.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1938.b18246102.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1960.b18246321.txt  
  inflating: data/MO

  inflating: data/MOH/python/Westminster.1891.b2005709x.txt  
  inflating: data/MOH/python/Westminster.1857.b18248342.txt  
  inflating: data/MOH/python/CityofWestminster.1933.b18247957.txt  
  inflating: data/MOH/python/Poplar.1899.b18222894.txt  
  inflating: data/MOH/python/CityofWestminster.1944.b18248068.txt  
  inflating: data/MOH/python/CityofWestminster.1909.b1824774x.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1946.b18246187.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1931.b18246035.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1951.b18246230.txt  
  inflating: data/MOH/python/Westminster.1857.b18248354.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1904.b18245791.txt  
  inflating: data/MOH/python/CityofWestminster.1960.b18248226.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1961.b18246333.txt  
  inflating: data/MOH/python/PoplarMetropolitanBorough.1939.b18246114.txt  
  inflating: da

The data are stored in the following folder structure:

```
data
|___ moh
     |___ python
          |____ CityofWestminster.1901.b18247660.txt
          |____ ...
```

The code below:
- harvests all paths to `.txt` files in `working_data/moh/python`
- converts the result to a `list`

In [20]:
moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python

We can print the paths to ten document with list slicing: `[:10]` means, get document from index positions `0` till `9`. (i.e. the first ten items).

In [21]:
print(moh_reports_paths[:10]) # print the first ten items

[PosixPath('data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt'), PosixPath('data/MOH/python/CityofWestminster.1932.b18247945.txt'), PosixPath('data/MOH/python/CityofWestminster.1921.b18247830.txt'), PosixPath('data/MOH/python/PoplarandBromley.1900.b18245754.txt'), PosixPath('data/MOH/python/Poplar.1919.b18120878.txt'), PosixPath('data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt'), PosixPath('data/MOH/python/CityofWestminster.1907.b18247726.txt'), PosixPath('data/MOH/python/CityofWestminster.1906.b18247714.txt'), PosixPath('data/MOH/python/CityofWestminster.1903.b18247684.txt'), PosixPath('data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt')]


Once we know where all the files are located we can create a corpus.

To do this, we apply the following steps:

- create an empty list variable where we will store the tokens of the corpus (line 3)
- iterate over the collected paths (line 5)
- read the text file (line 6)
- lowercase the text (line 6)
- tokenize the string (line 7): this converts the string to a list of tokens
- iterate over tokens (line 8)
- test if a token contains only alphabetic characters (line 9)
- add a token to the list if line 9 evaluates to True (line 10)

The general flow of the program is similar to what we've seen before: we create an empty list where we store information from our text collection, in this case, all alphabetic tokens.

We use one more Notebook functionality `%%time` to print how long the cell took to run.

It could take a few seconds for the cell to run, so please be a bit patient:

In [None]:
%%time

corpus = [] # inititialize an empty list where we will store the MOH reports

for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths 
    text_lower = open(p).read().lower() # read the text files and lowercase the string
    tokens = wordpunct_tokenize(text_lower) # tokenize the string
    for token in tokens: # iterate over the tokens
        if token.isalpha(): # test if token only contains alphabetic characteris
            corpus.append(token) # if the above test evaluates to True, append token to the corpus list
print('collected', len(corpus),'tokens')

While this program works perfectly fine, it's not the most efficient code. The example below is a bit better, especially if you're confronted with lots of text files. 

- the `with open` statement is a convenient way of handling the opening **and** closing of files (to make sure you don't keep all information in memory), which would slow down the execution of your program
- line 8 shows a list comprehension, this is similar to a `for` loop but faster and more concise.

We won't spend too much time discussing list comprehensions. The example below should suffice for now. We write a small program that collects odd numbers. First, we generate a list of numbers with `range(10)`...

In [None]:
# see the output of range(10)
list(range(10))

... we test for division by 2: `%` is the **modulus operator**: "which returns the remainder after dividing the left-hand operand by right-hand operand". `n % 2` evaluates to `0` if a number `n` can be divided by `2`. In Python `0` is equal to `False`, meaning if `n % 2` evaluates to `0`/`False` we won't append the number to `odd`. if it evaluates to any other integer, we'll append `n` to `odd`.

In [None]:
print(10%2)
print(15%2)

In [None]:
%%time
# program for find odd numbers
numbers = range(10) # get numbers 0 to 9
odd = [] # empty list where we store even numbers
for k in numbers: # iterate over numbers
    if k % 2: # test if number if divisible by 2
        odd.append(k) # if True append
print(odd) # print number of tokens collected

The same can be achieved with just one line of code using a list comprehension.

In [None]:
%time
odd = [k for k in range(10) if k % 2]
print(odd)

### -- Exercise

To see differences in performance, do the following:

- Remove the `print()` statement
- Increase the size of the list, i.e. change `range(10)` to `range(1000000)`.
- Compare the **Wall time** of these cells

Now returning to our example: run the slightly more efficient code and observe that it produces the same output, just faster!

In [None]:
%%time

corpus = [] # inititialize an empty list where we will store the MOH reports

for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths 
    with open(p) as in_doc: # make sure to close the document after opening it
        tokens = wordpunct_tokenize(in_doc.read().lower())
        corpus.extend([t for t in tokens if t.isalpha()]) # list comprehension    
print('collected', len(corpus),'tokens') # print number of tokens collected

After collecting all tokens in a `list` we can convert this to another data type: an NLTK `Text` object. The cell below shows the results of the conversion.

In [None]:
print(type(corpus))
nltk_corpus = nltk.text.Text(corpus) # convert the list of tokens to a nltk.text.Text object
print(type(nltk_corpus))

Why is this useful? Well the  NLTK`Text` object comes with many useful methods for corpus exploration. To inspect all the tools attached to a `Text` object, apply the `help()` function to `nltk_corpus` or (`help(nltk.text.Text)` does the same trick). You have to scroll down a bit (ignore all methods starting with `__`) to inspect the class methods.

In [None]:
help(nltk_corpus) # show methods attached to the nltk.text.Text object or nltk_corpus variable

Let's have a closer look at `.concordance()`. According to the official documentation this method 
> Prints a concordance for ``word`` with the specified context window. Word matching is not case-sensitive.

It takes multiple arguments:
    - word: query term
    - width: the context window, i.e. determines the number of character printed 
    - lines: determines the number of lines to show (i.e. KWIC examples)

The first line of the output states the total number of hits for the query term (`Displaying * of * matches:`)

The example code below prints the context of the word **"poor"**.

In [None]:
nltk_corpus.concordance('poor',width=100,lines=10) # print the context of poor, window = 100 character

### --Exercise

Use KWIC analysis to compare the word "poor" in MOsH reportss from the City of Westminster and Poplar. Using everything you learned the previous Notebook
- Create two subcopora one with Westminster, one with Poplar reports
- Tokenize the texts and convert the list of tokens to an NLTK `Text` object
- Use concardance to analyse the context of the work "poor"

In [None]:
# Enter code here

## 6.2 Collocations

While KWIC analysis is useful for investigating the context of words, it is a method that doesn't scale well: it helps with the close reading of around 100 words, but when examples run in the thousands it becomes more difficult. Collocations can help quantify the semantics of term, or how the meaning of words is different between corpora or subsamples of a corpus.

Collocations, as explained in the AntConc section, are often multi-word expressions containing tokens that tend to co-occur, such "New York City" (the span between words can be longer, they don't have to appear next to each other).

The NLTK `Text` object has `collocations()` function. Below we print and explain the documentation.

> collocations(self, num=20, window_size=2)
    Print collocations derived from the text, ignoring stopwords.
    
It has the following parameters:
> `:param num:` The maximum number of collocations to print.

The number of collocations to print (if not specified it will print 20)

> `:param window_size:` The number of tokens spanned by a collocation (default=2)

If `window_size=2` collocations will only include bigrams (words occurring next to each other). But sometimes we wish to include longer intervals, to make the co-occurrence of words within a broader window more visible, this allows us to go beyond multiword expressions and study the distribution of words in a corpus more generally. For example, we could look if "men" and "women" are discussed in each other's context (within a span of 10), even if they don't appear next to each other. 

In [None]:
%%time
nltk_corpus.collocations(window_size=2)

In [None]:
%%time 
nltk_corpus.collocations(window_size=5)

While the `.collocations()` method provides a convenient tool for obtaining collocations from a corpus, its functionality remains rather limited. Below we will inspect the collocation functions of NLTK in more detail, giving you more power as well as precision.

Before we start we import all the required tools that `nltk.collocations` provides. This is handled by the `import *`, similar to a wildcard, it matches and loads all functions from `nltk.collocations`.

In [None]:
import nltk
from nltk.collocations import *

We have to select an association measure to compute the "strength" with which two tokens are "attracted" to each other. In general, collocations are words that are likely to appear together (within a specific context or window size). This explains why "the "red wine" is a strong collocation and "the wine" less so.

NLTK provides us with different measures, which you can print and investigate in more detail. Many of the functions refer to the classic NLP Handbook of Manning and Schütze, ["Foundations of statistical natural language processing"](https://nlp.stanford.edu/fsnlp/).

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()

In [None]:
help(bigram_measures)

In our example we use pointwise mutual inforamtion (pmi) to compute collocations.

In [None]:
help(bigram_measures.pmi)



![pmi](https://miro.medium.com/max/930/1*OoI8_cZQwYGJEUjzozBOCw.png)

`pmi` is a rather straightforward metric, in the case of bigrams (i.e. collocations of length two and window size two):
- compute the total number of tokens in a corpus, assume this is `n` (3435)
- compute the probability of  `a` and `b` appearing as a bigram. If the bigram `(a,b)` occurs 10 times, the probability (`P(a,b)` = 10/3435 = 0.0029)
- compute the probability of observing `a` and `b` across the whole corpus. For example if `a` appears `30` times and b `45`, their respective probabilities are `P(a)` = 30/3435 = 0.0087 and P(b) = 45/3435 = 0.0131. We then multiple `P(a)` and `P(b)` to obtain the denominator 0.0087 `*` 0.0131 = 0.0001
- next we 0.0029 / 0.0001 = 28.9999 and log this value log2(28.9999)

In [None]:
from numpy import log2
nom = 10/3435
denom = (30/3435) * (45/3435)
mpi = log2(nom/denom)
mpi

To get collocations by their `pmi` scores, we apply the `.from_words()` method to the `nltk_corpus` (or any list of tokens). The result of this operation is stored in a `finder` object which we can subsequently used to rank and print collocations. 

Note that the results below look somewhat strange, these aren't very meaningful collocates.

In [None]:
finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.nbest(bigram_measures.pmi, 10) 

These results are rather spurious, why? If, for example `a` and `b` both appear only once **and** next to each other, the `pmi` score will be high. But such collocations aren't meaningful collocation, more a rare artefact of the data.
To solve this problem, we filter by ngram frequency, removing in our case all bigrams that appear less than 3 times with `.apply_freq_filter()` function.

In [None]:
help(finder.apply_freq_filter)

In [None]:
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)

Now many names appear. We can even be more strict and use a higher threshold for filtering.

In [None]:
finder.apply_freq_filter(20)
finder.nbest(bigram_measures.pmi, 10)

It is also possible to change the window size, but the larger the window size the longer the computation takes

In [None]:
finder = BigramCollocationFinder.from_words(nltk_corpus, window_size = 5)
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.pmi, 10)

Lastly, you can focus on collocations that contains a specific token, i.e. for example get all collocations with the token "poor". We have pass function to `.apply_ngram_filter()`. At this point, you shouldn't worry about the code, only understand how to adapt it (see exercise below). 

In [None]:
def token_filter_poor(*w):
     return 'poor' not in w

finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.apply_freq_filter(3)
finder.apply_ngram_filter(token_filter_poor)
finder.nbest(bigram_measures.pmi, 10)

### -- Exercise
Copy-paste the above code and create a program that prints the first 10 collocations with the word "women".
- change the frequency threshold
- explore otherr association measure, to what extent do your results change?

In [None]:
# Enter code here

### 6.3 Feature selection

In the last section of this Notebook, we explore computational methods for finding words that characterize a collection: we try to select tokens (more generally features) that distinguish a particular set of documents vis-a-vis another corpus. 

Such comparisons help us determine what type of language use was distinctive for a particular group or (such as a political party) period or location. We continue with the example of the MOsH reports, but compare the language of different boroughs, the affluent Westminster with the industrial, and considerably poorer, Poplar.

The code below should look familiar, but we made a few changes.\
- to make sure all data are in the right place, we download and extract it again
- we create two empty lists `corpus` and `labels`. In the former we store our text documents (each item in the list is one text file/string), the latter contains labels, `0` for Poplar and `1` for Westminster. We collect these labels in parallel with the text, i.e. the if the first item in `corpus` is a text from Westminster, the first label in `labels` is `1`.
- we use `with open` to automatically close each document after opening it (line 1)
- lines 9 - 12 contain an `if else` statement: if the string `westminster` appears in the file name we add `1` to `labels`, otherwise `0`.

In [18]:
%%time
import nltk # import natural language toolkit
from pathlib import Path # import Path object from pathlib
from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize

moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python

corpus = [] # save corpus here
labels = [] # save labels here

for r in moh_reports_paths: # iterate over documents
    with open(r) as in_doc: # open document (also take care close it later)
        corpus.append(in_doc.read().lower()) # append the lowercased document to corpus
        
        if 'westminster' in r.name.lower(): # check if westeminster appear in the file name
            labels.append(1) # if so, append 1 to labels
        else: # if not
            labels.append(0) # append 0 to labels

CPU times: user 242 ms, sys: 61.7 ms, total: 303 ms
Wall time: 368 ms


Each document should correspond to one label. The lists `labels` and `corpus` should have equal length.

In [11]:
print(len(labels),len(corpus))

0 0


In [None]:
print(len(labels) == len(corpus))

As said earlier, we collect labels for each document, `1` for Westminster and `0` for Poplar (it could also be reverse, of course!). It is important that each label corresponds correctly with a text file in `corpus`. 

In [None]:
print(labels[:10])

We can check this by printing the first hundred characters of the first document (labelled as `0`)...

Note that `corpus[0]` returns the first document, from which we slice the first hundred character `[:100]`.

In [None]:
corpus[0][:100]

... and the second document (labelled as `0`)

In [None]:
corpus[1][:100]

Checking your code by eyeballing the output is always good practice. Even if your code runs, it could still contain bugs, which are commonly referred to as "semantic errors".

To obtain the most distinctive words (for both report from Westminster and Poplar) we use an external library [`TextFeatureSelection`](https://pypi.org/project/TextFeatureSelection/). Python has a very rich and fast-evolving ecosystem. If you have a problem, it's very likely someone wrote a library to help you with this problem. We first have to install this package (it's not yet part of Colab)

In [None]:
!pip install TextFeatureSelection

Now we can apply the `TextFeatureSelection` library. The documentation is available [here](https://pypi.org/project/TextFeatureSelection/).

Computing the features requires only a few lines of code. You only need to provide 
- a corpus for the `input_doc_list` parameter
- a list of labels for the `target` parameter

`TextFeatureSelection` then uses various metrics to compute the extent to which words are associated with a label. The output of this process is a `pandas.DataFrame`. Working with tabular data and data frames will be extensively discussed in Part II of this course. For now, we show you how to sort information and get the most distinctive words or features.

In [None]:
help(TextFeatureSelection)

In [None]:
from TextFeatureSelection import TextFeatureSelection # import TextFeatureSelection
fsOBJ=TextFeatureSelection(target=labels,input_doc_list=corpus) # compute features
df=fsOBJ.getScore() # get features as a dataframe
df

A `pandas.DataFrame` is similar to an Excel speadsheet. It contain several columns which we can use for selecting and sorting information. In fact, if you are familiar with Excel, you can export the data frame and open it as a spreadsheet. The code below takes care of this.


In [None]:
df.to_excel('working_data/result_features.xlsx')

We use the following columns to select and rank words:
- **Word occurence count**: How often a term occurs in the corpus
- **Proportional Difference**: It helps ﬁnd unigrams that occur mostly in one class of documents or the other."
- **Mutual Information**: The discriminatory power of a word.

In [None]:
westminster_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] > 0 )]
westminster_df.sort_values('Information Gain',ascending=False)[:10]

In [None]:
poplar_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] < 0 )]
poplar_df.sort_values('Information Gain',ascending=False)[:10]

In [None]:
poplar_df = df[(df['word occurence count'] > 20 ) & (df['Proportional Difference'] < 0 )]
poplar_df.sort_values('Chi Square',ascending=False)[:10]

## Fin.