# 6 Corpus Exploration

This Notebook explores various tools for analysing and comparing texts at the corpus level. As such, these are your first ventures into "macro-analysis" with Python. The methods described here are particularly powerful in combination with the techniques for content selection explained in Notebook 5 **Corpus Creation**.

More specifically, we will have a closer look at:

- **Keyword in Context Analysis**: Explore context of words, similar to concordance in AntConc
- **Collocations**: Compute which tokens tend to co-occur together
- **Feature selection**: Compute which tokens are distinctive for a subset of texts

## 6.1 Keyword in Context

Computers are excellent in indexing, organizing and retrieving information. However, interpreting information (especially natural language) is still a difficult task. Keyword-in-Context (KWIC) analysis, brings together the best of both worlds: the retrieval power of machines, with the close-reading skills of the historian. KWIC (or concordance) centres a corpus on a specific query term, with `n` words (or characters) to the left and the right. 

In this section, we investigate reports of the London Medical Officers of Health, the [London's Pulse corpus](https://wellcomelibrary.org/moh/). 

> The reports were produced each year by the Medical Officer of Health (MOH) of a district and set out the work done by his public health and sanitary officers. The reports provided vital data on birth and death rates, infant mortality, incidence of infectious and other diseases, and a general statement on the health of the population. 

Source: https://wellcomelibrary.org/moh/about-the-reports/about-the-medical-officer-of-health-reports/

**[Important]** Before you continue run the cell to downlaod and extract the data we need in the exercises.

We start by importing the necessary libraries. Some of the code is explained in previous Notebooks, so won't discuss it in detail here.

The tools we need are:
- `nltk`: Natural Language Toolkint: for tokenization and concordance
- `pathlib`: a library for managing files and folders

In [None]:
import nltk # import natural language toolkit
from pathlib import Path # import Path object from pathlib
from nltk.tokenize import wordpunct_tokenize # import word_tokenize function from nltk.tokenize

In [None]:
!ls data/MOH/python/ # list all files in data/MOH/python/

The data are stored in the following folder structure:

```
data
|___ MOH
     |___ python
          |____ CityofWestminster.1901.b18247660.txt
          |____ ...
```

The code below:
- harvests all path to `.txt` files in `data/MOH/python`
- converts the result to a `list`

In [None]:
moh_reports_paths = list(Path('data/MOH/python').glob('*.txt')) # get all txt files in data/MOH/python

We can print the paths to ten document with list slicing: `[:10]` means, get document from index positions `0` till `9`. (i.e. the first ten items).

In [None]:
print(moh_reports_paths[:10]) # print the first ten items

Once we know where all the files are located, we can apply the following steps:
- create an empty list variable where we will store the tokens of the corpus (line 3)
- iterate over the collected paths (line 5)
- read the text file (line 6)
- lowercase the text (line 6)
- tokenize the string (line 7): this converts the string to a list of tokens
- iterate over tokens (line 8)
- test if a token is contain only alphabetic characters (line 9)
- add token to the list if line 9 evaluates to True (line 10)

The general flow of the program is similar to what we've seen before: we create an empty list (or other object) where we store specific information from a text collection, in this case all alphabetic tokens.

We use one more notebook functionalities here
- `%%time` print how long the cell took to run

It could take a few seconds for the cell to run, so please be a bit pit patient:

In [None]:
%%time

corpus = [] # inititialize an empty list where we will store the MOH reports

for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths 
    text_lower = open(r).read().lower() # read the text files and lowercase the string
    tokens = wordpunct_tokenize(text_lower) # tokenize the string
    for token in tokens: # iterate over the tokens
        if token.isalpha(): # test if token only contains alphabetic characteris
            corpus.append(token) # if the above test evaluates to True, append token to the corpus list
print('collected', len(corpus),'tokens')

While this small program works perfectly fine, it's not the most efficient code. The example below is a bit more, especially if you're confronted with lots of text files. 

- the `with open` statement is a convenient way of handling the opening and closing of files, to make sure you don't keep all information in memory, which would slow down the execution of your program
- line 8 shows a list comprehension, this actually similar to a for loop, but faster and more concise.

We won't spend too much time discussing list comprehensions, the examples below should suffice for now. We write a small programs that collects odd numbers. First we generate a list of numbers with `range(10)`...

In [None]:
# see the output of range(10)
list(range(10))

... the we test for division by 2: `%` is the modulus operator, "which returns the remainder after dividing the left-hand operand by right-hand operand". It `n % 2` evaluates to `0` if a number `n` can be divided by `2`. In Python `0` is equal to `False`, meaning if `n % 2` evaluates to `0` we won't append the number to `odd`.

In [None]:
%%time
# program for find odd numbers
numbers = range(10) # get numbers 0 to 9
odd = [] # empty list where we store even numbers
for k in numbers: # iterate over numbers
    if k % 2: # test if number if divisible by 2
        odd.append(k) # if True append
print(odd) # print number of tokens collected

The same can be achieved with just one line of code using a list comprehension.

In [None]:
%time
odd = [k for k in range(10) if k % 2]
print(odd)

### -- Exercise

To see differences in performance, do the follwoing:

- remove the `print()` statement
- crank up the size of the list, i.e. change range(10) to range(1000000).
- compare the **Wall time** of these cells

Now returning to the actual example: Run the slightly better code and observe that it produces the same output, just faster!

In [None]:
%%time

corpus = [] # inititialize an empty list where we will store the MOH reports

for p in moh_reports_paths: # iterate over the paths to MOH reports, p will take the value of each item in moh_reports_paths 
    with open(r) as in_doc: # make sure to close the document after opening it
        tokens = wordpunct_tokenize(in_doc.read().lower())
        corpus.extend([t for t in tokens if t.isalpha()]) # list comprehension    
print('collected', len(corpus),'tokens') # print number of tokens collected

After collecting all tokens in a `list` we can convert this of another data type, a NLTK `Text` object. The cell below shows the results of the conversion.

In [None]:
print(type(corpus))
nltk_corpus = nltk.text.Text(corpus) # convert the list of tokens to a nltk.text.Text object
print(type(nltk_corpus))

Why is this useful? Well the `Text` object comes with many useful methods for corpus exploration. To inspect all the tools attached to a `Text` object, apply the `help()` function to `nltk_corpus` or (`help(nltk.text.Text)` would do the same trick). You have to scroll down a bit (ignore all methods starting with `__`).

In [None]:
help(nltk_corpus) # show methods attached to the nltk.text.Text object or nltk_corpus variable

Let's have a closer look at `.concordance()`. According to the official documentation this method 
> Prints a concordance for ``word`` with the specified context window. Word matching is not case-sensitive.

It take multiple arguments:
    - word: query term
    - width: the context window, i.e. determines the number of character printed 
    - lines: determines the number of lines (i.e. KWIC examples) returns
The first line of the output states total number of hits for the query term (`Displaying * of * matches:`)

The example code below print the context of the word **"poor"**.

In [None]:
nltk_corpus.concordance('poor',width=100,lines=10) # print the context of poor, window = 100 character

## --Exercise

Compare "poor" between City of Westminster and Poplar. 

**[TO DO: explain exercise]**

## 6.2 Collocations

While KWIC analysis is useful for investigating the context of words, it is a method that doesn't scale well: it helps with the close reading of around 100 words, but when examples run in the thousands it becomes more difficult. Collocations can help to quantify the semantics of term, or how the meaning of words is different betwen corpora or subsamples of a corpus.

Collocations, as explained in the AntConc section are multi-word expression containing words that tend to co-occur.

The NLTK `Text` object has `collocations()` function. Below we print and explain the documentation.

> collocations(self, num=20, window_size=2)
    Print collocations derived from the text, ignoring stopwords.
    
It has the following parameters:
> `:param num:` The maximum number of collocations to print.

The number of collocations to print (if not specified it will print 20)

> `:param window_size:` The number of tokens spanned by a collocation (default=2)

If `window_size=2` collocations will only include bigrams (words occuring next to each other). But sometimes we wish to include longer intervals, to make co-occurence of words withing a broader window more visible, this allows us to go beyond multiword expressions and study the distribution of words in a corpus more generally. For example, we could look if "men" and "women" are discussed in each other's context (within a span of 10), even if they don't appear next to each other. 

In [None]:
nltk_corpus.collocations(window_size=2)

In [None]:
nltk_corpus.collocations(window_size=5)

While the `.collocations()` method is an easy tool for quickly computing collocations, it's functionality remains rather limited. The cells below will inspect the collocation functions of NLTK in a bit more detail, giving you a bit more power of and precision.

Before we start we import all the tools `nltk.collocations` provides. This is handled by the `import *`, similar to a wildcard, it matches and loads everthing in `nltk.collocations`.

In [None]:
import nltk
from nltk.collocations import *

Next we have to select an association measure this to compute the "strength" with which two tokens are attracted to each other. In general collocations are words that appear frequently together (within a certain window size), but are unlikely to appear in general (outside this window size). This explains why "the wine" is not a collocation while "red wine" is.

NLTK provides us with different measures, which you can print and investigate in more detail. Many of the functions refer to the classic NLP Handbook of Manning and Schütze, ["Foundations of statistical natural language processing"](https://nlp.stanford.edu/fsnlp/).

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()

In [None]:
help(bigram_measures)

In [None]:
help(bigram_measures.pmi)

`pmi` is a rather straightforward metric, in the case of bigrams
- compute the total number of tokens in a corpus, assume this is `n` (3435)
- compute the probability of  `a` and `b` appearing as a bigram. If the bigram (a,b) occurs 10 times, the probability (P(a,b) is 10/3435)
- compuate the probability of observing `a` and `b`. For exampe a appears `30` times and b `45`, this becomes (30/3435) * (45/3435)
- log this value
![pmi](https://miro.medium.com/max/930/1*OoI8_cZQwYGJEUjzozBOCw.png)

In [None]:
from numpy import log2
nom = 10/3435
denom = (30/3435) * (45/3435)
mpi = log2(nom/denom)
mpi

To rank collocations by their PMI scores, we use the `.from_words()` method to the `nltk_corpus` (or any list of tokens). The result of this operation is stored in `finder` which we can subsequently use for printing collocations. Note that the results below look somewhat strange, these aren't very meaningful collocates.

In [None]:
finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.nbest(bigram_measures.pmi, 10) 

These results are rather spurious. If, for example `a` and `b` both appear only once and next to each other, the PMI score will be very high, but this is not necessarily a very meaningful collocation, more a rare artefact of the data.

We filter by ngram frequency, removing in our case all bigrams that appear less than 3 time with `.apply_freq_filter()` function.

In [None]:
help(finder.apply_freq_filter)

In [None]:
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)

In [None]:
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.pmi, 10)

It is also possible to change the window size, but the larger the window size the longer the computation takes

In [None]:
finder = BigramCollocationFinder.from_words(nltk_corpus, window_size = 5)
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.pmi, 10)

Lastly you can focus on collocations that contains a specific token, i.e. for example get all collocations with the token "poor".

In [None]:
#def token_filter(*w):
#     return 'poor' not in w

token_filter = lambda *w: 'poor' not in w

finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.apply_ngram_filter(token_filter)
finder.nbest(bigram_measures.pmi, 10)

### 6.3 Feature selection

The last section of this Notebook takes aims at contrasting corpora and find tokens (or word patterns) that distinguish on set of documents from another. This may help us discovering that is particular about the language of specific group (such as a political party) or period. We continue with the example of the MOsH reports, but compare the language of different boroughs, the affluent Westminster with the industrial, and considerable poorer Poplar.

The code below should look familiar but we made a few changes.



In [None]:
corpus = [] # save corpus here
labels = [] # save labels here


for r in moh_reports: # iterate over documents
    with open(r) as in_doc: # open document (also take care close it later)
        if 'westminster' in r.name.lower(): # check if westeminster appear in the file name
            labels.append(1) # if so, append 1 to labels
        else: # if not
            labels.append(0) # append 0 to labels

        corpus.append(in_doc.read().lower()) # append the lowercase document to corpus
  

check number of labels and documents are equal

In [None]:
print(len(labels),len(corpus))

In [None]:
process text: lemmatize keep only adj and noun

In [None]:
install external library

In [None]:
!pip install TextFeatureSelection

In [None]:
apply library

In [None]:
from TextFeatureSelection import TextFeatureSelection
fsOBJ=TextFeatureSelection(target=labels,input_doc_list=corpus)
result_df=fsOBJ.getScore()
result_df

In [None]:
inspect results

In [None]:
result_df[result_df['word occurence count'] > 5].sort_values('Chi Square',ascending=False)[:20]

In [None]:
help(result_df.sort_values)

## Fin.

### Appendix With Sklearn

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import CountVectorizer 

In [None]:
vectorizer = CountVectorizer(min_df=5)
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()


In [None]:
ch2 = SelectKBest(chi2, k=10)
X = ch2.fit_transform(X, labels)


In [None]:
selected = [(feature_names[i],ch2.scores_[i])for i
                    in ch2.get_support(indices=True)]
selected